The benefits of multi-model serving where latency matters. By By Alejandro Lince and Steven Ross.
Machine Learning (ML) inference, defined as the process of deploying a trained model and serving live queries with it, is an essential component of many deployed ML systems and is often a significant portion of their total cost. Costs can grow even more uncontrollably when considering hardware accelerators such as GPUs.
Many modern user-focused applications critically depend on ML to substantially improve the user experience (by providing recommendations or filling in text, for example). Accelerators such as GPUs allow for even more complex models to still run with reasonable latencies, but come at a cost.
The article then deals with:
- Benefits of multi-model serving
- Analyzing single versus multi-model serving latency
- Costs Versus Latency
Multi-model serving enables lower cost while maintaining high availability and acceptable latency, by better using the RAM capacity of large VMs. While it is common and simple to deploy only one model per server, instead load a large number of models on a large VM that offers low latency, which should offer acceptable latency at a lower cost. These cost savings also apply to serving on accelerators such as GPUs. Good read!
[Read More]