Use it when
- You want a serving framework that supports a wide range of ML frameworks.
- You want a model serving solution which provides tight integration with the hardware stack, model versioning, and dynamic batching.
- You want to autoscale GPU instances.
- You want to use Kubernetes on GPU instances.
- You want to minimize model initialization times in inference workloads.
- You want built-in model monitoring features.
- You want to optimize GPU instance utilization.
Watch out
- Requires supported NVIDIA GPUs.
Example stacks
Airflow + MLflow stack