NVIDIA Triton logo


Model serving

NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production.

Use it when

  • You want a serving framework that supports a wide range of ML frameworks.
  • You want a model serving solution which provides tight integration with the hardware stack, model versioning, and dynamic batching.
  • You want to autoscale GPU instances.
  • You want to use Kubernetes on GPU instances.
  • You want to minimize model initialization times in inference workloads.
  • You want built-in model monitoring features.
  • You want to optimize GPU instance utilization.

Watch out

  • Requires supported NVIDIA GPUs.

Example stacks

Airflow + MLflow stack