Use it when
- You are working with big data (large datasets).
- You would like to parallelize computation across multiple machines.
- You want fast large-scale data processing.
- You want a machine learning-specific API and many operators that facilitate transforming data.
Watch out
- Requires clusters with higher RAM since it stores datasets in memory.
- Higher infrastructure and setup costs.
Example stacks
Airflow + MLflow stack