Apache Spark logo

Apache Spark

Runtime engine

Apache Spark is a data processing framework for large datasets and distributed computing

Use it when

  • You are working with big data (large datasets).
  • You would like to parallelize computation across multiple machines.
  • You want fast large-scale data processing.
  • You want a machine learning-specific API and many operators that facilitate transforming data.

Watch out

  • Requires clusters with higher RAM since it stores datasets in memory.
  • Higher infrastructure and setup costs.

Example stacks

Airflow + MLflow stack