Apache Airflow logo

Apache Airflow

Pipeline orchestration

Airflow is a platform created by the community to author, schedule, and monitor workflows programmatically.

Use it when

  • You want to visualize pipelines running in production.
  • You want to generate pipelines dynamically as Python code.
  • You want ETL pipelines to extract batch data from multiple sources and run data transformations.
  • You want to automate machine learning model training.
  • You want to create workflows as DAGs (Direct Acyclic Graphs), where each node is a task.
  • You want to configure pipelines as Python code.
  • You want to run a task on multiple workers managed by Celery, Dask, or Kubernetes.
  • You want to define time limits for tasks or workflows to highlight anomalies or inefficiencies.
  • You want a developer-friendly environment and UI.

Watch out

  • Orchestrates batch workflows that rely on time-based scheduling.
  • Airflow doesn't manage event-based jobs.
  • Airflow does not offer versioning for pipelines.
  • Requires significant user customization to work safely in production workloads.
  • By default, Airflow uses SQLite, which may experience data losses in production.
  • By default, Airflow runs one task at a time with a serial executor.

Example stacks

Airflow + MLflow stack


pip install apache-airflow