What it does

Key features

Python DAG definition — pipelines as version-controlled Python code for APAC data engineering team collaboration
Rich operator ecosystem — 1,000+ operators for BigQuery, Snowflake, Spark, dbt, SageMaker, and APIs for APAC data stacks
Dependency management — DAG-level task dependency enforcement with retry, SLA monitoring, and timeout for APAC pipelines
Dynamic DAG generation — programmatic DAG creation for APAC multi-tenant or parameter-driven pipeline patterns
Web UI — real-time DAG monitoring, task log access, and manual trigger for APAC data operations teams
Managed deployment — Amazon MWAA, Google Cloud Composer, and Astronomer managed Airflow for APAC cloud teams
KubernetesPodOperator — run each Airflow task in an isolated Kubernetes pod for APAC resource isolation and dependency management

When to reach for it

Best for

APAC data engineering teams orchestrating complex multi-step ELT pipelines across heterogeneous data sources and warehouse targets
Engineering organisations running ML training pipelines that require scheduled feature engineering, model training, and evaluation steps
APAC data platform teams needing code-based pipeline definition with Git workflow, code review, and CI/CD integration
Teams managing APAC batch data workflows at scale where task-level monitoring, retry logic, and SLA tracking are operational requirements

Don't get burned

Limitations to know

! Airflow is not a streaming engine — Airflow orchestrates batch tasks on schedules or triggers; APAC real-time streaming pipelines on Kafka or Flink require complementary tools, not Airflow
! Significant operational overhead for self-hosted deployment — running Airflow on Kubernetes (scheduler, webserver, workers, metadata database) requires APAC platform engineering investment; managed offerings reduce but do not eliminate this cost
! Python DAG complexity at scale — large APAC Airflow deployments with 1,000+ DAGs and many dynamic DAGs create scheduler performance challenges; Airflow 2.x improved this but scaling requires careful DAG optimisation
! Not designed for data quality monitoring — Airflow schedules tasks but does not natively monitor data quality; APAC teams integrate Great Expectations or Soda Core operators for data quality gates rather than using Airflow natively for this purpose

Context

About Apache Airflow

Apache Airflow is an open-source workflow orchestration platform used by APAC data engineering teams to define, schedule, monitor, and debug complex data pipelines as Python Directed Acyclic Graphs (DAGs) — where each node in the DAG is an operator (a discrete task like running a SQL query, triggering a dbt run, invoking an ML training job, or calling an external API) and the edges define dependencies between tasks, enabling Airflow to determine the correct execution order and parallelise independent tasks.

Airflow's DAG model — where pipelines are defined as Python code in `.py` files, committing the pipeline definition to Git like application code — enables APAC data engineering teams to apply software engineering practices (version control, code review, automated testing, staged deployment) to data pipeline development, rather than managing pipelines through GUI-based scheduling tools that lack code-level auditability.

Airflow's operator ecosystem — where a rich library of operators covers BigQuery, Snowflake, Redshift, Databricks, Spark, dbt, Great Expectations, Soda Core, AWS (S3, Glue, EMR, SageMaker), GCP (Cloud Composer, Dataflow, Vertex AI), Azure (Data Factory, Synapse), and HTTP API operators — enables APAC data engineering teams to orchestrate data flows across heterogeneous cloud and on-premise systems from a single DAG definition, without building custom integration code for each data source or destination.

Airflow's task dependency and retry model — where tasks define upstream dependencies that Airflow enforces at runtime, combined with configurable retry policies (retry on failure with exponential backoff), SLA monitoring, and task-level timeout enforcement — enables APAC data teams to build resilient pipelines that handle transient failures (network timeouts, temporary API unavailability) automatically without manual intervention.

Airflow's managed deployment options — where Apache Airflow runs as self-hosted Docker/Kubernetes deployment (Helm chart available), or as a fully managed service through Amazon MWAA (Managed Workflows for Apache Airflow), Google Cloud Composer, or Astronomer — enables APAC organisations to choose between operational control (self-hosted) and managed convenience (cloud-managed) based on their platform engineering capacity and cloud strategy.

Apache Airflow

Key features

Best for

Limitations to know

About Apache Airflow

Where this category meets practice depth.