Geoff Ruddock

The best way to manage dependencies between DAGs in Airflow

Airflow provides a few different sensors and operators which enable you to coordinate scheduling between different DAGs, including:

Which one is the best to use?

I have previously written about how to use ExternalTaskSensor in Airflow but have since realized that this is not always the best tool for the job. Depending on your specific decision criteria, one of the other approaches may be more suitable to your problem.

Use cases

I need the ability to sometimes run dag_B independent of dag_A, but I want to share state (history) between them.

Using SubDagOperator creates a tidy parent–child relationship between your DAGs. The sub-DAGs will not appear in the top-level UI of Airflow, but rather nested within the parent DAG, accessible via a Zoom into Sub DAG button. This is a nice feature if those DAGs are always run together. However if you need to sometimes run the sub-DAG alone, you will need to initialize it as it’s own top-level DAG, which will not share state with the sub-DAG.

In this scenario, you are better off using either ExternalTaskSensor or TriggerDagRunOperator.

My local development or test environment uses SQLite rather than a Postgres DB.

SQLite does not support concurrent write operations, so it forces Airflow to use the SequentialExecutor, meaning only one task can be active at any given time. Using ExternalTaskSensor will consume one worker slot spent “waiting” for the upstream task, and so your Airflow will be deadlocked.

In this case, it is preferable to use SubDagOperator, since these tasks can be run with only a single worker. Astronomer.io has some good documentations on how to use sub-DAGs in Airflow.

I want dag_B to sometimes run depending on some conditional logic

If you want to include conditional logic, you can feed a python function to TriggerDagRunOperator which determines which DAG is actually triggered (if at all).


comments powered by Disqus