What orchestrator to use for your ETL jobs?

In today’s data-driven world, organizations are facing the ever-increasing complexity of data pipelines. The need to handle diverse data formats, sizes, sources and processing requirements can result in the need of different types of transformations blocks, ranging from SQL to PySpark to low/no-code solutions. When complex table dependencies also enter the picture, you get high, multidimensional requirements.

Orchestrators can help you in these situations. An orchestrator is a tool that manages and coordinates the execution of various data processing tasks. It is typically a flow of steps that are executed after each other or in parallel. An orchestrator also allows you to add a schedule for when the flow needs to be executed. You can schedule this however you want, but a typical schedule varies between once an hour, to once a day.

Choosing the right orchestrator to effectively manage these complex pipelines has become a daunting task for organizations. Especially due to the wide range of orchestrator offerings. In this blog post, we dive into a detailed comparison of three popular orchestrators: Apache Airflow, Databricks Workflows, and Azure/Fabric Data Factory.

Each of these orchestrators offers a distinct set of features and capabilities that cover different requirements and use cases. In the following sections, each orchestrator’s performance on a certain criterion will be discussed. The criteria that will be discussed are Ease of Use, Connectors, Flexibility, Monitoring and Alerting, and Git Support.

Ease of use

Data Factory and Workflows are the orchestrators that are really easy to use and to start with. Data Factory allows you to drag-and-drop multiple activities in your pipeline. Adding copy activities to extract data from many sources, transforming the data using dataflows or executing Databricks notebooks is really easy. Most activities only require some basic input parameters. Creating a Workflow in Databricks is also really easy, as you just need to select which notebooks you want to execute, in which order, and how frequently you want the pipeline to be executed.

For a large part of the audience, Airflow will probably the most difficult one to start with. It doesn’t provide a clean UI where you can select the steps you want to execute. Instead, it requires you to create your pipeline (DAG) through Python code. This can be a bigger step for less technical teams. Next to this, it can be difficult to set up and maintain the Airflow environment itself. For this reason, some teams decide to use managed Airflow services, like ADF Managed Airflow or GCP Cloud Composer, to facilitate this part.

Tie: Data Factory & Workflows

Connectors

In complex architectures, there are often many source systems, as well as multiple tools that are used to do the transformations. In those scenarios it is desired that your orchestrator includes connectors that make it easy to interact with these services.

Airflow offers a wide range of connectors that allow seamless integration with various systems and services. These connectors enable the execution of tasks and data exchange. Some popular connectors include Databricks, Azure Data Factory, email, Kubernetes, databases (Azure SQL DB, PostgreSQL, …) and cloud storage services. Most of these connectors are pretty easy to set up, although for some the configuration can be a bit more complicated.

Currently, Databricks Workflows itself doesn’t include many connectors. Most connections to data systems will be made inside the notebooks that are triggered by the Workflows. Especially since Spark offers a large range of connectors (JDBC connectors, Kafka, PubSub, …). Next to this, Databricks also partners with different companies to facilitate connections. Fivetran for data ingestion, and Power BI for visualizations are examples of this. The recent acquirement of Arcion will also result in extra connectors available. First examples of this can probably be expected in 2024.

Azure Data Factory also provides many connectors, known as Linked Services (or Connections in Fabric). Setting up connectors in Data Factory is user-friendly, due to the intuitive UI and comprehensive documentation. Users can easily configure the necessary connection details and leverage the built-in connectors for popular data platforms, simplifying data movement and transformation within pipelines.

Tie: Data Factory & Airflow

Flexibility

When looking at flexibility, Airflow will be the orchestrator that is most flexible. Due to its many connectors, as well as the possibility to write the DAG logic in Python, Airflow offers a lot of flexibility. Especially, with its sensors and data-aware scheduling you can easily solve dependencies between pipelines. This is something that is less easy with other orchestrators. It should be noted that while it is possible to create advanced DAGs, it is not always desired. The first choice should always be to make them as easy/basic as possible.

Data Factory also offers quite some flexibility due to the connectors. However, while it offers some activities that allow you to implement some logic in your pipelines (E.g. if activity, for loop, …), these quickly face some limitations once they become more complicated. There is also a limit of 40 activities per pipeline. For more complex pipelines this can become an issue, but it can also easily be fixed by my making the pipelines more modular.

Workflows offers more than enough flexibility as long as the goal is to execute Databricks notebooks. Once you start interacting with other services, you might face some limitations for which you’ll need to find a workaround. However, as you are triggering Databricks notebooks or scripts, you typically want to add the complex logic in there. A nice additional feature of Workflows is the support for streaming jobs. This is something that is less supported in tools like Airflow and Data Factory. Thanks to the many new features that are released every few months, this part will probably be improved a lot in the coming months and years.

Winner: Airflow

Monitoring and Alerting

Workflows offers basic features related to alerting. You can enable notifications that will send a message by mail, Slack or another tool when a job started, failed or succeeded. For monitoring, there are some dashboards available within Databricks itself. These dashboards show the typical things like if a job succeeded or not, and how long it took. One very cool feature of Workflows is spotting anomalies in run duration, as this can be done really easy. As Databricks focuses a lot on providing an API for everything, it is also possible to extract job details through the API and create dashboards yourself.

In Data Factory, you can create rules based on a large list of criteria/metrics. The pipeline monitoring overview can sometimes make it difficult to find a certain pipeline run when you have many pipelines running each night. Luckily, by adding good tags, this part can be made easier. Just like Databricks Workflows, it is possible to send your logs to Azure Monitor for different views.

Airflow also gives a nice overview of each DAG run, by immediately showing if a run failed or not, and which tasks caused the problem. By default, there is already some pretty good logging, but next to this you can also add a lot of logging yourself. By adding this to your code, you can follow up on everything that happened during a DAG run. Just like the other discussed orchestrators, you can send mails and Slack messages when there are issues with a DAG.

Winner: Airflow

Git Support

For Airflow this part can be really short. As all DAGs are built using Python code, this code can be easily kept under version control.

Data Factory pipelines are created with a drag-and-drop UI, but under the hood everything is stored as JSON files. Within the UI, users can easily enable version control and link to a GIT repository. The UI also allows you to create feature branches and merge them into a main branch. Merge conflicts can sometimes be annoying, as the user needs to open a tool like VSCode to fix them by changing the underlying JSON files.

Within Workflows, it has for a long time not been possible to keep the manually created Workflows under version control. Previously, the only option to keep them under version control has been DBX, a tool that (next to other features) allows users to define workflows in yaml files, and Terraform. While these are nice alternatives, those were not always that straightforward, especially DBX. However, in the most recent Databricks Data & AI Summit, Databricks Asset Bundles (DAB) was announced. This allows users to easily define Workflow configurations in a yaml file and deploy it to multiple environments. In 2023 Q4, developing Workflows using Python will be in private preview, so we can probably expect this GA in 2024.  These highly-requested features result in full Git support.

Tie: Airflow & Workflows

Conclusion

Based on the criteria discussed, an overview of the comparison can be found below.

 Azure/Fabric
Data Factory
Databricks
Workflows
Airflow
Ease of Use ●●●●●●●●●●●●●○○
Connectors●●●●●●●●○○●●●●●
Flexibility●●●●○●●●○○●●●●●
Monitoring/Alerting●●●●○●●●●○●●●●●
Git Support●●●●○●●●●●●●●●●

While Airflow scores really well on most criteria, it should be kept in mind the configuration and maintenance of the platform is a fair bit harder compared to Data Factory and Workflows.

The fact that Data Factory and Airflow have both been around since 2015 should also be considered. Workflows was introduced later, with most development efforts only in the past few years. By looking at the features that are added each few months by Databricks, it can be expected that this orchestrator also will keep expanding in the coming years. Especially for triggering Databricks notebooks, this orchestrator is more than adequate for scheduling your jobs.

The discussed criteria should all be used in making the decision. Next to the criteria, the technical capabilities of the team, use-cases and future complex additions should be taken into account.

A final thing to say, it is recommended to make an informed decision on which orchestrator to use, and then try to stick with one orchestrator. If all pipelines are scheduled by only one orchestrator, it will avoid many headaches compared to combining multiple ones. Especially when the number of dependencies increases.

Jasper Puts

Consultant @ Aivix