Data Platform Orchestration: Apache Airflow vs Databricks Jobs

Georgios

Written By: Georgios - 25 January 2024

In the age of data-driven business, keeping information in sync and providing the required analysis for critical decisions is a major concern of data teams. For business intelligence to work effectively, analysts rely on the precise scheduling implemented by the data engineering team or provided by the business intelligence platform. In practice, the leading open-source orchestration and workflow management platform Apache Airflow often plays a significant role in such infrastructure. But as cloud data platforms based on the Databricks ecosystem and Lakehouse architecture become more and more popular, is the dedicated orchestration service even still required?

Databricks Jobs is a built-in scheduling service that can be used as an alternative or in conjunction with Airflow. Today, we give an overview of the two services’ strengths and weaknesses, and discuss in which situations one may be favored over the other.

A small introduction to Databricks Jobs

Databricks is a cloud-based data analysis platform that can provide all services needed for modern data management, from storage over highly scalable processing to machine learning application development. Built around the Delta table format for efficient storage of and access to huge amounts of data, Databricks has created a mature ecosystem and is established as one of the most refined platform solutions available today.

Databricks Jobs are a component of the Databricks platform that extends its capabilities with a mechanism for scheduling and orchestrating data processing tasks. These tasks can be any of the supported types available to Databricks users, like Jupyter Notebooks, SQL queries, Python scripts, Apache Spark jobs, and even JAR files. The fact that it is part of the Databricks platform allows users to orchestrate and run data engineering, machine learning and analytics workflows in their Databricks cluster. Databricks Jobs offers a user interface to schedule the frequency of job execution, configure the job related parameters and also specify dependencies between the jobs. Finally, job execution can be monitored in variable granularity, giving access to job status as well as actual log messages.

2 - databricks code & graph (1)

A small introduction to Airflow

Airflow was first developed by Airbnb as an internal orchestration platform and later contributed as a fully open-source project to the Apache Software Foundation. Apache Airflow provides a Python code–based interface to schedule, manage, and scale workflows and a user interface to monitor the status of all workflows and tasks. Due to its code-first nature, Airflow is highly customizable and extensible with the ability for users to add their own operators and hooks. As a system of micro services, Apache Airflow can be scaled to suit any amount of workflows and can enhance resource efficiency and workflow optimization with parallel execution of tasks. Though often used to orchestrate data processing pipelines, Airflow is generally agnostic of what kinds of tasks it schedules and can be used for any imaginable purpose where digital workloads need scheduling and orchestration.

1- airflow  code & graph (1)

Similarities

As we have seen, processing tasks can be scheduled and orchestrated by both Apache Airflow and with the Databricks Jobs component. Both systems offer mechanisms for scaling processing power - in various ways with Airflow depending on workflow design and through scaling of the Apache Spark resources used for a job in Databricks. Integration with a plethora of popular third party systems is provided by both systems, interfacing with different databases and storage systems or external services used for processing. Finally, monitoring and notification capabilities are given in both environments and cover the usual needs of process owners to keep track of process status.

Differences

While Apache Airflow and Databrick serve the same purposes to some degree, understanding the conceptual and technological differences is key to deciding in which situation one would rather use one over the other. First on this list of differences is the fact that Airflow serves a much wider spectrum of tasks and types of workflows it may schedule and orchestrate. If you need to orchestrate tasks that are not natively supported by Databricks use cases, e.g. triggering routines from system operations, backup, clean-up, etc., Airflow should be your choice.

If your scheduled tasks are actually implemented in Databricks but require orchestration with external pre- or post-conditions, again Airflow might be the better suited option as the leading system. Combination of both systems is actually a viable solution for these kinds of scenarios, provided by the Apache Airflow Databrick integration provider package. A variety of Airflow Operators are available for communication with Databricks. Jobs can be immediately triggered with the DatabricksRunNowOperator class or submitted for asynchronous execution with the DatabrickSubmitRunOperator.

 

3 - airflow_code_databricks_UI (1)

Databricks has a much wider scope than Airflow when it comes to housing the interactive development of data processing and data analysis including AI and machine learning scenarios. When these processes are your main focus and you’re looking for ways to transition from manual execution to routine operations, Databricks Jobs is the obvious choice to manage your workflows. By this nature, Databricks Jobs are closer to the data they process than what you would achieve with native Apache Airflow components. Implementing and scheduling data pipelines natively in Databrick may be a perfect fit if the bulk of your data is already part of a Databricks-powered Lakehouse environment or if jobs mainly ingest data for it. The number of tasks that can be scheduled with Databricks is technically limited at roughly 1.000 different Jobs though, so very high scale scheduling needs might require external scheduling or a mixed approach.


Effective workflow management with Apache Airflow 2.0 

NextLytics Whitepaper Apache Airflow


In scenarios where you have a multi-cloud infrastructure, Airflow can orchestrate all the existing workflows across the various domains thanks to a wide array of connector modules. Central management and monitoring can be a strong argument to use such an existing service for scheduling rather than Databricks’ own.

Finally, Databricks Jobs offer native support for continuously running jobs and real-time processing pipelines with the Delta Live Tables feature. Similar behavior could be achieved with high frequency triggering through Airflow but would introduce more latency and communication overhead into the process.

Comparison factors

Databricks Jobs

Apache Airflow

Pricing

The cost of Databricks Jobs depends on some factors like:

  • The plan that you will select (Standard, Premium or Enterprise),
  • The cloud provider that you will choose (AWS, Azure or Google cloud) and
  • The region of deployment (US East or West, North or West Europe etc)

Databricks Jobs are billed based on active compute resource usage for Job executions, going for roughly $0,15 per Databricks Unit* (DBU) - around 1/5th of the rate of normal interactively used compute resources on Databricks.

Job execution comes at a discount in comparison to the active development which usually precedes it and the general operation costs for the existing Databricks system which you need to even consider using its scheduling subsystem.

* DBU is an abstract, normalized unit of processing power; 1 DBU roughly aligns with the use of a single compute node with 4 CPU cores and 16GB memory for 1 hour.

The operational cost of Apache Airflow can depend on a couple of factors, the most important is the infrastructure cost. It needs servers or cloud resources to host it as well as expert knowledge and support from a service provider like NextLytics, if these cannot be provided internally.

Supported programming languages

Databricks supports many languages, primarily those compatible with the underlying Apache Spark distributed compute engine. Some of the available languages that are available in Databricks are Python, Scala, R, Java, and SQL.

When you create Databricks Jobs, you can choose the language that best suits your needs for the specific task

Apache Airflow’s core programming language is Python and it is used for defining and executing workflows. Despite the fact that Python is the main language of Airflow, each task within the workflow can run scripts or just commands in other languages.The above described flexibility gives the ability to users to write code in SQL, Java, Go, Rust and more.

Apache Airflow is language-agnostic when it comes to the task level since it is capable of running any command in any language that is compatible with the environment where Airflow runs. 

Target Group

It is designed for programmers that are working in the data science field. Databricks Jobs fits perfectly for developers that are leveraging Apache Spark. As part of the Databricks unified platform, it is a good fit for companies or organizations that deal with big data analytics and machine learning use cases and looking for a robust workflow orchestration solution

Apache Airflow caters to data engineers, DevOps professionals, and workflow automation teams. It provides a flexible and extensible platform for orchestrating and scheduling complex workflows across various systems and technologies. Airflow accommodates ideally to teams that are looking for a reliable open source solution.

 

Performance

Offers high-performance query execution, thanks to its optimized query engine and distributed computing capabilities. It's suitable for processing large-scale data and complex analytics workloads.

Apache Airflow has remarkable performance in orchestrating and managing workflows. It also supports parallel execution of tasks. Furthermore its distributed architecture empowers the program’s scalability.

Community

Databricks has a strong user community with active forums, knowledge sharing, and resources. The Databricks Community Edition is a free option that allows users to explore the platform and get support from the community. Paid plans offer additional support and services. Databricks official documentation is vast and easily accessible, providing technical references alongside practical examples for many use cases.

As a popular, widely-adopted open-source project, Apache Airflow boasts a thriving community of users and contributors. With an ever-expanding user base, Airflow provides a dedicated documentation site that offers comprehensive instructions and valuable information for both beginners and experienced users. Additionally, the fast-growing community can be easily accessed on platforms such as GitHub, Slack, and Stack Overflow. These channels foster collaboration, provide support, and facilitate collaboration among the users

 

Data Platform Orchestration - Our Conclusion

Apache Airflow and Databricks Jobs offer robust solutions for orchestrating data workflows but the unique strengths of each tool lie in different areas. On the one hand, Databricks Jobs is an excellent solution for companies that have already invested in Databricks since it integrates seamlessly with the platform and provides an easily accessible scheduling mechanism. On the other hand, Apache Airflow’s open source nature and its extensive library of operators makes it an all-around choice for orchestrating diverse tasks across various platforms. Ultimately, the choice between Databricks Jobs and Apache Airflow depends on the specific needs and preconditions as any combination may be valuable and provide the best possible solution for a very specific case.

If you are not sure which option is right for you, we’re happy to discuss and help you find the perfect fit. Simply get in touch with us - we look forward to exchanging ideas with you!

Learn more about Apache Airflow

Topics: Machine Learning, Apache Airflow

Share article