The number of open-source Big Data tools has exploded in our data-driven society in just a few years. Because of the variety of alternatives, a wide range of tools and platforms for storing, processing, and visualizing data has emerged thus changing the data engineering landscape.
Apache Airflow is the one platform that has quickly become the de-facto standard for data workflow management. Airflow garnered a lot of popularity since it was originally presented by Airbnb in 2015, because to its stability and versatility due to the boom of python usage in the industry.
The primary advantage of Airflow is that it uses code to design its workflows. Its users have complete control over the code that runs at each stage of the pipeline. Because Airflow does not put any constraints on how your workflows should perform, the options are unlimited when working with it.
Airflow has stamped itself as a goliath in the data engineering world because it has a very robust and well-equipped user interface. This makes it simple to keep track of jobs, rerun them, and configure the platform. In this post, I provide a solid introduction on why Apache’s Airflow should be your next tool.
The History of Apache’s Airflow
Airbnb ran into an issue in 2015. They were exploding in size, with a vast amount of data that was only rising. To realize their goal of becoming a fully data-driven company, they needed to expand their workforce of data engineers, data scientists, and analysts, all of whom had to automate operations on a regular basis by developing batch tasks.
Maxime Beauchemin built an open-sourced Airflow to meet the demand for a sophisticated scheduling tool, with the goal of allowing them to swiftly author, iterate on, and monitor their batch data pipelines.
Airflow has come a long way since Maxime’s initial commitment. In April of 2016, the project entered the official Apache Foundation Incubator, where it stayed and grew until January 8, 2019, when it graduated as a top-level. Airflow 2.0 was released on December 17th, 2020, with big enhancements and powerful new features.
Thousands of Data Engineering teams use Airflow around the world, and it continues to increase in popularity as the community grows.
How is Airflow Used?
Data pipelines or workflows are scheduled and orchestrated using Apache Airflow. The sequencing, coordination, scheduling, and management of complex data pipelines from several sources is referred to as data pipeline orchestration.
These data pipelines produce data sets that may be consumed by business intelligence and data science applications, as well as machine learning models that support big data applications.
These workflows are represented as Directed Acyclic Graphs in Airflow (DAG). To further grasp what a workflow/DAG is, consider baking a cake
Workflows typically have an end objective, such as developing charts for the previous day’s sales figures. The DAG now demonstrates how each step is dependent on several other stages that must be completed first. You’ll need flour, milk, vanilla, sugar, and butter to mix and create the cake.
Similarly, you’ll need to move your data from relational databases to a data warehouse to construct your visualization from yesterday’s sales.
The comparison above also demonstrates that some tasks, such as mixing the wet ingredients and mixing the dry ingredients, can be done in simultaneously because they are not interdependent.
Likewise, you may need to load data from many sources in order to generate your reports. Here’s an example of a Dag that generates visualizations from previous days’ sales.
Data scientists may construct better tuned and more accurate machine learning models using efficient, cost-effective, and well-orchestrated data pipelines, because those models have been trained using whole data sets rather than small samples.
Because Airflow is naturally connected with big data systems like Hive, Presto, and Spark, it’s a great framework for orchestrating operations that execute on any of these engines. Airflow is rapidly being used to orchestrate ETL/ELT processes by businesses.
The Benefits of Apache Airflow
Airflow has great advantages which are listed below:
Dynamic: Airflow pipelines are configured as code (Python) and can be generated dynamically. This enables users to build code that dynamically creates pipelines.
Elegant: Pipelines for airflow are simple and explicit. The Jinja templating engine is used to parameterize your scripts, which is incorporated into Airflow.
Extensible: Airflow allows you to effortlessly define your own operators and executors and modify the library, so it fits your environment.
Scalable: Airflow has a modular design and communicates with and orchestrates an arbitrary number of workers via a message queue.
Another advantage of utilizing code to maintain pipelines is that it enables for versioning and change responsibility. Airflow is better than other alternatives at supporting roll-forward and roll-back, and it provides more detail and accountability of changes over time. Airflow will adapt with you as your needs change, even if not everyone uses it this way.
Another benefit is that Airflow is open source and has a large developer community that supports and contributes to the project. You can tweak and add your own code if you need a new operator or functionality, making it highly extendable.
Apache Airflow use cases
Because of Airflow’s adaptability, you may use it to arrange any form of workflow. Airflow can handle ad hoc workloads that aren’t bound by any timetable or interval. However, it does work best for pipelines:
- That change slowly
- Related to a specific time interval
- Scheduled on time
By “slowly changing,” we mean that once launched, the pipeline is expected to change over time (days/weeks rather than hours or minutes). It’s due to a lack of versioning in Airflow pipelines. The phrase “related to a specific time interval” refers to the Airflow’s suitability for processing data intervals. Airflow also performs best when pipelines are planned to run at specified times. Although the pipelines can be started manually or using external triggers (for example via REST API).
Apache Airflow can be used to schedule:
- ETL pipelines that extract data from multiple sources and run Spark jobs or any other data transformations
- Training machine learning models
- Report generation
- Backups and similar DevOps operations
Apache Airflow Important Concepts
A data pipeline specified in Python code is known as a Directed Acyclic Graph, or DAG. Each DAG is arranged to illustrate relationships between tasks in Airflow’s UI and represents a group of tasks you want to conduct. When you break down the qualities of DAGs, you can see how valuable they are:
- Directed: If there are many tasks/jobs with dependencies, each one must have at least one upstream or downstream task declared.
- Acyclic: Tasks are not permitted to generate data that can be used to self-refer. This is done to avoid infinite cycles.
- Graph: All tasks are planned out in a logical order, with operations taking place at defined points and with predetermined links to other tasks.
For example, the figure below depicts an acceptable DAG on the left with a few simple dependencies, as opposed to an invalid, non-acyclic DAG on the right.
Each node of a DAG is represented by a task. They are visual representations of the work being done at each phase of the workflow, with operators defining the actual work that they represent.
Operators are the foundation of Airflow, and they determine how much work gets done. They’re like a shell around a single job, or node in a DAG, that describes how that task will be executed. Operators define the work that must be done at each step of the process, whereas DAGs ensure that operators are planned and run in a specific order.
Operators are divided into three categories:
- Action Operators execute a function, like the PythonOperator or BashOperator
- Transfer Operators move data from a source to a destination, like the S3ToRedshiftOperator
- Sensor Operators wait for something to happen, like the ExternalTaskSensor
Operators are defined independently, although they can use XComs to transfer data to other operators.
The integrated system of DAGs, operators, and tasks looks like this at a high level:
Airflow’s method of interacting with third-party systems is using hooks. They make it possible to connect to external APIs and databases like as Hive, S3, GCS, MySQL, Postgres, and others. They serve as operators’ building blocks. Secure information, such as authentication credentials, is kept out of hooks and instead stored in the encrypted metadata db that lives under your Airflow instance via Airflow connections.
Providers are community-maintained packages that contain all of a service’s fundamental Operators and Hooks (e.g. Amazon, Google, Salesforce, etc.). These packages are offered as part of Airflow 2.0 as various, separate but connected packages that can be put straight into an Airflow environment.
Airflow plugins are a collection of Hooks and Operators that can be used to do a specific activity, such as moving data from Salesforce to Redshift. If you want to see if a plugin you need has already been built by the community, go to our open-source library of Airflow plugins.
Connections are where Airflow stores information that allows you to connect to external systems, such as authentication credentials or API tokens. This is managed directly from the UI and the actual information is encrypted and stored as metadata in Airflow’s underlying Postgres or MySQL database.
To summarize, Airflow is a robust framework for automating workflow management and scheduling. Airflow can be used in a variety of situations, including:
- Cleaning, organizing, and storing data in a data warehouse is known as data warehousing.
- Automate and monitor multiple machine learning workflows using machine learning.
- Email reporting: To automate the process of creating and sending reports by email, and so on.
It would be beneficial to take a look at Airflow no matter what you do in the data environment. It is a front-runner in this space and large companies such as Bloomberg, Alibaba and Lyft use it to its’ full force.
Thanks to all the features mentioned and the ease of which you can configure an Airflow server, it’s no surprise that Airflow quickly managed to establish itself as the go-to workflow management platform of this generation.
Interested in more? Check out our product, Vantage Point. Vantage Point (VP) is a no-code, click & go business acceleration tool which enables data driven decisions across your business. It drives interactivity across all parts of your organization by communicating value (KPIs), autogenerating tasks with cutting-edge ML/AI technology and enabling users to combine VP’s ML/AI recommendations with their own analysis. You can finally track the exact ROI impact throughout your entire business with Vantage Point.
Sign up for a demo with the link below
Senior Technology Architect
Senior Technology Architect