Apache Airflow: An open-source, powerful platform for orchestrating complex workflows and data pipelines. This must-have tool for the modern-day data engineer supports programmatically authoring, scheduling, and monitoring workflows. You’re covered from top to bottom if you need to set up in order to make and manage your very first DAG.
What is Apache Airflow
It is an open-source framework designed to orchestrate complex activities and manage data pipelines. It allows users to programmatically define and schedule workflows, making it a powerful tool in the hands of a data engineer or analyst. It fundamentally uses DAGs to graphically describe processes. In this, each node represents a job while edges convey relationships between tasks.
What is a Directed Acyclic Graph (DAG) in Apache Airflow?
A Directed Acyclic Graph (DAG) is a crucial concept in Apache , functioning as the foundation for coordinating activities. It constitutes an assemblage of tasks systematically arranged to illustrate their interdependence and interrelations, facilitating effective execution and oversight.
As with any graph, a DAG has such properties as:
Directed: The edges in the graph have an orientation associated with them, which indicates one way in which the tasks need to execute. For example, if Task A needs to be finished before Task B can start, this relationship is represented as an edge from A directed to B.
Acyclic: The graph is acyclic which means that there are no loops that will take it back to the starting point. This guarantees the workflows can be executed without getting into a loop with no end.
Tasks: Each of the nodes in the DAG represents a task, which can comprise operations like executing a script, querying the database, or sending an email.
Dependence: The interdependencies among the tasks are defined with the help of dependence that depicts the dependency between two or more tasks and runs them in a particular order. If Task A is to be performed prior to the initiation of Task B, then their dependence can simply be depicted in the form of a DAG.
Key Features of Apache:
Dynamic workflow management can be used to programmatically create workflows in Apache. This feature facilitates dynamic task building based on external conditions.
Proximity to other systems: One can create self-defined operators and plugins.
Strong Scheduling: Users are enabled to define the time of workflow runs with the scheduling features that it comes with built-in.
Friendly User Interface: All the status of tasks, logs, and execution history of any workflow can be accessed through a web-based interface.
Configuring Apache.
To run Apache Airflow, these are the steps for you to follow
Step 1: Installation
You can install Airflow using pip:
bash
pip install apache-airflow
For an even more isolated environment, consider using Docker.
Step 2: Initialize the Database
After installation, initialize the metadata database:
bash
airflow db init
Step 3: Start the Web Server and Scheduler
Run the following commands in separate terminal windows:
bash
airflow webserver –port 8080
airflow scheduler
The web server presents a user interface at http://localhost:8080/.
Creating Your First DAG
Creating a DAG in Airflow means defining tasks and their dependencies. Here’s how to set up your first DAG:
Step 1: Import Required Modules
Import classes from Airflow library
python
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
Step 2: Default Arguments
Create a dictionary to hold your DAG’s default parameters
python
default_args = {
‘owner’: ‘airflow’,
‘start_date’: datetime(2024, 1, 1),
‘retries’: 1,
‘retry_delay’: timedelta(minutes=5),
}
Step 3: DAG Creation
Define your DAG object with a unique identifier
python
dag = DAG(
dag_id=’my_first_dag’,
default_args=default_args,
schedule_interval=’@daily’,
catchup=False,
)
Step 4: Tasks
Define your tasks within your DAG. For example,
python
____________________________________________________
start_task = DummyOperator(task_id=’start’, dag=dag)
create_database = BashOperator(
task_id=’create_database’,
bash_command=’hive -f /path/to/create_database.hql’,
dag=dag,
)
end_task = DummyOperator(task_id=’end’, dag=dag)
Step 5: Define Task Dependencies
You use the bitwise operator >> to define dependencies between tasks:
python
start_task >> create_database >> end_task
This configuration means that create_database will be executed after start_task, then end_task.
Run Your DAG
After you have defined your DAG and saved it to the dags/ directory (for example, my_first_dag.py), you can run it from the UI:
Go to http://localhost:8080/.
Find your DAG in the list and click on it.
Click the “Trigger Dag” button.