Sunteți pe pagina 1din 35

Building (Better) Data Pipelines

using Apache Airflow

Sid Anand (@r39132)


QCon.AI 2018

1
About Me
Work [ed | s] @

Co-Chair for

Maintainer of

Spare time
2
Apache Airflow
What is it?

3
Apache Airflow : What is it?

In a :

Airflow is a platform to
programmatically author, schedule
and monitor workflows (a.k.a. DAGs
or Directed Acyclic Graphs)

4
Apache Airflow
UI Walk-Through

5
Apache Airflow : UI Walk-through

6
Airflow - Authoring DAGs
Airflow: Visualizing a DAG

7
Airflow - Authoring DAGs
Airflow: Author DAGs in Python! No need to bundle many XML files!

8
Airflow - Authoring DAGs
Airflow: The Tree View offers a view of DAG Runs over time!

9
Airflow - Performance Insights
Airflow: Gantt charts reveal the slowest tasks for a run!

10
Airflow - Performance Insights
Airflow: …And we can easily see performance trends over time

11
Apache Airflow
Why use it?

12
Apache Airflow : Why use it?
When would you use a Workflow Scheduler like
Airflow?

• ETL Pipelines

• Machine Learning Pipelines

• Predictive Data Pipelines


• Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…

• General Job Scheduling (e.g. Cron)


• DB Back-ups, Scheduled code/config deployment
13
Apache Airflow : Why use it?

What should a Workflow Scheduler do well?


• Schedule a graph of dependencies
• where Workflow = A DAG of Tasks

• Handle task failures

• Report / Alert on failures

• Monitor performance of tasks over time

• Enforce SLAs
• E.g. Alerting if time or correctness SLAs are not met

• Easily scale for growing load


14
Apache Airflow : Why use it?
What Does Apache Airflow Add?

• Configuration-as-code

• Usability - Stunning UI / UX

• Centralized configuration

• Resource Pooling

• Extensibility

15
Use-Case : Message
Scoring
Batch Pipeline Architecture

16
Use-Case : Message Scoring
S3 uploads every 15
minutes
enterprise A
enterprise B S3

enterprise C

17
Use-Case : Message Scoring

enterprise A
enterprise B S3

enterprise C Airflow kicks of a Spark


message scoring job
every hour

18
Use-Case : Message Scoring

enterprise A
enterprise B S3 S3

enterprise C
Spark job writes scored
messages and stats to
another S3 bucket

19
Use-Case : Message Scoring

enterprise A
enterprise B S3 S3

enterprise C
This triggers SNS/SQS
SNS
messages events

SQS

20
Use-Case : Message Scoring

enterprise A
enterprise B S3 S3

enterprise C
An Autoscale Group SNS
(ASG) of Importers spins
up when it detects SQS SQS
messages

ASG
Importers

21
Use-Case : Message Scoring

enterprise A
enterprise B S3 S3

enterprise C
SNS
The importers rapidly ingest scored
messages and aggregate statistics into SQS
the DB

ASG
DB
Importers

22
Use-Case : Message Scoring

enterprise A
enterprise B S3 S3

enterprise C
SNS
Users receive alerts of
untrusted emails & SQS
can review them in
the web app
ASG
DB
Importers

23
Use-Case : Message Scoring

enterprise A
enterprise B S3 S3

enterprise C
SNS

Airflow manages the entire process SQS

ASG
DB
Importers

24
Airflow DAG

25
Apache Airflow
Incubating

26
Apache Airflow : Incubating
Timeline
• Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin
• Max launched it @ Hadoop Summit in Summer 2015
• On 3/31/2016, Airflow —> Apache Incubator

Today
• 2400+ Forks
• 7600+ GitHub Stars
• 430+ Contributors
• 150+ companies officially using it!
• 14 Committers/Maintainers <— We’re growing here
27
Thank You!

28
Apache Airflow
Behind the Scenes

29
Apache Airflow : Behind the Scenes

Airflow is a platform to programmatically author,


schedule and monitor workflows (a.k.a. DAGs)

It ships with a
• DAG Scheduler
• Web application (UI)
• Powerful CLI
• Celery Workers!

30
Apache Airflow : Behind the Scenes
Webserver

1. A user schedules / manages


DAGs using the Airflow UI!
Scheduler
2. Airflow’s webserver stores
scheduling metadata in the Meta DB
metadata DB

3. The scheduler picks up new


Celery / RabbitMQ
schedules and distributes
work over Celery /
RabbitMQ
Worker Worker Worker

4. Airflow workers pick up


Airflow tasks over Celery
31
Apache Airflow : Behind the Scenes
Webserver

1. A user schedules / manages


DAGs using the Airflow UI!
Scheduler
2. Airflow’s webserver stores Meta DB
scheduling metadata in the
metadata DB

Celery / RabbitMQ
3. The scheduler picks up new
schedules and distributes
work over Celery / Worker Worker Worker
RabbitMQ

4. Airflow workers pick up


Airflow tasks over Celery 32
Apache Airflow : Behind the Scenes
Webserver

1. A user schedules / manages


DAGs using the Airflow UI!
Scheduler
2. Airflow’s webserver stores
scheduling metadata in the Meta DB
metadata DB

3. The scheduler picks up new


Celery / RabbitMQ
schedules and distributes
work over Celery /
RabbitMQ
Worker Worker Worker

4. Airflow workers pick up


Airflow tasks over Celery
33
Apache Airflow : Behind the Scenes
Webserver

1. A user schedules / manages


DAGs using the Airflow UI!

2. Airflow’s webserver stores Scheduler


scheduling metadata in the
Meta DB
metadata DB

3. The scheduler picks up new


schedules and distributes Celery / RabbitMQ
work over Celery /
RabbitMQ
Worker Worker Worker

4. Airflow workers pick up


Airflow tasks from RabbitMQ
34
Thank You!

35

S-ar putea să vă placă și