Continuous Processing With Apache Flink: Stephan Ewen @stephanewen

Continuous Processing
with Apache Flink

Stephan Ewen
@stephanewen
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced
2
Continuous Apps before Streaming
Scheduler
file 1 Job 1
Serving
file 2 Job 2
file 3 Job 3
time
3
Continuous Apps with Lambda
Scheduler
file 1 Job 1
Serving
file 2 Job 2
Streaming
job 4
Continuous Apps with Streaming
collect log analyze serve & store
5
Continuous Data Sources
Process a period of Process latest data
historic data with low latency
(tail of the log)
partition
partition
Reprocess stream
(historic data first, catches up with realtime data)
6
Continuous Data Sources
Stream of events in Apache Kafka partitions
partition
partition
Stream view over sequence of files
2016-3-1 2016-3-1 2016-3-1 2016-3-11 2016-3-11 2016-3-12 2016-3-12 2016-3-12 2016-3-12

12:00 am 1:00 am 2:00 am … 10:00pm 11:00pm 12:00am 1:00am 2:00am 3:00am
7
Continuous Processing
Time State
Enter Apache Flink
9
Apache Flink Stack
Libraries
DataStream API DataSet API

Stream Processing Batch Processing
Runtime
Distributed Streaming Data Flow
Streaming and batch as first class citizens.

10
Programs and Dataflows
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) Source
val events: DataStream[Event] = lines.map((line) => parse(line))

Transformation
val stats: DataStream[Statistic] = stream

.keyBy("sensor")
.timeWindow(Time.seconds(5)) Transformation
.apply(new MyAggregationFunction())
stats.addSink(new RollingSink(path)) Sink
keyBy()/
Source map() window()/
[1] [1] apply()
[1]
Sink Streaming
[1] Dataflow
keyBy()/
Source map() window()/
[2] [2] apply()
[2]
11
What makes Flink flink?
Low latency
Make more sense of data
High Throughput
Well-behaved Works on real-time

flow control and historic data
True
(back pressure) Event Time
Streaming
Windows & Stateful APIs

user-defined state
Streaming Libraries Complex Event Processing
Exactly-once semantics
for fault tolerance
Globally consistent Flexible windows
savepoints (time, count, session, roll-your own)
12
(It's) About Time
13
Different Notions of Time
Flink Flink
Event Producer Message Queue
Data Source Window Operator
partition 1
partition 2
Event Window
Time Storage Stream Processor Processing
Ingestion Ingestion Time
Time Time
14
Event Time vs. Processing Time
Event Time
Episode Episode Episode Episode Episode Episode Episode
IV V VI I II III VII
1977 1980 1983 1999 2002 2005 2015
Processing Time
15
Batch: Implicit Treatment of Time
1h
1h
Serving
Batch Job
Batch Job Layer
1h
Batch Job
Time is treated outside of your application.

Data is grouped by storage ingestion time.
16
Streaming: Windows
Aggregates on streams
are scoped by windows
Time-driven Data-driven
e.g. last X minutes e.g. last X records
Time
17
Streaming: Windows
Time
"Average over the last 5 minutes”
18
Event Time Windows
Event Time Windows reorder the events to their Event Time order
19
Processing Time
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment

env.setStreamTimeCharacteristic(ProcessingTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
20
Ingestion Time

env.setStreamTimeCharacteristic(IngestionTime)
stream
.keyBy("id")
.sum("measure")
21
Event Time

env.setStreamTimeCharacteristic(EventTime)

val tsStream = stream.assignTimestampsAndWatermarks(
new MyTimestampsAndWatermarkGenerator())
tsStream
.keyBy("id")
.sum("measure") 22
The Power of Event Time
 Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions
• Wrong grouping at batch boundaries
 Traditional Stream Processors: Processing time

• Results depend on when the program runs (different on re-execution)
• Results affected by network speed and delays
 Event-Time Stream Processors: Event time

• No incorrect results at batch boundaries
23
The Power of Event Time
 Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions Mix of data-driven and
• Wrong grouping at batch boundaries wall clock time
 Traditional Stream Processors: Processing time

• Results depend on when the program runs (different on re-execution)
• Results affected by network speed and delays Purely wall clock time
 Event-Time Stream Processors: Event time
• No incorrect results at batch boundaries Purely data-driven time
24
Event Time Progress: Watermarks
Stream (in order)
23 21 20 19 18 17 15 14 11 10 9 9 7
W(20) W(11)
Event
Watermark
Event timestamp
Stream (out of order)
21 19 20 17 22 12 17 14 12 9 15 11 7
W(17) W(11)
Event
Watermark
Event timestamp 25
Bounding the Latency for Results
 Triggering on combinations on
Event Time and Processing Time
 See previous talks by Tyler Akidau &

Kenneth Knowles on Apache Beam (incub.)
 Concepts apply almost 1:1 to Apache Flink

 Syntax varies
26
Matters of State
27
Batch vs. Continuous
Continuous
Batch Jobs
Programs
• No state across batches • Continuous state across time
• Fault tolerance within a job • Fault tolerance guards state
• Re-processing starts empty • Reprocessing starts stateful
28
Continuous State
Sessions over time
time
No stateless point in time
29
Re-processing data (in batch)
2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1

12:00 am 1:00 am 2:00 am 3:00 am 4:00am 5:00am 6:00am 7:00am
30
Re-processing data (in batch)
2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1

12:00 am 1:00 am 2:00 am 3:00 am 4:00am 5:00am 6:00am 7:00am
Wrong / corrupt results 31

Streaming: Savepoints
Savepoint A Savepoint B
Globally consistent point-in-time snapshot

of the streaming application 32
Re-processing data (continuous)
Savepoint A
33
Re-processing data (continuous)
 Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …)
 Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)
Run new streaming

program from savepoint Savepoint
34
Forking and Versioning Applications
App. A
Savepoint Savepoint
App. B
Savepoint App. C
Savepoint
35
Conclusion
36
Wrap up
 Streaming is the architecture for continuous processing
 Continuous processing makes data applications

• Simpler: Fewer moving parts
• More correct: No broken state at any boundaries
• More flexible: Reprocess data and fork applications via savepoints
 Requires a powerful stream processor, like Apache Flink
37
Upcoming Features
 Dynamic Scaling, Resource Elasticity
 Stream SQL
 CEP enhancements
 Incremental & asynchronous state snapshotting
 Mesos support
 More connectors, end-to-end exactly once
 API enhancements (e.g., joins, slowly changing inputs)
 Security (data encryption, Kerberos with Kafka)
38
What makes Flink flink?
Low latency
Make more sense of data
High Throughput
Well-behaved Works on real-time

flow control and historic data
True
(back pressure) Event Time
Streaming
Windows & Stateful APIs

user-defined state
Streaming Libraries Complex Event Processing
Exactly-once semantics
for fault tolerance
Globally consistent Flexible windows
savepoints (time, count, session, roll-your own)
39
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!
data-artisans.com/careers

Continuous Processing With Apache Flink: Stephan Ewen @stephanewen

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Continuous Processing With Apache Flink: Stephan Ewen @stephanewen

Încărcat de

Drepturi de autor:

Formate disponibile

Continuous Processing

with Apache Flink

collect log analyze serve & store

Stream view over sequence of files

2016-3-1 2016-3-1 2016-3-1 2016-3-11 2016-3-11 2016-3-12 2016-3-12 2016-3-12 2016-3-12

DataStream API DataSet API

Streaming and batch as first class citizens.

val events: DataStream[Event] = lines.map((line) => parse(line))

val stats: DataStream[Statistic] = stream

stats.addSink(new RollingSink(path)) Sink

Well-behaved Works on real-time

Windows & Stateful APIs

1977 1980 1983 1999 2002 2005 2015

Time is treated outside of your application.

val env = StreamExecutionEnvironment.getExecutionEnvironment

val stream: DataStream[Event] = env.addSource(…)

val env = StreamExecutionEnvironment.getExecutionEnvironment

val stream: DataStream[Event] = env.addSource(…)

val env = StreamExecutionEnvironment.getExecutionEnvironment

val stream: DataStream[Event] = env.addSource(…)

 Traditional Stream Processors: Processing time

 Event-Time Stream Processors: Event time

 Traditional Stream Processors: Processing time

Stream (out of order)

 See previous talks by Tyler Akidau &

 Concepts apply almost 1:1 to Apache Flink

• No state across batches • Continuous state across time

• Fault tolerance within a job • Fault tolerance guards state

• Re-processing starts empty • Reprocessing starts stateful

2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1

2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1

Wrong / corrupt results 31

Globally consistent point-in-time snapshot

Run new streaming

 Continuous processing makes data applications

 Requires a powerful stream processor, like Apache Flink

Well-behaved Works on real-time

Windows & Stateful APIs

S-ar putea să vă placă și