Sunteți pe pagina 1din 41

Continuous Processing

with Apache Flink


Stephan Ewen
@stephanewen
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced

2
Continuous Apps before Streaming
Scheduler

file 1 Job 1

Serving
file 2 Job 2

file 3 Job 3

time
3
Continuous Apps with Lambda
Scheduler

file 1 Job 1

Serving
file 2 Job 2

Streaming
job 4
Continuous Apps with Streaming

collect log analyze serve & store

5
Continuous Data Sources
Process a period of Process latest data
historic data with low latency
(tail of the log)

partition
partition

Reprocess stream
(historic data first, catches up with realtime data)

6
Continuous Data Sources
Stream of events in Apache Kafka partitions

partition
partition

Stream view over sequence of files

2016-3-1 2016-3-1 2016-3-1 2016-3-11 2016-3-11 2016-3-12 2016-3-12 2016-3-12 2016-3-12


12:00 am 1:00 am 2:00 am … 10:00pm 11:00pm 12:00am 1:00am 2:00am 3:00am

7
Continuous Processing

Time State
Enter Apache Flink

9
Apache Flink Stack
Libraries

DataStream API DataSet API


Stream Processing Batch Processing

Runtime
Distributed Streaming Data Flow

Streaming and batch as first class citizens.


10
Programs and Dataflows
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) Source

val events: DataStream[Event] = lines.map((line) => parse(line))


Transformation

val stats: DataStream[Statistic] = stream


.keyBy("sensor")
.timeWindow(Time.seconds(5)) Transformation
.apply(new MyAggregationFunction())

stats.addSink(new RollingSink(path)) Sink

keyBy()/
Source map() window()/
[1] [1] apply()
[1]

Sink Streaming
[1] Dataflow
keyBy()/
Source map() window()/
[2] [2] apply()
[2]
11
What makes Flink flink?
Low latency
Make more sense of data
High Throughput

Well-behaved Works on real-time


flow control and historic data
True
(back pressure) Event Time
Streaming

Windows & Stateful APIs


user-defined state
Streaming Libraries Complex Event Processing

Exactly-once semantics
for fault tolerance
Globally consistent Flexible windows
savepoints (time, count, session, roll-your own)
12
(It's) About Time

13
Different Notions of Time
Flink Flink
Event Producer Message Queue
Data Source Window Operator

partition 1

partition 2

Event Window
Time Storage Stream Processor Processing
Ingestion Ingestion Time
Time Time

14
Event Time vs. Processing Time

Event Time
Episode Episode Episode Episode Episode Episode Episode
IV V VI I II III VII

1977 1980 1983 1999 2002 2005 2015

Processing Time
15
Batch: Implicit Treatment of Time

1h
1h
Serving
Batch Job
Batch Job Layer
1h
Batch Job

Time is treated outside of your application.


Data is grouped by storage ingestion time.
16
Streaming: Windows
Aggregates on streams
are scoped by windows

Time-driven Data-driven
e.g. last X minutes e.g. last X records

Time
17
Streaming: Windows

Time
"Average over the last 5 minutes”

18
Event Time Windows

Event Time Windows reorder the events to their Event Time order
19
Processing Time
case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironment


env.setStreamTimeCharacteristic(ProcessingTime)

val stream: DataStream[Event] = env.addSource(…)

stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")

20
Ingestion Time
case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironment


env.setStreamTimeCharacteristic(IngestionTime)

val stream: DataStream[Event] = env.addSource(…)

stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")

21
Event Time
case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironment


env.setStreamTimeCharacteristic(EventTime)

val stream: DataStream[Event] = env.addSource(…)


val tsStream = stream.assignTimestampsAndWatermarks(
new MyTimestampsAndWatermarkGenerator())

tsStream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure") 22
The Power of Event Time
 Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions
• Wrong grouping at batch boundaries

 Traditional Stream Processors: Processing time


• Results depend on when the program runs (different on re-execution)
• Results affected by network speed and delays

 Event-Time Stream Processors: Event time


• Stable across re-executions
• No incorrect results at batch boundaries

23
The Power of Event Time
 Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions Mix of data-driven and
• Wrong grouping at batch boundaries wall clock time

 Traditional Stream Processors: Processing time


• Results depend on when the program runs (different on re-execution)
• Results affected by network speed and delays Purely wall clock time
 Event-Time Stream Processors: Event time
• Stable across re-executions
• No incorrect results at batch boundaries Purely data-driven time

24
Event Time Progress: Watermarks
Stream (in order)

23 21 20 19 18 17 15 14 11 10 9 9 7

W(20) W(11)

Event
Watermark
Event timestamp

Stream (out of order)

21 19 20 17 22 12 17 14 12 9 15 11 7

W(17) W(11)

Event
Watermark
Event timestamp 25
Bounding the Latency for Results
 Triggering on combinations on
Event Time and Processing Time

 See previous talks by Tyler Akidau &


Kenneth Knowles on Apache Beam (incub.)

 Concepts apply almost 1:1 to Apache Flink


 Syntax varies

26
Matters of State

27
Batch vs. Continuous
Continuous
Batch Jobs
Programs

• No state across batches • Continuous state across time

• Fault tolerance within a job • Fault tolerance guards state

• Re-processing starts empty • Reprocessing starts stateful

28
Continuous State
Sessions over time

time
No stateless point in time
29
Re-processing data (in batch)

2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1


12:00 am 1:00 am 2:00 am 3:00 am 4:00am 5:00am 6:00am 7:00am

30
Re-processing data (in batch)

2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1 2016-3-1


12:00 am 1:00 am 2:00 am 3:00 am 4:00am 5:00am 6:00am 7:00am

Wrong / corrupt results 31


Streaming: Savepoints

Savepoint A Savepoint B

Globally consistent point-in-time snapshot


of the streaming application 32
Re-processing data (continuous)

Savepoint A

33
Re-processing data (continuous)
 Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …)
 Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)

Run new streaming


program from savepoint Savepoint

34
Forking and Versioning Applications

App. A

Savepoint Savepoint

App. B

Savepoint App. C

Savepoint

35
Conclusion

36
Wrap up
 Streaming is the architecture for continuous processing

 Continuous processing makes data applications


• Simpler: Fewer moving parts
• More correct: No broken state at any boundaries
• More flexible: Reprocess data and fork applications via savepoints

 Requires a powerful stream processor, like Apache Flink

37
Upcoming Features
 Dynamic Scaling, Resource Elasticity
 Stream SQL
 CEP enhancements
 Incremental & asynchronous state snapshotting
 Mesos support
 More connectors, end-to-end exactly once
 API enhancements (e.g., joins, slowly changing inputs)
 Security (data encryption, Kerberos with Kafka)
38
What makes Flink flink?
Low latency
Make more sense of data
High Throughput

Well-behaved Works on real-time


flow control and historic data
True
(back pressure) Event Time
Streaming

Windows & Stateful APIs


user-defined state
Streaming Libraries Complex Event Processing

Exactly-once semantics
for fault tolerance
Globally consistent Flexible windows
savepoints (time, count, session, roll-your own)
39
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!
data-artisans.com/careers

S-ar putea să vă placă și