Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Apache Spark
Crash Course - DataWorks Summit – Berlin 2018
Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Data
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Sources
Ã Internet of Things (IoT)
– Wind Turbines, Oil Rigs
– Beacons, Wearables
– Smart Cars
Ã User Generated Content (Social, Web & Mobile)

– Twitter, Facebook, Snapchat
– Clickstream
– Paypal, Venmo

Data Growth in Zeta Bytes (ZB)
70.00
60.00
50.00
40.00 50+ ZB in 2021

30.00
20.00
10.00
0.00
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

The “Big Data” Problem
Problem
Ã A single machine cannot process or even store all the data!
Solution
Ã Distribute data over large clusters
Difficulty
Ã How to split work across machines?
Ã Moving data over network is expensive

Ã Must consider data & network locality
Ã How to deal with failures?
Ã How to deal with slow nodes?

Apache Spark

What Is Apache Spark?
Ã Apache open source project

originally developed at AMPLab
(University of California Berkeley)
Ã Unified, general data processing
engine that operates across varied
data workloads and platforms

Why Apache Spark?
Ã Elegant Developer APIs

– Single environment for data munging, data wrangling, and Machine Learning (ML)
Ã In-memory computation model – Fast!
– Effective for iterative computations and ML
Ã Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark MLlib)
– External libraries via open & commercial projects (H2Os Sparkling Water)

Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Real-time Machine Learning Graph Analysis

Spark SQL

Structured Data Near Real-time Machine Learning Graph Analysis

More Flexible /// Better Storage and Performance

Spark SQL Overview
Ã Spark module for structured data processing (e.g. ORC, Parquet, Avro, MySQL)
Ã Two ways to manipulate data:
– DataFrame/Dataset API
– SQL query

SparkSession
What is it?
Ã Main entry point for Spark functionality
Ã Allows programming with DataFrame and Dataset APIs

Ã Represented as spark and auto-initialized in a notebook type env. (Zeppelin or Jupyter)

DataFrames
Column
Ã Distributed collection of data organized into named
columns
Col1 Col2 … … ColN
Ã Conceptually equivalent to a table in relational DB or
Row
a data frame in R/Python
Ã API available in Scala, Java, Python, and R
DataFrame
Data is described as a DataFrame

with rows, columns, and a schema

Sources
Avro Column
CSV
JSON
Col1 Col2 … … ColN
Row
Spark SQL
DataFrame
HIVE

Create a DataFrame
Example
val path = "examples/flights.json"
val flights = spark.read.json(path)

Register a Temporary View (SQL API)
Example
flights.createOrReplaceTempView("flightsView")

Two API Examples: DataFrame and SQL APIs
DataFrame API
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
SQL API +------+----+--------+
SELECT Origin, Dest, DepDelay |Origin|Dest|DepDelay|
FROM flightsView +------+----+--------+
WHERE DepDelay > 15 LIMIT 5 | IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

Spark Streaming

Structured Data Real-time Machine Learning Graph Analysis

What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics

real-time (now) historical (past)

Modern Data Applications approach to Insights
Traditional Analytics Next Generation Analytics
Structured & Repeatable Iterative & Exploratory
Structure built to store data Data is the structure
Start with hypothesis Data leads the way

Test against selected data Explore all data, identify correlations
23 Analyze
© Hortonworks Inc. 2011 – 2016. after
All Rights landing…
Reserved Analyze in motion…
23
Spark Streaming
Overview
Ã Extension of Spark Core API
Ã Stream processing of live data streams

– Scalable
– High-throughput
– Fault-tolerant
No longer
supported
ZeroMQ
in
Spark
MQTT 2.x

Spark Streaming

Spark Streaming
Discretized Streams (DStreams)

Ã High-level abstraction representing continuous stream of data
Ã Internally represented as a sequence of RDDs

Ã Operation applied on a DStream translates to operations on the underlying RDDs

Spark Streaming
Example: flatMap operation

Spark Streaming
Window Operations
Ã Apply transformations over a sliding window of data, e.g. rolling average

Challenges in Streaming Data
Ã Consistency
Ã Fault tolerance
Ã Out-of-order data

Structured Streaming
Ã High-Level APIs - DataFrames, Datasets and SQL. Same in streaming and in batch
Ã Event-time Processing - Native support for working w/ out -of-order and late data
Ã End-to-end Exactly Once - Transactional both in processing and output

Structured Streaming: Basics

Structured Streaming: Model

Handling late arriving data

Spark MLlib


Spark ML Pipeline
Ã fit() is for training

Ã transform() is for prediction
Input
Train DataFrame Pipeline
(TRAIN)
fit()
Input transform() Output
Predict DataFrame Pipeline Model Dataframe
(TEST) (PREDICTIONS)

Spark ML Pipeline
Feature Feature
Combine Linear
transform transform
features Regression
1 2
Input
Train DataFrame Pipeline
Export Model
Input Output
Predict DataFrame Pipeline Model DataFrame

Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model

results = model.transform(testData) # Test model

Exporting ML Models - PMML
Ã Predictive Model Markup Language (PMML)
–> XML-based predictive model interchange format
Ã Supported models
–K-Means
–Linear Regression
–Ridge Regression
–Lasso
–SVM
–Binary

Spark GraphX


Ã Page Rank
Ã Topic Modeling (LDA)
Ã Community Detection
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: ampcamp.berkeley.edu

GraphX Algorithms
Ã PageRank
Ã Connected components
Ã Label propagation
Ã SVD++
Ã Strongly connected components
Ã Triangle count

Sample GraphX Code in Scala
graph = Graph(vertices, edges)

messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}

Apache Zeppelin

What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful

data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more

Apache Zeppelin with HDP 2.6+
Web-based Notebook for interactive analytics
Features Use Case

• Ad-hoc experimentation • Data exploration and discovery
• Deeply integrated with • Visualization

Spark + Hadoop • Interactive snippet-at-a-time
• Supports multiple experience
language backends • “Modern Data Science Studio”
• Incubating at Apache

How does Zeppelin work?
Notebook
Author
Zeppelin
Cluster
Spark | Hive | HBase
Collaborators/ Any of 30+ back ends
Report viewers

Big Data Lifecycle
Business user
Customer
Data Engineer Data Scientist

Report
ETL /
Collect Analysis
Process
Data
Product

Zeppelin Multitenancy

Livy
Ã Livy is the open source REST interface for interacting with Apache Spark from anywhere
Ã Installed as Spark Ambari Service
Spark Interactive Session

SparkContext
HTTP HTTP (RPC)
Livy Client Livy Server

Spark Batch Session
SparkContext

Security Across Zeppelin-Livy-Spark
Shiro LDAP
Zeppelin
Driver Ispark Group Interpreter Livy APIs
Spark on YARN
SPNego: Kerberos Kerberos
Livy Server

Reasons to Integrate with Livy
Ã Bring Sessions to Apache Zeppelin

– Isolation
– Session sharing
Ã Enable efficient cluster resource utilization

– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout)
Ã To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN

SparkSession Sharing
Session-1
Client 1
SparkSession-1
SparkContext
Session-1
Session-1
Session-2
Client 2
SparkSession-2
Livy Server SparkContext
Session-2
Client 3

Apache Zeppelin + Livy End-to-End Security
Tommy Callahan
Ispark Group Interpreter Livy APIs

Spark on YARN
SPNego: Kerberos Kerberos/RPC
Zeppelin Job runs as
Livy Server Tommy Callahan
LDAP

HDP Basics

Ã Zeppelin è Interactive notebook
Scala
Java
Python
R
MLlib
Spark
SQL
Spark
Streaming
GraphX
Ã Spark
APIs
Ã YARN è Resource Management

Spark Core Engine
Ã HDFS è Distributed Storage Layer (4M files)

YARN – Future: Ozone object store
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° °
HDFS
° ° ° ° ° N

Hortonworks Data Platform

Sample Architecture

Managed Dataflow
REGIONAL CORE
SOURCES
INFRASTRUCTURE INFRASTRUCTURE

High-Level Overview
Live Dashboard
IoT Devices IoT Edge

(single node)
NiFi Hub Data Broker
Data Column
Store DB
HDFS/Ozone HBase/Cassandra
IoT Devices IoT Edge
72
(single node) Data Center
(on prem/cloud)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.x & HDP 2.x
What’s New?

What’s New
Ã Future HDP / Spark 2.3

– Spark Structured Streaming latency in single-digit milliseconds in continuous mode in stream
processing (instead of 100ms we’d normally see with micro batching)
– stream-to-stream joins
– PySpark boost by improving performance with pandas UDFs
– runs on Kubernetes clusters by providing native support for Apache Spark applications
Ã HDP 2.6.4 / Spark 2.2
– Structured Streaming GA
– Yahoo! Benchmark: 65M rec/s
– ORC feature & performance improvements à Parquet Parity
Ã HDP 2.6.3 / Spark 2.1
– Spark SQL Ranger integration for row and column security
– DataSet API GA
– GraphX GA
Spark 2.3


DSX + HDP

Data Science Experience (DSX) Local
Enterprise Data Science platform for teams
DSX
Livy REST interface
Hortonworks Data Platform (HDP)

HDP Enterprise compute (Spark/Hive) & storage
(HDFS/Ozone)

Lab

Hortonworks Community Connection
community.hortonworks.com
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories

Community Engagement
community.hortonworks.com
20k+
Registered Users
45k+
Answers
100k+
Technical Assets
© Hortonworks Inc. 2011 – 2015. All Rights Reserved

Future of Data Meetups

Thanks!
Robert Hryniewicz
@RobHryniewicz

Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Apache Spark

Crash Course - DataWorks Summit – Berlin 2018

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã User Generated Content (Social, Web & Mobile)

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

40.00 50+ ZB in 2021

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Moving data over network is expensive

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Apache open source project

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Elegant Developer APIs

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Allows programming with DataFrame and Dataset APIs

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data is described as a DataFrame

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Stream Processing + Batch Processing = All Data Analytics

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Start with hypothesis Data leads the way

Ã Stream processing of live data streams

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Discretized Streams (DStreams)

Ã Internally represented as a sequence of RDDs

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example: flatMap operation

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã fit() is for training

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

model = pipe.fit(trainData) # Train model

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: ampcamp.berkeley.edu

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

graph = Graph(vertices, edges)

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

You can make beautiful

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved