Documente Academic
Documente Profesional
Documente Cultură
Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Data
60.00
50.00
20.00
10.00
0.00
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Solution
à Distribute data over large clusters
Difficulty
à How to split work across machines?
à Spark module for structured data processing (e.g. ORC, Parquet, Avro, MySQL)
à Two ways to manipulate data:
– DataFrame/Dataset API
– SQL query
What is it?
à Main entry point for Spark functionality
DataFrame
Avro Column
CSV
JSON
Col1 Col2 … … ColN
Row
Spark SQL
DataFrame
HIVE
Example
val path = "examples/flights.json"
val flights = spark.read.json(path)
Example
flights.createOrReplaceTempView("flightsView")
DataFrame API
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
SQL API +------+----+--------+
SELECT Origin, Dest, DepDelay |Origin|Dest|DepDelay|
FROM flightsView +------+----+--------+
WHERE DepDelay > 15 LIMIT 5 | IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
23 Analyze
© Hortonworks Inc. 2011 – 2016. after
All Rights landing…
Reserved Analyze in motion…
23
Spark Streaming
Overview
à Extension of Spark Core API
No longer
supported
ZeroMQ
in
Spark
MQTT 2.x
Window Operations
à Apply transformations over a sliding window of data, e.g. rolling average
à Consistency
à Fault tolerance
à Out-of-order data
à High-Level APIs - DataFrames, Datasets and SQL. Same in streaming and in batch
à Event-time Processing - Native support for working w/ out -of-order and late data
à End-to-end Exactly Once - Transactional both in processing and output
Input
Train DataFrame Pipeline
(TRAIN)
fit()
Input transform() Output
Predict DataFrame Pipeline Model Dataframe
(TEST) (PREDICTIONS)
Feature Feature
Combine Linear
transform transform
features Regression
1 2
Input
Train DataFrame Pipeline
Export Model
Input Output
Predict DataFrame Pipeline Model DataFrame
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
à PageRank
à Connected components
à Label propagation
à SVD++
à Strongly connected components
à Triangle count
Web-based notebook
that enables interactive
data analytics.
Notebook
Author
Zeppelin
Cluster
Spark | Hive | HBase
Collaborators/ Any of 30+ back ends
Report viewers
ETL /
Collect Analysis
Process
Data
Product
à Livy is the open source REST interface for interacting with Apache Spark from anywhere
à Installed as Spark Ambari Service
Shiro LDAP
Zeppelin
Driver Ispark Group Interpreter Livy APIs
Spark on YARN
SPNego: Kerberos Kerberos
Livy Server
à To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN
Session-1
Client 1
SparkSession-1
SparkContext
Session-1
Session-1
Session-2
Client 2
SparkSession-2
Livy Server SparkContext
Session-2
Client 3
Tommy Callahan
LDAP
° ° ° ° °
HDFS
° ° ° ° ° N
Data Column
Store DB
HDFS/Ozone HBase/Cassandra
IoT Devices IoT Edge
72
(single node) Data Center
(on prem/cloud)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.x & HDP 2.x
What’s New?
DSX
20k+
Registered Users
45k+
Answers
100k+
Technical Assets
82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved