Data Lake 1

treaming
What is Spark Streaming?
• > Receive data streams from input sources, process them in

a cluster, push out to databases/ dashboards
• > Scalable, fault-tolerant
Kafka
Flume HDFS
HDFS Databases
Kinesis treaming Dashboards
Twitter
How does Spark Streaming work?
> Chop up data streams into batches of few

secs
> Spark treats each batch of data as RDDs and
processes them using RDD operations
> Processed results are pushed out in batches.
Streaming
data streams
batches as results as
RDDs RDDs
Spark Streaming Programming Model
> Discretized Stream (DStream)
-Represents a stream of data
-Implemented as a sequence of RDDs
> DStreams API very similar to RDD API

-Functional APIs in Scala, Java
-Create input DStreams from different sources
-Apply parallel operations
Example – Get hashtags from Twitter
val ssc = new StreamingContext(sparkContext, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)
Input DStream
Twitter Streaming API batch @ t batch @ t+1 batch @ t+2
tweets DStream
stored in memory
as RDDs
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
transformed transformation: modify data in one

DStream DStream to create another
DStream
batch @ t batch @ t+1 batch @ t+2
tweets DStream
flatMap flatMap flatMap
hashTags Dstream new RDDs

[#cat, #dog, … ]
created for
every batch
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage

tweets DStream
flatMap flatMap flatMap
hashTags DStream
save save save
every batch
saved to
HDFS
Transformations
• Stateless transformations
• the processing of each batch does not depend on the data of its previous
batches.
• Examples map(), filter(), reduceByKey() etc.
• Stateful transformations
• use data or intermediate results from previous batches to compute the
results of the current batch.
• They include transformations based on sliding windows and on tracking state
across time.
• Example updateStateByKey()
hashTags.foreachRDD(hashTagRDD => { ... })
foreach: do whatever you want with the processed data

tweets DStream
flatMap flatMap flatMap
hashTags DStream
foreach foreach foreach
Write to a database, update

analyOcs UI, do whatever you
want
Window-based Transformations
val tweets = TwitterUtils.createStream(ssc, auth)
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
window length sliding interval
operation
window length
DStream of data
sliding interval
Fault-tolerance
> Batches of input data are tweets
RDD input data
replicated in memory for replicated
fault-tolerance in memory
flatMap
> Data lost due to worker
failure, can be
recomputed from hashTags
replicated input data RDD
lost parOOons
recomputed on
>All transformations are other workers
fault- tolerant, and exactly-
once transformations
Input Sources
• Out of the box, we provide
- Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.
• Very easy to write a custom receiver

- Define what to when receiver is started and stopped
• Also, generate your own sequence of RDDs,

etc. and push them in as a “stream”
Output Sinks
• HDFS, S3, etc (Hadoop API compatible filesystems)
• Cassandra (using Spark-Cassandra connector)
• Hbase (integrated support coming to Spark soon)
• Directly push the data anywhere

Example:
import org.apache.spark._
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("localhost" , 7777)
val words = lines.flatMap(_.split(" "))
val wc = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
wc.print()
ssc.start()
ssc.awaitTermination().
>Nc –lk 7777
> Ram hai sita ram gita ram
Machine Learning with Mllib
• Designed to run in parallel on clusters.

• MLlib contains a variety of learning algorithms and is accessible from
all of Spark’s programming languages.
• Example:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
points = # (create RDD of LabeledPoint)
model = LinearRegressionWithSGD.train(points, iterations=200,
intercept=True)
print "weights: %s, intercept: %s" % (model.weights, model.intercept)
Steps that occur when you run a Spark
application on a cluster:
• The user submits an application using spark-submit.
• spark-submit launches the driver program and invokes the main()
method specified by the user.
• The driver program contacts the cluster manager to ask for resources to
launch executors.
• The cluster manager launches executors on behalf of the driver
program.
• The driver process runs through the user application. Based on the RDD
actions and transformations in the program, the driver sends work to
executors in the form of tasks.
• Tasks are run on executor processes to compute and save results.
• If the driver’s main() method exits or it calls SparkContext.stop(), it will
termi‐ nate the executors and release resources from the cluster
manager.
Deploying Applications with spark-submit
• spark-submit is running on
bin/spark-submit my_script.py
• spark-submit is running on Spark’s Standalone mode bin/spark-

bin/spark-submit --master spark://host:7077 my_script.py
• spark-submit is running on YARN

bin/spark-submit –master yarn my_script.py
Example:
bin/spark-submit –master local –class WordCount wordCount.jar
Possible values for the --master flag in
spark-submit
Value Explanations
spark:// host:port Connect to a Spark Standalone cluster at the specified port. By default Spark
Standalone masters use port 7077
mesos:// host:port Connect to a Mesos cluster master at the specified port. By default Mesos masters
listen on port 5050.
Yarn Connect to a YARN cluster. When running on YARN you’ll need to set the
HADOOP_CONF_DIR environment variable to point the location of your Hadoop
configuration directory, which contains information about the cluster.
Local Run in local mode with a single core.
Local [N] Run in local mode with N cores.
Local[*] Run in local mode and use as many cores as the machine has.
Common flags for spark-submit
Flags Explanations
--master Indicates the cluster manager to connect to.
--class The “main” class of your application if you’re running a Java or Scala program.
--executor- memory The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger
quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver- memory The amount of memory to use for the driver process, in bytes. Suffixes can be used to
specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).

Data Lake 1

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Lake 1

Încărcat de

Drepturi de autor:

Formate disponibile

treaming

What is Spark Streaming?

• > Receive data streams from input sources, process them in

> Chop up data streams into batches of few

> DStreams API very similar to RDD API

Twitter Streaming API batch @ t batch @ t+1 batch @ t+2

transformed transformation: modify data in one

flatMap flatMap flatMap

hashTags Dstream new RDDs

output operation: to push data to external storage

batch @ t batch @ t+1 batch @ t+2

foreach: do whatever you want with the processed data

batch @ t batch @ t+1 batch @ t+2

Write to a database, update

• Very easy to write a custom receiver

• Also, generate your own sequence of RDDs,

• HDFS, S3, etc (Hadoop API compatible ﬁlesystems)

• Cassandra (using Spark-Cassandra connector)

• Hbase (integrated support coming to Spark soon)

• Directly push the data anywhere

• Designed to run in parallel on clusters.

• spark-submit is running on Spark’s Standalone mode bin/spark-

• spark-submit is running on YARN

Local [N] Run in local mode with N cores.

--master Indicates the cluster manager to connect to.

S-ar putea să vă placă și