Sunteți pe pagina 1din 19

treaming

What is Spark Streaming?

• > Receive data streams from input sources, process them in


a cluster, push out to databases/ dashboards
• > Scalable, fault-tolerant

Kafka
Flume HDFS
HDFS Databases
Kinesis treaming Dashboards
Twitter
How does Spark Streaming work?

> Chop up data streams into batches of few


secs
> Spark treats each batch of data as RDDs and
processes them using RDD operations
> Processed results are pushed out in batches.
Streaming

data streams

batches as results as
RDDs RDDs
Spark Streaming Programming Model
> Discretized Stream (DStream)
-Represents a stream of data
-Implemented as a sequence of RDDs

> DStreams API very similar to RDD API


-Functional APIs in Scala, Java
-Create input DStreams from different sources
-Apply parallel operations
Example – Get hashtags from Twitter
val ssc = new StreamingContext(sparkContext, Seconds(1))
val tweets = TwitterUtils.createStream(ssc, auth)

Input DStream

Twitter Streaming API batch @ t batch @ t+1 batch @ t+2

tweets DStream

stored in memory
as RDDs
Example – Get hashtags from Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))

transformed transformation: modify data in one


DStream DStream to create another
DStream
batch @ t batch @ t+1 batch @ t+2

tweets DStream

flatMap flatMap flatMap

hashTags Dstream new RDDs


[#cat, #dog, … ]
created for
every batch
Example – Get hashtags from Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")

output operation: to push data to external storage

batch @ t batch @ t+1 batch @ t+2


tweets DStream
flatMap flatMap flatMap

hashTags DStream
save save save

every batch
saved to
HDFS
Transformations

• Stateless transformations
• the processing of each batch does not depend on the data of its previous
batches.
• Examples map(), filter(), reduceByKey() etc.

• Stateful transformations
• use data or intermediate results from previous batches to compute the
results of the current batch.
• They include transformations based on sliding windows and on tracking state
across time.
• Example updateStateByKey()
Example – Get hashtags from Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreachRDD(hashTagRDD => { ... })

foreach: do whatever you want with the processed data

batch @ t batch @ t+1 batch @ t+2


tweets DStream
flatMap flatMap flatMap

hashTags DStream
foreach foreach foreach

Write to a database, update


analyOcs UI, do whatever you
want
Window-based Transformations
val tweets = TwitterUtils.createStream(ssc, auth)
val hashTags = tweets.flatMap(status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window
window length sliding interval
operation

window length

DStream of data
sliding interval
Fault-tolerance
> Batches of input data are tweets
RDD input data
replicated in memory for replicated
fault-tolerance in memory

flatMap
> Data lost due to worker
failure, can be
recomputed from hashTags
replicated input data RDD
lost parOOons
recomputed on
>All transformations are other workers
fault- tolerant, and exactly-
once transformations
Input Sources
• Out of the box, we provide
- Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.

• Very easy to write a custom receiver


- Define what to when receiver is started and stopped

• Also, generate your own sequence of RDDs,


etc. and push them in as a “stream”
Output Sinks

• HDFS, S3, etc (Hadoop API compatible filesystems)

• Cassandra (using Spark-Cassandra connector)

• Hbase (integrated support coming to Spark soon)

• Directly push the data anywhere


Example:

import org.apache.spark._
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("localhost" , 7777)
val words = lines.flatMap(_.split(" "))
val wc = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
wc.print()
ssc.start()
ssc.awaitTermination().
>Nc –lk 7777
> Ram hai sita ram gita ram
Machine Learning with Mllib

• Designed to run in parallel on clusters.


• MLlib contains a variety of learning algorithms and is accessible from
all of Spark’s programming languages.
• Example:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
points = # (create RDD of LabeledPoint)
model = LinearRegressionWithSGD.train(points, iterations=200,
intercept=True)
print "weights: %s, intercept: %s" % (model.weights, model.intercept)
Steps that occur when you run a Spark
application on a cluster:
• The user submits an application using spark-submit.
• spark-submit launches the driver program and invokes the main()
method specified by the user.
• The driver program contacts the cluster manager to ask for resources to
launch executors.
• The cluster manager launches executors on behalf of the driver
program.
• The driver process runs through the user application. Based on the RDD
actions and transformations in the program, the driver sends work to
executors in the form of tasks.
• Tasks are run on executor processes to compute and save results.
• If the driver’s main() method exits or it calls SparkContext.stop(), it will
termi‐ nate the executors and release resources from the cluster
manager.
Deploying Applications with spark-submit

• spark-submit is running on
bin/spark-submit my_script.py

• spark-submit is running on Spark’s Standalone mode bin/spark-


bin/spark-submit --master spark://host:7077 my_script.py

• spark-submit is running on YARN


bin/spark-submit –master yarn my_script.py

Example:
bin/spark-submit –master local –class WordCount wordCount.jar
Possible values for the --master flag in
spark-submit
Value Explanations

spark:// host:port Connect to a Spark Standalone cluster at the specified port. By default Spark
Standalone masters use port 7077
mesos:// host:port Connect to a Mesos cluster master at the specified port. By default Mesos masters
listen on port 5050.
Yarn Connect to a YARN cluster. When running on YARN you’ll need to set the
HADOOP_CONF_DIR environment variable to point the location of your Hadoop
configuration directory, which contains information about the cluster.
Local Run in local mode with a single core.

Local [N] Run in local mode with N cores.

Local[*] Run in local mode and use as many cores as the machine has.
Common flags for spark-submit

Flags Explanations

--master Indicates the cluster manager to connect to.

--class The “main” class of your application if you’re running a Java or Scala program.

--executor- memory The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger
quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver- memory The amount of memory to use for the driver process, in bytes. Suffixes can be used to
specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).

S-ar putea să vă placă și