Documente Academic
Documente Profesional
Documente Cultură
Kafka
Flume HDFS
HDFS Databases
Kinesis treaming Dashboards
Twitter
How does Spark Streaming work?
data streams
batches as results as
RDDs RDDs
Spark Streaming Programming Model
> Discretized Stream (DStream)
-Represents a stream of data
-Implemented as a sequence of RDDs
Input DStream
tweets DStream
stored in memory
as RDDs
Example – Get hashtags from Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
tweets DStream
hashTags DStream
save save save
every batch
saved to
HDFS
Transformations
• Stateless transformations
• the processing of each batch does not depend on the data of its previous
batches.
• Examples map(), filter(), reduceByKey() etc.
• Stateful transformations
• use data or intermediate results from previous batches to compute the
results of the current batch.
• They include transformations based on sliding windows and on tracking state
across time.
• Example updateStateByKey()
Example – Get hashtags from Twitter
val tweets = TwitterUtils.createStream(ssc, None)
val hashTags = tweets.flatMap(status => getTags(status))
hashTags.foreachRDD(hashTagRDD => { ... })
hashTags DStream
foreach foreach foreach
sliding window
window length sliding interval
operation
window length
DStream of data
sliding interval
Fault-tolerance
> Batches of input data are tweets
RDD input data
replicated in memory for replicated
fault-tolerance in memory
flatMap
> Data lost due to worker
failure, can be
recomputed from hashTags
replicated input data RDD
lost parOOons
recomputed on
>All transformations are other workers
fault- tolerant, and exactly-
once transformations
Input Sources
• Out of the box, we provide
- Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.
import org.apache.spark._
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(5))
val lines = ssc.socketTextStream("localhost" , 7777)
val words = lines.flatMap(_.split(" "))
val wc = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
wc.print()
ssc.start()
ssc.awaitTermination().
>Nc –lk 7777
> Ram hai sita ram gita ram
Machine Learning with Mllib
• spark-submit is running on
bin/spark-submit my_script.py
Example:
bin/spark-submit –master local –class WordCount wordCount.jar
Possible values for the --master flag in
spark-submit
Value Explanations
spark:// host:port Connect to a Spark Standalone cluster at the specified port. By default Spark
Standalone masters use port 7077
mesos:// host:port Connect to a Mesos cluster master at the specified port. By default Mesos masters
listen on port 5050.
Yarn Connect to a YARN cluster. When running on YARN you’ll need to set the
HADOOP_CONF_DIR environment variable to point the location of your Hadoop
configuration directory, which contains information about the cluster.
Local Run in local mode with a single core.
Local[*] Run in local mode and use as many cores as the machine has.
Common flags for spark-submit
Flags Explanations
--class The “main” class of your application if you’re running a Java or Scala program.
--executor- memory The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger
quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
--driver- memory The amount of memory to use for the driver process, in bytes. Suffixes can be used to
specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).