Documente Academic
Documente Profesional
Documente Cultură
Shivaram Venkataraman
Zongheng Yang
Fast !
Scalable Interactive
Statistics !
Packages Plots
Fast ! Statistics !
Scalable Packages
RDD Actions
count
collect
save
Q: How can I use a loop
R
to [...insert task here...] ?!
A: Dont. Use one of the
apply functions.!
From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
R + RDD =
R2D2
lapply
lapplyPartition
groupByKey
reduceByKey
sampleRDD
R + RDD = collect
cache
RRDD
broadcast
includePackage
textFile
parallelize
Example: Word Count
lines
<-
textFile(sc,
args[[2]])
Example: Word Count
lines
<-
textFile(sc,
args[[2]])
words
<-
flatMap(lines,
function(line)
{
strsplit(line,
"
")[[1]]
})
wordCount
<-
lapply(words,
function(word)
{
list(word,
1L)
})
Example: Word Count
lines
<-
textFile(sc,
args[[2]])
words
<-
flatMap(lines,
function(line)
{
strsplit(line,
"
")[[1]]
})
wordCount
<-
lapply(words,
function(word)
{
list(word,
1L)
})
counts
<-
reduceByKey(wordCount,
"+",
2L)
output
<-
collect(counts)
Demo
MNIST
A
Worker
Dataow
Local Worker
R Worker
Dataow
Local Worker
JNI Java
R Spark Spark
Context Context Worker
Dataow
Local Worker
R
Spark
Executor
exec
JNI Java
R Spark Spark
Context Context Worker
R
Spark
Executor
exec
From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
Dataow
Local Worker
R
Spark
Executor
exec
JNI Java
R Spark Spark
Context Context Worker
R
Spark
Executor
exec
Pipelined RDD
words
<-
flatMap(lines,)
wordCount
<-
lapply(words,)
R R
Spark Spark
Executor Executor
exec exec
Pipelined RDD
R R
Spark Spark
Executor Executor
exec exec
R R
Spark Spark
Executor Executor
exec
Alpha developer release
Daemon R processes
RDD distributed lists
Re-use R packages
Data: RDD[Array[Byte]]
Func: Array[Byte]
Spark Context
rJava
R Shell