Sunteți pe pagina 1din 2

Data Science in Spark Data Science Toolchain with Spark + sparklyr

Understand
Using sparklyr
A brief example of a data analysis using
with sparklyr Import Tidy Transform Visualize Communicate
Apache Spark, R and sparklyr in local mode
Cheat Sheet Export an R dplyr verb fd
Transformer fd
Collect data into Collect data
Direct Spark into R
fd
DataFrame
fd library(sparklyr); library(dplyr); library(ggplot2);
function R for plotting
Read a file SQL (DBI) Share plots, library(tidyr);
Read existing SDF function Model documents, Install Spark locally
Wrangle set.seed(100)
Hive table (Scala API) fd
Spark MLlib and apps
R for Data Science, Grolemund & Wickham
H2O Extension spark_install("2.0.1") Connect to local version

Intro Getting started sc <- spark_connect(master = "local")

sparklyr is an R interface for Local Mode import_iris <- copy_to(sc, iris, "spark_iris",
On a YARN Managed Cluster
Apache Spark, it provides a Easy setup; no cluster required overwrite = TRUE)
complete dplyr backend and sparklyr 1. Install RStudio Server or RStudio Pro on one
com

1. Install a local version of Spark: Copy data to Spark memory


the option to query directly
dio.

of the existing nodes, preferably an edge


tu
w.rs
ww

using Spark SQL statement. With sparklyr, you spark_install ("2.0.1") partition_iris <- sdf_partition(
node Partition
can orchestrate distributed machine learning 2. Open a connection import_iris,training=0.5, testing=0.5) data
2. Locate path to the clusters Spark Home
using either Sparks MLlib or H2O Sparkling sc <- spark_connect (master = "local")
Directory, it normally is /usr/lib/spark
Water. sdf_register(partition_iris,
3. Open a connection c("spark_iris_training","spark_iris_test"))
Starting with version 1.044, RStudio Desktop,
On a Mesos Managed Cluster spark_connect(master=yarn-client,
Server and Pro include integrated support for
version = 1.6.2, spark_home = [Clusters
the sparklyr package. You can create and 1. Install RStudio Server or Pro on one of the Create a hive metadata for each partition
Spark path])
manage connections to Spark clusters and local existing nodes
Spark instances from inside the IDE. tidy_iris <- tbl(sc,"spark_iris_training") %>%
2. Locate path to the clusters Spark directory
select(Species, Petal_Length, Petal_Width)
RStudio Integrates with sparklyr 3. Open a connection On a Spark Standalone Cluster Spark ML
spark_connect(master=[mesos URL], Decision Tree
Open connection log Disconnect 1. Install RStudio Server or RStudio Pro on
version = 1.6.2, spark_home = [Clusters model_iris <- tidy_iris %>% Model
Spark path]) one of the existing nodes or a server in the
ml_decision_tree(response="Species",
same LAN features=c("Petal_Length","Petal_Width"))
Using Livy (Experimental) 2. Install a local version of Spark:
spark_install (version = 2.0.1") test_iris <- tbl(sc,"spark_iris_test") Create
Open the 1. The Livy REST application should be running reference to
Spark UI on the cluster 3. Open a connection Spark table
spark_connect(master=spark:// pred_iris <- sdf_predict(
2. Connect to the cluster
Preview host:port, version = "2.0.1", model_iris, test_iris) %>%
Spark & Hive Tables 1K rows sc <- spark_connect(master = http://host:port , Bring data back
spark_home = spark_home_dir()) collect
method = livy") into R memory
for plotting
pred_iris %>%
Cluster Deployment Tuning Spark inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
Cluster Deployment Options Example Configuration Important Tuning Parameters
ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
Managed Cluster Stand Alone Cluster config <- spark_config() with defaults continued
geom_point()
Cluster Worker Nodes Worker Nodes spark.executor.heartbeatInterval 10s
config$spark.executor.cores <- 2
Manager spark.network.timeout 120s
Driver Node config$spark.executor.memory <- "4G"
Driver Node
spark.executor.memory 1g
fd fd sc <- spark_connect (master = "yarn-
client", config = config, version = "2.0.1") spark.executor.cores 1

fd
YARN
or
fd spark.executor.extraJavaOptions
Mesos Important Tuning Parameters spark.executor.instances
with defaults
fd fd spark.yarn.am.cores
sparklyr.shell.executor-memory
spark_disconnect(sc) Disconnect
sparklyr.shell.driver-memory
spark.yarn.am.memory 512m
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com Learn more at spark.rstudio.com package version 0.5 Updated: 12/21/16
Import Visualize & Communicate Model (MLlib)
Copy a DataFrame into Spark Spark SQL commands Download data to R memory ml_decision_tree(my_table , response=Species", features=
DBI::dbWriteTable( r_table <- collect(my_table) c(Petal_Length" , "Petal_Width"))
sdf_copy_to(sc, iris, "spark_iris")
sc, "spark_iris", iris) plot(Petal_Width~Petal_Length, data=r_table)
ml_als_factorization(x, rating.column = "rating", user.column =
sdf_copy_to(sc, x, name, memory, repartition, DBI::dbWriteTable(conn, name, dplyr::collect(x) "user", item.column = "item", rank = 10L, regularization.parameter =
overwrite) value) Download a Spark DataFrame to an R DataFrame 0.1, iter.max = 10L, ml.options = ml_options())
sdf_read_column(x, column)
Import into Spark from a File From a table in Hive Returns contents of a single column to R
ml_decision_tree(x, response, features, max.bins = 32L, max.depth
Arguments that apply to all functions: = 5L, type = c("auto", "regression", "classification"), ml.options =
my_var <- tbl_cache(sc,
sc, name, path, options = list(), repartition = 0, Save from Spark to File System ml_options())
name= "hive_iris") Same options for: ml_gradient_boosted_trees
memory = TRUE, overwrite = TRUE Arguments that apply to all functions: x, path
tbl_cache(sc, name, force = TRUE) ml_generalized_linear_regression(x, response, features,
CSV spark_read_csv( header = TRUE, Loads the table into memory spark_read_csv( header = TRUE,
CSV intercept = TRUE, family = gaussian(link = "identity"), iter.max =
columns = NULL, infer_schema = TRUE, delimiter = ",", quote = "\"", escape = "\\",
my_var <- dplyr::tbl(sc, 100L, ml.options = ml_options())
delimiter = ",", quote = "\"", escape = "\\", charset = "UTF-8", null_value = NULL)
name= "hive_iris") ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
charset = "UTF-8", null_value = NULL)
dplyr::tbl(scr, ) JSON spark_read_json(mode = NULL) compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options())
JSON spark_read_json()
Creates a reference to the table PARQUET spark_read_parquet(mode = NULL) ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha =
PARQUET spark_read_parquet() without loading it into memory (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options())
Reading & Writing from Apache Spark ml_linear_regression(x, response, features, intercept = TRUE,
Wrangle alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options())
tbl_cache Same options for: ml_logistic_regression
Spark SQL via dplyr verbs ML Transformers sdf_copy_to
dplyr::tbl ml_multilayer_perceptron(x, response, features, layers, iter.max
dplyr::copy_to
Translates into Spark SQL statements ft_binarizer(my_table,input.col=Petal_ = 100, seed = sample(.Machine$integer.max, 1), ml.options =
DBI::dbWriteTable
Length, output.col="petal_large", ml_options())
my_table <- my_var %>%
threshold=1.2) ml_naive_bayes(x, response, features, lambda = 0, ml.options =
filter(Species=="setosa") %>%
sample_n(10) Arguments that apply to all functions: spark_read_<fmt> ml_options())
x, input.col = NULL, output.col = NULL sdf_collect
File ml_one_vs_rest(x, classifier, response, features, ml.options =
dplyr::collect
Direct Spark SQL commands System ml_options())
ft_binarizer(threshold = 0.5) sdf_read_column
my_table <- DBI::dbGetQuery( sc , SELECT * spark_write_<fmt> ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options())
Assigned values based on threshold
FROM iris LIMIT 10") ml_random_forest(x, response, features, max.bins = 32L,
ft_bucketizer(splits)
DBI::dbGetQuery(conn, statement) Extensions max.depth = 5L, num.trees = 20L, type = c("auto", "regression",
Numeric column to discretized column
"classification"), ml.options = ml_options())
ft_discrete_cosine_transform(invers Create an R package that calls the full Spark API &
Scala API via SDF functions e = FALSE) provide interfaces to Spark packages. ml_survival_regression(x, response, features, intercept =
Time domain to frequency domain TRUE,censor = "censor", iter.max = 100L, ml.options =
Core Types
sdf_mutate(.data) ft_elementwise_product(scaling.col) ml_options())
Element-wise product between 2 spark_connection() Connection between R and the
Works like dplyr mutate function ml_binary_classification_eval(predicted_tbl_spark, label,
columns Spark shell process
sdf_partition(x, ..., weights = NULL, seed spark_jobj() Instance of a remote Spark object score, metric = "areaUnderROC")
ft_index_to_string()
= sample (.Machine$integer.max, 1)) spark_dataframe() Instance of a remote Spark ml_classification_eval(predicted_tbl_spark, label, predicted_lbl,
Index labels back to label as strings
sdf_partition(x, training = 0.5, test = 0.5) ft_one_hot_encoder() DataFrame object metric = "f1")
sdf_register(x, name = NULL) Continuous to binary vectors ml_tree_feature_importance(sc, model)
ft_quantile_discretizer( n.buckets Call Spark from R
Gives a Spark DataFrame a table name
= 5L) invoke() Call a method on a Java object
sdf_sample(x, fraction = 1, replacement = spark_jobj() Create a new object by invoking a
Continuous to binned categorical invoke_new()
TRUE, seed = NULL) constructor
values spark_dataframe()
sdf_sort(x, columns) ft_sql_transformer(sql) invoke_static() Call a static method on an object sparklyr
Sorts by >=1 columns in ascending order ft_string_indexer( params = NULL) is an R
Column of labels into a column of Machine Learning Extensions interface
sdf_with_unique_id(x, id = "id")
label indices. ml_create_dummy_variables() ml_options()
Add unique ID column ft_vector_assembler() for
sdf_predict(object, newdata) ml_prepare_dataframe() ml_model()
Combine vectors into a single row-
Spark DataFrame with predicted values vector ml_prepare_response_features_intercept()
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com Learn more at spark.rstudio.com package version 0.5 Updated: 12/21/16

S-ar putea să vă placă și