Sunteți pe pagina 1din 8

Spark Distributed Analytic Framework How to Use

Description and Overview - Apache Spark is a fast and general engine for
large-scale data processing.
How to Use Spark
Because of its high memory and I/O bandwidth requirements, we recommend
you run your spark jobs on Cori.
Follow the steps below to use spark, note that the order of the commands
matters. DO NOT load the spark module until you are inside a batch job.
Interactive mode
Submit an interactive batch job with at least 2 nodes:
salloc -N 2 -t
30
You need to use at least 2 nodes because the driver runs on the head node by
itself and then executors run on all the other nodes (if you would like to change
this functionality see this section).
Wait for the job to start. Once it does you will be on a compute node and you will
need to load the spark module:
module load
spark
You can start spark with this command:
startall.sh
To connect to the Python Spark Shell, do:
pyspar
k
To connect to the Scala Spark Shell, do:
sparkshell
To shutdown the Spark cluster, do:
stop-

all.sh

Batch mode
Below are example batch script for Cori and Edison. You can change number of
nodes/time/queue accordingly (so long as the number of nodes is greater than
1). On Cori you can use the debug queue for short, debugging jobs and the
regular queue for long jobs.
Here's an example script for Cori called run.sl:
#!/bin/bas
h

#SBATCH -p
regular
#SBATCH -N 2
#SBATCH -t 00:30:00
#SBATCH -e mysparkjob_
%j.err
#SBATCH -o mysparkjob_
%j.out
module load
spark
startall.sh
spark-submit
$SPARK_EXAMPLES/python/pi.py
stopall.sh
To submit the job:
sbatch
run.sl
Running an Executor on the Same Node as the Driver
If you would like one of the executors to run on the same node as the driver
(which would allow you to use Spark on one node or if using multiple nodes, to

have as many executors as nodes instead of 1 fewer executors than nodes), set
this variable before module loading spark:
export SPARK_CLUSTER_MODE=ON
Monitoring Your Spark Application
Running the History Server
The history server allow you to visualize the information provided by the event
logs in a nice interactive web interface. Here are instructions to run it on Cori
(note the history server is independent of running a spark job, so no need to
start a spark job to run the server):
Run the following commands in a login node (do not run in a compute node):
module load spark/histserver
run_history_server.sh

This command will return a url that will look something like
this: http://120.44.234.30:18080
Go to the address returned in your browser on your local machine.
Alternatively, if you are in an NX or in an X11 forwarded ssh session, you can
enter

firefox

which will open a firefox browser from the login node. From there, enter
"localhost:18080" as the url to see the history server.
Initially, the page will display "No completed applications found!" until all logs
are processed. This processing can take anywhere from a minute to ten minutes
depending on how many event logs you have accumulated
To get a quicker turnaround time, consider using an event logs directory with
fewer event logs (if possible!)
Make sure to stop the server when you are done. To stop the history server:
run_history_server.sh

--stop
Note you must be on the same login node, where you started the history server,
in order to stop it.

Trouble Shooting
Module Load Errors
A successful module load spark has four steps and looks like this:
Creating Directory SPARK_WORKER_DIR
/global/cscratch1/sd/racah/spark/1054167
Creating /global/cscratch1/sd/racah/spark/1054167/slaves file
Determining the master node name...
Master node is nid00092
Module load error outputs look like this:
spark/1.6.0(137):ERROR:102: Tcl command execution failed: if { [ module-info
mode load ] } {
puts stderr "Creating Directory SPARK_WORKER_DIR $env(SPARK_WORKER_DIR)"
puts stderr "Creating $env(SPARK_WORKER_DIR)/slaves file"
puts stderr "Determining the master node name..."
set master [exec $root/myfindmaster.sh]
puts stderr "Master node is $master"
exec /bin/mkdir -p $env(SPARK_WORKER_DIR)
exec $root/myfindslaves.sh $master $env(SPARK_WORKER_DIR)/slaves
setenv SPARKURL spark://$master:7077
setenv SPARKMASTER $master
}
If the module load error comes after

Master node is nid00092


OR
Determining the master node name...
then check your ~/.bashrc.ext and ~/.bash_profile.ext files to make sure they
produce no errors upon login like:
ModuleCmd_Load.c(226):ERROR:105: Unable to locate a modulefile for
if you receive this error:
SLURM_JOBID not set, please run this module inside of a batch job
you have called module load either inside a login node or in one of the compute
nodes that is not the initial one that you were placed on when your job started.
Memory Limits
Sometimes jobs can start with incorrect user memory limits. You can add "ulimit
-s unlimited" to your .bashrc.ext file to avoid this.
Further spark documentation is available from the Apache Spark Web Page.
Availability
Spark is Available on Edison and Cori. However, we recommend you run on Cori
because of its larger memory nodes and faster scratch connection.
Packa
ge

Platform

Category

Spark

cori

applications/
debugging

Version
1.5-instru

Module

Install
Date

spark/1.5instru

2015-1204

1.5.1 spark/1.5.1

2016-0718

1.5.1-mkl spark/1.5.1mkl

2016-0325

Spark data analytic framework


Spark

cori

applications/
debugging

Spark data analytic framework


Spark

cori

applications/
debugging

Date Made
Default

Packa
ge

Platform

Category

Version

Module

Install
Date

Date Made
Default

Spark data analytic framework


Spark

cori

applications/
debugging

1.5.1-sc- spark/1.5.1instru
sc-instru

2016-0330

1.6.0 spark/1.6.0

2016-0718

2.0.0 spark/2.0.0

2016-0817

Spark data analytic framework


Spark

cori

applications/
debugging

Spark data analytic framework


Spark

cori

applications/
debugging

Spark data analytic framework


Spark

cori

applications/
debugging

hist-server

spark/histserver

2016-0201

Spark history server for monitoring jobs after the fact


Spark

edison

applications/
debugging

1.0.0 spark/1.0.0

2014-0710

1.0.2 spark/1.0.2

2014-0816

2014-08-16

1.1.0 spark/1.1.0

2014-0921

2014-10-31

1.1.0-shm spark/1.1.0shm

2014-1028

Spark data analytic framework


Spark

edison

applications/
debugging

Spark data analytic framework


Spark

edison

applications/
debugging

Spark data analytic framework


Spark

edison

applications/
debugging

Spark data analytic framework

Packa
ge

Platform

Category

Spark

edison

applications/
debugging

Version

Module

Install
Date

1.2.1 spark/1.2.1

2015-0217

applications/ 1.2.1-breeze spark/1.2.1debugging


breeze

2015-0313

Date Made
Default

2015-02-18

Spark data analytic framework


Spark

edison

Spark data analytic framework


Spark

edison

applications/
debugging

1.2.1- spark/1.2.1scratch
scratch

2015-0331

2015-04-02

1.3.1 spark/1.3.1

2015-0427

2015-05-15

1.3.1- spark/1.3.1scratch
scratch

2015-0812

1.4.1 spark/1.4.1

2015-0805

Spark data analytic framework


Spark

edison

applications/
debugging

Spark data analytic framework


Spark

edison

applications/
debugging

Spark data analytic framework


Spark

edison

applications/
debugging

2015-08-11

Spark data analytic framework


Spark

edison

applications/
debugging

1.5-rc1-inst

spark/1.5rc1-inst

2015-0917

1.5.0 spark/1.5.0

2015-0925

Spark data analytic framework


Spark

edison

applications/
debugging

Spark data analytic framework

2015-11-13

Packa
ge

Platform

Category

Spark

edison

applications/
debugging

Version

hist-server

Module

spark/histserver

Install
Date

2016-0628

Spark history server for monitoring jobs after the fact


Spark

edison

applications/
debugging

Spark data analytic framework

scratch spark/scratc
h

2015-0330

Date Made
Default

S-ar putea să vă placă și