Spark Distributed Analytic Framework - How To Use

Spark Distributed Analytic Framework How to Use
Description and Overview - Apache Spark is a fast and general engine for
large-scale data processing.
How to Use Spark
Because of its high memory and I/O bandwidth requirements, we recommend
you run your spark jobs on Cori.
Follow the steps below to use spark, note that the order of the commands
matters. DO NOT load the spark module until you are inside a batch job.
Interactive mode
Submit an interactive batch job with at least 2 nodes:
salloc -N 2 -t
30
You need to use at least 2 nodes because the driver runs on the head node by
itself and then executors run on all the other nodes (if you would like to change
this functionality see this section).
Wait for the job to start. Once it does you will be on a compute node and you will
need to load the spark module:
module load
spark
You can start spark with this command:
startall.sh
To connect to the Python Spark Shell, do:
pyspar
k
To connect to the Scala Spark Shell, do:
sparkshell
To shutdown the Spark cluster, do:
stop-
all.sh
Batch mode
Below are example batch script for Cori and Edison. You can change number of
nodes/time/queue accordingly (so long as the number of nodes is greater than
1). On Cori you can use the debug queue for short, debugging jobs and the
regular queue for long jobs.
Here's an example script for Cori called run.sl:
#!/bin/bas
h
#SBATCH -p
regular
#SBATCH -N 2
#SBATCH -t 00:30:00
#SBATCH -e mysparkjob_
%j.err
#SBATCH -o mysparkjob_
%j.out
module load
spark
startall.sh
spark-submit
$SPARK_EXAMPLES/python/pi.py
stopall.sh
To submit the job:
sbatch
run.sl
Running an Executor on the Same Node as the Driver
If you would like one of the executors to run on the same node as the driver
(which would allow you to use Spark on one node or if using multiple nodes, to
have as many executors as nodes instead of 1 fewer executors than nodes), set
this variable before module loading spark:
export SPARK_CLUSTER_MODE=ON
Monitoring Your Spark Application
Running the History Server
The history server allow you to visualize the information provided by the event
logs in a nice interactive web interface. Here are instructions to run it on Cori
(note the history server is independent of running a spark job, so no need to
start a spark job to run the server):
Run the following commands in a login node (do not run in a compute node):
module load spark/histserver
run_history_server.sh
This command will return a url that will look something like
this: http://120.44.234.30:18080
Go to the address returned in your browser on your local machine.
Alternatively, if you are in an NX or in an X11 forwarded ssh session, you can
enter
firefox
which will open a firefox browser from the login node. From there, enter
"localhost:18080" as the url to see the history server.
Initially, the page will display "No completed applications found!" until all logs
are processed. This processing can take anywhere from a minute to ten minutes
depending on how many event logs you have accumulated
To get a quicker turnaround time, consider using an event logs directory with
fewer event logs (if possible!)
Make sure to stop the server when you are done. To stop the history server:
run_history_server.sh
--stop
Note you must be on the same login node, where you started the history server,
in order to stop it.
Trouble Shooting
Module Load Errors
A successful module load spark has four steps and looks like this:
Creating Directory SPARK_WORKER_DIR
/global/cscratch1/sd/racah/spark/1054167
Creating /global/cscratch1/sd/racah/spark/1054167/slaves file
Determining the master node name...
Master node is nid00092
Module load error outputs look like this:
spark/1.6.0(137):ERROR:102: Tcl command execution failed: if { [ module-info
mode load ] } {
puts stderr "Creating Directory SPARK_WORKER_DIR $env(SPARK_WORKER_DIR)"
puts stderr "Creating $env(SPARK_WORKER_DIR)/slaves file"
puts stderr "Determining the master node name..."
set master [exec $root/myfindmaster.sh]
puts stderr "Master node is $master"
exec /bin/mkdir -p $env(SPARK_WORKER_DIR)
exec $root/myfindslaves.sh $master $env(SPARK_WORKER_DIR)/slaves
setenv SPARKURL spark://$master:7077
setenv SPARKMASTER $master
}
If the module load error comes after
Master node is nid00092

OR
Determining the master node name...
then check your ~/.bashrc.ext and ~/.bash_profile.ext files to make sure they
produce no errors upon login like:
ModuleCmd_Load.c(226):ERROR:105: Unable to locate a modulefile for
if you receive this error:
SLURM_JOBID not set, please run this module inside of a batch job
you have called module load either inside a login node or in one of the compute
nodes that is not the initial one that you were placed on when your job started.
Memory Limits
Sometimes jobs can start with incorrect user memory limits. You can add "ulimit
-s unlimited" to your .bashrc.ext file to avoid this.
Further spark documentation is available from the Apache Spark Web Page.
Availability
Spark is Available on Edison and Cori. However, we recommend you run on Cori
because of its larger memory nodes and faster scratch connection.
Packa
ge
Platform
Category
Spark
cori
applications/
debugging
Version
1.5-instru
Module
Install
Date
spark/1.5instru
2015-1204
1.5.1 spark/1.5.1
2016-0718
1.5.1-mkl spark/1.5.1mkl
2016-0325
Spark data analytic framework

Spark
cori
applications/
debugging

Spark
cori
applications/
debugging
Date Made
Default
Packa
ge
Platform
Category
Version
Module
Install
Date
Date Made
Default

Spark
cori
applications/
debugging
1.5.1-sc- spark/1.5.1instru
sc-instru
2016-0330
1.6.0 spark/1.6.0
2016-0718
2.0.0 spark/2.0.0
2016-0817

Spark
cori
applications/
debugging

Spark
cori
applications/
debugging

Spark
cori
applications/
debugging
hist-server
spark/histserver
2016-0201
Spark history server for monitoring jobs after the fact

Spark
edison
applications/
debugging
1.0.0 spark/1.0.0
2014-0710
1.0.2 spark/1.0.2
2014-0816
2014-08-16
1.1.0 spark/1.1.0
2014-0921
2014-10-31
1.1.0-shm spark/1.1.0shm
2014-1028

Spark
edison
applications/
debugging

Spark
edison
applications/
debugging

Spark
edison
applications/
debugging
Packa
ge
Platform
Category
Spark
edison
applications/
debugging
Version
Module
Install
Date
1.2.1 spark/1.2.1
2015-0217
applications/ 1.2.1-breeze spark/1.2.1debugging

breeze
2015-0313
Date Made
Default
2015-02-18

Spark
edison

Spark
edison
applications/
debugging
1.2.1- spark/1.2.1scratch
scratch
2015-0331
2015-04-02
1.3.1 spark/1.3.1
2015-0427
2015-05-15
1.3.1- spark/1.3.1scratch
scratch
2015-0812
1.4.1 spark/1.4.1
2015-0805

Spark
edison
applications/
debugging

Spark
edison
applications/
debugging

Spark
edison
applications/
debugging
2015-08-11

Spark
edison
applications/
debugging
1.5-rc1-inst
spark/1.5rc1-inst
2015-0917
1.5.0 spark/1.5.0
2015-0925

Spark
edison
applications/
debugging
2015-11-13
Packa
ge
Platform
Category
Spark
edison
applications/
debugging
Version
hist-server
Module
spark/histserver
Install
Date
2016-0628
Spark history server for monitoring jobs after the fact

Spark
edison
applications/
debugging
scratch spark/scratc
h
2015-0330
Date Made
Default

Spark Distributed Analytic Framework - How To Use

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Spark Distributed Analytic Framework - How To Use

Încărcat de

Drepturi de autor:

Formate disponibile

Spark Distributed Analytic Framework How to Use

Master node is nid00092

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark history server for monitoring jobs after the fact

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

applications/ 1.2.1-breeze spark/1.2.1debugging

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark data analytic framework

Spark history server for monitoring jobs after the fact

Spark data analytic framework

S-ar putea să vă placă și