Sunteți pe pagina 1din 9

For more details visit this link

https://www.gangboard.com/big-data-training/apache-spark-training
 Apache Spark is an ultra-fast cluster computing technology designed for
fast calculations. It is based on the Hadoop MapReduce and extends the
MapReduce model to efficiently use it for more types of calculations,
including interactive queries and flow processing. Spark's key feature is in-
memory cluster computing that increases the speed of processing an
application.

 Spark is designed to cover a wide variety of workloads such as batch


applications, iterative algorithms, interactive queries, and streaming. In
addition to supporting all these workloads in a respective system, it reduces
the administrative burden of keeping tools separate.
Spark is one of Hadoop subprojects developed in 2009 at AMPLab of UC
Berkeley by Matei Zaharia. It was Open Sourced in 2010 under a BSD
license. It was donated to the Apache software foundation in 2013 and now
Apache Spark has become a high-end Apache project since February 2014.
Apache Spark has the following features.

 Speed: Spark helps run an application on the Hadoop cluster, up to 100


times faster in memory and 10 times faster when running on disk. This is
possible by reducing the number of read / write operations on the disk.
Stores intermediate processing data in memory.

 Supports multiple languages: Spark provides Java, Scala or Python


integrated APIs. Therefore, you can write applications in different
languages. Spark comes with 80 high level operators for interactive
queries.
 Advanced analysis: Spark not only supports "Map" and "Reduce". It also
supports SQL queries, data transmission, machine learning (ML) and
graphing algorithms.

Get Apache Spark Online Training


The following illustration shows the different components of Spark.

 Apache Spark Core


Spark Core is the underlying general execution engine for the ignition
platform on which all other functionality is based. It provides memory
computation and reference data sets on external storage systems.

 Spark SQL
Spark SQL is a component beyond Spark Core that features a new data
abstraction called SchemaRDD, which provides support for structured
and semi-structured data.
 Spark streaming
Spark Streaming takes advantage of Spark Core's fast programming
capability to perform stream analysis. It inserts data into mini-batches and
performs resilient distributed data sets (RDD) transformations on these mini-
batches of data.

 MLlib (Machine Learning Library)


MLlib is a distributed machine learning framework in Spark due to Spark's
distributed memory-based architecture. According to benchmarks, MLlib
developers do this against Alterning Least Squares (ALS) implementations.
Spark MLlib is nine times faster than the disk-based version of Hadoop's
Apache Mahout (before Mahout gets a Spark interface).
 GraphX

GraphX ​is a distributed graphical rendering framework in Spark. It provides


an API to express the calculation of graphs that can be modeled by user-
defined graphs using the Pregel abstraction API. It also provides optimized
runtime for this abstraction.
Thank you

For More info Visit - apache spark online certification

S-ar putea să vă placă și