Sunteți pe pagina 1din 4

Partners Support Dev-Hub Community Training Blog Resources My Account

Why MapR? Products Services Solutions Customers Find a Reseller TRY MAPR

Platform Overview > Open Source > Apache Spark

Apache Spark EBOOK


Unified Analytics Engine for Large-Scale Distributed Data Processing and Machine Learning
Getting Started with Apache
Spark 2.x from Inception to
Production

Get the eBook

WHAT IS APACHE SPARK?


Apache Spark is a powerful unified analytics engine for large-scale distributed data processing and machine learning. On top of the Spark core
data processing engine are libraries for SQL, machine learning, graph computation, and stream processing. These libraries can be used together
in many stages in modern data pipelines and allow for code reuse across batch, interactive, and streaming applications. Spark is useful for
ETL processing, analytics and machine learning workloads, and for batch and interactive processing of SQL queries, machines learning inferences,
and artificial intelligence applications.

The Power of Data Pipelines


Much of Spark's power lies in its ability to combine very different techniques and processes into a single, coherent whole. Outside Spark, the discrete tasks of selecting data, transforming that data
in various ways, and analyzing the transformed results might easily require a series of separate processing frameworks, such as Apache Oozie. Spark, on the other hand, offers the ability to
combine these, crossing boundaries between batch, streaming, and interactive workflows in ways that make the user more productive.

Spark jobs perform multiple operations consecutively, in memory, only spilling to disk when required by memory limitations. Spark simplifies the management of these disparate processes, offering
an integrated whole – a data pipeline that is easier to configure, run, and maintain. In use cases such as ETL, these pipelines can become extremely rich and complex, combining large numbers of
inputs and a wide range of processing steps into a unified whole that consistently delivers the desired result.

Predicting Flight Delays with Apache Spark Machine Learning


On Demand Webinar: Predicting Flight Delays …
Learn more about Apache Spark's MLlib, which makes machine learning scalable and easier
with ML pipelines built on top of DataFrames.

In this video, you will see an example from the eBook Getting Started with Apache Spark 2.x.

WHY APACHE SPARK?

CHALLENGES WITH PREVIOUS TECHNOLOGIES ADVANTAGES OF SPARK


Before Spark, there was MapReduce, a scalable, resilient distributed processing framework that Apache Spark began life in 2009 as a project within the AMPLab at the University of California,
enabled Google to index the exploding volume of content on the web across large clusters of Berkeley. The goal of the Spark project was to keep the benefits of MapReduce's scalable, distributed,
commodity servers. fault-tolerant processing framework while making it more efficient and easier to use. Spark is
designed for speed:
With MapReduce, iterative algorithms require chaining multiple MapReduce jobs together. This
Spark runs multi-threaded lightweight tasks inside of JVM processes, providing fast job startup and
causes a lot of reading and writing to disk. For each MapReduce job, data is read from a distributed
parallel multi-core CPU utilization.
file block into a map process, written to and read from a file in between, and then written to an output
file from a reducer process. Spark caches data in memory across multiple parallel operations, making it especially fast for
parallel processing of distributed data with iterative algorithms.

The MapReduce Java API is not easy to program with, although Pig and Hive make this somewhat
easier.

MapReduce, Pig, and Hive are only for batch ETL, and data sources are limited to Hadoop.

Spark provides a rich functional programming model and comes packaged with higher level libraries
for SQL, machine learning, streaming, and graphs.

Spark's Structured API provides the same API for batch and real-time streaming. Spark's architecture
supports tight integration with a number of leading storage solutions in the Hadoop ecosystem and
beyond, including Apache HDFS, MapR XD Distributed File and Object Store, Apache HBase, MapR
Database JSON, Apache Kafka, and Apache Hive.

KEY BENEFITS OF APACHE SPARK

EVENT STREAM PROCESSING MACHINE LEARNING


From log files to sensor data, application developers As data volumes grow, machine learning approaches
increasingly have to cope with streams of data. This data become more feasible and increasingly accurate. Software
arrives in a steady stream, often from multiple sources can be trained to identify and act upon triggers within well-
simultaneously. While it is possible to store these data understood datasets before applying the same solutions to
streams on disk and analyze them retrospectively, it is new and unknown data. Spark’s ability to store data in
sometimes necessary to process and act upon the data as it memory and rapidly run repeated queries makes it a good
arrives. Streams of data related to financial transactions, choice for training machine learning algorithms. Running
for example, can be processed in real time to identify – and broadly similar queries again and again, at scale,
refuse – potentially fraudulent transactions. significantly reduces the time required to go through a set
of possible solutions in order to find the most efficient
algorithms.

INTERACTIVE ANALYTICS DATA INTEGRATION


Rather than running pre-defined queries to create static Data produced by different systems across a business is
dashboards of sales or production line productivity or stock rarely clean or consistent enough to be simply and easily
prices, business analysts and data scientists want to combined for reporting or analysis. ETL processes are
explore their data by asking a question, viewing the result, often used to pull data from different systems, clean and
and then either altering the initial question slightly or standardize it, and then load it into a separate system for
drilling deeper into results. This interactive query process analysis. Spark (and Hadoop) are increasingly being used
requires systems like Spark that are able to respond and to reduce the cost and time required for this ETL process.
adapt quickly.

WHY SPARK MATTERS TO YOU


Spark enables developers, data engineers, and data scientists to collaborate and combine SQL, streaming data, machine learning, and graph processing into modern data pipelines to rapidly
access, transform, and analyze big data at scale.

DEVELOPERS AND DATA ENGINEERS DATA SCIENTISTS

Easier, faster data pipelines: Easier, faster time to insight:

Develop and deploy applications that run 10-100x faster in production environments with in-memory Provides a uniform set of high-level machine learning pipeline APIs built on top of DataFrames to
processing of data make machine learning scalable with the ease of SQL for data manipulation

Build complex ETL pipelines that can speed up data ingestion and deliver superior performance Integrated distributed machine algorithms for classification, regression, collaborative filtering,
clustering, dimensionality reduction, and frequent pattern mining
Spark SQL's Structured Data API simplifies the complexity of data access, transformation, and storage
across distributed file systems, different file formats, streaming data, and NoSQL data stores Leverage Spark and deep learning with external libraries including BigDL, Spark Deep Learning
Pipelines, TensorFlowOnSpark, dist-keras, H2O Sparkling Water, PyTorch, Caffe, and MXNet
Combine event streams with machine learning to handle the logistics of machine learning in a flexible
way by:

Making input and output data available to independent consumers

Managing and evaluating multiple models and easily deploying new models

Free Hadoop Training: Spark Essentials


Free Hadoop Training: Spark Essentials
Get a glimpse of what free Hadoop on-demand training is like in this preview of the course "DEV
360 - Introduction to Apache Spark (Spark v2.1)."

If you're interested in this free on-demand course, learn more about it here.

WHY IS SPARK ON MAPR BETTER?

Everything on One Cluster Accessing Data In-Place

A confluence of several different technology shifts have dramatically changed machine learning applications. The combination of distributed computing, streaming analytics, and machine learning
is accelerating the development of next-generation intelligent applications, which take advantage of modern computational paradigms powered by modern computational infrastructure. The MapR
Data Platform integrates global event streaming, real-time database capabilities, and scalable enterprise storage with Hadoop, Spark, Apache Drill, and other ML libraries to power this new
generation of data processing pipelines and intelligent applications. Diverse and open APIs allow all types of analytics workflows to run on the data in place.

The MapR XD Distributed File and Object Store is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations into a single platform. MapR XD supports
industry-standard protocols and APIs, including POSIX, NFS, S3, and HDFS. Unlike Apache HDFS, which is a write once, append-only paradigm, the MapR Data Platform delivers a true read-write,
POSIX-compliant file system. Support for the HDFS API enables Spark and Hadoop ecosystem tools, for both batch and streaming, to interact with MapR XD. Support for POSIX enables Spark and
all non-Hadoop libraries to read and write to the distributed data store as if the data were mounted locally, which greatly expands the possible use cases for next-generation applications. Support
for an S3-compatible API means MapR XD can also serve as the foundation for Spark applications that leverage object storage.

The MapR Event Store for Apache Kafka is the first big-data-scale streaming system built into a unified data platform and the only big data streaming system to support global event replication
reliably at IoT scale. Support for the Kafka API enables Spark streaming applications to interact with data in real time in a unified data platform, which minimizes maintenance and data copying.

MapR Database is a high-performance NoSQL database built into the MapR Data Platform. MapR Database is multi-model: wide-column, key-value with the HBase API, or JSON (document) with
the OJAI API. Spark connectors are integrated for both HBase and OJAI APIs, enabling real-time and batch pipelines with MapR Database:

The MapR Database Connector for Apache Spark enables you to use MapR Database as a sink for Spark Structured streaming or Spark Streaming.
The Spark MapR Database Connector enables users to perform complex SQL queries and updates on top of MapR Database, while applying critical techniques such as projection and filter
pushdown, custom partitioning, and data locality.

MapR put key technologies essential to achieving high scale and high reliability in a fully distributed architecture that spans on-premises, cloud, and multi-cloud deployments, including edge-first
IoT, while dramatically lowering both the hardware and operational costs of your most important applications and data.

“ We are very excited about the new features [in MapR], Spark structured streaming allows us to use advanced analytics on
real-time oil well data while Drill allows us to explore the same data using SQL. This helps us make operational decisions
faster.

—  Eric Keister, advanced analytics and emerging technologies manager at Anadarko


CUSTOMERS USING APACHE SPARK ON MAPR
ADDITIONAL eBook On-Demand Training
Getting Started with Apache Spark Developer Courses
RESOURCES Spark 2.x from Inception to
Production

Blog Posts Documentation


Apache Spark Blog Posts MapR Spark Documentation

What's New?

MapR Accelerates the MapR Ecosystem Pack (MEP) MapR Clarity vs Cloudera Unity
Separation of Compute and 6.1
Storage
April 02, 2019 February 06, 2019 November 07, 2018

Products What's New Compute and Storage Products What's New MEP 6.1 MapR Differentiation Cloudera MapR Clarity MapR
MapR Accelerates the Separation of Compute Ecosystem Pack (MEP) 6.1 MapR Amplifies Announces Clarity Program Available Today,
and Storage Latest Release Integrates with Power of Kubernetes, Kafka, and MapR MapR Clarity Provides a Clear Path to AI,
Kubernetes to Better Manage Today's Bursty Database to Speed Up AI Application Hybrid Cloud, Containers, and Operational
and Unpredictable AI... Development. MAPR IS THE LEADING DATA Analytics WEBINAR Learn what...
PLATFORM...

GET STARTED
Email Us +1 855-NOW-MAPR Download MapR for Free Request a Demo

Why MapR? Company Contact Us


Press | News
Customers Contact Sales
Leadership
Solutions United States: +1 408-914-2390
Investors

Resellers Outside the US: +1 855-NOW-MAPR


Products
Partners Legal
Services
Careers

Training Awards

© 2019 MapR Technologies, Inc. All Rights Reserved

S-ar putea să vă placă și