Documente Academic
Documente Profesional
Documente Cultură
BUSINESS
DASHBOARDS
TO SUMMARIZE A
BILLION RECORDS
IN SECONDS
Title: Using Business Dashboards to Summarize a Billion Records in Seconds
Contributors
Devendran S, Software Engineer, Syncfusion Big Data Platform Team: Devendran is passionate about his
work and is an expert on all things big data including the Apache Hadoop and Spark platforms.
Rajendran SP, Product Manager, Big Data: Rajendran manages the Big Data Platform Team at
Syncfusion. He has been deeply involved with the design and implementation of the Syncfusion Big Data
Platform.
Abstract ......................................................................................................................................................... 3
Introduction .................................................................................................................................................. 3
Resources used ............................................................................................................................................. 3
Software .................................................................................................................................................... 3
Hardware .................................................................................................................................................. 4
The data set: New York City taxi data ....................................................................................................... 4
Overview ....................................................................................................................................................... 4
Create table with New York City taxi data set .............................................................................................. 4
Initial performance tuning and optimization ................................................................................................ 5
Query execution using Spark SQL ................................................................................................................. 6
Default query execution performance...................................................................................................... 6
Caching ...................................................................................................................................................... 6
Partitioning ............................................................................................................................................... 7
Dashboard ..................................................................................................................................................... 8
Conclusion ..................................................................................................................................................... 9
Consolidated metrics for easy review ....................................................................................................... 9
Appendix: Spark 2.0.0 preview release version testing .............................................................................. 10
The key takeaway is that you can now process and display massive amounts of data using commodity
hardware and the Syncfusion Dashboard and Big Data Platforms. No other software is required.
This white paper assumes a basic working knowledge of Apache Spark, the Syncfusion Big Data Platform,
and the Syncfusion Dashboard Platform.
Syncfusion publishes a free e-book on Apache Spark that can be downloaded from
https://www.syncfusion.com/resources/techportal/details/ebooks/spark.
The Syncfusion Big Data Platform is covered in this YouTube playlist:
https://www.youtube.com/playlist?list=PLDzXQPWT8wEATq33AeaF6HkOLuicbcCrA. A full trial
can be downloaded from https://www.syncfusion.com/products/big-data.
The Syncfusion Dashboard Platform is covered in this YouTube playlist:
https://www.youtube.com/playlist?list=PLDzXQPWT8wECbo2lLciJAK-4cMESbuG4L. A full trial
can be downloaded from https://www.syncfusion.com/products/dashboard.
Introduction
The Syncfusion Dashboard Platform is a complete solution for creating and sharing interactive
dashboards. The Dashboard Designer can directly connect to the Syncfusion Big Data Platform and
interactively transform huge amounts of data into attractive dashboards.
The Syncfusion Big Data Platform is a complete Apache Hadoop distribution designed for Windows. The
included Cluster Manager makes provisioning easy, allowing us to manage and monitor Hadoop clusters
on Windows. Apache Spark is tightly integrated into the Syncfusion Big Data platform.
Resources used
Software
Syncfusion Big Data Platform
Syncfusion Dashboard Platform
Overview
We formed a Syncfusion Big Data Platform cluster with five nodes: two name nodes and three data
nodes. Apache Spark is shipped and configured with the Syncfusion Big Data Platform. No additional
work is needed to get it up and running. We then uploaded the New York City taxi data set into HDFS
and created a table for access using Apache Spark SQL. We then accessed the data using the Syncfusion
Dashboard Platform for display and sharing.
1. Download the archives of the data set, extract them into a single folder, and then upload it into
HDFS using the following command.
2. Once uploaded, open the Apache Spark shell provided inside the Syncfusion Big Data Studio and
execute the queries listed in the Query Text table that follows.
sqlContext.sql("query_text_goes_here")
Alternatively, we can execute the queries directly against the Apache Spark Thrift server using
the Syncfusion.ThriftHive.Base assembly that ships with the Syncfusion Big Data Platform. We
have prepared a simple console application that demonstrates executing queries directly against
the Apache Spark Thrift server. This application is available at
https://github.com/syncfusion/spark-sql-with-
dashboard/tree/master/Syncfusion.Bigdata.ThriftApplication.
Query Text
Apache Spark is very resource intensive so we had to be careful in appropriately allocating YARN and Spark
resources to achieve expected performance. Each Spark application consists of a single driver process and
a set of executor processes scattered across the nodes on the cluster. Each Spark executor in an
In our Syncfusion Big Data Platform nodes, we had a total of 60 cores, excluding two name nodes and one
driver which would be running on the active name node. For our purposes we estimated that the OS and
other Apache Hadoop services utilize 10 cores. We would therefore have 50 cores available solely for the
purpose of running Spark executors. Given this backdrop, we allocated five cores per executor and set the
total number of instances to 10.
spark.driver.memory 28g
spark.executor.memory 28g
spark.executor.instances 10
spark.executor.cores 5
We also removed the spark.cleaner.ttl property from the configuration information to avoid clearing
cached data.
Given the size of data involved and the scalable nature of the processing system, this performance is
decent, but we could do better. Much better.
Caching
Apache Spark supports loading data into a cluster-wide in-memory cache. Caching computes and
materializes a Spark resilient distributed data set in memory while keeping track of its lineage, which can
Resilient distributed data set caching in our test took about 9.3 minutes. Once cached, we could process
data much faster as the metrics in the following table indicate.
Partitioning
As a next step, we deployed partitioning techniques to improve performance. Table partitioning is a
common optimization approach used in systems such as Apache Spark SQL and Hive. Each table can
have one or more partition keys which determine how the data is stored. Partitions allow the system to
efficiently identify the tuples that satisfy a specified criterion. For example, in our use case, New York
City taxi data sets are available from the year 2009 to 2015. If the data is partitioned based on year,
when we perform queries based on a particular year2009, for examplethen the query will be
performed only on the 2009 partition of the table, thereby speeding up the analysis significantly.
By executing the following commands in the Spark shell, we can partition the New York City taxi data set
based on year.
After partitioning, we achieved serious improved performance as shown in the following table.
Dashboard
We then created a dashboard visualization using the Syncfusion Dashboard Designer shipped as part of
our Dashboard Platform. In order to do this, we simply connected to the Syncfusion Big Data Platform as
the data source and used the previously cached table as our data.
We configured three bar charts to display the record count, passenger count, and the total amount
grouped by each vendor ID and the payment type. The following screenshot shows the dashboard we
designed. It takes just seconds to load the data into the charts.
Conclusion
We have demonstrated that querying a massive 173 GB data set with over a billion records and displaying
it in a rich dashboard interface takes only a few seconds with the Syncfusion Dashboard and Big Data
Platforms.
There is no comparable solution on the market today. A comparable dashboard platform without access
to a big data back end costs hundreds of thousands of dollars in annual licensing fees alone. The entire
Syncfusion Data Platform can be licensed for unlimited organization-wide deployment with an unlimited
number of nodes starting at just $3,995 per year. Contact us today to get started!
The following table shows the metrics for query execution using Spark 2.0. We will be shipping a preview
version of the Syncfusion Big Data Platform with Spark 2.0 soon.