Sunteți pe pagina 1din 11

USING

BUSINESS
DASHBOARDS
TO SUMMARIZE A
BILLION RECORDS
IN SECONDS
Title: Using Business Dashboards to Summarize a Billion Records in Seconds

Published: June 2016

Contributors
Devendran S, Software Engineer, Syncfusion Big Data Platform Team: Devendran is passionate about his
work and is an expert on all things big data including the Apache Hadoop and Spark platforms.

Rajendran SP, Product Manager, Big Data: Rajendran manages the Big Data Platform Team at
Syncfusion. He has been deeply involved with the design and implementation of the Syncfusion Big Data
Platform.

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 1 .


Table of Contents

Abstract ......................................................................................................................................................... 3
Introduction .................................................................................................................................................. 3
Resources used ............................................................................................................................................. 3
Software .................................................................................................................................................... 3
Hardware .................................................................................................................................................. 4
The data set: New York City taxi data ....................................................................................................... 4
Overview ....................................................................................................................................................... 4
Create table with New York City taxi data set .............................................................................................. 4
Initial performance tuning and optimization ................................................................................................ 5
Query execution using Spark SQL ................................................................................................................. 6
Default query execution performance...................................................................................................... 6
Caching ...................................................................................................................................................... 6
Partitioning ............................................................................................................................................... 7
Dashboard ..................................................................................................................................................... 8
Conclusion ..................................................................................................................................................... 9
Consolidated metrics for easy review ....................................................................................................... 9
Appendix: Spark 2.0.0 preview release version testing .............................................................................. 10

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 2 .


Abstract
This white paper demonstrates interactively processing and displaying information extracted from a
billion data records using the Syncfusion Dashboard and Big Data Platforms. The records processed are
from the well-known New York City taxi data set. Data processing is performed by the Syncfusion Big Data
Platform running Apache Spark. We analyze overall performance and tune the system based on factors
such as memory and cores available on the cluster. We also improve query performance by applying
partitioning and caching techniques.

The key takeaway is that you can now process and display massive amounts of data using commodity
hardware and the Syncfusion Dashboard and Big Data Platforms. No other software is required.

This white paper assumes a basic working knowledge of Apache Spark, the Syncfusion Big Data Platform,
and the Syncfusion Dashboard Platform.

Syncfusion publishes a free e-book on Apache Spark that can be downloaded from
https://www.syncfusion.com/resources/techportal/details/ebooks/spark.
The Syncfusion Big Data Platform is covered in this YouTube playlist:
https://www.youtube.com/playlist?list=PLDzXQPWT8wEATq33AeaF6HkOLuicbcCrA. A full trial
can be downloaded from https://www.syncfusion.com/products/big-data.
The Syncfusion Dashboard Platform is covered in this YouTube playlist:
https://www.youtube.com/playlist?list=PLDzXQPWT8wECbo2lLciJAK-4cMESbuG4L. A full trial
can be downloaded from https://www.syncfusion.com/products/dashboard.

Introduction
The Syncfusion Dashboard Platform is a complete solution for creating and sharing interactive
dashboards. The Dashboard Designer can directly connect to the Syncfusion Big Data Platform and
interactively transform huge amounts of data into attractive dashboards.

The Syncfusion Big Data Platform is a complete Apache Hadoop distribution designed for Windows. The
included Cluster Manager makes provisioning easy, allowing us to manage and monitor Hadoop clusters
on Windows. Apache Spark is tightly integrated into the Syncfusion Big Data platform.

Resources used
Software
Syncfusion Big Data Platform
Syncfusion Dashboard Platform

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 3 .


Hardware
Node types Number Machine specs
of
nodes
Name node 2 Azure VM instance type D4 standard
running RAM 28 GB
Syncfusion Big Hard disk 400 GB
Data Platform Core 8
OS Windows Server 2012
Data node 3 Azure VM instance type D15 standard
running RAM 140 GB
Syncfusion Big Hard disk 1 TB
Data Platform Core 20
OS Windows Server 2012

The data set: New York City taxi data


The New York City Taxi & Limousine Commission released a detailed historical data set covering over 1.1
billion individual taxi trips in the city from January 2009 through June 2015. We used this data set to
explore the capabilities of the Syncfusion Dashboard and Big Data Platforms. You can download the data
set from the following link:
https://data.cityofnewyork.us/data?agency=Taxi+and+Limousine+Commission+%28TLC%29&cat=&type
=new_view&browseSearch=&scope=

Refer to the following link for detailed schema information:


http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf.

Overview
We formed a Syncfusion Big Data Platform cluster with five nodes: two name nodes and three data
nodes. Apache Spark is shipped and configured with the Syncfusion Big Data Platform. No additional
work is needed to get it up and running. We then uploaded the New York City taxi data set into HDFS
and created a table for access using Apache Spark SQL. We then accessed the data using the Syncfusion
Dashboard Platform for display and sharing.

We will now look into each step in detail.

Create table with New York City taxi data set

1. Download the archives of the data set, extract them into a single folder, and then upload it into
HDFS using the following command.

> hdfs dfs -put <localfolderpath> /SparkSQLDemo/

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 4 .


We can also use the Syncfusion Big Data Studio IDE to connect with the cluster and perform
upload operations without using shell commands.

2. Once uploaded, open the Apache Spark shell provided inside the Syncfusion Big Data Studio and
execute the queries listed in the Query Text table that follows.

sqlContext.sql("query_text_goes_here")

Alternatively, we can execute the queries directly against the Apache Spark Thrift server using
the Syncfusion.ThriftHive.Base assembly that ships with the Syncfusion Big Data Platform. We
have prepared a simple console application that demonstrates executing queries directly against
the Apache Spark Thrift server. This application is available at
https://github.com/syncfusion/spark-sql-with-
dashboard/tree/master/Syncfusion.Bigdata.ThriftApplication.

Query Text

CREATE EXTERNAL table nyctrips (vendor_id string, pickup_datetime timestamp,


dropoff_datetime timestamp, passenger_count double, trip_distance double,
pickup_longitude double, pickup_latitude double, rate_code double, store_and_fwd_flag
string, dropoff_longitude double, dropoff_latitude double, payment_type string,
fare_amount double, surcharge double, mta_tax double, tip_amount double, tolls_amount
double, total_amount double) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOAD DATA INPATH '/SparkSQLDemo/yellow_tripdata_2009.csv' INTO TABLE nyctrips


LOAD DATA INPATH '/SparkSQLDemo/yellow_tripdata_2010.csv' INTO TABLE nyctrips
LOAD DATA INPATH '/SparkSQLDemo/yellow_tripdata_2011.csv' INTO TABLE nyctrips
LOAD DATA INPATH '/SparkSQLDemo/yellow_tripdata_2012.csv' INTO TABLE nyctrips
LOAD DATA INPATH '/SparkSQLDemo/yellow_tripdata_2013.csv' INTO TABLE nyctrips
LOAD DATA INPATH '/SparkSQLDemo/nyc_taxi_data.csv' INTO TABLE nyctrips
LOAD DATA INPATH '/SparkSQLDemo/yellow_tripdata_2015-01-06.csv' INTO TABLE nyctrips

Initial performance tuning and optimization


At the outset we used the configuration settings listed in the following table.

Configuration file Property name Property value


yarn.nodemanager.resource.memory-mb 122880
Yarn-site.xml yarn.scheduler.minimum-allocation-mb 8192
yarn.scheduler.maximum-allocation-mb 32678

Apache Spark is very resource intensive so we had to be careful in appropriately allocating YARN and Spark
resources to achieve expected performance. Each Spark application consists of a single driver process and
a set of executor processes scattered across the nodes on the cluster. Each Spark executor in an

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 5 .


application has a number of cores and a heap size. In order to obtain high throughput, we specifically set
the number of instances, cores, and heap size used by executors.

In our Syncfusion Big Data Platform nodes, we had a total of 60 cores, excluding two name nodes and one
driver which would be running on the active name node. For our purposes we estimated that the OS and
other Apache Hadoop services utilize 10 cores. We would therefore have 50 cores available solely for the
purpose of running Spark executors. Given this backdrop, we allocated five cores per executor and set the
total number of instances to 10.

spark.driver.memory 28g
spark.executor.memory 28g
spark.executor.instances 10
spark.executor.cores 5

We also removed the spark.cleaner.ttl property from the configuration information to avoid clearing
cached data.

Query execution using Spark SQL


Once the configuration changes were completed, we used a battery of sample queries that query the
total number of records, passenger count, and total amount grouped by payment type and vendor ID to
obtain performance information.

Default query execution performance


The following table details the metrics we obtained.

Description Query Run time


Total number select trips.payment_type, trips.vendor_id, count(*) from (select * 2.4 minutes
of records from nyctrips) trips where trips.payment_type in ('CRD','CSH') group
by trips.payment_type, trips.vendor_id
Passenger select trips.payment_type, trips.vendor_id, 2.4 minutes
count sum(trips.passenger_count) from (select * from nyctrips) trips where
trips.payment_type in ('CRD','CSH') group by trips.payment_type,
trips.vendor_id
Total amount select trips.payment_type, trips.vendor_id, sum(trips.total_amount) 3.0 minutes
from (select * from nyctrips) trips where trips.payment_type in
('CRD','CSH') group by trips.payment_type, trips.vendor_id

Given the size of data involved and the scalable nature of the processing system, this performance is
decent, but we could do better. Much better.

Caching
Apache Spark supports loading data into a cluster-wide in-memory cache. Caching computes and
materializes a Spark resilient distributed data set in memory while keeping track of its lineage, which can

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 6 .


be used to re-create a data set as needed from its original sources. To cache a table, we used the following
code.

cache table tablename

In our case, we used the following:

cache table nyctrips

Resilient distributed data set caching in our test took about 9.3 minutes. Once cached, we could process
data much faster as the metrics in the following table indicate.

Description Query Run


time
Total number select trips.payment_type, trips.vendor_id, count(*) from (select * from 40
of records nyctrips) trips where trips.payment_type in ('CRD','CSH') group by seconds
trips.payment_type, trips.vendor_id
Passenger select trips.payment_type, trips.vendor_id, sum(trips.passenger_count) 41
count from (select * from nyctrips) trips where trips.payment_type in seconds
('CRD','CSH') group by trips.payment_type, trips.vendor_id
Total amount select trips.payment_type, trips.vendor_id, sum(trips.total_amount) 49
from (select * from nyctrips) trips where trips.payment_type in seconds
('CRD','CSH') group by trips.payment_type, trips.vendor_id

Could we do any better? Read on!

Partitioning
As a next step, we deployed partitioning techniques to improve performance. Table partitioning is a
common optimization approach used in systems such as Apache Spark SQL and Hive. Each table can
have one or more partition keys which determine how the data is stored. Partitions allow the system to
efficiently identify the tuples that satisfy a specified criterion. For example, in our use case, New York
City taxi data sets are available from the year 2009 to 2015. If the data is partitioned based on year,
when we perform queries based on a particular year2009, for examplethen the query will be
performed only on the 2009 partition of the table, thereby speeding up the analysis significantly.

By executing the following commands in the Spark shell, we can partition the New York City taxi data set
based on year.

// Load the nyctrips table data into the Data Frame


var nycTripsData = sqlContext.sql(select * from nyctrips)

// Partition the NycTripsData by year of the column dropoff-datetime


var partitionByYear = nycTripsData.repartition(year($dropoff_datetime))

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 7 .


// Save the partitioned result into the new Table
nyctrips_partitionbyyear
partitionByYear.saveAsTable(nyctrips_partitionbyyear)

Partitioning the data set by year took about 42 minutes.

After partitioning, we achieved serious improved performance as shown in the following table.

Description Query Run time


Total number select trips.payment_type, trips.vendor_id, count(*) from (select * from 19 seconds
of records nyctrips_partitionbyyear) trips where trips.payment_type in
(CRD,CSH) group by trips.payment_type, trips.vendor_id
Passenger select trips.payment_type, trips.vendor_id, sum(trips.passenger_count) 13 seconds
count from (select * from nyctrips_partitionbyyear) trips where
trips.payment_type in (CRD,CSH) group by trips.payment_type,
trips.vendor_id
Total amount select trips.payment_type, trips.vendor_id, sum(trips.total_amount) 12 seconds
from (select * from nyctrips_partitionbyyear) trips where
trips.payment_type in (CRD,CSH) group by trips.payment_type,
trips.vendor_id

Dashboard
We then created a dashboard visualization using the Syncfusion Dashboard Designer shipped as part of
our Dashboard Platform. In order to do this, we simply connected to the Syncfusion Big Data Platform as
the data source and used the previously cached table as our data.

We configured three bar charts to display the record count, passenger count, and the total amount
grouped by each vendor ID and the payment type. The following screenshot shows the dashboard we
designed. It takes just seconds to load the data into the charts.

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 8 .


You can find out more about the Syncfusion Dashboard Platform at
http://help.syncfusion.com/dashboard-platform/overview.

Conclusion
We have demonstrated that querying a massive 173 GB data set with over a billion records and displaying
it in a rich dashboard interface takes only a few seconds with the Syncfusion Dashboard and Big Data
Platforms.

Consolidated metrics for easy review


Query Without tuning Resilient distributed data set Partitioning by year
Spark caching using Spark SQL

Total record 2.4 minutes 40 seconds 19 seconds


count
Passenger 2.4 minutes 41 seconds 13 seconds
Count
Total Amount 3.0 minutes 49 seconds 12 seconds

There is no comparable solution on the market today. A comparable dashboard platform without access
to a big data back end costs hundreds of thousands of dollars in annual licensing fees alone. The entire
Syncfusion Data Platform can be licensed for unlimited organization-wide deployment with an unlimited
number of nodes starting at just $3,995 per year. Contact us today to get started!

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 9 .


Download a free trial at https://www.syncfusion.com/downloads/bigdata
Call 1-888-9DOTNET
Email sales@syncfusion.com

Appendix: Spark 2.0.0 preview release version testing


Spark 2.0 delivers speedups of up to five to ten times. We processed the New York City taxi data set using
the Spark 2.0.0 preview and the Syncfusion Big Data Platform with the same configuration used in this
white paper. Query execution took less than 15 seconds.

The following table shows the metrics for query execution using Spark 2.0. We will be shipping a preview
version of the Syncfusion Big Data Platform with Spark 2.0 soon.

Query Without tuning Spark Resilient distributed data Partitioning by year


set caching using Spark SQL
Total record count 1.9 minutes 15 seconds 5 seconds
Passenger Count 1.9 minutes 13 seconds 5 seconds
Total Amount 2.8 minutes 12 seconds 6 seconds

Synfusion | Using Business Dashboards to Summarize a Billion Records in Seconds 10 .

S-ar putea să vă placă și