Documente Academic
Documente Profesional
Documente Cultură
Deriving Intelligence from Large Data ‐
Using Hadoop and Applying Analytics
Abstract
This white paper is focused on discussing the
challenges facing large scale data processing and the
approaches and solutions to manage structure and
apply analytics on this voluminous data to draw
valuable insights and business intelligence. The paper
also elaborates on using the Hadoop eco system for
deriving structure from this large scale data.
Impetus Technologies, Inc.
www.impetus.com
October 2010
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
Table of Contents
Introduction ........................................................................................................... 3
Challenges facing large data management ............................................................ 5
Finding solutions to large data challenges ............................................................. 7
BI implementation approaches over large data .................................................... 9
Approach 1: Analytics over Hadoop with MPP DWs ............................................. 9
Approach 2: Indirect Analytics over Hadoop ......................................................... 9
Approach 3: Direct Analytics over Hadoop .......................................................... 11
Case Study ............................................................................................................ 12
Summary .............................................................................................................. 13
2
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
Introduction
Large Data can be described as data that occupies very large storage space in
the file system, which in the present context can range from 1 terabyte to
petabytes. The world is currently witnessing an upsurge in digital data, and
organizations require solutions that can help them extract valuable intelligence
and insights from this data.
In this scenario, the role of Business Intelligence and Analytics in drawing
insights from large scale data has increased exponentially. Take the instance of
Financial Data Monitoring systems. We all know that billions/trillions of financial
transactions take place every single hour. Now for an organization that provides
a fraud detection system for financial transactions, it would be nearly
impossible to detect suspicious transactions over the huge amount of data
generated.
Figure 1 Deriving Intelligence from Unstructured Data
For such enterprises, it is imperative to find the best solutions to:
• Store Data
• Cleanse and Transform Data
• Apply Analytics over this Set
3
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
The data has to be collected over a long period of time, maybe months, or even
years, from various distributed geographical locations. They require an excellent
and cost‐effective framework for processing this data in a distributed manner
on commodity hardware. The data will have to be processed and summarized to
its farthest limit, and then pursued and processed time and again. The various
mechanisms, such as generating alerts on any dubious transactions would be
required by the system. This is where BI and analytics come into the picture.
Organizations today are also driving their businesses based on the feeling and
emotions of their customers and potential customers. While they have huge
amounts of unstructured data available in the form of tweets, comments,
posting, blogs etc., it needs to be properly mined and analyzed to gauge the
sentiments of the customers and identify their compliments, comments and
grievances. Based on this feedback and inputs, organizations can decide the
course of their business roadmaps. They can also carry on predictive analysis by
identifying the patterns in customer behavior as well as their preferences.
Clearly, every business domain can use the data that it already has, or wants to
gather, in myriad useful ways and use it to create more business opportunities
and ideas.
4
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
Challenges facing large data management
Large data is one of the biggest challenges facing organizations today, and
applying analytics is an even more difficult job. Internet applications are
showing nearly 100% growths every year and enterprise applications are also
gathering momentum. Thus, addressing the concerns related to large data has
become the top priority.
Figure 2 Analytics Challenges
The main challenges associated with Large Enterprise Data Warehouses are as
follows:
• Disparate Sources: Data comes from many different sources and
therefore, companies need to identify these data sources and the issues
related to them. They have to perform cleansing operations to obtain
the ‘sensible’ data and mechanisms to draw conclusions from the data.
• Identification of Useful Data: The Extraction Transformation and
Loading (ETL) process is very resource intensive. Companies need to
throw away a lot of garbage data, correct the missing information and
find replacements for certain fields that may be more useful for internal
processing.
5
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
• Structure Data: A structure has to be derived out of the raw data, so
that analytics can happen.
• Store Data: One of the biggest challenges with Large Data is storing it.
For deriving any sort of intelligence from data of this magnitude,
effective mechanisms are required to store it.
• Use of RDBMS: A lot of traditional data warehouses are built around
RDBMS. With the data warehouses growing to terabytes and beyond,
conventional relational database management systems are not the best
option when it comes to managing the large data sets.
• Minimizing Response Time: The time needed to perform analytics is
very high. This time needs to be reduced for companies to derive results
in the form of conclusions.
In order to overcome these challenges, organizations must choose the
appropriate Business Intelligence strategy, based on factors such as ease and
cost of implementation. Since, gathering the right information from the data
can be an expensive proposition; companies must select the right BI solution for
their business needs, to lower impact on their IT budgets. They must also check
to see how easily the solution can be implemented and start giving returns.
At the same time, organizations must identify whether the data that is being
gathered and stored needs to be analyzed in real time, using high touch queries
or is whether the same can be processed in batches, using a framework like
Hadoop to produce results in an offline manner.
Companies will additionally need to gauge the best strategy to store their large
data based on its size and usage. The data, which is available in different
formats like Structured or Unstructured forms, might intersect at some level
and need to be combined to provide more useful business insights.
The other challenges that one has to consider are complex computations, data
security, scalability, and accessibility and fault tolerance.
Based on the business needs, companies have to weigh each of these challenges
and identify the BI strategy that best suits them. Properly utilized, large data
can emerge as a differentiator for companies. Organizations can use it to gain an
edge over competition. By effectively managing as well as storing terabytes to
petabytes of data, performing rich analysis on these massive data sets, and
doing it all at ultrafast speeds, organizations can transform their voluminous
data into a business asset.
6
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
Finding solutions to large data challenges
Although, there are various solutions available for storing large amounts of
data, Hadoop is one of the best options available. Hadoop is a flexible
infrastructure for large scale computation and data processing on a network of
commodity hardware. It is a common infrastructure pattern extracted from
building distributed systems. Hadoop takes a large piece of data, breaks it up
into smaller pieces and distributes it into various nodes of the cluster. These
nodes then execute the pieces in parallel and independently, feeding back into
each other. They follow the programming paradigm of MapReduce, a patented
software framework introduced by Google to support distributed computing on
large data sets on computer clusters.
MapReduce is significant because it allows developers to create a large variety
of parallel programs without having to worry about programming for intra‐
cluster communication, task monitoring or task failure handling. Earlier,
creating programs that handled these issues could consume literally months of
programming time. MapReduce programs have been created for everything
from text tokenization, indexing, search to data mining and machine learning
algorithms.
The best part about Hadoop is that it is an Open Source project initiated by
Apache. It was created taking inspiration from Google’s MapReduce and GFS
papers. Today, Yahoo is one of the largest contributors to the evolution of
Hadoop and also responsible for getting Hadoop to its current state. It is used
extensively by businesses these days because of its various advantages like:
• Ability to linearly scale up to thousands of nodes
• Ability to use commodity hardware, leading to a huge cost advantage
• Flexibility, which gives Hadoop its real power and allows it to implement
map and reduce processes to solve particular problems
• It’s simple programming and execution environment, that
accommodates for failures
• Its DFS for storing data and the Map Reduce execution paradigm
• The availability of tools built around Hadoop such as Hive and Pig that
simplify writing MapReduce jobs
Hive for instance, is a database or a data warehouse infrastructure that
functions on top of Hadoop. It tries to remove most of the complications with
Hadoop and provides programmers with an easy interface to Hadoop. Hive is an
effective tool for enabling easy ETL, generating summarizations, and putting
7
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
structures on the data. It also has the capability to query and analyze large data
sets stored in Hadoop files.
When Hadoop is used as an ETL tool, it enables the storage of huge amounts of
unstructured data and also scales up massively with time and as the inflow of
data increases. Inputs coming in the raw format from various sources can be
collected and transformed using Hadoop ETL. It is possible to create a daisy
chain of Hadoop nodes to crunch a huge data set. The structured data can then
be utilized for performing analytics, Reporting and deriving conclusions.
When choosing a BI Strategy for your organization there are a few things you
should consider:
1. Overall ease & cost of implementation: Gathering the right information
from the data can be an expensive proposition. Choosing the right BI
solution for your business needs will definitely impact the IT budgets if
not thought over prudently. Another consideration to be made before
choosing a solution is the ease with which it can be implemented and
start giving returns.
2. Real Time Analysis vs. Batch Analysis: One needs to identify whether
the data that is being gathered and stored needs to be analyzed in real
time, using High touch queries or is whether the same can be processed
in batches that can use a framework like Hadoop to process and
produce results in an offline manner.
3. End User Ad hoc Analytics: Every enterprise today has the ability to
determine and define their requirements and needs. Service providers
with a reactive approach to those needs can only play a catch up role
rather than acting as the innovators. It also gives a competitive
advantage where in your smart customers have an option of gathering
the data that they need.
Based on the business needs, you have to weigh each of these challenges and
identify the right BI strategy for yourself. The next section elaborates the
various popular BI approaches that are generally being used for achieving BI
implementations over large data.
8
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
BI implementation approaches over large data
Approach 1: Analytics over Hadoop with MPP DWs
Today, a lot of options are available in the market that allows the integration of
MPP DWs with Hadoop. This is worth considering for users that have a large
amount of data even after applying summarization over it. Using Hadoop for
cleaning/transforming the data into a structured form allows them to load the
data into any of the available options of MPP DWs. While the data is being
uploaded, they can write UDFs to perform Database level analytics and then
integrate the same with BI solutions using ODBC/JDBC connectivity for end‐user
analytics and reporting.
Figure 3 Analytics over Hadoop with MPP DW
Also, using MPP DWs allows companies to deploy various performance
enhancement techniques like index compression, materialized views, result set
caching and I/O sharing. Alternatively, some of the MPP DWs may also provide
organizations with a robust framework that supports MR jobs executions within
their own clusters at MPP levels, providing them with two levels of parallel
processing. This feature is really good for working with high touch queries and
provides an excellent framework for end user ad hoc analytics.
Approach 2: Indirect Analytics over Hadoop
Another interesting approach could be to use Hadoop for cleaning/transforming
the data into a structured form and then loading the same into the RDBMS
9
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
databases. Hadoop can efficiently access the data between the RDBMS data
sources and Hadoop systems through DBInputFormat and DBOutputFormat
interfaces. Once the Unstructured Data is processed, it can be then pushed to
an RDBMS database which can subsequently act as a data source for any BI
solution.
Figure 4 Indirect Analytics over Hadoop
This approach provides the end‐user with the flexibility of parallel processing of
Hadoop and an SQL interface at the summarized data level. This approach works
well when the summarized data is not too big to pose a challenge for the
RDBMS database being used. This solution is not as expensive as the first
approach. It is also suitable for the high touch queries where the user wants to
perform real‐ time ad hoc analytics as most of the RDBMS databases come with
a comprehensive set of performance enhancement techniques.
10
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
Approach 3: Direct Analytics over Hadoop
We can also apply analytics directly over a Hadoop System without moving it to
any RDBMS database. It can prove to be a very effective practice to analyze the
data directly from the Hadoop file system. If we have a scenario where the
processed & summarized data in itself is very huge, is placed on the Hadoop
system. So, we do not want to get into the complications of moving the data out
of the Hadoop system either to a MPP DW or RDBMS. This can be done by using
Hive as an interface for the data present on Hadoop systems.
Figure 5 Direct Analytics over Hadoop
This approach allows you to do batch and asynchronous analytics over the same
data present over the Hadoop system. It is a very cost effective approach as it
does not involve any expense in managing the separate data source other than
your existing Hadoop System. It also provides you with the flexibility of scaling
to any level with your summarized data.
The Large Data Processing and BI Strategy Matrix below can be used as a
guideline to choose the right BI strategy for our business needs.
11
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
Figure 6 The Large Scale BI Strategy
Case Study
One of the customers of Intellicus, a global leader in digital marketing
intelligence, was facing the challenge of processing and aggregating statistical
advertising data of more than GBs. It wanted to use this data for mining
behavioral insights to help its clients better understand their own customers
and leverage and profit from the rapidly evolving World Wide Web and mobile
arena.
The problem was just not limited to creating a storage system for this data but
also running analytics over it on a monthly basis. The intent was to extract
specific patterns and then aggregate them based on complex parameters and
weighing systems.
Impetus quickly realized that for deriving intelligence from data of this
magnitude, a conventional relational database management system would be
inadequate in the long run. We therefore developed and deployed an optimum
Hadoop‐Java implementation of the product, tuning the Hadoop infrastructure
configuration to get the maximum throughput. Intellicus also used MapReduce
and Tokyocabinet, which is a Berkley like file storage system to handle the
12
Deriving Intelligence from Large Data ‐ Using Hadoop and Applying Analytics
processing of a large metadata (>1.2 GB). The Java‐based implementation
helped Intellicus achieve optimum results in the given small cluster.
Summary
Impetus believes that organizations have to choose the solutions that best suit
their needs, to solve their Large Data challenges. At the same time, they also
have to select a BI strategy based on their business requirements.
Nevertheless, it is desirable to get all the possible options available under the
same hood, as it will help in reducing the complications that arise when dealing
with multiple alternatives to achieve a common goal. According to Impetus, the
ideal solution will be to provide easy interfacing of the data present on Hadoop,
the RDBMS and MPP DWs, and create a common platform for doing BI and
analytics over that data. When it comes to batch processing, having out‐of‐the‐
box scheduling and asynchronous execution of MR Jobs and BI analytics
components like dashboards, Reports, is critical.
About Impetus
Impetus Technologies offers Product Engineering and Technology R&D services for software product development.
With ongoing investments in research and application of emerging technology areas, innovative business models, and
an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver
cutting‐edge software products. Our expertise spans the domains of Data Analytics, Large Data Management, SaaS,
Cloud Computing, Mobility Solutions, Testing, Performance Engineering, and Social Media among others.
Impetus Technologies, Inc.
5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA
Tel: 408.252.7111 | Email: inquiry@impetus.com
Regional Development Centers ‐ INDIA: • New Delhi • Indore • Hyderabad
Disclaimers
The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of
this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus 13
Technologies Inc.