Documente Academic
Documente Profesional
Documente Cultură
Teaching Material by
Dr.T.Ramkumar
Associate Professor
School of Information Technology & Engineering
Vellore Institute of Technology
Vellore – 632 014
ramkumar.thirunavukarasu@vit.ac.in
What is Big Data?
“Big Data”
Big Data” refers to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze
“Big Data”
Big data is data that exceeds the processing capacity of conventional
database systems. The data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures
“Big Data”
Whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value
and hidden knowledge from it
.“Big Data”
Big data is the realization of greater business intelligence by storing,
processing, and analyzing data that was previously ignored due to the
limitations of traditional data management technologies.
(Source: Harness the Power of Big Data: The IBM Big Data Platform)
Why Big Data? & What makes Big Data?
Key enablers for the growth of “Big Data” are
Availability of data
Facts about Big Data
1 Bit = Binary Digit
8 Bits = 1 Byte
Mobile devices
(Tracking all objects all the time) Sensor technology and networks
(Measuring all kinds of data)
The Changing model of data
The model of generating /Consuming data has
changed
Old Model - Few companies are generating data;
all others are consuming data
Examples:
– Yahoo – over 170PB of data
– Facebook – over 30PB of data
– eBay – over 5PB of data
Examples:
1. E-Promotions: Based on your current location, your
purchase history, what you like send promotions
right now for store next to you
2. Healthcare monitoring: sensors monitoring your
activities and body any abnormal measurements
require immediate reaction
Complexity - Variety
Various formats, types, and structures
Supervised Learning
Un-supervised Learning
Ensemble Methods
Text Analysis
Collaborative filtering
Supervised Learning
48
Un-Supervised Learning
49
Ensemble Methods
50
Text Analysis
51
Collaborative Filtering
52
Collaborative Filtering Types
• User-based collaborative filtering predicts a test user’s
interest in a test item based on rating information from
similar user profiles
• Item-based collaborative filtering, is a form of
Collaborative Filtering based on the similarity between
items calculated using people's ratings of those items
• Many Different Similarity Measures are used
o Cosine, Person….
Evolution of Analytic Tools
Nature of Analysis
– Batch processing
– Parallel execution
– Spread data over a cluster of servers and take the
computation to the data
Analysis conducted at lower cost
Analysis conducted in less time
Greater flexibility
Linear scalability
What analysis is possible with Hadoop?
Text Mining
Index Binding
Pattern Recognition
Collaborative filtering
Model Prediction
Sentiment Analysis
Risk Assessments
Modeling true risk
Challenge:
How much risk exposure does an organization really have
with each customer?
Solution with the Hadoop:
Source and aggregate disperse data sources to build data
picture – eg. Credit card records, call recordings, chat
sessions, e-mails, banking activity
Structure and analysis – Sentiment analysis, graph creation,
pattern recognition
Typical Industry:
Financial services - Bank, insurance companies
Customer Churn Analysis
Challenge:
Why organizations are really losing customer?
Solution with the Hadoop:
Rapidly build behavioral model from disparate
data sources
Structure and analyze with Hadoop – Traversing,
Graph creation, pattern recognition
Typical Industry:
Telecommunication, Financial Services
Recommending Engine / Ad targeting
Challenge:
Using user data to predict which products to
recommend
Solution with the Hadoop:
Batch processing framework – Allow execution in
parallel over large data sets
Collaborative filtering – Collect taste information
from many users and utilizing information to predict
what similar users like
Typical Industry:
E-Commerce, Manufacturing, Retail, Advertising
Point of sale transaction analysis
Challenge:
Analyzing Point of Sale (PoS) data to target
promotions and manage operations
Sources are complex and data volumes grow across
chains of stores and other sources
Solution with the Hadoop:
Batch processing framework – Allow execution in
parallel over large data sets
Pattern recognition
Optimizing over multiple data sources
Utilizing information to predict demand
Interactive analysis of big data
Challenge:
Analyzing real-time data series from network of sensors
Calculating average frequency over time is extremely tedious because of the
need to analyze terabytes
Solution with the Hadoop:
Task the computation to the data
Expand from simple scans to more complex data mining
Better understand how the network reacts to fluctuations
Discrete anomalies may, in fact, be interconnected
Identify the leading indicators of component failure
Typical Industry:
Telecommunication, Data centers
Threat Analysis / Trade Surveillance
Challenge:
Detecting threats in the form of fraudulent activity or
attacks
Large data volumes involved
Like looking of needle in a hay stack
Some excellent tools are available that mines data from web
and transform into big data analytics platform
Some of it is free, and some are available for fee
https://www.programmableweb.com/api/extractiv
(which automatically converts unstructured text into structured
semantic data)
https://www.mozenda.com/
(Quickly turn web page content into structured data)
openrefine.org
http://commoncrawl.org/
(Maintain an open repository of web crawl data that can be
accessed and analyzed by anyone)
https://books.google.com/ngrams/
(optimized for quick inquiries into the usage of small sets
of phrases
Data Analytics Life Cycle
Phase – 1 Discovery
• Learning the business domain
• Availability of resources to support the project in terms of
people, technology, data and time
• Framing the business problem as an analytical challenge
• Identifying the key stakeholders
• Interviewing the analytics sponsor
• Developing initial hypotheses
• Identifying potential data sources
Phase – 2 Data preparation