Documente Academic
Documente Profesional
Documente Cultură
22.02.16
2
Predictive Analytics:
What? Where? Why?
22.02.16
3
What is Predictive Analytics?
The semantics
6 ©2
22.02.16
4
Source: Presentation on “Predictive Big Data Analytics” by Jussi Ahola @IBM
Source: http://www.predictiveanalyticstoday.com/what-is-predictive-analytics/
22.02.16
5
Evolution of Analytics
• Analytics of Things
Speed
More
sophisticated /
granular insights
Cost savings
22.02.16
7
Typical applications in Retail, Sales,
Marketing
22.02.16
8
Predictive analytics span a variety of core business objectives
Growing areas of applications
Evidence-based • Improving patient care and satisfaction
• Reducing costs through optimized allocation of resources
medicine
• Measuring and improving patient outcomes
Supply chain • Increasing visibility into virtually all areas of the supply chain
• Decreasing downtime and unpredictability
management
• Improving customer satisfaction
Scientific investigation
20% 16% 35% 29% No plans
Source:
Marketing and sales analysis currently lead the way. Halper,
The top F. (2014):
use cases Predictiveanalytics
for predictive Analyticsamong
for Business
the Advantage
22.02.16
active group include direct marketing (58%), cross-sell and upsell (55%), and retention analysis 10
(55%). In fact, predictive analytics is currently being used primarily in marketing (64%) and sales
22.02.16
11
Source: TDWI’s best practices report 2016 “Operationalizing and Embedding Analytics for Action”
22.02.16
12
More is better?
22.02.16
13
Source: The 2015 Nordic survey on Big Data and Hadoop (by SAS Institute and Intel)
What are your top predictive analytics challenges? Select three or fewer.
24%
Active
Lack of skilled personnel
19%
Investigating
Lack of understanding of predictive analytics technology 23%
30%
Figure 5. Based on 126 respondents in the active group and 195 in the investigating group.
Organizing to execute. Source: Halper, F. (2014): Predictive Analytics for Business Advantage
22.02.16
subject was raised in discussions with respondents about challenges. Some organizations have hired 14
Current users: What are the most popular techniques for predictive analytics in your organization? Those
in the investigation phase: Which techniques are you looking at? Select 3 or fewer.
Linear regression
59% Active
47%
57% Investigating
Decision trees
57%
51%
Cluster analysis
40%
47%
Time series models
45%
30%
Logistic regression
28%
17%
Other regression
7%
16%
Neural networks
18%
12%
Association rule learning
11%
11%
Naive Bayes
10%
6%
Support vector machines
2%
5%
Survival analysis
7%
5%
Ensemble learning
6%
Figure 12. Based on 126 respondents in the active group and 195 in the investigating group.
and time Clustering and time series are also popular. As indicated
Source:in Figure 12,F.
Halper, clustering
(2014):and time seriesAnalytics
Predictive analysis for Business Advantage
are also are also popular techniques for predictive analytics. In fact, time series analysis seems to be more 22.02.16
popular. popular than clustering among those investigating the technology than by those actually using it. 15
This might be because clustering is very useful in market segmentation, and marketing and sales are
Predictive Analytics and Big Data
What is Big Data?
22.02.16
17
Internet of Things à Big Data
22.02.16
18
Internet of Things (IoT) =
22.02.16
19
Adoption and future directions
Source: The 2015 Nordic survey on Big Data and Hadoop (by SAS Institute and Intel)
22.02.16
20
happens to have analytics embedded. In addition, Figure 14 shows that if respondents stick to their
plans, more than 80% of the active-group respondents will be using analytics platforms in the next
Where is predictive analytics done?
three years. Slightly more than 60% will use an appliance of some kind.
What infrastructure do you have in place for predictive analytics? Now? Three years from now?
1%
Data warehouse/data marts 89% 7% 3%
1%
Desktop application only 78% 15% 6%
Using today and
Flat files on servers 75% 4% 15% 6% will keep using
Likewise, those investigating the technology (not shown) are also looking at analytics platforms
(73%) and appliances (46%) over the next three Halper,
Source: years toF.help support
(2014): theirAnalytics
Predictive predictive
foranalytics
Business needs.
Advantage
22.02.16
Hadoop is gaining momentum among those using predictive analytics. Perhaps not surprisingly, 61% of the 21
active group has plans for utilizing Hadoop over the next three years as part of their analytics efforts.
SOURCE: ENTERPRISE HADOOP BENCHMARK SURVEY, CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET;
22.02.16
3 "53).%33ó).&/2-!4)/.óósóó/#4/"%2ó 22
TER | EXECUTIVE DASHBOARD
SOURCE: ENTERPRISE HADOOP BENCHMARK SURVEY, CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET;
BASED ON RESPONSES FROM 1,053 IT AND BUSINESS PROFESSIONALS 45
Leg
dat
E
21% Data processing (parsing, transformation)
22.02.16
23
35
Ow
17% Hosts sandboxes for ad hoc analysis
and
architectures are tent in cloud collaboration agement retention rules
11% data 9% 4% systems. Here are some
%
S BARGAIN BAIT
37% preparing systems. Here are some
% 37% 11%
for analysis and
9%
11% 4%
Enterprise data repository
of their concerns:
TERPRISE HADOOP
housing
TERPRISE HADOOP BENCHMARK
ESPONSES FROM 1,053
Institute’s
SURVEY,
IT AND BUSINESS
using
BENCHMARK
big
CONDUCTED BY
SURVEY, PROFESSIONALS
the cloud to manage
data
WAYNE maturity
ECKERSON
pools of
model are
FOR TECHTARGET;
big data.
CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET;
ESPONSES FROM 1,053 IT AND BUSINESS PROFESSIONALS 09% Online archiving
45%
45%
managers who would
put 100% of their content
Legal requirements for
in the cloud
Legal requirements
data privacy for
and location
data privacy and location SOURCE: AIIM’S 2014 REPORT CONTENT
35%
RESPONSES FROM 1,053 IT AND BUSINESS PROFESSIONALS. RESPONDENTS WERE ASKED TO SELECT ALL ITEMS THAT APPLIED. CLOUD AND MOBILE WORLD
21% Data processing (parsing, transformation)considering it
Ownership of content
17% Hosts sandboxes for ad hoc analysis 7% Yes, Ownership of content
and future migration
17% Hosts
3 sandboxes for ad hoc analysis public cloud
"53).%33ó).&/2-!4)/.óósóó/#4/"%2ó and future migration
he top two
he
or top two
Hadoop
17% Hosts sandboxes for data mining and machine learning
8% Yes,
17% Hosts sandboxes for data mining and machine learning
25%
25%
or Hadoop
n analytics private cloud Not under records man-
analytics
ctures are
13% Data staging area Not underretention
agement records man-
rules
ctures are
13% Data staging area 4% Yes, a mix agement retention rules
aring data
aring 11% Enterprise data repository of public and
alysisdata
and
11% Enterprise data repository private clouds
alysis and
supporting 9
upporting
plications. 10% Analytical operating system (supporting multiple
6% Don’t know
engines) 9
plications. 10% Analytical operating system (supporting multiple engines)
35% No, and
10% Data pump (feeding downstream systems) Percentage of information
10% Data pump (feeding downstream systems)no plans to Percentage
managers whoof information
would
09% Online archiving managers
put 100% ofwho would
their content
09% THE
SOURCE: Online archiving
DATA WAREHOUSING INSTITUTE; BASED ON RESPONSES put 100%
in the of their content
cloud
COLLECTED ONLINE FROM 222 IT AND BUSINESS PROFESSIONALS BETWEEN in the cloud
SOURCE: AIIM’S 2014 REPORT CONTENT
OCTOBER 2013
TERPRISE HADOOP BENCHMARK AND
SURVEY, MAY 2014
CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET; BASED ON COLLABORATION AND PROCESSING IN A
SOURCE: AIIM’S 2014 REPORT CONTENT
FROM 1,053 IT AND BUSINESS PROFESSIONALS.
ERPRISE HADOOP BENCHMARK SURVEY, CONDUCTEDRESPONDENTS WERE
BY WAYNE ASKED TO
ECKERSON FORSELECT ALL ITEMS
TECHTARGET; THAT
BASED ONAPPLIED. CLOUD AND MOBILE
COLLABORATION AND WORLD
PROCESSING IN A
ROM 1,053 IT AND BUSINESS PROFESSIONALS. RESPONDENTS WERE ASKED TO SELECT ALL ITEMS THAT APPLIED. CLOUD AND MOBILE WORLD
33ó).&/2-!4)/.óósóó/#4/"%2ó
22.02.16
33ó).&/2-!4)/.óósóó/#4/"%2ó
organizations were using public, private or hybrid clouds 24
Company financials
What is the
Sentiment indicators
Predictive probability that the
model company files for
Macro-economic & bankruptcy?
market data
22.02.16
27
Instances and features
Data = collection of data instances & their features
22.02.16
30
Supervised vs unsupervised learning
Supervised Learning from selected training data (examples provided by the user /
learning developer)
e.g., an agent observes some example input-output pairs (labeled
data) and learns a function that maps from input to output
Unsupervised The agent learns patterns in the input data even though it is not given
learning information about the ”correct” output values (i.e. no explicit feedback)
Semi- The agent is given some labeled data and a large amount of
supervised unlabeled data
learning
22.02.16
31
Views of Learning
Learning is the removal of our remaining uncertainty
- We may not know the functions form, but if we have enough examples (prior
knowledge), we can learn it?
22.02.16
32
Top 10 algorithms in Data Mining
Ref: Wu et al. (2008): ”Top 10 algorithms in data mining”, Knowl. Inf. Syst.
(2) K-Means
• A clustering algorithm which partitions observations k clusters
• Each observation belongs to the cluster with nearest mean
• Unsupervised
22.02.16
33
Top 10 algorithms in Data Mining
(4) Apriori
• One of the most popular approaches to find frequent itemsets from a transaction dataset and derive
association rules
(6) PageRank
• Provides a static ranking of web pages by using the link structure as an indicator of an individual
page’s quality
(7) AdaBoost
• An ensemble learning algorithm which uses multiple learners to solve a problem (e.g. classificiation
task)
22.02.16
34
Top 10 algorithms in Data Mining
(8) kNN (k-nearest neighbor classification)
• Finds a group of k objects in the training set that are closest to the test object, and
bases the assignment of a label on the predominance of a particular class in this
neighborhood
(9) Naive Bayes (”independent feature model”)
• Given a set of objects (each corresponding to a vector of features) with known
classifications, the algorithm uses Bayes theorem to construct a rule which classifies
future objects based on their feature vectors
• The name of the method follows from the assumption that the features are
independent
(10) CART (Classification And Regression Trees)
• A non-parametric decision tree learning algorithm
• The trees are classification trees if the dependent variable is nominal
• The trees are regression trees if the dependent variable is continuous
22.02.16
35
Challenges in predictive modeling
• What are good hypothesis spaces?
- Hypothesis space = space of all hypotheses that can be output by a learning
algorithm
- Hypothesis = proposed function that is believed to be similar to the true function
(i.e. target concept)
• What algorithms can work with the hypothesis space?
- Can we find any general design principles for machine learning algorithms?
• Can we optimize predictive accuracy?
- How to avoid overfitting?
• When can we be confident in the results?
- How much training data is needed to be able to find accurate hypotheses?
22.02.16
36
Review point
• In your own words, explain what is predictive analytics and
how it differs from retrospective/descriptive analytics.
• List at least 3 application areas where predictive analytics
can be used. Think how analytics could help in your current
organization.
• List at least 5 algorithms that are widely used in predictive
modeling.
• Define the 4Vs of Big Data.
• Describe what are considered as challenges in the job of a
“data scientist”.
22.02.16
37