Sunteți pe pagina 1din 37

Introduction to Predictive Analytics

Pekka Malo, Assist. Prof. (statistics)


Aalto BIZ / Department of Information and Service Economy
Learning Objectives for this week
•  Able to explain what is predictive analytics and where it is
used

•  Knows basic concepts in predictive modeling (incl. aware of


commonly used algorithms)

•  Awareness of current trends (evolution) in data analytics

•  Tutorial and assignment: Able to apply C5.0 decision tree


algorithm for commonly encountered classification problems

22.02.16
2
Predictive Analytics:
What? Where? Why?
22.02.16
3
What is Predictive Analytics?
The semantics

Technology that learns from experience


What is
(data) to predict the future behavior of happening?
individuals Discovery and
exploration
Descriptive analytics
Why did it
vs. What action What did happen?
should I take? I learn, Reporting,
Predictive analytics Decision what’s  best? analysis, content
management Cognitive analytics
vs.
What could
Prescriptive analytics happen?
Predictive
analytics
and modeling

6 ©2
22.02.16
4
Source: Presentation on “Predictive Big Data Analytics” by Jussi Ahola @IBM
Source: http://www.predictiveanalyticstoday.com/what-is-predictive-analytics/
22.02.16
5
Evolution of Analytics

•  Descriptive statistics from


data with “high information
density”

•  Strong commercial platforms •  Predictive analytics (relations, causal effects)

•  Analytics of Things

•  Large sets of data with varying information


•  High volume, high velocity and high variety of density
information (but lower information density)
•  Focus on creating new ways of doing business,
•  Focus on building the capabilities for processing instead of polishing current operations
large data
•  Fast and agile insights – analytics as part of
•  Support for current operations, improved efficiency decision and operational processes 22.02.16
6
Source: iianalytics.com
What do we expect from data and
analytics?
Insights

Speed

More
sophisticated /
granular insights

Cost savings

22.02.16
7
Typical applications in Retail, Sales,
Marketing

22.02.16
8
Predictive analytics span a variety of core business objectives
Growing areas of applications
Evidence-based • Improving patient care and satisfaction
• Reducing costs through optimized allocation of resources
medicine
• Measuring and improving patient outcomes

• Acquiring, growing and retaining employees


Human capital
• Helping ensure optimal staff levels
management
• Increasing performance, efficiency and engagement

Crime prediction • Identifying predictors of threat and fraud


• Optimizing force deployment
and prevention
• Anticipating and visualizing crime hot spots

Supply chain • Increasing visibility into virtually all areas of the supply chain
• Decreasing downtime and unpredictability
management
• Improving customer satisfaction

• Improving accurate responses at the point of impact


Process
• Decreasing costs through operational efficiency
optimization
• Transforming threat and fraud identification processes 22.02.16
9
Source: Presentation on “Predictive Big Data Analytics” by Jussi Ahola @IBM
Predictive Analytics fo
What is predictive analytics being used for in your company? Now? Three years from now?
Figu
Direct marketing
58% 13% 20% 9% st
Retention analysis is Mar
Cross-sell/upsell/propensity to spend
55% 21% 16% 8% a top use case among (7
acti
those currently deploying (55%
in
Retention analysis predictive analytics. (54
55% 17% 17% 11% p
resp
Portfolio analysis/prediction towb
47% 23% 18% 12%
reve
h
Optimization
46% 31% 15% 8% New use cases 72%
nex
Risk analysis
43% 26% 17% 14%
W
Econometric forecasting
34% 31% 21% 14% Using
8 today and
TDWI RESE A RCH
will keep using
Fraud detection
30% 19% 32% 19%
Will use within
Quality assurance 3 years
24% 28% 27% 21%

Scientific investigation
20% 16% 35% 29% No plans

Loan default N/A or don’t know


15% 9% 45% 31%

Figure 2. Based on 126 active respondents.

Source:
Marketing and sales analysis currently lead the way. Halper,
The top F. (2014):
use cases Predictiveanalytics
for predictive Analyticsamong
for Business
the Advantage
22.02.16
active group include direct marketing (58%), cross-sell and upsell (55%), and retention analysis 10
(55%). In fact, predictive analytics is currently being used primarily in marketing (64%) and sales
22.02.16
11
Source: TDWI’s best practices report 2016 “Operationalizing and Embedding Analytics for Action”
22.02.16
12
More is better?

22.02.16
13
Source: The 2015 Nordic survey on Big Data and Hadoop (by SAS Institute and Intel)
What are your top predictive analytics challenges? Select three or fewer.
24%
Active
Lack of skilled personnel
19%
Investigating
Lack of understanding of predictive analytics technology 23%
30%

Inability to assemble necessary data—integration issues 13%


9% Need for
Not enough budget 13% education
10%

Business case not strong enough 13%


21%

Inability to assemble necessary data—cultural issues 8%


5%
4%
3%

The technology is too hard to use 3%


3%

Figure 5. Based on 126 respondents in the active group and 195 in the investigating group.

Organizing to execute. Source: Halper, F. (2014): Predictive Analytics for Business Advantage
22.02.16
subject was raised in discussions with respondents about challenges. Some organizations have hired 14
Current users: What are the most popular techniques for predictive analytics in your organization? Those
in the investigation phase: Which techniques are you looking at? Select 3 or fewer.

Linear regression
59% Active
47%
57% Investigating
Decision trees
57%
51%
Cluster analysis
40%
47%
Time series models
45%
30%
Logistic regression
28%
17%
Other regression
7%
16%
Neural networks
18%
12%
Association rule learning
11%
11%
Naive Bayes
10%
6%
Support vector machines
2%
5%
Survival analysis
7%
5%
Ensemble learning
6%

Figure 12. Based on 126 respondents in the active group and 195 in the investigating group.

and time Clustering and time series are also popular. As indicated
Source:in Figure 12,F.
Halper, clustering
(2014):and time seriesAnalytics
Predictive analysis for Business Advantage
are also are also popular techniques for predictive analytics. In fact, time series analysis seems to be more 22.02.16
popular. popular than clustering among those investigating the technology than by those actually using it. 15
This might be because clustering is very useful in market segmentation, and marketing and sales are
Predictive Analytics and Big Data
What is Big Data?

22.02.16
17
Internet of Things à Big Data

22.02.16
18
Internet of Things (IoT) =

Source: IoT Infographic by Goldman Sachs

22.02.16
19
Adoption and future directions

Source: The 2015 Nordic survey on Big Data and Hadoop (by SAS Institute and Intel)

22.02.16
20
happens to have analytics embedded. In addition, Figure 14 shows that if respondents stick to their
plans, more than 80% of the active-group respondents will be using analytics platforms in the next
Where is predictive analytics done?
three years. Slightly more than 60% will use an appliance of some kind.

What infrastructure do you have in place for predictive analytics? Now? Three years from now?
1%
Data warehouse/data marts 89% 7% 3%
1%
Desktop application only 78% 15% 6%
Using today and
Flat files on servers 75% 4% 15% 6% will keep using

An analytics platform 55% 28% 9% 8% Will use within


3 years
Appliance 41% 21% 21% 17%
No plans
Hadoop 26% 35% 25% 14% N/A or don’t know

Public cloud 18% 29% 38% 15%

Figure 14. Based on 126 active respondents.

Likewise, those investigating the technology (not shown) are also looking at analytics platforms
(73%) and appliances (46%) over the next three Halper,
Source: years toF.help support
(2014): theirAnalytics
Predictive predictive
foranalytics
Business needs.
Advantage
22.02.16
Hadoop is gaining momentum among those using predictive analytics. Perhaps not surprisingly, 61% of the 21

active group has plans for utilizing Hadoop over the next three years as part of their analytics efforts.
SOURCE: ENTERPRISE HADOOP BENCHMARK SURVEY, CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET;

The Role of Hadoop?


BASED ON RESPONSES FROM 1,053 IT AND BUSINESS PROFESSIONALS 45%
Legal requirements for
data privacy and location

21% Data processing (parsing, transformation)


35%
Ownership of content
17% Hosts sandboxes for ad hoc analysis
and future migration

The top two


uses for Hadoop
17% Hosts sandboxes for data mining and machine learning
25%
in analytics Not under records man-
architectures are
13% Data staging area agement retention rules
preparing data
for analysis and 11% Enterprise data repository
then supporting 9
applications. 10% Analytical operating system (supporting multiple engines)

10% Data pump (feeding downstream systems) Percentage of information


managers who would
09% Online archiving put 100% of their content
in the cloud
SOURCE: AIIM’S 2014 REPORT CONTENT
SOURCE: ENTERPRISE HADOOP BENCHMARK SURVEY, CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET; BASED ON COLLABORATION AND PROCESSING IN
RESPONSES FROM 1,053 IT AND BUSINESS PROFESSIONALS. RESPONDENTS WERE ASKED TO SELECT ALL ITEMS THAT APPLIED. CLOUD AND MOBILE WORLD

22.02.16
3 "53).%33ó).&/2-!4)/.óósóó/#4/"%2ó 22
TER | EXECUTIVE DASHBOARD

The Scoop on Hadoop Hi


Are users buying vendors’ Hadoop whoop? Some organizations have, Wh
and others are showing interest. But some still need convincing. are
are
cre
ten
11% 9% 4% sys
40% 37%
of t

No plans Under Under Partially Fully


55
HER
Exp
consideration development deployed deployed priv

SOURCE: ENTERPRISE HADOOP BENCHMARK SURVEY, CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET;
BASED ON RESPONSES FROM 1,053 IT AND BUSINESS PROFESSIONALS 45
Leg
dat
E
21% Data processing (parsing, transformation)
22.02.16
23
35
Ow
17% Hosts sandboxes for ad hoc analysis
and
architectures are tent in cloud collaboration agement retention rules
11% data 9% 4% systems. Here are some
%
S BARGAIN BAIT
37% preparing systems. Here are some
% 37% 11%
for analysis and
9%
11% 4%
Enterprise data repository
of their concerns:

Partly Cloudy Conditions of their concerns:


then supporting
applications. 10% Analytical operating 55%
55%
system (supporting multiple engines)
9
ans Less
Under than 20% Under of the organizationsPartially that have Fully Exposure of confidential/
ans Under
consideration Under
development Partially
deployed Fully
deployed Exposure of confidential/
private data
assessed themselves
consideration development against TheData
10%
deployed Data Ware-
pumpdeployed
(feeding downstream systems)
private data Percentage of information

TERPRISE HADOOP
housing
TERPRISE HADOOP BENCHMARK
ESPONSES FROM 1,053
Institute’s
SURVEY,
IT AND BUSINESS
using
BENCHMARK
big
CONDUCTED BY
SURVEY, PROFESSIONALS
the cloud to manage
data
WAYNE maturity
ECKERSON

pools of
model are
FOR TECHTARGET;

big data.
CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET;
ESPONSES FROM 1,053 IT AND BUSINESS PROFESSIONALS 09% Online archiving
45%
45%
managers who would
put 100% of their content
Legal requirements for
in the cloud
Legal requirements
data privacy for
and location
data privacy and location SOURCE: AIIM’S 2014 REPORT CONTENT

21% DataSOURCE: 40% No, but


35%
ENTERPRISE HADOOP BENCHMARK SURVEY, CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET; BASED ON COLLABORATION AND PROCESSING IN A
processing (parsing, transformation)

35%
RESPONSES FROM 1,053 IT AND BUSINESS PROFESSIONALS. RESPONDENTS WERE ASKED TO SELECT ALL ITEMS THAT APPLIED. CLOUD AND MOBILE WORLD
21% Data processing (parsing, transformation)considering it
Ownership of content
17% Hosts sandboxes for ad hoc analysis 7% Yes, Ownership of content
and future migration
17% Hosts
3 sandboxes for ad hoc analysis public cloud
"53).%33ó).&/2-!4)/.óósóó/#4/"%2ó and future migration
he top two
he
or top two
Hadoop
17% Hosts sandboxes for data mining and machine learning
8% Yes,
17% Hosts sandboxes for data mining and machine learning
25%
25%
or Hadoop
n analytics private cloud Not under records man-
analytics
ctures are
13% Data staging area Not underretention
agement records man-
rules
ctures are
13% Data staging area 4% Yes, a mix agement retention rules
aring data
aring 11% Enterprise data repository of public and
alysisdata
and
11% Enterprise data repository private clouds
alysis and
supporting 9
upporting
plications. 10% Analytical operating system (supporting multiple
6% Don’t know
engines) 9
plications. 10% Analytical operating system (supporting multiple engines)
35% No, and
10% Data pump (feeding downstream systems) Percentage of information
10% Data pump (feeding downstream systems)no plans to Percentage
managers whoof information
would
09% Online archiving managers
put 100% ofwho would
their content
09% THE
SOURCE: Online archiving
DATA WAREHOUSING INSTITUTE; BASED ON RESPONSES put 100%
in the of their content
cloud
COLLECTED ONLINE FROM 222 IT AND BUSINESS PROFESSIONALS BETWEEN in the cloud
SOURCE: AIIM’S 2014 REPORT CONTENT
OCTOBER 2013
TERPRISE HADOOP BENCHMARK AND
SURVEY, MAY 2014
CONDUCTED BY WAYNE ECKERSON FOR TECHTARGET; BASED ON COLLABORATION AND PROCESSING IN A
SOURCE: AIIM’S 2014 REPORT CONTENT
FROM 1,053 IT AND BUSINESS PROFESSIONALS.
ERPRISE HADOOP BENCHMARK SURVEY, CONDUCTEDRESPONDENTS WERE
BY WAYNE ASKED TO
ECKERSON FORSELECT ALL ITEMS
TECHTARGET; THAT
BASED ONAPPLIED. CLOUD AND MOBILE
COLLABORATION AND WORLD
PROCESSING IN A
ROM 1,053 IT AND BUSINESS PROFESSIONALS. RESPONDENTS WERE ASKED TO SELECT ALL ITEMS THAT APPLIED. CLOUD AND MOBILE WORLD

33ó).&/2-!4)/.óósóó/#4/"%2ó
22.02.16
33ó).&/2-!4)/.óósóó/#4/"%2ó
organizations were using public, private or hybrid clouds 24

to support big data applications. Another 40% said cloud


22.02.16
25
Terms, concepts, and commonly
used techniques
Predictive modeling
= making predictions on the future based on historical data (past &
present)

Example: How to predict company’s bankruptcy filing?

Company financials
What is the
Sentiment indicators
Predictive probability that the
model company files for
Macro-economic & bankruptcy?
market data

Input variables Hypothesis ~ a proposed function Target variable


(predictors) that is similar to the true function
(target concept)

22.02.16
27
Instances and features
Data = collection of data instances & their features

Features (a.k.a. attributes, fields, characteristics)

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

Objects (a.k.a. 5 No Divorced 95K Yes


instances, 6 No Married 60K No
records, cases, 7 Yes Divorced 220K No
entities) 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Source: Tan, Steinbach & Kumar: Introduction to data mining


22.02.16
28
Artificial Intelligence?
Different people approach AI with different goals in mind:
•  Are you concerned with thinking or behavior?
•  Do you want to model humans or work from an ideal standard?
AI is the study and design of intelligent agents, where an
intelligent agent is a system that perceives its environment
and takes actions that maximize its chances of success.
- Wikipedia

”Intelligence is concerned mainly with rational action. Ideally, an


intelligent agent takes the best possible action in a situation.”
- Russell & Norvig

Ref: Russell, S. & Norvig, P.: ”Artificial Intelligence: A Modern Approach”


22.02.16
29
Machine Learning?
A field of artificial intelligence, where the purpose is to develop algorithms
that can learn to perform tasks based on empirical data
Emphasis on two aspects:
•  Automatic recognition of complex patterns within data
•  Ability to generalize beyound the already observed empirical data

” A computer program is said to learn from experience E with respect to


some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E.[“

- The famous definition of learning by Tom M. Mitchell

22.02.16
30
Supervised vs unsupervised learning
Supervised Learning from selected training data (examples provided by the user /
learning developer)
e.g., an agent observes some example input-output pairs (labeled
data) and learns a function that maps from input to output

Unsupervised The agent learns patterns in the input data even though it is not given
learning information about the ”correct” output values (i.e. no explicit feedback)

Semi- The agent is given some labeled data and a large amount of
supervised unlabeled data
learning

22.02.16
31
Views of Learning
Learning is the removal of our remaining uncertainty
-  We may not know the functions form, but if we have enough examples (prior
knowledge), we can learn it?

Learning requires guessing a good, small hypothesis class


-  Start with a small class and enlarge until fits data?

But when are we wrong?


-  Prior knowledge might be wrong
-  Guess of hypothesis class can be wrong (the smaller the class, the more likely it is
wrong)

22.02.16
32
Top 10 algorithms in Data Mining
Ref: Wu et al. (2008): ”Top 10 algorithms in data mining”, Knowl. Inf. Syst.

(1) C4.5 / C5.0 (decision tree algorithms)


•  Given a set of training data, builds a decision tree using information entropy which can be used for
classification tasks
•  Divide-and-conquer: Use hierarchical, joint-variable conditions while breaking solution space into
subspaces

(2) K-Means
•  A clustering algorithm which partitions observations k clusters
•  Each observation belongs to the cluster with nearest mean
•  Unsupervised

(3) Support-vector Machines (SVM)


•  The aim of SVM is to find the best (maximum margin) classification function to distinguish between
members of different classes in the training data
•  Considered to be one of the most robust and accurate methods among all well-known algorithms
•  Supervised

22.02.16
33
Top 10 algorithms in Data Mining
(4) Apriori
•  One of the most popular approaches to find frequent itemsets from a transaction dataset and derive
association rules

(5) EM algorithm (Expectations Maximization)


•  Finite mixture distributions are a flexible tool for modeling and clustering data observed on random
phenomena

(6) PageRank
•  Provides a static ranking of web pages by using the link structure as an indicator of an individual
page’s quality

(7) AdaBoost
•  An ensemble learning algorithm which uses multiple learners to solve a problem (e.g. classificiation
task)

22.02.16
34
Top 10 algorithms in Data Mining
(8) kNN (k-nearest neighbor classification)
•  Finds a group of k objects in the training set that are closest to the test object, and
bases the assignment of a label on the predominance of a particular class in this
neighborhood
(9) Naive Bayes (”independent feature model”)
•  Given a set of objects (each corresponding to a vector of features) with known
classifications, the algorithm uses Bayes theorem to construct a rule which classifies
future objects based on their feature vectors
•  The name of the method follows from the assumption that the features are
independent
(10) CART (Classification And Regression Trees)
•  A non-parametric decision tree learning algorithm
•  The trees are classification trees if the dependent variable is nominal
•  The trees are regression trees if the dependent variable is continuous

22.02.16
35
Challenges in predictive modeling
•  What are good hypothesis spaces?
-  Hypothesis space = space of all hypotheses that can be output by a learning
algorithm
-  Hypothesis = proposed function that is believed to be similar to the true function
(i.e. target concept)
•  What algorithms can work with the hypothesis space?
-  Can we find any general design principles for machine learning algorithms?
•  Can we optimize predictive accuracy?
-  How to avoid overfitting?
•  When can we be confident in the results?
-  How much training data is needed to be able to find accurate hypotheses?

22.02.16
36
Review point
•  In your own words, explain what is predictive analytics and
how it differs from retrospective/descriptive analytics.
•  List at least 3 application areas where predictive analytics
can be used. Think how analytics could help in your current
organization.
•  List at least 5 algorithms that are widely used in predictive
modeling.
•  Define the 4Vs of Big Data.
•  Describe what are considered as challenges in the job of a
“data scientist”.

22.02.16
37

S-ar putea să vă placă și