Machine Learning - Practitioner Basic - v1.0

Machine Learning for Practitioners
Level Basic
What we plan to do today
Introduce
Data Science
Classification
Continuous
Clustering
Time Series
Analysis
Decision
Trees
Introduction to Data Science
Machine Learning - Introduction
Field of study that gives

computers the ability to
learn
without
being
explicitly programmed
Machine Learning Vs Reporting
Systematic Study and extraction of knowledge from data
Unlike
database
querying, which asks
What data satisfies this
pattern
(query)?
discovery asks What
patterns satisfy this
data?
Data Driven Insights
Till Now:
Implement an idea based on an established theory.
Collect data to validate the theory.
Future:
Look at the data and ask the right question to obtain
useful insights.
Keep in Mind.simple is not just good Its great!!
Skill Sets of a Data Scientist
Math &
Statistics
Data
Scientist
Domain
Knowledge &
Soft Skills
Programmi
ng &
Database
Communicati
on &
Visualization
The Machine Learning Process
2
Gather the
Data
Identify the
Business
Problem
Transform
and Sanitize
the Data
Explore the
Data
Interpret and
Improve the
Results
5
Perform
Statistical
Analysis
Business
Interpretati
on of
Results
Classification
Classification Problems- An Overview
Common Types of Classification Problems are as follows:

Classifying Email as SPAM or Not SPAM
Classifying tumor as Malignant or Benign
Classifying online transaction as Fraudulent or Not
Classifying Test Match Outcomes as Win, Loss or
Draw
The classification problem could be Binomial or Multinomial
Data sources and data types
There are all types of data sources and data types:

Data Sources:
We have websites, databases or files
Data Types:
Structured Numeric, Categories
Semi-Structured Books, Brochures/ Documents, E-Mails, etc.
Un-Structured Social Media, Images, Voice, etc.
We will look at only loading text and csv files during this course. These are the most
generic forms for a dataset.
Lets Look at a Case Study: Titanic
On April 15, 1912, during her maiden voyage,

the Titanic sank after colliding with an iceberg,
killing 1502 out of 2224 passengers and crew.
This
sensational
tragedy
shocked
the
international community and led to better
safety regulations for ships.
Although there was some element of luck involved in surviving the sinking, some groups of
people were more likely to survive than others, such as women, children, and the upperclass
In this challenge, we ask you to complete the analysis of what sorts of people were likely to
survive
How do we predict survival possibility for such an

incident?
Well boys, do your best for the

women and children, and look out
for yourselves
Probability Theory
Chance
Outcome
Conditional
Probability
of
an
Lets Look The Data

Passenge
r Id
Survived
PClas
s
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
Integer
Integer
Integer
Factor
Factor
Num
eric
Intege
r
Integer
Factor
Numeric
Factor
Factor
Braund, Mr. Owen Harris
male
22
A/5 21171
7.25
Cumings, Mrs. John Bradley
female
38
PC 17599
71.2833
Heikkinen, Miss. Laina
female
26
STON/O2. 3101282
7.925
Futrelle, Mrs. Jacques Heath
female
35
113803
53.1
SibS
p
No: of siblings/spouses aboard.
Parc
h
No: of parents/children aboard
Ticke
t
Ticket Number.
Passeng
erId
Id of the row. It's a reference
Survived
0 means Not Survived, 1 means Survived
PClass
Class of the passenger. 3 classes present. 1st

class carrying the value 1, 2nd class with value 2
and 3rd class with value 3.
Name
Name of the passenger
Fare
Passenger Fare.
Sex
Male/ Female
Age
Specifies the age of the individual
Cabi
n
Cabin Number.
Emb
arke
S
C85
C
S
C123
Port on which embarked. C = Cherbourg; Q = Queenstown;

S = Southampton
A brief look at different types of structured data
Data Types
Categorical
Numeric
Nominal
Ordinal
Discrete
Variables
with no
ranking
or order.
Variables
with
ordered
values.
Count of
somethin
g
Ex:
Gender
Ex:
Performa
nce
(Good,
Ok, Bad)
Ex: Parch
(Parents
and
Children)
Continuou
s
Informatio
n
measured
on some
scale
Ex: Age
Can take
decimal
values
Food For Thought
What is a dummy variable?

ADummy variableor IndicatorVariableis an artificialvariablecreated to represent
an attribute with two or more distinct categories/levels.
Why do we need this?

A category can be nominal or ordinal. Ordinal categories are ordered and nominal are
not. Dummy variables are needed when working with nominal data.
Dummy Variables in the Titanic Data Set
Which variables need to be converted into dummy among the

following?
Embarke
PClass
Sex
SibSp
Parch
How does a dummy variable look?

Before
Transformation
After Transformation
Sex
Embarked
Sexmale
Sexfemale
EmbarkedS
EmbarkedC
EmbarkedQ
male
female
female
female
Lets study the data
Intuitively which columns are relevant in predicting the survival of

Survived
the
passenger?
Pclass
Sex
Age
SibSp (Maybe)
Parch (Maybe)
Fare
Embarked (Maybe)
Are Any of These Columns Related?

PClass and Fare will be related to one another
Embarking and Pclass could have a combined impact on Fare
Some common descriptive statistics used to explain data
Mean
Sum of each value/Total No: of values Works when values are
not skewed
A mother 6 feet tall and a son 2 feet tall cross a river 3
feet deep.
They dont have a problem as on an average the pair
making the crossing are 4 feet tall.
(6 + 2)/2 = 8/2 = 4 feet
Median
Middle value of a dataset.
Odd No: of values : Pick the middle one
Even No: of values: Take an average of the two
values that lie in the middle
What is the average salary of the attendees in this

classroom?
What if Frank joins us in this classroom? !!
What should I do first with the dataset in order to

manually pick the median ??
Mode
The most frequent value in a dataset. Useful when the data is of

categorical nature
Food For Thought
What are the different ways to replace missing values?
Mean imputation
Median imputation
Mode imputation
Forward Propagation
Backward Propagation
Decision Tree based imputation
Looking closely at the age feature
The outlier concept
What is an outlier?
Anoutlieris an observation that lies an abnormal distance from other values in a
random sample from a population
How to handle an outlier?

1. Examine the data in the context of outcome
predicted
2. Remove unusual observations that are far removed
from general observations
Pick the outlier from the image in this

slide.
Understanding box plots Five Point Summary
Maximum Value
75th Percentile Value
Median (50th Percentile
Data)
25th Percentile Value
Lowest value
NULL Hypothesis and Alternate Hypothesis
NULL
Thenull
hypothesis(H0)
is
ahypothesiswhich
the researcher tries
Hypothesis
to disprove, reject or nullify. The 'null'
often refers to the common view of
something,
while
the
alternativehypothesisis
what
the
researcher really thinks is the cause of a
phenomenon.
Alternate
Thealternative hypothesis, denoted by

Hypothesis
H1or Ha, is thehypothesis that sample
observations are influenced by some nonrandom cause.
Mathematical understanding of variable importance
We have categorical data and continuous

data.
Lets look at the Titanic case study example:
Target Variable = Survived Categorical
Variable
Categorical Predictors:
Pclass ,Sex , SibSp ,Parch
,Embarked
Checking the importance of these variables:
Both the target variable and the predictor

variables can be any of these two types.
In order to mathematically establish the
importance of a categorical data in accurately
predicting values in the target variable
(continuous/categorical), we need
to perform
Right
Left Variable
one of Test
the following:
Variable
ANOVA
Continuous
Categorical
Chi-Square
Categorical
Categorical
Pearsons
Coeff
Continuous
Continuous
The Chi-Square Test
What is Chi-Square Test?

It is a test conducted across multiple samples to identify whether they are different.
Here the target variable and the independent variables should be discrete in nature.
What does Chi-Square test do?
It checks relationships between the samples and check how much they vary
between each other as a ratio of variance within each other. This metric tells us how
distinct/ homogeneous each of these samples are.
When do I use Chi-Square?

To understand the goodness of fit (or) association test between samples.
The concept of sample and population
Population
All possible data points that can be used to identify pattern for a specific objective
Example: All the job applications received by Cognizant
Sample
A representative subset of the population
Example: All the job applications received by Cognizant who have more than 4
years of experience
Why are we doing this??
Sampling Techniques
Simple Random Sample

A random selection of a subset of data. Each element from the population has an
equal chance of being selected while choosing a sufficiently large sample.
Stratified Random Sampling

Selection of elements based on the distributions observed in the critical
parameters, maintaining the ratios between both the sample and the population.
Cluster Sampling
Withcluster samplingone should
divide the population into groups (clusters).
obtain a simple random sample of so many clusters from all possible clusters.
obtain data on every sampling unit in each of the randomly selected clusters.
Simple Random Sampling
Simple random samplingis thebasic samplingtechnique where we select a group of

subjects (asample) for study from a larger group (a population). Each individual is
chosen entirely by chance and each member of the population has an equal chance of
being included in the sample.
Stratified Sampling
Stratified sampling is a divide-and-conquer sampling strategy. The population is divided

into groups called strata. The strata are chosen so that similar cases are grouped
together, then a second sampling method, usually simple random sampling, is employed
within each stratum. Stratified sampling is especially useful when the cases in each
stratum are very similar with respect to the outcome of interest
33
Derived from https://www.openintro.org/stat/
Cluster Sampling
A cluster sample is much like a two-stage simple random sample. We break up the
population into many groups, called clusters. Then we sample a fixed number of
clusters and collect a simple random sample within each cluster. This technique is
similar to stratified sampling in its process, except that there is no requirement in
cluster sampling to sample from every cluster. Stratified sampling requires
observations be sampled from every stratum
Derived from https://www.openintro.org/stat/
Model Validation Process
Titanic Data
Training Data
Data for Training
the model 60%
of overall data
available
Cross Validation
Data
Data for
validating the
model 20% of
the overall data
available
Test Data
Data for testing
the model 20%
of the overall
data available
Applying stratified sampling to Titanic dataset
Original
Population
:
891 Data
Points
65% Males
35% Females
lin
p
am
S
ed
fi
i
t
ra ble
t
S ria
va
sin
u
g
ex
S
g
Training ~
534
65%
Males
35%
Females
Cross
Validation
~ 178
65%
Males
35%
Females
Test ~
178
65%
Males
35%
Females
36
Types of Models
There are many ways to perform classification in R:

Logistic Regression (Basic Algorithm)
Decision Trees
Random Forest
Support Vector Machines
Neural Networks
Lets take at logistic regression.
Model Diagnostics Reduction In Variance
Actual Result/ Classification
Predictive
Result/
Classification
Yes
No
Yes
True Positive
False Positive
No
False Negative
True Negative
Scoring the model
Area Under the Curve (AUC)

A graphical representation of the logistic
regression accuracy in predicting values for
the target variable.
Continuous
Continuous Data Prediction A Case Study
Singapore House Price Prediction

Singapore has a well developed house
leasing market, where the features of house,
plays a key role in determining its price.
We are going look at features like size of the house,
no: of bedrooms house type, etc. to predict the right
lease price
In this challenge, we ask you to predict the right price
point for a house in Singapore
As usual, lets look at the data
Area
Block
Address
Floor
SQMS
Lease Commence
Date
Approval
Date
Flat Type
Resale
Price
lat
long
Bedok
520
Bedok Nth Ave 1
11 to 15
67
1979
4/1/2013
New Generation
370000
1.3213587
103.931801
Bedok
105
Bedok Nth Ave 4
06 to 10
67
1977
4/1/2013
New Generation
361000
1.3213587
103.931801
Bedok
74
Bedok Nth Rd
06 to 10
70
1978
4/1/2013
Improved
358000
1.3213587
103.931801
Bedok
77
Bedok Nth Rd
11 to 15
60
1986
4/1/2013
Improved
323000
1.3213587
103.931801
Area
Location of the property
Block
Pin Code of the property location
Approval
Date
Latest approval date for the property
Address
Street Name for the property
Flat Type
Types of construction
Floor
A floor group within which property lies
Price to be predicted. The target variable
SQMS
Area/ Size of the property
Resale
Price
Lease Commence
Date
lat
Latitude of the building location
Start date of the current lease contract
long
Longitude of the building location
In Class Exercise
Data Types
Categorical
Nominal
Place the variables in the right

bucket
Numeric
Ordinal
Discrete
Area
Continuou
s
Block
Address
Floor
SQMS
Flat Type
Lease
Commen
ce Date
Resale
Price
Approval
Date
long
lat
Food For Thought
What is feature engineering?

Feature engineeringis the process of using domain knowledge of the
data to createfeaturesthat make machine learning algorithms work.
Identify a couple of features for the Singapore Price Prediction

Problem
1. Lease Period Value <- (Typical Lease Period Lease Period Expired)/
Typical Lease Period
2. Approved Period <- (Approval Period Time Expired Since Approval)
ANOVA Analysis of Variance
What is ANOVA?
It is a test conducted across multiple samples to identify whether they are different.
Here the target variable should be continuous in nature and the independent variables
should be discrete in nature.
What does ANOVA do?
Total Variation
It checks relationships between the samples and check how

much they vary between each other as a ratio of variance within
each other. This metric tells us how distinct/ homogeneous each
of these samples are.
When do I use ANOVA?
Between Group
Variation
Within
Group
Variation
When I want to find out whether the discrete independent variable is significant in
determining the continuous variable value.
ANOVA Visual Explanation
Homogeneous Groups Significant

Group 1
Group 2
Group 3
Hetrogeneous Groups Not

Significant
Group 1
Group 2
Group 3
Pearsons Coefficient
Pearsons Coefficient?
Pearsons coefficient is a measure of linear relationship between two continuous
variables.
What values will it take?

It can take any values between -1 and +1.
What is the interpretation of the values?

+1 : Perfect positive correlation
0 : No correlation
-1 : Perfect Negative Correlation
Handling Multicollinearity
What is multicollinearity?
Multicollinearity(also collinearity) is a phenomenon

in which two or more predictor variables in a multiple
regression model are highly correlated, meaning that
one can be linearly predicted from the others with a
substantial degree of accuracy
Why should I worry about this?
Severe multicollinearity is a problem because it can increase

the variance of the coefficient estimates and make the
estimates very sensitive to minor changes in the model. The
result is that the coefficient estimates are unstable and
difficult to interpret.
Visualizing the output RMSE
Clustering
Clustering An Overview
What is clustering?
Clusteranalysis orclusteringis the task of grouping

a set of objects in such a way that objects in the same
group (called acluster) are more similar (in some
sense or another) to each other than to those in other
groups (clusters).
Why do I need clustering?
Cluster analysis serves as a tool to gain insight into

the distribution of data to observe characteristics of
each cluster.
Types of clustering analysis
Following are the most popular methods of

clustering
K-means clustering
Hierarchical clustering
A short description of k-means clustering

K-means clusters data through distances measured using Euclidean formula. It works
best on continuous variables. In the scenario of categorical variables a special
approach has to be adopted that converts the categorical values of the variable into
dummy variables.
A short description of hierarchical clustering

A method which builds a hierarchy of clusters like a tree structure.
K-means clustering
Input:
K No: of clusters to be formed
Centroids: Anchor points for the cluster groups to be
created
Methodology:
The k-means clustering technique will calculate the centroid

of the clusters through an iterative process.
The iterative process calculates new means (centroid value)
in each iteration. The iterations stop once the centroid values
freeze.
Output:
Each row (record) will be classified to a specific cluster
Scaling and Weighing
What is scaling ?
Determining ability to service a debt based on Income
and No: of children:
The Income data is bigger than No: of children data in
Table 1, by order of thousands. Is Income more
important than, No: of children by that factor?
Income
No: of
children
10000
12000
No: of cards
No: of cars
What is weighing?
Determining whether to offer a credit card or not based
on No: of cards currently held by the client and No: of
cars
owned:
Are the
cards and cars variable equally important in
determining the outcome?
CPU Cycles
K-Means Process
2
3
Memory Consumption
CPU Cycles
Finding Shortest distance for C1
c
1
c
2
D2: distance :
15
D1: distance :
10
2
c
3
D3: distance :
20
Memory Consumption
Closest
Centroi
d
c
1
CPU Cycles
c
2
Closest
Centroi
d
c
1
D1: distance :
30
D2: distance :
10
2
c
3
D3: distance :
15
Memory Consumption
c
1
c
2
CPU Cycles
c
2
c
1
2
c
3
D3: distance :
15
Memory Consumption
Closest
Centroi
d
c
1
c
2
c
3
c
1
c
2
c
3
CPU Cycles
Memory Consumption
Average of all the

Data Points to
reposition c1
Average of all the
Data Points to
reposition c2
Average of all the
Data Points to
reposition c3
c
1
c
2
c
3
CPU Cycles
Memory Consumption
Average of all the

data points
results in the
reposition of
c1,c2 and c3
c
1
CPU Cycles
1
c
2
c
1
c
2
c
3
2
c
3
Memory Consumption
Repositioning the centroids
CPU Cycles
c
1
Iteration
continues
till the point the
centroids reach the
optimal position
where cost is
minimum
Cost = 1/m
c
2
2
c
3
3
Memory Consumption
Choosing the optimal cluster set
The plot on the left is the Total

The plot on the left is the Total
Sum of Squares captured for
Sum of Squares captured for
increasing number of centroids
increasing number of centroids
X axis No: of centroids
X axis No: of centroids
Y axis Total Sum of Squares within
Y axis Total Sum of Squares within
each cluster, summed across all
each cluster, summed across all
clusters
clusters
The trajectory changes from a steep
The trajectory changes from a steep
drop to a moderate decline past 5
drop to a moderate decline past 5
clusters.
clusters.
Hence, choosing 5 despite a tradeoff
Hence, choosing 5 despite a tradeoff
where a better minimum is possible
where a better minimum is possible
at 9 clusters.
at 9 clusters.
Time Series Analysis
Time Series Analysis
What is time series analysis?
Atime seriesis a phenomenon observed overtime

measured as a sequence ofdata points, typically
consisting of successive measurements made over a
time interval.
Time seriesanalysiscomprises methods for analyzing
time series data in order to extract meaningful
statistics and other characteristics of the data.
Why do I need time series analysis?
To identify inherent trend and seasonal patterns in

data and predict outcomes accordingly.
Decomposing a time series data
Time series data usually consists of 3 components:
Seasonal
A seasonal pattern exists when a series is influenced

by seasonal factors (e.g., the quarter of the year,
the month, or day of the week). Seasonality is
always of a fixed and known period.
Trend
A trend exists when there is a long-term increase or
decrease in the data. It does not have to be linear.
Cyclic
A cyclic pattern exists when data exhibit rises and
falls that arenot of fixed period
Additionally random fluctuations may also occur.
Time series analysis techniques
Smoothing Methods
Moving Average
Moving average/Centered moving average - Average of the most recent n data values
in the time series
The n-period moving average builds a forecast by averaging the observations in the
most recent n periods:
where xt represents the observation made in period t, and At denotes the moving
average calculated
after making the observation in period t.
Exponential
Smoothing
Average of pervious values, with recent values weighted more strongly:
ARIMA Modelling
ARIMA Overview:
ARIMA stands for Auto-Regressive Integrated Moving Average.
Key
Requirement
for
ARIMA
Modeling:
Stationarity
A time series exhibits Stationarity if the mean and variance of the series is constant over
time.
ARIMA Process:
Auto Regressive: Next term is a sum of some past terms
Integrated: (1/0) The need to smoothen the time series to ensure Stationarity
Moving Average: A sum of random shocks over time
Order and seasonality are specified which looks back the number of data points
mentioned.
Decision Trees
Decision Trees An Overview
What is a decision tree ?

Adecision treeis adecisionsupport tool that uses atree-like graph or model
ofdecisionsand their possible consequences, including chance event outcomes,
resource costs, and utility.
What are its advantages?

1.
2.
3.
4.
Requires minimal data preparation

Missing value problems are handled
Key predictor variables are identified
Easy to interpret and explain to business stakeholders
The Decision Tree The Building Process
Grow the
Tree the
Grow
tree
until
criteria
is
met
Find the
Split the
Choose
best split
Generate Rule
Set
Interpret
output
Prune the
Create
Treesubtrees
and
choose
Random Forest An Overview
What is a random forest?

Random forests is a notion of the general technique of random decision
forests that are an ensemble learning method for classification, regression and
other tasks, that operate by constructing a multitude of decision trees at
training time.
Why
is
random
technique?
forest
powerful
A decision tree consists of a structured approach in identifying the

characteristics in data that help in identifying values for the target variable
By building a bag of trees, most of the noise is eliminated and the probability
of identifying the optimal values increases.
Variable Importance
Random forests identifies variables of highest significance in calculating the

outcomes

Machine Learning - Practitioner Basic - v1.0

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Machine Learning - Practitioner Basic - v1.0

Încărcat de

Drepturi de autor:

Formate disponibile

Machine Learning for Practitioners

What we plan to do today

Introduction to Data Science

Machine Learning - Introduction

Field of study that gives

Machine Learning Vs Reporting

Systematic Study and extraction of knowledge from data

Data Driven Insights

Skill Sets of a Data Scientist

The Machine Learning Process

Classification Problems- An Overview

Common Types of Classification Problems are as follows:

Data sources and data types

There are all types of data sources and data types:

Lets Look at a Case Study: Titanic

On April 15, 1912, during her maiden voyage,

How do we predict survival possibility for such an

Well boys, do your best for the

Lets Look The Data

Braund, Mr. Owen Harris

Cumings, Mrs. John Bradley

Heikkinen, Miss. Laina

Futrelle, Mrs. Jacques Heath

No: of siblings/spouses aboard.

No: of parents/children aboard

Id of the row. It's a reference

0 means Not Survived, 1 means Survived

Class of the passenger. 3 classes present. 1st

Name of the passenger

Specifies the age of the individual

Port on which embarked. C = Cherbourg; Q = Queenstown;

A brief look at different types of structured data

Food For Thought

What is a dummy variable?

Why do we need this?

Dummy Variables in the Titanic Data Set

Which variables need to be converted into dummy among the

How does a dummy variable look?

Lets study the data

Intuitively which columns are relevant in predicting the survival of

Are Any of These Columns Related?

Some common descriptive statistics used to explain data

Some common descriptive statistics used to explain data

What is the average salary of the attendees in this

What if Frank joins us in this classroom? !!

What should I do first with the dataset in order to

Some common descriptive statistics used to explain data

The most frequent value in a dataset. Useful when the data is of

Food For Thought

What are the different ways to replace missing values?

Looking closely at the age feature

The outlier concept

How to handle an outlier?

Pick the outlier from the image in this

Understanding box plots Five Point Summary

NULL Hypothesis and Alternate Hypothesis

Thealternative hypothesis, denoted by

Mathematical understanding of variable importance

We have categorical data and continuous

Both the target variable and the predictor

The Chi-Square Test

What is Chi-Square Test?

What does Chi-Square test do?