Documente Academic
Documente Profesional
Documente Cultură
Level Basic
Introduce
Data Science
Classification
Continuous
Clustering
Time Series
Analysis
Decision
Trees
Unlike
database
querying, which asks
What data satisfies this
pattern
(query)?
discovery asks What
patterns satisfy this
data?
Till Now:
Implement an idea based on an established theory.
Collect data to validate the theory.
Future:
Look at the data and ask the right question to obtain
useful insights.
Keep in Mind.simple is not just good Its great!!
Math &
Statistics
Data
Scientist
Domain
Knowledge &
Soft Skills
Programmi
ng &
Database
Communicati
on &
Visualization
2
Gather the
Data
Identify the
Business
Problem
Transform
and Sanitize
the Data
Explore the
Data
Interpret and
Improve the
Results
5
Perform
Statistical
Analysis
Business
Interpretati
on of
Results
Classification
Probability Theory
Chance
Outcome
Conditional
Probability
of
an
Survived
PClas
s
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
Integer
Integer
Integer
Factor
Factor
Num
eric
Intege
r
Integer
Factor
Numeric
Factor
Factor
male
22
A/5 21171
7.25
female
38
PC 17599
71.2833
female
26
STON/O2. 3101282
7.925
female
35
113803
53.1
SibS
p
Parc
h
Ticke
t
Ticket Number.
Passeng
erId
Survived
PClass
Name
Fare
Passenger Fare.
Sex
Male/ Female
Age
Cabi
n
Cabin Number.
Emb
arke
S
C85
C
S
C123
Data Types
Categorical
Numeric
Nominal
Ordinal
Discrete
Variables
with no
ranking
or order.
Variables
with
ordered
values.
Count of
somethin
g
Ex:
Gender
Ex:
Performa
nce
(Good,
Ok, Bad)
Ex: Parch
(Parents
and
Children)
Continuou
s
Informatio
n
measured
on some
scale
Ex: Age
Can take
decimal
values
Sex
SibSp
Parch
After Transformation
Sex
Embarked
Sexmale
Sexfemale
EmbarkedS
EmbarkedC
EmbarkedQ
male
female
female
female
Pclass
Sex
Age
SibSp (Maybe)
Parch (Maybe)
Fare
Embarked (Maybe)
Mean
Sum of each value/Total No: of values Works when values are
not skewed
A mother 6 feet tall and a son 2 feet tall cross a river 3
feet deep.
They dont have a problem as on an average the pair
making the crossing are 4 feet tall.
(6 + 2)/2 = 8/2 = 4 feet
Median
Middle value of a dataset.
Odd No: of values : Pick the middle one
Even No: of values: Take an average of the two
values that lie in the middle
Mode
Mean imputation
Median imputation
Mode imputation
Forward Propagation
Backward Propagation
Decision Tree based imputation
What is an outlier?
Anoutlieris an observation that lies an abnormal distance from other values in a
random sample from a population
Maximum Value
75th Percentile Value
Median (50th Percentile
Data)
25th Percentile Value
Lowest value
NULL
Thenull
hypothesis(H0)
is
ahypothesiswhich
the researcher tries
Hypothesis
to disprove, reject or nullify. The 'null'
often refers to the common view of
something,
while
the
alternativehypothesisis
what
the
researcher really thinks is the cause of a
phenomenon.
Alternate
Continuous
Categorical
Chi-Square
Categorical
Categorical
Pearsons
Coeff
Continuous
Continuous
It checks relationships between the samples and check how much they vary
between each other as a ratio of variance within each other. This metric tells us how
distinct/ homogeneous each of these samples are.
Population
All possible data points that can be used to identify pattern for a specific objective
Example: All the job applications received by Cognizant
Sample
A representative subset of the population
Example: All the job applications received by Cognizant who have more than 4
years of experience
Why are we doing this??
Sampling Techniques
Cluster Sampling
Withcluster samplingone should
divide the population into groups (clusters).
obtain a simple random sample of so many clusters from all possible clusters.
obtain data on every sampling unit in each of the randomly selected clusters.
Stratified Sampling
33
Cluster Sampling
A cluster sample is much like a two-stage simple random sample. We break up the
population into many groups, called clusters. Then we sample a fixed number of
clusters and collect a simple random sample within each cluster. This technique is
similar to stratified sampling in its process, except that there is no requirement in
cluster sampling to sample from every cluster. Stratified sampling requires
observations be sampled from every stratum
Titanic Data
Training Data
Data for Training
the model 60%
of overall data
available
Cross Validation
Data
Data for
validating the
model 20% of
the overall data
available
Test Data
Data for testing
the model 20%
of the overall
data available
Original
Population
:
891 Data
Points
65% Males
35% Females
lin
p
am
S
ed
fi
i
t
ra ble
t
S ria
va
sin
u
g
ex
S
g
Training ~
534
65%
Males
35%
Females
Cross
Validation
~ 178
65%
Males
35%
Females
Test ~
178
65%
Males
35%
Females
36
Types of Models
Predictive
Result/
Classification
Yes
No
Yes
True Positive
False Positive
No
False Negative
True Negative
Continuous
Area
Block
Address
Floor
SQMS
Lease Commence
Date
Approval
Date
Flat Type
Resale
Price
lat
long
Bedok
520
11 to 15
67
1979
4/1/2013
New Generation
370000
1.3213587
103.931801
Bedok
105
06 to 10
67
1977
4/1/2013
New Generation
361000
1.3213587
103.931801
Bedok
74
Bedok Nth Rd
06 to 10
70
1978
4/1/2013
Improved
358000
1.3213587
103.931801
Bedok
77
Bedok Nth Rd
11 to 15
60
1986
4/1/2013
Improved
323000
1.3213587
103.931801
Area
Block
Approval
Date
Address
Flat Type
Types of construction
Floor
SQMS
Resale
Price
Lease Commence
Date
lat
long
In Class Exercise
Data Types
Categorical
Nominal
Ordinal
Discrete
Area
Continuou
s
Block
Address
Floor
SQMS
Flat Type
Lease
Commen
ce Date
Resale
Price
Approval
Date
long
lat
1. Lease Period Value <- (Typical Lease Period Lease Period Expired)/
Typical Lease Period
2. Approved Period <- (Approval Period Time Expired Since Approval)
What is ANOVA?
It is a test conducted across multiple samples to identify whether they are different.
Here the target variable should be continuous in nature and the independent variables
should be discrete in nature.
Total Variation
Between Group
Variation
Within
Group
Variation
When I want to find out whether the discrete independent variable is significant in
determining the continuous variable value.
Group 2
Group 3
Group 2
Group 3
Pearsons Coefficient
Pearsons Coefficient?
Pearsons coefficient is a measure of linear relationship between two continuous
variables.
Handling Multicollinearity
What is multicollinearity?
Clustering
Clustering An Overview
What is clustering?
K-means clustering
Hierarchical clustering
K-means clustering
Input:
K No: of clusters to be formed
Centroids: Anchor points for the cluster groups to be
created
Methodology:
Output:
What is scaling ?
Determining ability to service a debt based on Income
and No: of children:
The Income data is bigger than No: of children data in
Table 1, by order of thousands. Is Income more
important than, No: of children by that factor?
Income
No: of
children
10000
12000
No: of cards
No: of cars
What is weighing?
Determining whether to offer a credit card or not based
on No: of cards currently held by the client and No: of
cars
owned:
Are the
cards and cars variable equally important in
determining the outcome?
CPU Cycles
K-Means Process
2
3
Memory Consumption
CPU Cycles
c
1
c
2
D2: distance :
15
D1: distance :
10
2
c
3
D3: distance :
20
Memory Consumption
Closest
Centroi
d
c
1
CPU Cycles
c
2
Closest
Centroi
d
c
1
D1: distance :
30
D2: distance :
10
2
c
3
D3: distance :
15
Memory Consumption
c
1
c
2
CPU Cycles
c
2
c
1
2
c
3
D3: distance :
15
Memory Consumption
Closest
Centroi
d
c
1
c
2
c
3
c
1
c
2
c
3
CPU Cycles
Memory Consumption
c
1
c
2
c
3
CPU Cycles
Memory Consumption
c
1
CPU Cycles
1
c
2
c
1
c
2
c
3
2
c
3
Memory Consumption
CPU Cycles
c
1
Iteration
continues
till the point the
centroids reach the
optimal position
where cost is
minimum
Cost = 1/m
c
2
2
c
3
3
Memory Consumption
Seasonal
Trend
A trend exists when there is a long-term increase or
decrease in the data. It does not have to be linear.
Cyclic
A cyclic pattern exists when data exhibit rises and
falls that arenot of fixed period
Additionally random fluctuations may also occur.
Smoothing Methods
Moving Average
Moving average/Centered moving average - Average of the most recent n data values
in the time series
The n-period moving average builds a forecast by averaging the observations in the
most recent n periods:
where xt represents the observation made in period t, and At denotes the moving
average calculated
after making the observation in period t.
Exponential
Smoothing
Average of pervious values, with recent values weighted more strongly:
ARIMA Modelling
ARIMA Overview:
ARIMA stands for Auto-Regressive Integrated Moving Average.
Key
Requirement
for
ARIMA
Modeling:
Stationarity
A time series exhibits Stationarity if the mean and variance of the series is constant over
time.
ARIMA Process:
Auto Regressive: Next term is a sum of some past terms
Integrated: (1/0) The need to smoothen the time series to ensure Stationarity
Moving Average: A sum of random shocks over time
Order and seasonality are specified which looks back the number of data points
mentioned.
Decision Trees
Grow the
Tree the
Grow
tree
until
criteria
is
met
Find the
Split the
Choose
best split
Generate Rule
Set
Interpret
output
Prune the
Create
Treesubtrees
and
choose
Why
is
random
technique?
forest
powerful
Variable Importance