Just Give Me The Codes Lecture 5: Data Preprocessing II

Just Give me the Codes
Lecture 5: Data Preprocessing II
GOALS
• Normality (multivariate/bivariate/univariate distribution)
• Outlier detection and removal
Recap & Step 25
 From last lecture:

 Created a new df ‘Norway’
 From ‘Norway’ created yet another df ‘selection’ (3
numerical variables)
 This lecture (Step 25)
 Warnings are a nuisance
 Follow Step 25 to view dimensions of ‘selection’ and to
suppress ‘warnings’
 Place a hashtag before filter.warnings() should you
select to read the warnings
Normality – The Assumption of Normality
 Data needs to follow a normal distribution for many statistical tests

 Referred to as the Assumption of Normality
 critical for sample sizes <30
 Choose appropriate statistical test for your sample size
 Example: Sample size 23
 Royston test for multivariate normality
 Fail to reject the null at the 5% level (data follows a multivariate normal distribution)
 Henze-Zirkler test for multivariate normality

 Reject the null at the 5% level (data does not follow a multivariate normal distribution)
 Refer to the links at the end of the lecture for more information on normality tests
Step 26 – Install and import Pingouin
 Python limited with multivariate normality tests

 pingouin package
 Univariate, bivariate, multivariate
 Follow Step 26 to install and import the
penguin package
 Refer to links at the end of the lecture for more
information on pingouin
Step 27 – Shapiro-Wilk normality test
 The null hypothesis for the Shapiro-Wilk

normality test states that the data is normally
distributed
 The alternative hypothesis for the Shapiro-Wilk
normality test that the data is not normally
distributed.
 Follow Step 27 to determine the normality of
each variable
 All 3 numerical variables have univariate
normal distributions at signiﬁcance level 0.05,
however it is good practice to visualize your
dataset to diagnose any deviation from
multivariate normality (when testing for
multivariate normality)
Step 28 – Visually inspect TFR_foreign
 Step 28
 One could
additionally place
a k after -o to
account for a
black marker
 For example
plt.plot(a, fit, ‘-ok’)
Step 29 – Visually inspect TFR_native
 Step 29
Step 30 – Visually inspect Overall_TFR
 Step 30
Step 31 – Skewness & Kurtosis
 Measuring skewness:
 skewness = 0 : normally distributed (symmetrical
distribution)
 skewness > 0 : longer right tail; mass of distribution is
concentrated on the left of the figure
 skewness < 0 : longer left tail; mass of distribution is
concentrated on the right of the figure
 Measuring kurtosis:
 kurtosis = 0: normal distribution
 Kurtosis > 0: distribution’s tails are larger than for a normal
distribution
 Kurtosis < 0: distribution’s tails are smaller than for a normal
distribution
 Results from Step 31 show TFR_foreign to be moderately
skewed, whilst TFR_native and Overall_TFR are fairly
symmetrical. TFR_foreign is heavy-tailed whilst TFR_native is
light-tailed. Overall_TFR has a kurtosis value consistent with a
normal distribution.
Step 32: Multivariate normal distribution
 The null hypothesis for the Henze-Zirkler multivariate normality test states that the data follows
multivariate normal distribution
 The alternative hypothesis for the Henze-Zirkler multivariate normality test states that the data does
not follow a multivariate normal distribution.
 Dataset (‘selection’) does NOT have a multivariate normal distribution at significance level 0.05
 We will try removing one of the variables (establish bivariate normality between each variable pair)
Steps 33-34: Bivariate normal distribution
 TFR_foreign and Overall_TFR:

 DO NOT satisfy the bivariate normality
assumption at the signiﬁcance level
0.05.
 TFR_foreign and TFR_native:
 Satisfy the bivariate normality
0.05.
 TFR_native and Overall_TFR:
 Satisfy the bivariate normality
0.05.
 In case you weren’t aware:
 Bivariate = 2 variables
 Multivariate = 2 or more variables
Steps 35-36: IQR and outliers
 Follow Steps 35-36

to determine IQR
for each column
and accordingly,
number of outliers
per column
Step 37 – Position of outliers
 Follow Step 37 to
view position of
outliers
 Remember,
index=0 is position 1
 Therefore outliers
at position 13 and
23
 Steps 35-37
checked for
univariate outliers
Step 38 – Concat() & import seaborn
 Concatenating in this direction:

 x→y
 y→z
 x→z
Steps 39-41: PairGrid
 Bivariate relationships:
 g = TFR_foreign & TFR_native
 h = TFR_native & Overall_TFR
 i = TFR_foreign & Overall_TFR
Steps 39-41: PairGrid plots
Step 42 – Pearson’s correlation coefficient
 pairwise_corr() function is part of the pingouin package

 Pearson’s correlation coefficient hypothesis:
 NULL: No linear relationship exists
 ALTERNATIVE: A linear relationship does exist
 There IS a significant linear relationship between TFR_native and Overall_TFR
 There is NO significant linear relationship between TFR_foreign & Overall_TFR
 There is NO significant linear relationship between TFR_foreign & TFR_native
What does all this mean?
 TFR_foreign cause of outliers

 Deleting outliers would reduce dataset by 20%
 Deleting, imputing or transforming
 No guarantee outliers deleted first instance
 Faced with deleting or imputing new outliers
 Originality of dataset reduced even further
 No universal method for outlier detection and removal; choice comes

with experience
Steps 43-47: If the 5 outliers were
deleted…..
Steps 48-52 – Imputing outliers with the
median & new MVN test
 Note: a p-value
>0.05 for ‘impute’
and ‘no_outliers’
datasets does not
infer no outliers
 outlier tests need
to be conducted
again
End of Lecture 5
 Well done! You have gained intermediate skills in Data Preprocessing!

 Where to go from here? Lecture 6 of course! But things to consider:
 Read up on normality tests
 Read up on Pingouin
 A great place to start:
 Link to Pingouin: https://pingouin-stats.org/index.html
 Pingouin univariate normality: https://pingouin-stats.org/generated/pingouin.normality.html
 Pingouin multivariate normality: https://pingouin-
stats.org/generated/pingouin.multivariate_normality.html
 Link to article on Normality tests: https://www.nrc.gov/docs/ML1714/ML17143A100.pdf

Just Give Me The Codes Lecture 5: Data Preprocessing II

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Just Give Me The Codes Lecture 5: Data Preprocessing II

Încărcat de

Drepturi de autor:

Formate disponibile

Just Give me the Codes

Lecture 5: Data Preprocessing II

 From last lecture:

 Data needs to follow a normal distribution for many statistical tests

 Henze-Zirkler test for multivariate normality

 Python limited with multivariate normality tests

 The null hypothesis for the Shapiro-Wilk

 TFR_foreign and Overall_TFR:

 Follow Steps 35-36

 Concatenating in this direction:

 pairwise_corr() function is part of the pingouin package

 TFR_foreign cause of outliers

 No universal method for outlier detection and removal; choice comes

 Well done! You have gained intermediate skills in Data Preprocessing!

S-ar putea să vă placă și