Documente Academic
Documente Profesional
Documente Cultură
GOALS
• Normality (multivariate/bivariate/univariate distribution)
• Outlier detection and removal
Recap & Step 25
Refer to the links at the end of the lecture for more information on normality tests
Step 26 – Install and import Pingouin
Step 28
One could
additionally place
a k after -o to
account for a
black marker
For example
plt.plot(a, fit, ‘-ok’)
Step 29 – Visually inspect TFR_native
Step 29
Step 30 – Visually inspect Overall_TFR
Step 30
Step 31 – Skewness & Kurtosis
Measuring skewness:
skewness = 0 : normally distributed (symmetrical
distribution)
skewness > 0 : longer right tail; mass of distribution is
concentrated on the left of the figure
skewness < 0 : longer left tail; mass of distribution is
concentrated on the right of the figure
Measuring kurtosis:
kurtosis = 0: normal distribution
Kurtosis > 0: distribution’s tails are larger than for a normal
distribution
Kurtosis < 0: distribution’s tails are smaller than for a normal
distribution
Results from Step 31 show TFR_foreign to be moderately
skewed, whilst TFR_native and Overall_TFR are fairly
symmetrical. TFR_foreign is heavy-tailed whilst TFR_native is
light-tailed. Overall_TFR has a kurtosis value consistent with a
normal distribution.
Step 32: Multivariate normal distribution
The null hypothesis for the Henze-Zirkler multivariate normality test states that the data follows
multivariate normal distribution
The alternative hypothesis for the Henze-Zirkler multivariate normality test states that the data does
not follow a multivariate normal distribution.
Dataset (‘selection’) does NOT have a multivariate normal distribution at significance level 0.05
We will try removing one of the variables (establish bivariate normality between each variable pair)
Steps 33-34: Bivariate normal distribution
Follow Step 37 to
view position of
outliers
Remember,
index=0 is position 1
Therefore outliers
at position 13 and
23
Steps 35-37
checked for
univariate outliers
Step 38 – Concat() & import seaborn
Bivariate relationships:
g = TFR_foreign & TFR_native
h = TFR_native & Overall_TFR
i = TFR_foreign & Overall_TFR
Steps 39-41: PairGrid plots
Step 42 – Pearson’s correlation coefficient
Note: a p-value
>0.05 for ‘impute’
and ‘no_outliers’
datasets does not
infer no outliers
outlier tests need
to be conducted
again
End of Lecture 5