Documente Academic
Documente Profesional
Documente Cultură
Lusi Yang
University of Toronto
Abstract
Although the number of under-five deaths worldwide has declined from 12.7 million
in 1990 to 5.9 million in 2015, many children are still dying in the poor and unsanitary regions
of Africa. This paper had two aims: to study the associations between the response variable,
under-five mortality rate and the predictors, which are selected explanatory factors of child
mortality; and to assess the full multiple linear model by using Box-Cox transformation for
the skewed response variable. The associations were studied based on the results of the
multiple linear regression both before and after the Box-Cox transformation. Numerous
sources such as the WHO have confirmed the associations of the transformed model. To
justify the adequacy of the Box-Cox transformation, histograms of the residuals and the
normal Q-Q plots of both models were plotted. The transformed model satisfied not only
the normality of residuals assumptions of the linear regression but also other assumptions
of the linear regression including the linear relationship between dependent and
independent variables and homoscedasticity. The Box-Cox transformed model also had a
lower AIC than the non-transformed model. Thus, Box-Cox transformation can be a tool for
correcting the non-normality of residuals and fulfil other linear regression assumptions.
2
Child Mortality Reduction: A Closer View
Introduction
Children are the key in shaping the future demographics. Thus, for the well-being of
children and the stability of the international economy, world leaders must set goals to
reduce child mortality rate. According to the World Health Organization (WHO), the number
of under-five deaths worldwide has declined from 12.7 million in 1990 to 5.9 million in 2015
and that is 16,000 everyday compared with 35,000 in 1990 (WHO, 2016). According to the
United Nations Childrens Fund (UNICEF), Sub-Saharan African children suffer the highest
under-five mortality rate in the world (UNICEF, 2016). As a result, this paper will take the
initiative to take a closer look at under-five mortality rates around the world through a
statistical lens. This study aimed to investigate the associations between the dependent
variable, under-five mortality rate and the selected explanatory factors as the independent
variables by a multiple linear regression, and to assess the full multiple linear model by using
The independent variables were selected based on the studies published such as
WHO, the Alan Guttmacher Institute (AGI), Our World in Data, and the United Nations (UN).
For instance, a report on family planning by AGI listed several explanatory factors for infant
mortality including births to adolescents, closely spaced births, high fertility rates, less-
educated women, and lack of government spending on health (AGI, 2002). Moreover, the
WHO (2016) listed overcrowded conditions, unsafe drinking water and food, and poor
3
hygiene practices as major explanatory factors of under-five mortality. This paper would
then confirm the associations with previous studies through a statistical perspective.
In this study, the data were gathered from Gapminder through various sources
including the UN, the World Bank, the WHO, OECD, and UNICEF. There were 14 independent
variables and 1 dependent variables of 181 nations. The countries were assumed to be
representative of the world out of the total 196 nations. However, there are several
concerns about the dataset. The first concern is that many developing nations did not have
the resources to gather reliable data; thus, this study must assume the data consist no
measurement errors. The second concern is that since some countries did not have the data
for certain variables, missing data are expected in the dataset. One last limitation of the
dataset is that it is not possible to collect all the explanatory factors that are associated with
4
Figure 1 presents the 15 variables in this study of under-five mortality rate. Please refer each
transformation.
Suppose there are m selected explanatory factors, and they can be analyzed through
a multiple regression. Let the explanatory variables be X1, X2, , Xm, and let the response
y = X! + ", (1)
number of independent variables, and the X matrix includes the intercept. ! is an (m+1) 1
observations of the dependent variable. The coefficient !s are estimated by minimizing the
sum of squared residuals. A multiple linear regression can be estimated through lm() in R. A
linear regression model assumes: linear relationship between the response and predictors,
normality of the residuals and homoscedasticity. To study the association between the
independent variable x and the dependent variable y, there are three possible cases for ith
coefficient: !$ = 0 means there is no linear association between y and x; !$ > 0 means that
there is a positive linear association between y and X; and !$ < 0 means there is a negative
To ensure that the errors are i.i.d. normally distributed, Box and Cox developed a
method for choosing the best transformation from a set of power transformations to
5
Let % " , then
(2)
A % value that maximizes the log likelihood or minimizes the sum of squared residuals would
be most appropriate. There are several steps involved to find the optimal %.
Step 1. Use boxcox function in R package MASS and then use this function with the R lm
object. The boxcox function also displays the log-likelihood vs % plot to visually determine
Step 3. After finding the %, we apply this % as the exponent of the response variable and run
Step 4. Use the diagnostic plot in R, plot(fit1), and then examine the Normal Q-Q plot to see
Step 5. Compare the AIC of the two models to see the improvement after transformation.
Results
The goals of this study were to identify the associations between the dependent and
independent variables and to assess the full multiple linear model by using Box-Cox
transformation for the response variable. The results of the multiple linear regression model
6
before and after the transformation are presented in Table 1. The resulting % that minimizes
the sum of squared residuals from R for the Box-Cox transformation is 0.3838384.
Before the transformation, fertility (fer), teen fertility (teenfer), GDP per capita
(gdppc), government spending (gvt), and agriculture (agr) had a positive association with
under-five mortality rate; school (sch), water, sanitation (san), life expectancy (le), had a
negative association with under-five mortality rate; and population had no linear association
with under-five mortality rate. Based on the p-values, only fertility and life expectancy were
statistically significant. On the other hand, after the Box-Cox transformation, both GDP per
7
capita and government spending changed signs, so the associations became negative. Based
on the p-values, fertility, school, and life expectancy were statistically significant.
compared the normal Q-Q plot for checking the normality of residuals.
Figure 2. Histogram and Normal Q-Q plot of standardized residuals before the transformation
The above histogram of standardized residuals before transformation suggests that the
residuals are not normally distributed because there are several extreme positive and
negative residuals. The corresponding Normal Q-Q plot also suggests residuals are not
normally distributed because of the very high and very low points (outliers) relative to the
linear trend. The overall AIC for this model was 908.59.
Figure 3. Histogram and Normal Q-Q plot of standardized residuals after the Box-Cox transformation
8
Since point 4 (Angola) and point 144 (Sierra Leone) were outliers, they were removed
from the dataset. In Figure 3, the histogram after Box-Cox transformation looks more
normal than the histogram before the transformation. The corresponding Normal Q-Q plot
also looks straighter, although there are still some deviations at the tails. The overall AIC
after the transformation was 110.98. Based on the diagnostic plots of the second row in
Figure 4, all other linear regression assumptions including linear relationship and
Figure 4. The Diagnostic Plots of the model before and after the Box-Cox transformation
9
Discussion
The positive associations between under-five mortality rate and fertility, teen fertility,
and agriculture of the Box-Cox transformed model can be confirmed by numerous studies
(AGI, 2002; BBC, 2013; WHO, 2017; Roser, 2017;). It is interesting to see that the positive
economies, hazardous pesticides and child labour impose serious risks to children (BOI, 2015;
WHO, 2017). The negative associations between under-five mortality rate and GDP per
capita, government spending, school, water, sanitation, and life expectancy can also be
confirmed by several studies (AGI, 2002; Gunther & Fink, 2011; O Hare, 2013; Page et al., 2014;
Zeltner et al., 2015; WHO, 2017). Thus, the Box-Cox model had estimated the coefficient signs
correctly.
On the other hand, normality of residuals had been improved after the Box-Cox
transformation. The histogram in Figure 3 is more normal than the histogram in Figure 2.
Moreover, the Normal Q-Q plot identified the outliers of the dataset and became more linear
even though it still shows some outlying observations. The AIC of the before-transformation
model was 908.59 and the AIC of the after-transformation model was 110.98. The
second row of Figure 4, the Scale-Location shows that the residuals are randomly spread
out, which satisfies the homoscedasticity assumption; the Residuals vs Leverage plot states
that there are no influential cases because all points are within the Cook Distances lines; the
Residuals vs Fitted shows that the residuals are equally spread residuals around the
10
horizontal line, so the linear relationship between the dependent and independent variables
Conclusion
Based on this study, we confirmed the associations between child mortality rate and
the chosen explanatory factors with previous studies. Thus, policy makers should make their
policies in the directions of the associations of this study to reduce child mortality rate. For
the skewed response variable, we applied the Box-Cox transformation to ensure the
normality of the residuals. The transformation was adequate because it improved not just
the normality of the residuals but also improved other multiple linear regression model
assumptions: linear relationship between the respond and predictor variables and
homoscedasticity. As a result, this study shows that Box-Cox transformation can be a tool
for correcting the non-normality of residuals and fulfil other linear regression assumptions.
11
Appendices
Appendix A: References
AGI (2002). Family Planning Can Reduce High Infant Mortality Levels. (2016). Retrieved
December 26, 2016, from https://www.guttmacher.org/report/family-planning
can-reduce-highinfant-mortality-levels
Ezeh, O. K., Agho, K. E., Dibley, M. J., Hall, J., & Page, A. N. (2014, September). The Impact of
Water and Sanitation on Childhood Mortality in Nigeria: Evidence from
Demographic and Health Surveys, 20032013. Retrieved April 26, 2017, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4199018/
Findings on the Worst Forms of Child Labor - Cte d'Ivoire. (2016, December 07). Retrieved
April 26, 2017, from https://www.dol.gov/agencies/ilab/resources/reports/child
labor/c%C3%B4te-dIvoire
Gapminder: Unveiling the beauty of statistics for a fact. Retrieved December 26, 2016,
from https://www.gapminder.org/
Gnther, I., & Fink, G. (2011). Water and Sanitation to Reduce Child Mortality: The Impact and
Cost of Water and Sanitation Infrastructure (Rep.). Washington D.C.: The World
Bank.
Maruthappu, M., Ng, K. Y., Williams, C., Atun, R., & Zeltner, T. (2015, April 01). Government
Health Care Spending and Child Mortality. Retrieved April 26, 2017, from
http://pediatrics.aappublications.org/content/135/4/e887
O'Hare, B., Makuta, I., Chiwaula, L., & Bar-Zeev, N. (2013, October). Income and child
mortality in developing countries: a systematic review and meta-analysis.
Retrieved April 26, 2017, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3791093/
The cost of a polluted environment: 1.7 million child deaths a year, says WHO. (2017).
Retrieved April 26, 2017, from
http://www.who.int/mediacentre/news/releases/2017/pollution-child-death/en/
12
WHO. (2016, September). Children: reducing mortality. Retrieved December 26, 2016, from
http://www.who.int/mediacentre/factsheets/fs178/en/
Young mothers 'risk factor for early childhood death' (2013, September 30). Retrieved April
26, 2017, from http://www.bbc.com/news/health-24296960
13
X3 Income per capita Gross Domestic Product Compiled by Mattias
(GDP/cap, PPP$ per capita by Purchasing Lindgren, Gapminder
inflation-adjusted) Power Parities (in
international dollars,
fixed 2011 prices). The
inflation and differences
in the cost of living
between countries has
been taken into
account.
X4 Per capita Per capita general World Health Organization
government government http://www.who.int
expenditure on expenditure on health
health at average expressed at average
exchange rate (US$) exchange rate for that
year in US dollar.
Current prices.
X5 Agriculture, value Agriculture corresponds World Bank National
added (% of GDP) to ISIC divisions 1-5 and Accounts Data, and OECD
includes forestry, National Accounts Data
hunting, and fishing, as http://data.worldbank.org/ind
well as cultivation of icator
crops and livestock
production.
X6 Mean years in The average number of Institute for Health Metrics
school 15-44 women years of school and Evaluation
attended by all people http://www.healthdata.org/
in the age and gender
group specified,
including primary,
secondary and tertiary
education
X7 Improved water The percentage of the The United Nations site for
source, overall % total population who the MDG Indicators
use any of the following http://mdgs.un.org/unsd/mdg
types of water supply /Data.aspx
for drinking: piped
water into dwelling,
plot or yard; public
tap/standpipe;
borehole/tube well;
protected dug well;
protected spring;
14
rainwater collection and
bottled water.
X8 Improved sanitation, Access to improved World Development
overall % sanitation facilities Indicators
refers to the percentage http://data.worldbank.org/ind
of the population with icator/SH.STA.ACSN
at least adequate access
to excreta disposal
facilities that can
effectively prevent
human, animal, and
insect contact with
excreta. Improved
facilities range from
simple but protected pit
latrines to flush toilets
with a sewerage
connection. To be
effective, facilities must
be correctly constructed
and properly
maintained.
X9 Life expectancy The average number of Various sources
years a new born child
would live if current
mortality patterns were
to stay the same
X10 Total population Total number of Mattias Lindgren, Gapminder
population of both
sexes, data after 2010 is
based on the medium
estimates from UN
population division
X11 DTP3 immunized, % One-year-olds UNICEF and WHO
of one-year-olds immunized with three https://www.unicef.org/
doses of diphtheria
tetanus toxoid and
pertussis (DTP3) (%)
X12 Contraceptive use % Contraceptive World Bank
of women ages 15- prevalence rate is the http://data.worldbank.org/ind
49 percentage of women icator
who are practicing, or
whose sexual partners
15
are practicing, any form
of contraception. It is
usually measured for
married women ages 15-
49 only.
X13 CO2 per capita Carbon dioxide Gapminder
(metric tons per emissions from the
person) burning of fossil fuels
X14 Pneumonia deaths Pneumonia deaths in Gapminder
in newborn, per newborn(per 1,000
1,000 births births)
study = read.csv("/Users/Lusi/Desktop/study.csv")
attach(study)
# Rename Variables
mor = study$underFiveMortality
fer = study$totalFertilityRate
teenfer = study$teenFertility
gdppc = study$GDPPerCapita
gvt = study$govtExOnHealthPerCapita
agr = study$agriPercentGDP
sch = study$womenMeanYearsInSchool
water = study$improvedDrinkingWaterSourcesInPercentage
san = study$improvedSanitationFacilitiesInPercentage
le = study$lifeExpectancy
pop = study$totalPopulation
dtp3 = study$DTP3ImmunizedInPercentage
contra = study$contraceptivePrevalenceInPercentage
co2 = study$CO2
pne = study$pneumoniaDeathsInNewborns
16
hist(le)
hist(pop)
hist(dtp3)
hist(contra)
hist(co2)
hist(pne)
# Based on the original dataset, we removed outliers points 4 Angola and 144 Sierra Leone
study1 = read.csv("/Users/Lusi/Desktop/study1.csv")
attach(study1)
mor = study1$underFiveMortality
fer = study1$totalFertilityRate
teenfer = study1$teenFertility
gdppc = study1$GDPPerCapita
gvt = study1$govtExOnHealthPerCapita
agr = study1$agriPercentGDP
sch = study1$womenMeanYearsInSchool
water = study1$improvedDrinkingWaterSourcesInPercentage
san = study1$improvedSanitationFacilitiesInPercentage
le = study1$lifeExpectancy
pop = study1$totalPopulation
dtp3 = study1$DTP3ImmunizedInPercentage
contra = study1$contraceptivePrevalenceInPercentage
co2 = study1$CO2
pne = study1$pneumoniaDeathsInNewborns
# Apply the Box-Cox transformation to make variances of the error terms more constant
library(MASS)
bc =
boxcox(mor~fer+teenfer+gdppc+gvt+agr+sch+water+san+le+pop,na.action=na.exclude)
which.max(bc$y)
lambda<-bc$x[which.max(bc$y)]
lambda
fit1 = lm(mor^0.3838384~fer+teenfer+gdppc+gvt+agr+sch+water+san+le+pop)
par(mfrow=c(1,3))
plot(fit1)
17
residuals=scale(residuals(fit1))
hist(residuals,xlab="Standardized Residuals", main="Histogram of Standardized Residuals")
AIC(fit1)
18