Documente Academic
Documente Profesional
Documente Cultură
Introduction
The central limit theorem (CLT) could be considered as the most important principle of probability and mathematical statistics and contributes the foundation to the theory of statistical
estimation and hypothesis testing. Briefly, the CLT states that the arithmetic mean of a
significant large series of independent and identically distributed random variables will be
approximately normally distributed regardless of the probability distribution of the original
population. A description of the CLT using mathematical language is shown as follows.
If X1 , X2 ,...,Xn are independent and identically distributed random variables with
E(Xi ) =
V ar(Xi ) = 2
If we define
Zn =
/ n
Method
The simulation was performed by using the software R. To compare the simulation results,
I selected four types of population distribution, which are normal, exponential, t and chisquared distributions as the sample sources. In order to show the influential role of sample
size, four different sample sizes (n = 5, 10, 50, 100) were utilized for the simulation in each
population distribution. For each sample size, the sampling process was repeated 1,000,000
times and the mean of every sample was calculated and stored in the corresponding row of
1
a matrix, which was created for each certain population distribution. The plot of sampling
distribution was then generated for every sample size of each population distribution based on
the data of sample mean stored in the matrix. Within each image, a plot of the corresponding
population distribution function was also added for reference and comparison. In order to
better quantify the deviation of the sampling distribution of mean against the normal distribution, a Q-Q plot was constructed for every simulation by using samples from the standard
normal distribution with the same sample size (1,000,000). The extent of forming a linear
pattern in a Q-Q plot was used to evaluate if a distribution is similar to normal distribution.
For t and chi-squared distributions, two different degrees of freedom were selected to perform
the simulation (t distribution: df = 5, 10; chi-squared distribution: df = 2, 10). The source
code was attached in the appendix.
3
3.1
Results
Normal distribution
Fig. 1 The simulation results of the sampling distribution from the population with the standard
normal distribution. (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right
to the left. In each figure, the blue line represents the sampling distribution function of the mean and
the red line represents the population distribution (the standard normal distribution). (Lower panel)
The corresponding Q-Q plot of the sampling distribution with certain sample size against the standard
normal distribution.
The simulation result of using samples from the standard normal distribution is shown in
Fig. 1. The sampling distribution of the mean shows the conventional bell shape of normal
distribution even at small sample size (n = 5), which means that the distribution of the mean
converges to normal distribution in a substantial high rate (actually no matter how much the
sample size is, the mean is normally distributed). This attribute also can been seen in the four
Q-Q plots. The points in the all the four plots align in a perfect linear pattern and distribute
2
along the line y = x. In fact, for a normal distribution with the mean and variance 2 ,
the sampling distribution of the mean is given by a normal distribution with the mean and
variance 2 /n (n is the sample size). If the sample size goes to infinity, the variance of the
sampling distribution will be zero, which forms the so-called degenerate distribution. The
mean of the sampling distribution is constant and remains the same as the population mean.
In contrast, the variance of sampling distribution decreases when the sample size increases,
making the image of the sampling distribution function more and more steep (Fig. 1).
3.2
Exponential distribution
Fig. 2 The simulation results of the sampling distribution from the population with the exponential
distribution. (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to
the left. In each figure, the blue line represents the sampling distribution function of the mean and
the red line represents the population distribution (the exponential distribution). (Lower panel) The
corresponding Q-Q plot of the sampling distribution with certain sample size against the standard
normal distribution.
3.3
t-distribution
Fig. 3 The simulation results of the sampling distribution from the population with the t-distribution
(df = 5). (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to the
left. In each figure, the blue line represents the sampling distribution function of the mean and the
red line represents the population distribution (the t-distribution, df = 5). (Lower panel) The corresponding Q-Q plot of the sampling distribution with certain sample size against the standard normal
distribution.
Fig. 4 The simulation results of the sampling distribution from the population with the t-distribution
(df = 10). (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to
(continued ) the left. In each figure, the blue line represents the sampling distribution function of the
mean and the red line represents the population distribution (the t-distribution, df = 10). (Lower
panel) The corresponding Q-Q plot of the sampling distribution with certain sample size against the
standard normal distribution.
3.4
2 distribution
Fig. 5 The simulation results of the sampling distribution from the population with the 2 distribution
(df = 2). (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to the left.
In each figure, the blue line represents the sampling distribution function of the mean and the red line
represents the population distribution (the 2 distribution, df = 2). (Lower panel) The corresponding
Q-Q plot of the sampling distribution with certain sample size against the standard normal distribution.
The degree of freedom (df ) significantly influence the image of the chi-squared distribution
function (Fig. 5 and Fig. 6). When df goes to infinity, the chi-squared distribution converges
to a normal distribution. In this report, two typical degrees of freedom were selected (df = 2,
5
10) to show the dependence of sampling distribution on the population parameters. As shown
in Fig. 5, the population distribution is highly skewed when the degree of freedom equals 2, a
scenario that is comparable to exponential distribution. However, the sampling distribution
still converges to normal distribution at large sample size (n = 100). In contrast, when df =
10, even showing noticeable asymmetry, the population distribution becomes similar to the
normal distribution (Fig. 6). As can be seen from the image of sampling distribution and
Q-Q plot, its rate of converging to the normal distribution is significantly higher compared
to that of the chi-squared distribution with df = 2 (Fig. 5 and Fig. 6). The variance of
sampling distribution declines in a similar trend as the other three types of population when
the sample size increases.
Fig. 6 The simulation results of the sampling distribution from the population with the 2 distribution (df = 10). (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to
the left. In each figure, the blue line represents the sampling distribution function of the mean and
the red line represents the population distribution (the 2 distribution, df = 10). (Lower panel) The
corresponding Q-Q plot of the sampling distribution with certain sample size against the standard
normal distribution.
Discussion
This report summarizes the simulation results of the central limit theorem using four different
types of population distribution. The sampling distribution of the mean was plotted at each
given sample sizes and Q-Q plot was employed to quantitatively evaluate the similarity of
each sampling distribution to the normal distribution. Based on the results, it can be found
that the type of the original population substantially influence the rate of convergence for the
sampling distribution of the mean. If the original population is normally or approximately
normally distributed, the corresponding sampling distribution will converge to the normal
6
distribution quickly, which means that even at small sample size, the Q-Q plot shows a linear
pattern. Consistent with the statement of the CLT that the sampling distribution will be
distributed according to the normal distribution when n , the distribution of the mean
is even closer to normal distribution with larger sample size in all the simulations regardless
of the original populations. It is also noticed that the variance of the sampling distribution
diminishes following the increase of the sample size, which is implied in the CLT that the sample mean Y is approximately normally distributed with variance 2 /n for sufficiently large
sample size.
Larger sample size could be applied to better simulate the condition that n goes to infinity. In addition, more values of parameters and degrees of freedom (if possible) could be
selected to better demonstrate the rate of convergence for different types of population. Furthermore, it should be possible to formulate a quantitative method of assessing the rate of
convergence to evaluate the asymptotic behavior of different sampling distributions.
Appendix
Source code
1
2
n <- c (5 , 10 , 50 , 100)
nsim <- 1000000
3
4
5
6
7
8
9
10
11
12
13
14
# normal distribution
normal . bar <- matrix ( c ( rep (0 , nsim * 4)) , nrow =4 , ncol = nsim , byrow = TRUE )
for ( j in 1:4)
{
for ( i in 1: nsim )
{
y1 <- rnorm ( n [ j ])
normal . bar [j , i ] <- mean ( y1 )
}
}
par ( mfrow = c (2 ,4))
15
16
17
18
19
20
21
for ( p in 1:4)
{
plot . density ( density ( normal . bar [p ,]) , xlim = c ( -2 ,2) , lwd =1.5 ,
col = " blue " , main = " " , xlab = " x " , font . lab =2 , font . axis =2)
lines ( density ( rnorm ( nsim )) , lwd =1.5 , col = " red " )
}
22
23
24
25
26
27
28
for ( q in 1:4)
{
qqplot ( rnorm ( nsim ) , normal . bar [q ,] , cex =0.8 ,
xlab = " Theoretical quantiles of N (0 ,1) " ,
ylab = " Sample quantiles " , font . lab =2 , font . axis =2)
}
29
30
# #################################################
31
32
33
n <- c (5 , 10 , 50 , 100)
nsim <- 1000000
34
35
36
37
38
39
40
41
42
43
44
45
# exponential distribution
exp . bar <- matrix ( c ( rep (0 , nsim * 4)) , nrow =4 , ncol = nsim , byrow = TRUE )
for ( j in 1:4)
{
for ( i in 1: nsim )
{
y1 <- rexp ( n [ j ])
exp . bar [j , i ] <- mean ( y1 )
}
}
par ( mfrow = c (2 ,4))
46
47
48
49
50
51
52
for ( p in 1:4)
{
plot . density ( density ( exp . bar [p ,]) , xlim = c ( -1 ,3) , lwd =1.5 ,
col = " blue " , main = " " , xlab = " x " , font . lab =2 , font . axis =2)
lines ( density ( rexp ( nsim )) , lwd =1.5 , col = " red " )
}
53
54
55
56
57
58
59
for ( q in 1:4)
{
qqplot ( rnorm ( nsim ) , exp . bar [q ,] , cex =0.8 ,
xlab = " Theoretical quantiles of N (0 ,1) " ,
ylab = " Sample quantiles " , ont . lab =2 , font . axis =2)
}
60
61
# #################################################
62
63
64
n <- c (5 , 10 , 50 , 100)
nsim <- 1000000
65
66
67
68
69
70
71
72
73
74
75
76
# T distribution , df = 10 as an example
t . bar <- matrix ( c ( rep (0 , nsim * 4)) , nrow =4 , ncol = nsim , byrow = TRUE )
for ( j in 1:4)
{
for ( i in 1: nsim )
{
y1 <- rt ( n [ j ] , 10)
t . bar [j , i ] <- mean ( y1 )
}
}
par ( mfrow = c (2 ,4))
77
78
79
80
81
82
83
for ( p in 1:4)
{
plot . density ( density ( t . bar [p ,]) , xlim = c ( -2 ,2) , lwd =1.5 ,
col = " blue " , main = " " , xlab = " x " , font . lab =2 , font . axis =2)
lines ( density ( rt ( nsim , 10)) , lwd =1.5 , col = " red " )
}
84
85
86
87
for ( q in 1:4)
{
qqplot ( rnorm ( nsim ) , t . bar [q ,] , cex =0.8 ,
88
89
90
91
92
# #################################################
93
94
95
n <- c (5 , 10 , 50 , 100)
nsim <- 1000000
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
for ( p in 1:4)
{
plot . density ( density ( chisq . bar [p ,]) , xlim = c ( -1 ,20) , lwd =1.5 ,
col = " blue " , main = " " , xlab = " x " , font . lab =2 , font . axis =2)
lines ( density ( rchisq ( nsim , df =10)) , lwd =1.5 , col = " red " )
}
115
116
117
118
119
120
121
for ( q in 1:4)
{
qqplot ( rnorm ( nsim ) , chisq . bar [q ,] , cex =0.8 ,
xlab = " Theoretical quantiles of N (0 ,1) " ,
ylab = " Sample quantiles " , font . lab =2 , font . axis =2)
}