Sunteți pe pagina 1din 9

Studies of the sampling distribution given different

population distributions and sample sizes


Xiou Cao

Introduction

The central limit theorem (CLT) could be considered as the most important principle of probability and mathematical statistics and contributes the foundation to the theory of statistical
estimation and hypothesis testing. Briefly, the CLT states that the arithmetic mean of a
significant large series of independent and identically distributed random variables will be
approximately normally distributed regardless of the probability distribution of the original
population. A description of the CLT using mathematical language is shown as follows.
If X1 , X2 ,...,Xn are independent and identically distributed random variables with
E(Xi ) =
V ar(Xi ) = 2
If we define
Zn =

/ n

The distribution function of Zn converges to a standard normal distribution when n .


The CLT highlights the importance and universality of the normal distribution and provides a way of estimating the parameters of a population given a sufficient large number of
samples. In this report, I applied the statistical programming software, R to conduct the
simulation of the CLT using different population distributions and sample sizes. These simulations not only confirm the validity of the CLT and its independence of the type of the
original population but also qualitatively demonstrate that the rate of convergence of the
sampling distribution to normal distribution is affected by the sample size and the original
population. From the results of the simulations, we are able to obtain a more perceptual
understanding of the CLT and its limitations of usage in statistical analysis.

Method

The simulation was performed by using the software R. To compare the simulation results,
I selected four types of population distribution, which are normal, exponential, t and chisquared distributions as the sample sources. In order to show the influential role of sample
size, four different sample sizes (n = 5, 10, 50, 100) were utilized for the simulation in each
population distribution. For each sample size, the sampling process was repeated 1,000,000
times and the mean of every sample was calculated and stored in the corresponding row of
1

a matrix, which was created for each certain population distribution. The plot of sampling
distribution was then generated for every sample size of each population distribution based on
the data of sample mean stored in the matrix. Within each image, a plot of the corresponding
population distribution function was also added for reference and comparison. In order to
better quantify the deviation of the sampling distribution of mean against the normal distribution, a Q-Q plot was constructed for every simulation by using samples from the standard
normal distribution with the same sample size (1,000,000). The extent of forming a linear
pattern in a Q-Q plot was used to evaluate if a distribution is similar to normal distribution.
For t and chi-squared distributions, two different degrees of freedom were selected to perform
the simulation (t distribution: df = 5, 10; chi-squared distribution: df = 2, 10). The source
code was attached in the appendix.

3
3.1

Results
Normal distribution

Fig. 1 The simulation results of the sampling distribution from the population with the standard
normal distribution. (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right
to the left. In each figure, the blue line represents the sampling distribution function of the mean and
the red line represents the population distribution (the standard normal distribution). (Lower panel)
The corresponding Q-Q plot of the sampling distribution with certain sample size against the standard
normal distribution.

The simulation result of using samples from the standard normal distribution is shown in
Fig. 1. The sampling distribution of the mean shows the conventional bell shape of normal
distribution even at small sample size (n = 5), which means that the distribution of the mean
converges to normal distribution in a substantial high rate (actually no matter how much the
sample size is, the mean is normally distributed). This attribute also can been seen in the four
Q-Q plots. The points in the all the four plots align in a perfect linear pattern and distribute
2

along the line y = x. In fact, for a normal distribution with the mean and variance 2 ,
the sampling distribution of the mean is given by a normal distribution with the mean and
variance 2 /n (n is the sample size). If the sample size goes to infinity, the variance of the
sampling distribution will be zero, which forms the so-called degenerate distribution. The
mean of the sampling distribution is constant and remains the same as the population mean.
In contrast, the variance of sampling distribution decreases when the sample size increases,
making the image of the sampling distribution function more and more steep (Fig. 1).

3.2

Exponential distribution

Fig. 2 The simulation results of the sampling distribution from the population with the exponential
distribution. (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to
the left. In each figure, the blue line represents the sampling distribution function of the mean and
the red line represents the population distribution (the exponential distribution). (Lower panel) The
corresponding Q-Q plot of the sampling distribution with certain sample size against the standard
normal distribution.

The exponential distribution is highly skewed compared to normal distribution. However,


even at small sample size (n = 5), to some extent, the sampling distribution of the mean
shows the tendency of being symmetry (Fig. 2). When the sample size increases, the shape
of the sampling distribution function become steep and symmetric. It is difficult to tell the
difference in shape between the sampling distributions originated from normal and exponential
distributions when the sample size is sufficiently large (n = 50, 100 in Fig.1 and Fig. 2), which
implies that the sample size plays an influential role in determining the reliability of applying
the CLT in statistical analysis. Consistent with normal distribution, the similar phenomenon
is seen in exponential distribution that the variance of the sampling distribution diminishes
with the increase of sample size (Fig. 2). In addition, Q-Q plot also demonstrates the
propensity of sampling distribution to being normally distributed when the sample size n
goes to infinity (Fig. 2).
3

3.3

t-distribution

Fig. 3 The simulation results of the sampling distribution from the population with the t-distribution
(df = 5). (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to the
left. In each figure, the blue line represents the sampling distribution function of the mean and the
red line represents the population distribution (the t-distribution, df = 5). (Lower panel) The corresponding Q-Q plot of the sampling distribution with certain sample size against the standard normal
distribution.

Fig. 4 The simulation results of the sampling distribution from the population with the t-distribution
(df = 10). (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to

(continued ) the left. In each figure, the blue line represents the sampling distribution function of the
mean and the red line represents the population distribution (the t-distribution, df = 10). (Lower
panel) The corresponding Q-Q plot of the sampling distribution with certain sample size against the
standard normal distribution.

Students t-distribution is applied in statistical estimation and hypothesis testing when


the variance of population is unknown and substituted by the sample variance. I choose two
degrees of freedom, which are 5 and 10, respectively to perform the simulation. The sampling
distribution of t-distribution is similar to that of normal distribution; however, the image of
the sampling distribution of t-distribution has a heavier tail, implying a larger variance of the
sampling distribution originated from t-distribution (Fig. 1 and Fig. 3). Changing the degree
of freedom from 5 to 10 significantly increases the rate of sampling distribution converging to
normal distribution, which is manifested clear by the Q-Q plots in Fig. 3 and Fig. 4. When
df = 10, the sampling distribution is almost identical to normal distribution even at small
sample size (Fig. 4). Consistent with the observations mentioned in 1 and 2, the variance of
the sampling distribution tends to decrease in both degrees of freedom when the sample size
increases.

3.4

2 distribution

Fig. 5 The simulation results of the sampling distribution from the population with the 2 distribution
(df = 2). (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to the left.
In each figure, the blue line represents the sampling distribution function of the mean and the red line
represents the population distribution (the 2 distribution, df = 2). (Lower panel) The corresponding
Q-Q plot of the sampling distribution with certain sample size against the standard normal distribution.

The degree of freedom (df ) significantly influence the image of the chi-squared distribution
function (Fig. 5 and Fig. 6). When df goes to infinity, the chi-squared distribution converges
to a normal distribution. In this report, two typical degrees of freedom were selected (df = 2,
5

10) to show the dependence of sampling distribution on the population parameters. As shown
in Fig. 5, the population distribution is highly skewed when the degree of freedom equals 2, a
scenario that is comparable to exponential distribution. However, the sampling distribution
still converges to normal distribution at large sample size (n = 100). In contrast, when df =
10, even showing noticeable asymmetry, the population distribution becomes similar to the
normal distribution (Fig. 6). As can be seen from the image of sampling distribution and
Q-Q plot, its rate of converging to the normal distribution is significantly higher compared
to that of the chi-squared distribution with df = 2 (Fig. 5 and Fig. 6). The variance of
sampling distribution declines in a similar trend as the other three types of population when
the sample size increases.

Fig. 6 The simulation results of the sampling distribution from the population with the 2 distribution (df = 10). (Upper panel) The sample size n equals 5, 10, 50, 100, respectively from the right to
the left. In each figure, the blue line represents the sampling distribution function of the mean and
the red line represents the population distribution (the 2 distribution, df = 10). (Lower panel) The
corresponding Q-Q plot of the sampling distribution with certain sample size against the standard
normal distribution.

Discussion

This report summarizes the simulation results of the central limit theorem using four different
types of population distribution. The sampling distribution of the mean was plotted at each
given sample sizes and Q-Q plot was employed to quantitatively evaluate the similarity of
each sampling distribution to the normal distribution. Based on the results, it can be found
that the type of the original population substantially influence the rate of convergence for the
sampling distribution of the mean. If the original population is normally or approximately
normally distributed, the corresponding sampling distribution will converge to the normal
6

distribution quickly, which means that even at small sample size, the Q-Q plot shows a linear
pattern. Consistent with the statement of the CLT that the sampling distribution will be
distributed according to the normal distribution when n , the distribution of the mean
is even closer to normal distribution with larger sample size in all the simulations regardless
of the original populations. It is also noticed that the variance of the sampling distribution
diminishes following the increase of the sample size, which is implied in the CLT that the sample mean Y is approximately normally distributed with variance 2 /n for sufficiently large
sample size.
Larger sample size could be applied to better simulate the condition that n goes to infinity. In addition, more values of parameters and degrees of freedom (if possible) could be
selected to better demonstrate the rate of convergence for different types of population. Furthermore, it should be possible to formulate a quantitative method of assessing the rate of
convergence to evaluate the asymptotic behavior of different sampling distributions.

Appendix

Source code
1
2

n <- c (5 , 10 , 50 , 100)
nsim <- 1000000

3
4
5
6
7
8
9
10
11
12
13
14

# normal distribution
normal . bar <- matrix ( c ( rep (0 , nsim * 4)) , nrow =4 , ncol = nsim , byrow = TRUE )
for ( j in 1:4)
{
for ( i in 1: nsim )
{
y1 <- rnorm ( n [ j ])
normal . bar [j , i ] <- mean ( y1 )
}
}
par ( mfrow = c (2 ,4))

15
16
17
18
19
20
21

for ( p in 1:4)
{
plot . density ( density ( normal . bar [p ,]) , xlim = c ( -2 ,2) , lwd =1.5 ,
col = " blue " , main = " " , xlab = " x " , font . lab =2 , font . axis =2)
lines ( density ( rnorm ( nsim )) , lwd =1.5 , col = " red " )
}

22
23
24
25
26
27
28

for ( q in 1:4)
{
qqplot ( rnorm ( nsim ) , normal . bar [q ,] , cex =0.8 ,
xlab = " Theoretical quantiles of N (0 ,1) " ,
ylab = " Sample quantiles " , font . lab =2 , font . axis =2)
}

29
30

# #################################################

31
32
33

n <- c (5 , 10 , 50 , 100)
nsim <- 1000000

34
35
36
37
38
39
40
41
42
43
44
45

# exponential distribution
exp . bar <- matrix ( c ( rep (0 , nsim * 4)) , nrow =4 , ncol = nsim , byrow = TRUE )
for ( j in 1:4)
{
for ( i in 1: nsim )
{
y1 <- rexp ( n [ j ])
exp . bar [j , i ] <- mean ( y1 )
}
}
par ( mfrow = c (2 ,4))

46
47
48
49
50
51
52

for ( p in 1:4)
{
plot . density ( density ( exp . bar [p ,]) , xlim = c ( -1 ,3) , lwd =1.5 ,
col = " blue " , main = " " , xlab = " x " , font . lab =2 , font . axis =2)
lines ( density ( rexp ( nsim )) , lwd =1.5 , col = " red " )
}

53
54
55
56
57
58
59

for ( q in 1:4)
{
qqplot ( rnorm ( nsim ) , exp . bar [q ,] , cex =0.8 ,
xlab = " Theoretical quantiles of N (0 ,1) " ,
ylab = " Sample quantiles " , ont . lab =2 , font . axis =2)
}

60
61

# #################################################

62
63
64

n <- c (5 , 10 , 50 , 100)
nsim <- 1000000

65
66
67
68
69
70
71
72
73
74
75
76

# T distribution , df = 10 as an example
t . bar <- matrix ( c ( rep (0 , nsim * 4)) , nrow =4 , ncol = nsim , byrow = TRUE )
for ( j in 1:4)
{
for ( i in 1: nsim )
{
y1 <- rt ( n [ j ] , 10)
t . bar [j , i ] <- mean ( y1 )
}
}
par ( mfrow = c (2 ,4))

77
78
79
80
81
82
83

for ( p in 1:4)
{
plot . density ( density ( t . bar [p ,]) , xlim = c ( -2 ,2) , lwd =1.5 ,
col = " blue " , main = " " , xlab = " x " , font . lab =2 , font . axis =2)
lines ( density ( rt ( nsim , 10)) , lwd =1.5 , col = " red " )
}

84
85
86
87

for ( q in 1:4)
{
qqplot ( rnorm ( nsim ) , t . bar [q ,] , cex =0.8 ,

xlab = " Theoretical quantiles of N (0 ,1) " ,


ylab = " Sample quantiles " , font . lab =2 , font . axis =2)

88
89
90

91
92

# #################################################

93
94
95

n <- c (5 , 10 , 50 , 100)
nsim <- 1000000

96
97
98
99
100
101
102
103
104
105
106
107

# chi square distribution , df = 10 as an example


chisq . bar <- matrix ( c ( rep (0 , nsim * 4)) , nrow =4 , ncol = nsim , byrow = TRUE )
for ( j in 1:4)
{
for ( i in 1: nsim )
{
y1 <- rchisq ( n [ j ] , df =10)
chisq . bar [j , i ] <- mean ( y1 )
}
}
par ( mfrow = c (2 ,4))

108
109
110
111
112
113
114

for ( p in 1:4)
{
plot . density ( density ( chisq . bar [p ,]) , xlim = c ( -1 ,20) , lwd =1.5 ,
col = " blue " , main = " " , xlab = " x " , font . lab =2 , font . axis =2)
lines ( density ( rchisq ( nsim , df =10)) , lwd =1.5 , col = " red " )
}

115
116
117
118
119
120
121

for ( q in 1:4)
{
qqplot ( rnorm ( nsim ) , chisq . bar [q ,] , cex =0.8 ,
xlab = " Theoretical quantiles of N (0 ,1) " ,
ylab = " Sample quantiles " , font . lab =2 , font . axis =2)
}

S-ar putea să vă placă și