Sunteți pe pagina 1din 9


This Reading Assignment examines how data in a sample can be collected and then
used to provide information on the wider population. Many of the examples are
concerned with the mean of the sample being used to estimate the population
mean, this is a practice often used in finance. The Central Limit Theorem allows us
to make probability statements about a population mean based on sample data. It
is imperative that you understand the concept and calculation of confidence
intervals for the population mean and when to use the z-statistic or t-statistic.

There are different ways of selecting a sample from a population. The basic type of sample is a simple
random sample. In this sample each item or person in the population has an equal probability of
being included.

Example -1: Simple random sample

Put each member of a population in a sequence and identify each member by a number then use
random number tables to select the numbers for a sample (however many numbers needed for the
sample size required). Match these numbers to the members of the population to identify the sample.

When it is not practical to assign a number to each item in a population then we might use
systematic random sampling. In this case the items are arranged and then every nth item is
included in the sample. This assumes that there is no pattern to the way that the items are arranged.

Example 11-2: Systematic random sample

A chocolate bar manufacturer selects every 100th chocolate bar coming off a conveyor belt for
inclusion in a sample to test the weights of chocolate bars being produced.

Although a random sample will reflect the characteristics of the population in an unbiased way, there
is likely to be a difference between the estimate from the sample and the actual population

Sampling error is defined as the difference between the observed value of a sample statistic and the
quantity that it is being used to estimate from the population.

Example -3: Sampling error

A chocolate bar manufacturer selects every 100th chocolate bar coming off a conveyor belt for
inclusion in a sample to test the weights of chocolate bars being produced. The mean weight of
chocolate bars in the sample is 105 grams, the mean population weight is 100 grams, and sampling
error is therefore 5 grams.

A sampling distribution of a statistic is the distribution of all possible distinct values that the
statistic can assume when samples of the same size are randomly taken from the population.

For example the sampling distribution of the sample mean is the distribution of all possible
sample means of a given sample size and the probability of occurrence of each sample mean.

Example -4: Sampling distribution

The four employees of a firm have worked for the firm for 3, 7, 8 and 12 years. To calculate the
sampling distribution of the sample mean for samples of two workers calculate the means for all
possible samples
Employees in Sample (years worked) Sample Mean (in years)

3 and 7 5.0

3 and 8 5.5

3 and 12 7.5

7 and 8 7.5

7 and 12 9.5

8 and 12 10.0

Therefore the sampling distribution of the sample mean is:

Sample Mean (in years) Probability

5.0 0.167

5.5 0.167

7.5 0.333

9.5 0.167

10.0 0.167

We can see from the example that the mean of the sample mean is the same as the population mean,
and the standard deviation of the distribution of sample mean is less than that of the population.

Another method of taking a sample is stratified random sampling. In this case we divide the
population into subgroups (or strata) and select a sample from each subgroup. If it is a proportional
sample then the number of items selected from each subgroup will be the same as the size of the
subgroup as a proportion to the total population.

Example -5: Stratified random sampling

If we wish to study the usage of cars by a population of car owners we might decide to divide car
owners into three subgroups by age as shown below.

Age Percentage of car owners Number in sample

Under 25 ears 15% 300

25 ears a to 55
60% 1,200

55 ears and over 25% 500

Total 100% 2,000

The number from each group selected for the sample is based on the percentage of car owners in that

Two different forms of data are:

1. Time-series data

Time-series data is a sequence of returns collected at discrete and equally spaced time
intervals, for example historic monthly stock returns.

2. Cross-sectional data

This is data collected on a characteristic of a group, which might be a group of individuals or

companies, at a single point in time. Last year's closing prices for stocks that trade on the
NYSE is an example of cross-sectional data.

Central Limit Theorem

For a population with a mean of μ and a variance of σ2, the sampling distribution of the sampling
mean (x) of all possible samples of size n will be approximately normally distributed with a mean μ
and variance σ2/n (assuming n is large, say 30 or over).

To summarize:

• Even if the distribution of the population is not normal the sampling distribution of the
sampling mean, x, is approximately a normal distribution.

• The mean of the distribution of x will be equal to the mean of the population.

• The variance of the distribution of x will be equal to the variance of the population divided by
the sample size.

The standard error of the sample mean is


This is the standard deviation of the sampling distribution of the sample mean.

If the population standard deviation (σ) is not known, then we can use the sample standard deviation,
s, to estimate the standard error, it is then denoted by:




Example -6: Standard error of the sample mean

If the standard deviation of a population is 10 and a sample of 49 items is taken from the population
then the standard error of the sample mean is:
Estimating a Population Parameter
The formulae that we use to calculate a sample statistic are estimators. The particular value that we
calculate using an estimator is an estimate.

A point estimate is a single estimate calculated from a sample which is used to estimate the
population parameter. An example of this would be a sample mean being calculated as a point
estimate of the population mean.

Another approach is to make an interval estimate of the parameter; this means we find an interval
that will include the population parameter with a certain level of probability. This is a confidence

The three desirable properties of an estimator (or estimation formula) are:

1. Unbiased - the expected value (the mean of its sampling distribution) is the same as the
parameter it is intended to estimate.

2. Efficient - there is no other unbiased estimate of the same parameter with a sampling
distribution of smaller variance.

3. Consistent - the probability of accurate estimates increases as the sample size increases.

Confidence Intervals
This is an interval and the population parameter lies within this interval with a specified probability (1
- α). The probability is the degree of confidence. The interval is called the (1 - α)% confidence
interval for the parameter.

The end points of the interval are called the lower and upper confidence limits.

A 95% confidence interval can be interpreted by considering the case when we take a large number of
samples from the population and construct a confidence interval for each sample. We expect 95% of
these confidence intervals to include the population mean. Following on, we can say that we are 95%
confident that a single confidence level includes the population mean.

Constructing a confidence interval

A confidence interval is defined by:


Point = a point estimate of the parameter


reliability = a number based on the assumed distribution of the point estimate and degree of
factor confidence for the interval

standard = standard error of the sample statistic providing the point estimate

Applying this to the case where we are estimating the population mean and we are taking a sample
from a normally distributed population with known variance. The confidence interval is given by:


X = sample mean, which is the point estimate of the population mean

σ = population standard deviation

n = sample size

Zα/2 = reliability factor, the Point where α/2 of the Probabilitv is in the right tail

Using the characteristics of a normal distribution, we can see that:

• z0.05 is used for 90% confidence intervals, since it is when 5% of the probability is in the top
right tail and 5% in the bottom left tail. z0.05 is 1.645.

o z0.025 which is used for 95% confidence intervals, is 1.960.

o z0.005which is used for 99% confidence intervals, is 2.575.

Another way of saying this is that:

• 90% of the sample means will be within 1.645 standard deviations of the population mean.

• 95% of the sample means will be within 1.960 standard deviations of the population mean.

• 99% of the sample means will be within 2.575 standard deviations of the population mean.

For any distribution if we do not know the variance, and it is a large sample, we can use



s = sample standard deviation

x = sample mean

n = sample size


• The 90% confidence interval for the mean is x ±

• The 95 % confidence interval for the mean is x ±

• The 99% confidence interval for the mean is x ±

Example -7: Confidence intervals

A sample of 81 observations is taken from a normal population, the sample mean is 20 and the
standard deviation is 3.

The 90% confidence interval is 20 ± 1.645 × 3)/9 which is 19.45 up to 20.55.

This means we can be 90% confident that the population mean lies between 19.45 and 20.55.

The 95% confidence interval is 20 ± 1.960 × 3)/9 which is 19.35 up to 20.65.

The 99% confidence interval is 20 ± 2.575 × 3)/9 which is 19.14 up to 20.86.

Student's t -Distribution
An alternative method for constructing confidence intervals is to use the t-distribution. It is a more
conservative method, giving wider intervals, and ideally is used in all cases even when it is a large
sample. However when it is a small sample (less than 30), when we do not know the population
variance, it is essential to use the t-distribution approach.

The t-distribution is a symmetrical probability distribution defined by a single parameter, the number
of degrees of freedom (df).

Degrees of freedom are the number of independent observations used.

The t-distribution with a mean of 0 and (n -1) degrees of freedom is given by:


It is not normal since there are two random variables, the sample mean and standard deviation.
However as the number of degrees of freedom increases the t-distribution approaches the normal
distribution, as shown below:
Confidence Intervals for the Population Mean
If we are considering a population with unknown variance and either

• the sample is large, or

• the sample is small but normally distributed, then

The (1 - α) % confidence interval is given by:


where the number of degrees of freedom for tα/2 is (n -1), with a sample size of n.

In order to answer hypothesis questions you may be required to read t-distribution tables to find the
critical value of t. We show an excerpt from the tables below. Note that these are for one-tailed tests,
so for α = 0.05 then p = 0.05, whereas for a two-tailed test you would need to use p = 0.025, which
is half the significance level.

For example to find the critical t-value with 5 degrees of freedom and a = 0.05 and a one-tailed test
the critical t-value would be 2.015. For a two-tailed test (p = 0.025) it would be 2.571.

df p =0.10 p = 0.05 p = 0.025

1 3.078 6.314 12.706

2 1.886 2.920 4.303

3 1.638 2.353 3.182

4 1.533 2.132 2.776

5 1.476 2.015 2.571


Example -8: Confidence intervals

An investor is looking at the quarterly returns from a mutual fund portfolio which are assumed to be
normally distributed and have a mean of 3% and a sample standard deviation of 2%. He looks at 3
years' data and wishes to compute the 95% confidence interval. Since the sample is small he uses
Equation 3-12.

He will need to use t-distribution tables to look up t0.025 for 11 degrees of freedom (since the sample
size is 12), this is 2.201.

The confidence interval is

The 95% confidence interval is between 1.73% and 4.27%

The investor can be confident, at the 95% level, that this range includes the population mean.

In summary, the table below shows which statistic to use for different samples.

Distribution Variance Small Sample Large Sample

Normal Known z z

Normal Unknown t z or t

Nonnormal Known Not Available z

Nonnormal Unknown Not Available z or t

If a larger sample size is taken then the confidence interval will decrease as the standard error is
lower. As you would expect, a larger sample gives more precise results.

Biases Impacting on Data Selected

Data-Snooping Bias
This is the bias that occurs if you use the empirical results of other analysts' research, or focus on
patterns that may have been identified by other research. Ideally you would study new data but
unfortunately this may not be practical in financial markets where much of the research is based on
historic data.

Data-Mining Bias
This is when forecasting models are derived from searching through historic data for patterns/trading
rules. The problems occur when a large number of models are tested but only the successful ones

Sample Selection Bias

This occurs when certain data is excluded from the analysis, possibly because the data was not

Survivorship Bias
This is one type of sample selection bias, which occurs when companies that have gone bankrupt, or
funds or portfolios that have been liquidated, are not included in the analysis.

Look-Ahead Bias
This is when a test uses information that was not available at the test date. An example of this is
when the success of valuation ratios is considered but all investors may not have had access to the
accounting data incorporated in the valuation ratio at the test date.

Time-Period Bias
This is when the test period used does not match the conclusion being drawn, perhaps short-term
data is being applied to provide long-term forecasts.