Sunteți pe pagina 1din 60

Concepts

(Review of Probability)
In probability, we assume that the population and its
parameters are known and compute the probability of
drawing a particular sample.
In statistics, we assume that the population and its
parameters are unknown and the sample is used to infer the
values of the parameters.
sampling variability : Different samples give different
estimates of population parameters.
Sampling variability leads to sampling error.

Probability is deductive (general -> particular)
Statistics is inductive (particular -> general)
Probability Concepts
Random experiment procedure whose outcome cannot be
predicted in advance. E.g. toss a coin twice
Sample Space (S) Mutually exclusive, collectively exhaustive
listing of all possible outcomes
S={H,H},{H,T},{T,H},{T,T}
Event (A) a set of outcomes (subset of S). E.g. No heads A={T,T}
Union (or) E.g. A=heads on first, B=heads on second A U B=
{H,T},{H,H},{T,H}
Intersection (and): E.g. A= heads on first, B=heads on second A
B = {H,H}
Complement of Event A set of all outcomes not in A. E.g.
A={T,T}, Ac={H,H},{H,T},{T,H}

Probability Model
Example: There are 95% chances
that sale of apples on Monday will be
120 Kg
Estimate=______
Probability of error = ______
Interpretation:
To reach this kind of Judgement we
need data and fitting probability model

Process of getting a probability
model
Specify
Experiment
Recognize all
outcomes
Sample
Space
Assign
Number to
each outcome
Random
Variable
x
Determine probability
for each value of x
42
43
44
45
46
47
1st Qt r 2nd Qt r 3rd Qt r 4t h Qt r
The determination of
probability distribution
completes the process of
describing probability
model
Some definitions
Random Experiment
Sample Point
Sample Space
Random Variable
Event: any subset of sample space of
a random variable is called event.
A random sample gives a non-zero
(equal) chance to every unit of the
population to enter the sample.



Examples:
Experiment: Flip a Coin
Outcomes:Two Discrete Outcomes: H,T
Sample Space: Discrete & Finite
Random Variable:Define {x=1} if H occurs
And { x=0} if T occurs
Experiment: Taking an Exam
Outcomes: Grades A,B,C,D,E,F
Sample Space: Discrete and Finite
Random Variable: {y=4} if Grade is A
{y=3} if Grade is B etc

Random Variables
Continuous Random Variables
Cumulative Distribution Function
Two Famous Theorems
iid : independent identically distributed
Simply stating:
Law of Large Numbers (LLN) says that as
n, the sample mean converges to the
population mean, i.e.,

0 x , n As
Probability Distribution
Probability distribution is defined for a random variable x
which takes values x
1
, x
2
,.x
n
with probabilities P(x
1
),
P(x
2
),P(x
n
)
The function P(X
i
)is called Probability Mass Function this
symbol is used if variable is discrete. f(x
i
) is called
probability density function,notation f(x) is used if the
variable is continuous.

The summary of distinct values x
i
of a random variable X
together with their probabilities P(x) or f(x) is known as
probability distribution of the random variable

Discrete prob. Distributions: Binomial, Poisson
Continuous Prob. Distribution : Normal
Selected Discrete Distributions
)! ( !
!
x n x
n
C
x
n

=
Binomial Rule
P (two six from 3 dice)
n =3 Trials,
Define Success: Occurrence of Six
Define Failure : non-occurrence of six
r= 2 success
Formula: if probability of success in
any one trial is p, the probability of r
success in n trials


r n r
p p
r n r
n

= ) 1 ( *
)! ( !
!
Solve Following Questions
P ( 2 six from 3 dice)
P(3 six from five throws)
P(less than 2 six from 4 dice)

Binomial Theorem
Pascals Triangle
Poisson Distribution
The Poisson distribution can be used to determine the
probability of a designated number of events occurring, when
these events occur in a continuum of time.
A long-run mean() number of events for specific time of
interest is required to find probability of designated number of
events.
The probability of x no of success in Poisson distribution is
given by P(x| )= (
x
e
-
)/x!
The mean of the Poisson process is always proportional to the
length of time, therefore if mean is available for one length of
time then mean for any other required time period can be
determined.

Question on Poisson
On an average 12 people per hour ask questions to a
decorating consultant in a fabric store. What is the
probability that three or more will approach the
consultant with questions during a 10 min period?
Solution: Average per hour = 12
10 min = 1/6 of an hour
Av. Per 10 min = 1/6 *12 = 2
P(x >3| =2) =P(x =3| =2)+P(x =4| =2)+
=0.1804+0.0902+0.0361+0.0120+

Selected Discrete Distributions
(cont)
Selected Continuous Distributions
Normal Distribution
Many years ago I called the Laplace-Gauss curve the NORMAL
curve, which name, while it avoids an international question of
priority, has the disadvantage of leading people to believe that all
other distributions of frequency are in one sense or another
ABNORMAL.

That belief is, of course, not justifiable.

Karl Pearson, 1920
.
Normal Distribution
(Bell-curve, Gaussian)
Normal Distribution
Transformation of Normal Random
Variables
Finding probabilities using standard
normal distribution
Finding values of standard normal
random variable
Excel Functions:
NORMSDIST(z) Returns the area to the left of z in standard normal
distribution
NORMSINV(p) returns the value of z on st. normal distribution for
probability p.
NORMINV(p,m,s) returns the value of variable x with normal
distribution having probability p mean m and SD s.
Transformation of normal random variable to standard normal
variable
We move the distribution from its centre 50the centre of 0. this is done by
subtracting 50 from all the values of X. Thus, we shift the distribution 50 units
back so that its new centre is 0. the second thing we need to do is to make the
weight of the distribution, the standard deviation, equal to 1. this is done by
squeezing the width down from 10 to 1. Because the total probability under the
curve must remain 1.00,the distribution must grow upwards to maintain the same
area. Mathematically, squeezing curve to make the width 1 is equivalent to
dividing the random variable by standard variation. the area under the curve
adjusts so that the total remains the same. All probabilities adjust accordingly.
Mathematically Z= (x- )/
Transformation
Subtraction: x-
Z
1.0
=10
=50
x
Division by
0 50
Z
Normal Probability Plots
After collecting data problem involves
deciding whether a population or random
variable is normally distributed?
Since distribution of a random sample from
population will approximate the distribution
of population (larger sample provides
better approximation)- if population is
normally distributed, then a graph of a
sample should reflect it.
A sensitive graphical technique called
Normal probability plot helps.
Normal Probability Plots
A Normal Probability Plot is plot of
sample data versus Normal scores
Normal Scores is the data we would
expect to get by taking a sample of
same size from standard normal
distribution.
If sample is from a normal
population, then normal probability
plot should be linear ( or roughly
linear)
Guidelines for Inference from
Normal Probability Plot
These guidelines should be
interpreted loosely for small
samples, but can be
interpreted rather strictly
for large samples.
If the plot is roughly
linear, then accept as
reasonable that
population is
approximately normally
distributed
If plot shows systematic
deviations from linearity
(e.g. if displays
significant curvature),
than conclude that the
population is probably
not approximately
Normally distributed.
** Shapes of these plots are based on ideal situations, i.e. large
samples from exact distributions
Exhibit-1
The internal Revenue Service publishes data
on federal individual income tax returns in
statistics of income, individual income tax
returns. A sample of 12 returns reveal the
adjusted gross incomes, in thousands of
dollar, shown below
9.7 93.1 33.0 21.2
81.4 51.1 43.5 10.6
12.8 7.8 18.1 12.7
a) Construct a normality plot for these data
b) Assess the normality of adjusted gross
income
The normal probability plot in the figure above displays significant
curvature. Evidently, adjusted gross income are not approximately
normally distributed.
Detecting Outliers using Normal
probability Plot
The dept. of agriculture publishes
data on chicken consumptions. Last
years chicken consumptions, in Kg
for randomly selected people are
displayed in table below. Use normal
probability plot to discuss distribution
of chicken consumption and to detect
any outliers
47 39 62 49 50 70 59 45 72
53 55 0 65 63 53 51 50
On removing the outlier
0Kg from the data set, it
appears plausible that
among people who eat
chicken, the amounts they
consume annually are
approximately normally
distributed.
Statistics- The Easiest Subject
Sir Ronald A. Fisher
(1890-1962)

Wrote the first books
on statistical methods
(1926 & 1936):
A student should not
be made to read
Fishers books unless
he has read them
before.

George W. Snedecor
(1882-1974)

Taught at Iowa State
Univ. where wrote a
college textbook (1937)
Thank God for
Snedecor; now we can
understand Fisher.
(named the distribution
for Fisher)


Sampling
Procedure by which some members of
the defined population is selected as
representative of the entire population
Sampling Methods
Example 1: I know that the market for product X is Chinese
villages, is my assumption about them right?
Example 2: I know the product is for a peculiar population. Is
Delhi a right place to market the product?
Population
Sample
Or
Sampling Distribution
(Using C.L.T)
Analysis of data
collected from
previous step
What do you
know about
population ?
How can you be sure
that your sample is
fit for analysis?
what can you say about
population now?

What kind of analysis
Shall we carry out? Why?
Why do we use samples?
Get information from large populations
At minimal cost.
At maximum speed (Least Time)
At increased accuracy.



Using enhanced tools

Sampling Terminology
Population: the relevant target group for study.
Census: data collection from entire population.
Sample: a subset of target population, selected to
represent population.
Sample Unit: elements of the targeted population
available for selection during sampling.
Sample Frame: a list or other way of identifying
units from which a sample is to be drawn.

Sampling Terminology
Sample Representativeness: degree to which
sample is similar to target population in terms of key
characteristics.
Incident Rate: percentage of people in the general
population or on a list that fits the qualifications of
those the researcher wishes to describe.
Sampling Error: Discrepancies between data
generated from a sample and the actual population
data as a result of sampling instead of census.
Non Sampling Error: All other biases at any stage,
including inaccurate population, old sampling frame,
error in measurement etc. that can occur regardless of
sample or census used.
Steps in planning sample study
Step 1: Define the target population (key
characteristics)
Step 2: Define the data collection method
and, margin of error (o)
Step 3: Obtain the designate sampling frame.
Step 4: Determine the sampling method.
Considering Time/ Area/Budget/ precision
Non probability / probability method of sampling
Step 5: Determine sample size.
Step 6: Develop operational procedure
Types of Sampling Methods
Convenience
Samples
Non-Probability
Samples
Snowball
Judgment
Probability Samples
Simple
Random
Systematic
Stratified
Cluster
Multi-Stage
Sampling
Quota
Stratified Random Sampling
Separates the population into mutually exclusive sets
(strata), and then draw simple random samples from each
stratum.

Strata similar to blocks in an experiment

With this procedure we can acquire information about
the whole population
each stratum
the relationships among strata.

Sex
Male
Female
Age
under 20
20-30
31-40
41-50
Occupation
professional
clerical
blue-collar
Stratified Random Sampling
There are several ways to build a
stratified sample. For example, keep
the proportion of each stratum in the
population.

Total 1,000
Stratum Income Population proportion
1 under $15,000 25% 250
2 15,000-29,999 40% 400
3 30.000-50,000 30% 300
4 over $50,000 5% 50
Stratum size
Determining Sample Size

NON STATISTICAL APPROACH

Arbitrary % of population.
Conventional- suggested by past
research, industry standards
Cost /Time Constrains driven
Determining Sample Size
Statistical Approach- Using Confidence Intervals
3 Factors in Determining Sample Size
Confidence Intervals (Confidence in estimates)




Sampling Error: Precision or tolerance for error around
estimate stated in percentage points.
Estimated Standard Deviation: Estimate of variability of
population characteristics based on prior information

Confidence level Z Confidence level Z
90% 1.65 95% 1.96
98% 2.33 99% 2.58
Determining Sample Size-
3 Questions
1. How close you want your sample estimate to be to the
unknown parameter? The answer to this question is
denoted by e, desired accuracy range.
2. What do you want the confidence level to be so that the
distance between the estimate and parameter is less
than or equal to e?
3. What is your estimate of variance (or standard deviation)
of the population in question?
Ans: this is often unknown and we need to estimate this
using range (if you are sure of no outliers present), =
(Range/6) or if the population is approximately normal
and you can get the 95% bounds on values in the
population, divide the difference between upper and
lower bound by 4, or conduct a pilot survey to estimate
.
Sample size formula for means
(Interval or Ratio data)
effect design g
Range Accuracy Desired e
Mean Population for SD Estimated
level confidence desired for Z of Value Z
Size. Smple Required n
Where,
*
*
2
2
2
2
=
=
=
=
=
=
o
o
o
g
e
z
n
Sample size formula for means
(Nominal or Ordinal data)
effect design g
range accuracy desired e
P) - (100 Q
proportion population of estimation P
level confidence desired for value Z The Z
size sample required n
where,
*
) * ( *
2
2
2
=
=
=
=
=
=
= g
e
Q P z
n
o
Exhibit1:
A market research firm wants to conduct a survey to
estimate the average amount spent on entertainment by
each person visiting a popular resort. The people who
plan the survey would like to be able to determine the
average amount spent by all people visiting the resort to
within $120, with 95% confidence. From post operation
of the resort, an estimate of the population standard
deviations $400. what is the minimum required sample
size.
43 684 . 42
120
160000 * (1.96)
*
*
2
2
2
2
2
2
~ =
=
= g
e
z
n
o
o
Exhibit 2:
The manufacturer of a sports car wants to estimate the
proportion of people in a given income bracket who are
interested in the model. The company wants to know the
population proportion p to be with in 0.10 with 99%
confidence. Current company records indicate that
proportion p to within 0.25. what is the minimum required
sample size for this survey?
125 42 . 124
10 . 0
) 75 . 0 )( 25 . 0 ( * (2.576)
*
*
2
2
2
2
2
~ =
=
= g
e
pq z
n
o
Sample Size Calculations
1088
03 . 0
85 . 0 * 15 . * 96 . 1
* 2
* * z
* n
Sampling Cluster
544
03 . 0
85 . 0 * 15 . * 96 . 1 * * z
n
sampling systematic random/ Simple
2
2
2
2
2
2
2
2
= = =
= = =
d
q p
g
d
q p
effect design g
precision absolute d
p - 1 q
e prevalaenc expected p
ce significan of level with associated Score Z z
Where,
=
=
=
=
=
Sampling Cost
Sampling Cost = Fixed cost +Variable Cost
Fixed cost is independent of sample size, e.g. cost of
planning and organizing the sampling experiment
Variable Cost increases with increase in sample size.,
it includes cost of selection, measurement and
recording of each sampled item.
Error Cost: More difficult to estimate than
sampling cost. Usually it increases more rapidly
than linear increment in amount of error. Often
a quadratic formula is used.
Total Cost = sampling cost +error cost
Sampling Distributions
Definitions and Key Concepts
A sample statistic used to estimate an
unknown population parameter is called an
estimate.
The discrepancy between the estimate and
the true parameter value is known as
sampling error.
A statistic is a random variable with a
probability distribution, called the sampling
distribution, which is generated by repeated
sampling.



Distribution of Sample Means
How do the sample mean and variance vary in
repeated samples of size n drawn from the
population?
Generally, the exact distribution is difficult to calculate.
What can be said about the distribution of the
sample mean when the sample is drawn from an
arbitrary population?
In many cases we can approximate the distribution of
the sample mean when n is large by a normal
distribution.
The famous Central Limit Theorem

Central Limit Theorem
Estimators and their properties
Unbiased: if the estimators expected value is equal to the
population parameter it estimates. Sample mean is unbiased
estimator of population mean.
Efficiency: An estimator is efficient is it has relatively small
variance( not S.D)
Consistency: if estimators probability of being close to
parameter it estimates increases with increase in sample size.
Sufficient: An estimator is said to be sufficient if it contains all
the information in the data about the parameter it estimates.
Estimator of population mean can be mean and median





S
2
is an unbiased estimator of
2
but SD (s) is not the unbised
estimator of popn SD . We still use S as estimator, ignoring
small bias, relying on the fact that S
2
is unbiased estimator of

2



MEAN MEDIAN
UNBIASED YES YES
EFFICIENCY HIGHER THAN
MEDIAN
SUFFICIENCY USES ALL VALUES
FOR CALCULATION
USES ONLY POSITION
Degree of freedom
When we calculate the sample variance, the
deviations are taken from and not from .
The reason is simple while sample, almost
always the population mean is unknown and
we have to estimate using . This reduces
our degree of freedom from n to (n-1).But
taking squared deviations from induces a
downward bias in the deviations.
Dividing the sum of squared deviation by only
its d.f. Will yield an unbiased estimate of
population variance.
x
x
x
Exhibit:
Sampling Error & need of Sampling Distribution
Sampling Error: Error resulting from using a sample, instead of
census, to estimate population quantity.
Suppose population of interest consists of heights (in inches) of
five starting players on mens basketball team.
76 78 79 81 86 ( = 80)
(i) Determine the sampling distribution of the mean for random
sample of (a)two heights, (b) 4 heights, from a population of
five heights.
(ii) Make some observation about sampling error when mean of
random sample of (a)two heights, (b) 4 heights, is used to
estimate the population mean .
(iii) Employ the sampling distribution of the mean obtained above
to find the probability that the sampling error made in
estimating the population mean, , by the sample mean, will be
at most 1 inch; that is , determine the probability that sample
mean will be within 1 inch of

Considering the sampling size of two
Sample 76,
78
76,
79
76,
81
76,
86
78,
79
78,
81
78,
86
79,
81
79,
86
81,
86
Mean
77.0 77.5 78.5 81.0 78.5 79.5 82.0 80.0 82.5 83.5
78.00 78.50 79.00 79.50 80.00 80.50 81.00 81.50

Probability of one sample being selected = 1/10 =.1
Probability distribution of the random variable ( the sampling
distribution of mean)
Mean
77.0 77.5 78.5 81.0 78.5 79.5 82.0 80.0 82.5 83.5
Probability 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1
x
(ii) Using the above results we can make some simple but
significant observation about sampling error when the mean, , of
a random sample of two heights is used to estimate population
mean .
It is unlikely that mean of the sample selected will be equal to 80.
in fact only 1 of 10 samples have the mean 80, thus in this case
chances are only .01 that will equal ; some sampling error is
likely.

(iii)Since =80 inches, we need to find:





If we take a random sample of two heights, there is a 30% chance
that the mean of sample will be with in 1 inch of population mean.

x
x
3 . 0 1 . 0 1 . 0 1 . 0
) 0 . 81 ( ) 0 . 80 ( ) 5 . 79 (
) 0 . 81 , 0 . 80 , 5 . 79 ( ) 81 79 (
= + + =
+ + =
= = s s
P P P
or x P x P
Considering the sampling size of four






Sample 76,78,79,81 76,78,79,86 76,78,81,86 76,79,81.86 78,79,81,86
Mean
78.5 79.75 80.25 80.50 81.00
78.00 79.00 80.00 81.00 82.00

Probability of one sample being selected = 1/5 =.2
Probability distribution of the random variable ( the sampling
distribution of mean)
Mean
78.5 79.75 80.25 80.50 81.00
Probability
0.2 0.2 0.2 0.2 0.2
(ii) Using the above results we observe that none of the sample of
four heights has a mean equal to the population mean 80, thus
when the mean , , of a random sample of four heights is used to
estimate the population mean, , same sampling error is certain.

(iii)Since = 80 inches, we need to find:






If we take a random sample of two heights, there is a 80% chance
that the mean of sample will be with in 1 inch of population mean.

Hence as sample size , n ->, S.E -> 0

x
8 . 0 2 . 0 2 . 0 2 . 0 2 . 0
) 0 . 81 ( ) 50 . 80 ( ) 25 . 80 ( ) 75 . 79 (
) 0 . 81 50 . 80 , 25 . 80 , 75 . 79 ( ) 81 79 (
= + + + =
+ + + =
= = s s
P P P P
or x P x P

S-ar putea să vă placă și