Sunteți pe pagina 1din 64

SAMPLING AND SAMPLING

DISTRIBUTION
Dr. Prabhuram Tripathy
Exponential Probability Distribution
• The exponential probability distribution is
useful in describing the time it takes to
complete a task.
• The exponential random variables can be used
to describe:
-Time between vehicle arrivals at a toll booth
-Time required to complete a questionnaire
-The time between goals scored in a World Cup
soccer match
Exponential Probability
• Cumulative Probabilities

• x0 = some specific value of x


Example:
• HP’s Full-Service Pump The time between arrivals
of cars at HP’s full-service gas pump follows an
exponential probability distribution with a mean
time between arrivals 3 minutes. HP would like to
know the probability that the time between two
successive arrivals will be 2 minutes or less.

• P(x < 2) = 1 - 2.71828-2/3 = 1 - .5134 = .4866


Solution:
What is a sample?
• A sample is a finite part of a statistical
population whose properties are studied to
gain information about the whole, it can be
defined as a set of respondents(people)
selected from a larger population for the
purpose of a survey.
• A population is a group of individuals persons,
objects, or items from which samples are
taken for measurement for example a
population of books or students.
What is sampling?
• Sampling is the act, process, or technique of
selecting a suitable sample, or a representative
part of a population for the purpose of
determining parameters or characteristics of the
whole population.
What is the purpose of sampling?
• To draw conclusions about populations from
samples, we must use inferential statistics which
enables us to determine a population`s
characteristics by directly observing only a
portion (or sample) of the population.
• We obtain a sample rather than a complete
enumeration (a census ) of the population for
many reasons.
• It is cheaper to observe a part rather than the
whole, but we should prepare ourselves to cope
with the dangers of using samples.
What Is a Sampling Frame?
• When developing a research study, one of the first
things that you need to do is clarify all of the units that
you are interested in studying. Units could be people,
organizations, or existing documents.
• In research, these units make up the population of
interest. When defining the population, it's really
important to be as specific as possible.
• Prior to selecting a sample you need to define
a sampling frame, which is a list of all the units of the
population of interest. A sampling frame is a list or
database from which a sample can be used.
• You can only apply your research findings to the
population defined by the sampling frame.
Sampling Frame
• A sampling frame generally includes the
respondents’ names and appropriate contact
details (so that they can be contacted to take part
in the research), but may also include other
significant known information that may be drawn
upon in the analysis stage of the research such as
age, location or customer segmentation data.
• This information is often stored in an Excel
spreadsheet, or similar document.
• A sampling frame can be a list of just about
anything. For example, the population could be
“All infectious diseases in the United States.” The
frame is below:
TYPES OF SAMPLES
Non-probability (non-random) samples:
• These samples focus on volunteers, easily available units, or
those that just happen to be present when the research is
done. Non-probability samples are useful for quick and
cheap studies, for case studies, for qualitative research, for
pilot studies, and for developing hypotheses for future
research.
• Convenience sample: also called an "accidental" sample or
"man-in-the-street" samples. The researcher selects units
that are convenient, close at hand, easy to reach, etc.
• Purposive sample: the researcher selects the units with
some purpose in mind, for example, students who live in
dorms on campus, or experts on urban development.
• Quota sample: the researcher constructs quotas for
different types of units. For example, to interview a fixed
number of shoppers at a mall, half of whom are male and
half of whom are female.
Non-probability (non-random) samples
• Judgmental or Purposive Sampling: In judgmental
sampling, the samples are selected based purely on
researcher’s knowledge and credibility. In other words,
researchers choose only those who he feels are a right fit
(with respect to attributes and representation of a
population) to participate in research study.
• Snowball Sampling: Snowball sampling helps researchers
find sample when they are difficult to locate. Researchers
use this technique when the sample size is small and not
easily available. This sampling system works like the referral
program. Once the researchers find suitable subjects, they
are asked for assistance to seek similar subjects to form a
considerably good size sample.

• Other samples that are usually constructed with non-


probability methods include library research, participant
observation, marketing research, consulting with experts,
and comparing organizations, nations, or governments.
TYPES OF SAMPLES
Probability-based (random) samples:
• These samples are based on probability theory. Every
unit of the population of interest must be identified, and
all units must have a known, non-zero chance of being
selected into the sample.
• Simple random sample: Each unit in the population is
identified, and each unit has an equal chance of being in
the sample. The selection of each unit is independent of
the selection of every other unit. Selection of one unit
does not affect the chances of any other unit.
• Example—A teachers puts students' names in a hat and
chooses without looking to get a sample of students.
• Why it's good: Random samples are usually fairly
representative since they don't favor certain members.
Random samples
• Stratified random sample: The population is first
split into groups. The overall sample consists of
some members from every group. The members
from each group are chosen randomly.
• Example—A student council
surveys 100100100 students by getting random
samples of 25 25 25 freshmen, 25 25
25 sophomores, 25 25 25 juniors, and 25 25
25 seniors.
• Why it's good: A stratified sample guarantees that
members from each group will be represented in
the sample, so this sampling method is good
when we want some members from every group.
Random samples
• Cluster random sample: The population is first split
into groups. The overall sample consists of every
member from some of the groups. The groups are
selected at random.
• Example—An airline company wants to survey its
customers one day, so they randomly select 555 flights
that day and survey every passenger on those flights.
• Why it's good: A cluster sample gets every member
from some of the groups, so it's good when each group
reflects the population as a whole.
Random samples
• Systematic random sample: Members of the
population are put in some order. A starting
point is selected at random, and every end
superscript member is selected to be in the
sample.
• Example—A principal takes an alphabetized
list of student names and picks a random
starting point.
What is a 'Sampling Distribution'
• A sampling distribution acts as a frame of
reference for statistical decision making.
• It is a theoretical probability distribution of
the possible values of some sample statistic
that would occur if we were to draw all
possible samples of a fixed size from a given
population.
• The sampling distribution allows us to
determine whether, given the variability
among all possible sample means, the one we
observed is a common out come or a rare
outcome.
'Sampling Distribution'
A statistic is a random variable with a probability
distribution - called the sampling distribution -
which is generated by repeated sampling.
The sampling distribution has a mean and standard
deviation. The standard deviation of the sampling
distribution is called as standard error.
Sampling distributions are used to calculate the
probability that sample statistics could have
occurred by chance and thus to decide whether
something that is true of a sample statistic is also
likely to be true of a population parameter.
Central Limit Theorem
• The Central Limit Theorem (CLT) is a statistical
theory states that given a sufficiently large
sample size from a population with a finite
level of variance, the mean of all samples from
the same population will be approximately
equal to the mean of the population.
Central Limit Theorem
• The Central Limit Theorem is exactly what the
shape of the distribution of means will be
when we draw repeated samples from a given
population. Specifically, as the sample sizes
get larger, the distribution of means calculated
from repeated sampling will approach
normality.
Central Limit Theorem
Three different components of the central limit
theorem
(1) Successive sampling from a population
(2) Increasing sample size
(3) Population distribution
Distribution of the sample proportion (
• The sample mean are derived from measured
variables, the sample proportion is derived from
counts or frequency data.
• In many situations the use of the sample proportion is
easier and more reliable because, unlike the mean, the
proportion does not depend on the population
variance, which is usually an unknown quantity.
• We will represent the sample proportion by and the
population proportion by p.
• In many sources, the sample proportion is represented
using p and the population proportion is
represented using .
Properties of the sample proportion
• Construction of the sampling distribution of
the sample proportion is done in a manner
similar to that of the mean and the difference
between two means.
• When the sample size is large, the distribution
of the sample proportion is approximately
normally distributed because of the central
limit theorem.
The Sample Proportion and Standard
Deviation of the Number of Successes
• The Sampling Distribution of the Population Proportion
gives you information about the population
proportion, p. For example, you might want to know
the proportion of the population (p) who use
Facebook. You can’t survey everyone on the planet, so
you use a sample and get the sample proportion P̄ and
use that as an estimator for p.
• When studying the sampling distribution of the sample
proportion, you’ll also see a lowercase p̄. The
lowercase version refers to a single value (i.e. a single
estimate).
Useful Formulas for Sampling
Distribution of the Sample Proportion
• Expected value of the sampling distribution of P̄: E(p̄ ) = p.
• Variance for the sampling distribution of P̄: p(1-p) / n.
• Standard Error(SE) of the Sample Proportion: √ (p(1-p) / n).
• Note: as the sample size increases, the standard error
decreases.
• You can use the normal distribution if the following two
formulas are true:
A. np≥5
B. n(1-p)≥5.
• Z Score for sample proportion: z = (P̄ – p) / SE
For example,
• if you had a sample size (n) of 50 and a
proportion of 30%, then:
n * p = 50 * .3 = 15
50(1-.3) = 50(.7) = 35.
These are both larger than 5, so you can use the
normal distribution.
• You can transform P̄ into a z-score with the
following formula:
Z Score for sample proportion: z = (p̄ – p) / SE.
Example Question:
• A certain company’s customers is made up of
43% women and 57% men. An aggressive
marketing campaign results in an increase of
women customers to 46%, according to a sample
survey of 50 customers. If the company hadn’t
run the campaign, how likely is it that 46% of
customers are women? Was the campaign worth
it?
• Note that you’re looking for the probability that P̄
is greater than or equal to 46%.
Solution:
• Step 1: Check that your sample size is large enough:
n * p = 50 * .43 = 21.5
50(1-.43) = 28.5.
Both are above 5, so we can use the normal
distribution.
• Step 2: Find the standard error(SE):
√ (p(1-p) / n) = √ (0.43(1-0.43) / 50) = 0.07.
• Step 3: Find the z-score, using the SE you calculated in
Step 2:
z = (P̄ – p) / SE
P(Z≥) (0.46 – 0.43)/0.07 = 0.43.
• Step 4: Look up 0.43 in the z-table. The probability is
0.3336, or 33.36%.*
• At a probability of 33.36%, it’s fairly likely that the
proportion of women would have been 46% without a
campaign. It’s unlikely that the marketing campaign
made much of a difference.
Example
• In the mid seventies, according to a report by
the National Center for Health Statistics, 19.4
percent of the adult U.S. male population was
obese. What is the probability that in a simple
random sample of size 150 from this
population fewer than 15 percent will be
obese?
Solution
• (1) Write the given information

n = 150
p = .194

Find P( < 15)


Find the appropriate value(s) in the table

• A value of z = -1.36 gives an area of .0869


which is the probability
P (z < -1.36) = .0869
The probability that P̄ < 15 is .0869.
Statistical inference
• A statistical inference aims at learning
characteristics of the population from a sample;
the population characteristics are parameters
and sample characteristics are statistics.
• Estimation —Using observed data to make
informed “guesses” about unknown parameters
• Estimation represents ways or a process of
learning and determining the population
parameter based on the model fitted to the data.
• Point estimation and interval estimation, and
hypothesis testing are three main ways of
learning about the population parameter from
the sample statistic.
Statistical inference
• Point estimation = a single value that
estimates the parameter.
• Point estimates are single values calculated
from the sample
• EXAMPLE: Based on sample results, we
estimate that p, the proportion of all U.S.
adults who are in favor of stricter gun control,
is 0.6.
Statistical inference
• Confidence Intervals = gives a range of values for
the parameter Interval estimates are intervals
within which the parameter is expected to fall,
with a certain degree of confidence. In interval
estimation, we estimate an unknown parameter
using an interval of values that is likely to contain
the true value of that parameter (and state how
confident we are that this interval indeed
captures the true value of the parameter).
• EXAMPLE: Based on sample results, we are 95%
confident that p, the proportion of all U.S. adults
who are in favor of stricter gun control, is
between 0.57 and 0.63.
Statistical inference
• Hypothesis tests = tests for a specific value(s) of the
parameter.
• In hypothesis testing, we begin with a claim about the
population (we will call the null hypothesis), and we
check whether or not the data obtained from the
sample provide evidence AGAINST this claim.
EXAMPLE:
• It is claimed that among drivers 18-23 years of age (our
population) there is no relationship between drunk driving
and gender.
• A roadside survey collected data from a random sample of
5,000 drivers and recorded their gender and whether they
were drunk.
• The collected data showed roughly the same percent of
drunk drivers among males and among females. These
data, therefore, do not give us any reason to reject the
claim that there is no relationship between drunk driving
and gender.
Confidence Intervals

• Statisticians use a confidence interval to


express the precision and uncertainty
associated with a particular sampling method.
A confidence interval consists of three parts.
• A confidence level.
• A statistic.
• A margin of error.
Confidence Intervals
Confidence Level
• The probability part of a confidence interval is called
a confidence level. The confidence level describes the
likelihood that a particular sampling method will produce
a confidence interval that includes the true population
parameter.
• A 95% confidence level means that 95% of the intervals
contain the true population parameter; a 90% confidence
level means that 90% of the intervals contain the
population parameter; and so on.
• The level of confidence can be any number
between 0 and 100%, but the most common values are
probably 90% (α=0.10), 95% (α=0.05), and 99% (α=0.01).
Confidence Intervals
• The confidence level describes the uncertainty
of a sampling method.
• Where as the statistic and the margin of error
define an interval estimate that describes the
exactness of the method.
• The interval estimate of a confidence interval
is defined by the sample statistic + margin of
error.
For example
• suppose we compute an interval estimate of a
population parameter. We might describe this interval
estimate as a 95% confidence interval. This means that
if we used the same sampling method to select
different samples and compute different interval
estimates, the true population parameter would fall
within a range defined by the sample statistic + margin
of error 95% of the time.
• Confidence intervals are preferred to point estimates,
because confidence intervals indicate (a) the precision
of the estimate and (b) the uncertainty of the estimate.
Confidence Intervals
Margin of Error
• In a confidence interval, the range of values above and
below the sample statistic is called the margin of error.
• For example, suppose the local newspaper conducts an
election survey and reports that the independent
candidate will receive 30% of the vote. The newspaper
states that the survey had a 5% margin of error and a
confidence level of 95%. These findings result in the
following confidence interval: We are 95% confident
that the independent candidate will receive between
25% and 35% of the vote.
Note:
• Many public opinion surveys report interval
estimates, but not confidence intervals. They
provide the margin of error, but not the
confidence level. To clearly interpret survey
results you need to know both! We are much
more likely to accept survey findings if the
confidence level is high (say, 95%) than if it is
low (say, 50%).
• for computing the 95% confidence interval for the
population mean, as follows:

• point estimate + Z (SE)


• population standard deviation (σ)
• sample size is (n)
• lower confidence interval for a proportion and upper confidence interval for a
proportion

Confidence Interval Z Score


90% 1.645
95% 1.96
99% 2.576
• A sample of size 49 has sample mean 35 and
sample standard deviation 14. Construct
a 98% confidence interval for the population
mean using this information.
• For confidence level 98% =2.326.
• 35±2.326(14/√49)=35±4.652≈35±4.7
• We are 98% confident that the population
mean μ lies in the interval [30.3,39.7]
• A random sample of 120120 students from a
large university yields mean GPA 2.712.71 with
sample standard deviation 0.510.51. Construct
a 90%90% confidence interval for the mean GPA
of all students at the university.
• 2.71±1.645(0.51/√120) = 2.71±0.0766
• One may be 90%confident that the true average
GPA of all students at the university is contained
in the interval (2.71−0.08,2.71+0.08)=(2.63,2.79).
Hypotheses Testing
• A hypothesis test involves collecting data from
a sample and evaluating the data. Then, the
statistician makes a decision as to whether or
not the data supports the claim that is made
about the population.
Hypotheses Testing
• The purpose of the hypothesis test is to
decide between two explanations:
1. The difference between the sample and
the population can be explained by
sampling error
2. The difference between the sample and
the population is too large to be
explained by sampling error
Null Hypothesis and Alternative
Hypothesis
• Null Hypothesis H0: This is an assumed
maintained hypothesis. If we do not find a
sufficient evidence for its contrary, this null
hypothesis is held to be true.
• Alternative Hypothesis H1: This hypothesis is
against to the null hypothesis. After checking
the null hypothesis, if the null hypothesis is
false then the alternative hypothesis is held to
be true
Example
Ho: No more than 30% of the registered voters
in Santa Clara County voted in the primary
election.
Ha: More than 30% of the registered voters in
Santa Clara County voted in the primary
election.
Errors in Hypothesis Tests
Two types of error are possible.
• Type I Error: The rejection of a true null
hypothesis.
• Type II Error: The failure to reject a false null
hypothesis.
• Significance Level: The probability of rejecting a
null hypothesis that is true. This probability is
sometimes expressed as a percentage, so a test
of significance level α is referred to as a 100α%-
level test.
Errors in Hypothesis Tests
• α = probability of a Type I error = P(Type I
error) = probability of rejecting the null
hypothesis when the null hypothesis is true.
• β = probability of a Type II error = P(Type II
error) = probability of not rejecting the null
hypothesis when the null hypothesis is false.
Example
• Suppose the null hypothesis, Ho, is: Frank’s rock climbing
equipment is safe.
• Type I error: Frank concludes that his rock climbing
equipment may not be safe when, in fact, it really is safe.
Type II error: Frank concludes that his rock climbing
equipment is safe when, in fact, it is not safe.
• α = probability that Frank thinks his rock climbing
equipment may not be safe when, in fact, it really is. β =
probability that Frank thinks his rock climbing equipment is
safe when, in fact, it is not.
• Notice that, in this case, the error with the greater
consequence is the Type II error. (If Frank thinks his rock
climbing equipment is safe, he will go ahead and use it.)
One-Tailed and Two-Tailed Tests
• Hypothesis that predicted not only that the
sample mean would be different from the
population mean but that it would be
different in a specific direction—it would be
lower. This test is called
directional or one‐tailed test because the
region of rejection is entirely within one tail of
the distribution.
One-Tailed and Two-Tailed Tests
• Some hypotheses predict only that one value
will be different from another, without
additionally predicting which will be higher.
The test of such a hypothesis
is nondirectional or two‐tailed because an
extreme test statistic in either tail of the
distribution (positive or negative) will lead to
the rejection of the null hypothesis of no
difference.
One-Tailed and Two-Tailed Tests
• Suppose that you suspect that a particular class's
performance on a proficiency test is not
representative of those people who have taken
the test. The national mean score on the test is
74.
• The research hypothesis is:
• The mean score of the class on the test is not 74.
In notation: H a : μ ≠ 74
• The null hypothesis is:
• The mean score of the class on the test is 74.
In notation: H 0: μ = 74
SOME HYPOTHESIS TESTING EXAMPLES

• ILLUSTRATION- ONE TAILED (UPPER TAILED)


• An insurance company is reviewing its current policy
rates. When originally setting the rates they believed
that the average claim amount will be maximum
Rs180000. They are concerned that the true mean is
actually higher than this, because they could
potentially lose a lot of money. They randomly select
40 claims, and calculate a sample mean of Rs195000.
Assuming that the standard deviation of claims is
Rs50000 and set α= .05, test to see if the insurance
company should be concerned or not.
SOLUTION
• Step 1: Set the null and alternative hypotheses
H0 : μ≤ 180000
H1 : μ > 180000
• Step 2: Calculate the test statistic z= = x ¯ – μ σ/√n = 1.897
• 3: Set Rejection Region

1.65

• Step 4: Conclude
• We can see that 1.897 > 1.65, thus our test statistic is in the rejection
region. Therefore we fail to accept the null hypothesis. The insurance
company should be concerned about their current policies.
ILLUSTRATION
ONE TAILED (LOWER TAILED)
• Trying to encourage people to stop driving to
campus, the university claims that on average it
takes at least 30 minutes to find a parking space
on campus. I don’t think it takes so long to find a
spot. In fact I have a sample of the last five times I
drove to campus, and I calculated x = 20.
Assuming that the time it takes to find a parking
spot is normal, and that σ = 6 minutes, then
perform a hypothesis test with level α= 0.10 to
see if my claim is correct.
SOLUTION
• Step 1: Set the null and alternative hypotheses
H0 : μ ≥ 30
H1 : μ < 30
• Step 2: Calculate the test statistic Z= x ¯ – μ σ/√n = -3.727
• STEP 3: SET REJECTION REGION

• -1.28

• STEP 4: CONCLUDE We can see that -3.727 <-1.28 ( or


absolute value is higher than the critical value) , thus our
test statistic is in the rejection region. Therefore we Reject
the null hypothesis in favor of the alternative. We can
conclude that the mean is significantly less than 30, thus I
have proven that the mean time to find a parking space is
less than 30.
Two-Tailed

• Rejection Region for Two-Tailed Z Test (H1: μ ≠ μ 0 ) with


α =0.05
• The decision rule is: Reject H0 if Z < -1.960 or if
Z > 1.960.
ESTING HYPOTHESIS ABOUT DIFFERENCE
BETWEEN TWO POPULATION MEANS
• We assume that the populations are normally
distributed.
• The null hypothesis is H0 : μ1 = μ2
i.e. H0 : μ1 -μ2 = 0
• Z = x1 - x2 /√σ1 2/ n1 + σ2 2 / n2
• In case σ1 2 and σ2 2 are not known then s1 2
and s2 2 can be used
POPULATION NORMAL, POPULATION INFINITE, SAMPLE
SIZE SMALL (30 OR LESS) AND VARIANCE OF
POPULATION UNKNOWN
t statistic.
• t = x ¯ – μ /( σs/√n) with degree of freedom
=(n-1)
• σs = ∑(X- x ¯)2 (n-1)

S-ar putea să vă placă și