Sunteți pe pagina 1din 50

Data Collection &

Sampling
Business Analytics 07
Tools of Business Statistics

• Descriptive statistics
• Collecting, presenting, and describing data

• Inferential statistics
• Drawing conclusions and/or making decisions concerning a
population based only on sample data
Recall…
• Statistics is a tool for converting data into information:
Data Statistics Information

But where then does data come from? How is it gathered?


How do we ensure its accurate? Is the data reliable? Is it
representative of the population from which it was drawn?
This chapter explores some of these issues.
Methods of Collecting Data
• There are many methods used to collect or obtain data for statistical analysis

• Three of the most popular methods are:


• Direct observation (E.g. Number of customers entering a bank per hour)
• Experiments (E.g. new ways to produce things to minimize costs)
• Surveys
Essence of Sampling
• Sampling is a process of selecting subset of observations from a population
to make inference about various population parameters such as mean,
proportion, standard deviation, etc
• It is an important step in inferential statistics since an incorrect sample may
lead to wrong inference about the population
Sampling is necessary when it is difficult or expensive to collect data on the entire
population. The inference about the population is made based on the sample that was
collected; incorrect sample may lead to incorrect inference about the population.
Statistical Sampling
• Sampling is the foundation of statistical analysis.
• Sampling plan - a description of the approach that is used to obtain
samples from a population prior to any data collection activity.
• A sampling plan states:
• Its objectives
• Target population
• Population Frame (the list from which the sample is selected)
• Operational procedures for collecting data
• Statistical tools for data analysis
Illustrations: Populations and Samples

• Population - all items of interest for a particular decision or investigation


- all married cricket players over 25 years old
- all subscribers of Netflix
• Sample - a subset of the population
- a list of cricket players over 25 years who scored century abroad
- a list of individuals who rented a comedy from Netflix in the past year
Why Sample?
• Less time consuming than a census

• Less costly to administer than a census

• It is possible to obtain statistical results of a sufficiently high precision


based on samples.
Introduction to Sampling Methods
Population Sample

a b cd b c
ef gh i jk l m n gi n
o p q rs t u v w o r u
x y z y
Population Parameters
• Measures such as mean and standard deviation calculated using the entire
population are called population parameters

• The population parameters mean and standard deviation are usually denoted
using symbols  and , respectively
Inferential Statistics
Sample (known) Population (unknown but can be
estimated from sample)

Sample Population
Sample Statistic
• When population parameters are estimated from sample they are called
sample statistic or statistic


• The sample statistic is denoted using symbols X (for mean) and S (or s
for standard deviation)
Inferential Statistics
• Estimation
• e.g., Estimate the population mean
weight using the sample mean weight

• Hypothesis Testing
• e.g., Use sample evidence to test the
claim that the population mean weight
is 120 pounds
Drawing conclusions and/or making decisions
concerning a population based on sample results.
Sampling

Definition Steps Needed:


• The process of identifying a subset • Identification of target population
from a population of elements (aka that is important for a given
observations or cases) is called problem under study
sampling process or simply • Decide the sampling frame
sampling • Determine the sample size
• Sampling method
Sampling Methods
• Subjective Methods
• Judgment sampling – expert judgment is used to select the sample
• Convenience sampling – samples are selected based on the ease with which the data
can be collected
• Probabilistic Sampling
• Simple random sampling involves selecting items from a population so that every
subset of a given size has an equal chance of being selected
Random Sampling
• Shewhart (1931) defines random sample as a ‘sample drawn under conditions such that the
law of large number applies’

• Random sampling is usually carried out without replacement, that is, an observation which
is selected in the sample is removed from the population for further consideration

• Random samples can also be created with replacement, that is, an observation which is
selected for inclusion in the sample can again be considered since it is replaced (not
removed) in the population.
Simple Random Samples
• Every object in the population has an equal chance of being selected
• Objects are selected independently
• Samples can be obtained from a table of random numbers or computer random number
generators

• A simple random sample is the ideal against which other sample methods are compared
• Patients and length of stay (LoS) in days

Patient 1 2 3 4 5 6 7 8 9 10

LoS 4 20 12 13 15 17 16 20 9 17

• The corresponding samples (length of stay of patients selected in the sample)


Random Numbers Corresponding Sample (LoS value)

3 4 5 1 8 12 13 15 4 20

1 7 9 1 3 4 16 9 4 12

8 4 7 3 5 20 13 16 12 15
Additional Probabilistic Sampling Methods
• Systematic (periodic) sampling – a sampling plan that selects every nth item from the
population.
• Stratified sampling – applies to populations that are divided into natural subsets (called strata)
and allocates the appropriate proportion of samples to each stratum.
• Cluster sampling - based on dividing a population into mutually exclusive subgroups (clusters),
sampling a set of clusters, and (usually) conducting a complete census within the clusters sampled
• Sampling from a continuous process
• Select a time at random; then select the next n items produced after that time.
• Select n times at random; then select the next item produced after each of these times.
Systematic Sampling
• If a sample size of n is desired from a population containing N elements, we
might sample one element for every n/N elements in the population
• We randomly select one of the first n/N elements from the population list
• We then select every n/Nth element that follows in the population list
• This method has the properties of a simple random sample, especially if the list
of the population elements is a random ordering
• Example: Selecting every 100th listing in a telephone book after the first
randomly selected listing
Stratified Sampling

• The population is first divided into groups of elements called strata.


• Each element in the population belongs to one and only one stratum.
• Best results are obtained when the elements within each stratum are as
much alike as possible (i.e. a homogeneous group).
• A simple random sample is taken from each stratum.
Examples of Stratified Sampling
• Amount of time spent by male and female users in sending messages in a day. Here
the strata are male and female users.
• Efficacy of a prescribed medicines among different age groups. Age group can be
classified into categories such as less than 40, between 41 and 60, and over 60 years
• Performance of children in school and the parents’ marital status. Here, marital
status can be (a) Single, (b) Married, (d) Divorced. In this case we assume that the
parent’s marital status may influence children’s academic performance.
• Television rating points for a program across different geographical regions of a
country. For India, geographical regions could be different states of the country.
Steps for creating Stratified Samples
• Identify the factor that can be used for creating strata (for example: factor = Age; Strata
1: age less than 40; Strata 2: age between 41 and 60; and Strata 3: Age more than 60)
• Calculate the proportion of each stratum in the population (say p1, p2, and p3 for three
strata identified in step 1).
• Calculate the sample size (say N). The sample size for strata 1, 2, and 3 identified in step
2 are p1 × N, p2 × N, and p3 × N, respectively.
• Use random sampling procedure to generate random samples in each strata.
• Combine samples from each stratum to create the final sample.
Cluster Sampling
Cluster Sampling methods
• The population is first divided into separate groups of elements
called clusters
• Ideally, each cluster is a representative small-scale version of the
population (i.e. heterogeneous group)
• A simple random sample of the clusters is then taken.
• All elements within each sampled (chosen) cluster form the
sample.
Example for Cluster Sampling
• Identify the clusters (example: different models of smart phones sold by a
manufacturer, customers from different geographical locations).
• Using random sampling select the clusters.
• Select all units in the clusters selected in step 2 and form the sample. If the size is
too large, a random sampling within the clusters identified in step 2 may be used
for final sample.
• Stratified sampling and cluster sampling are similar; the major difference is that in a
stratified sample, all strata will be represented in the sample, whereas in a cluster sampling,
not all clusters will be represented
Non-Probability Sampling

• In a non-probability sampling, the


selection of sample units from the
population does not follow any
probability distribution
• Sample units are selected based on
convenience and/or on voluntary
basis.
Convenience Sampling
• Convenience sampling is a non-probability sampling technique in which the sample
units are not selected according to a probability distribution
Voluntary Sampling

• Sampling the data is collected from people who volunteer for such data collection.
• There could be bias in case of voluntary sampling
Sampling Distributions

• A sampling distribution is a distribution of all of the


possible values of a statistic such as sample mean and
sample standard deviation computed from a given size
sample selected from a population
Definitions
• An element is the entity on which data are collected.
• A population is a collection of all the elements of interest
• A sample is a subset of the population.
• The sampled population is the population from which the sample
is drawn.
• A frame is a list of the elements that the sample will be selected
from.
Reasons for Sampling
• The reason we select a sample is to collect data to answer a research
question about a population
• The sample results provide only estimates of the values of the
population characteristics
• The reason is simply that the sample contains only a portion of the
population
• With proper sampling methods, the sample results can provide
“good” estimates of the population characteristics.
Sampling Distributions of
Sample Means
Sampling
Distributions

Sampling Sampling Sampling


Distribution of Distribution of Distribution of
Sample Sample Sample
Mean Proportion Variance
Developing a
Sampling Distribution
C D
• Assume there is a population … A B
• Population size N=4
• Random variable, X,
is age of individuals
• Values of X:
18, 20, 22, 24 (years)
Developing a Sampling Distribution –
summary measures of population distribution
μ
 X i
P(x)
N
.25
18  20  22  24
  21
4

σ
 i
(X  μ) 2

 2.236
0
18 20 22 24 x
N A B C D
Uniform Distribution
Developing Sampling Distributions
1st 2nd Observation
16 Sample
Obs 18 20 22 24
Means
18 18,18 18,20 18,22 18,24 1st 2nd Observation
Obs 18 20 22 24
20 20,18 20,20 20,22 20,24 18 18 19 20 21
20 19 20 21 22
22 22,18 22,20 22,22 22,24 22 20 21 22 23
24 21 22 23 24
24 24,18 24,20 24,22 24,24
16 possible samples
(sampling with
replacement)
Expected Value of Sample Mean

• Let X1, X2, . . . Xn represent a random sample from a population

• The sample mean value of these observations is defined as


n
1
X   Xi
n i1
Summary Means of the Sampling Distribution

E(X) 
 X

18  19  21    24
i
 21  μ
N 16

σX 
 ( X i  μ) 2

N
(18 - 21) 2  (19 - 21) 2    (24 - 21) 2
  1.58
16
Comparing the Population with its Sampling
Distribution
Population Sample Means Distribution
N=4 n=2
μ  21 σ  2.236 μX  21 σ X  1.58
_
P(X) P(X)
.3 .3

.2 .2
.1 .1
0
18 20 22 24
0
18 19 20 21 22 23 24
_
X X
A B C D
Standard Error of the Mean
• Different samples of the same size from the same population will yield different sample
means
• A measure of the variability in the mean from sample to sample is given by the Standard
Error of the Mean:

σ
σX 
n
• Note that the standard error of the mean decreases as the sample size increases
Central Limit Theorem
• Let S1, S2, …, Sk be samples of size n drawn from an independent and
identically distributed population with mean  and standard deviation .

• Let X1 , X 2 , ..., X k be the sample means (of the samples S1, S2, …, Sk).

• The sampling distribution of mean will follow a normal distribution with


mean  (same as the mean of the population) and standard deviation  / n
Central Limit Theorem
the sampling
As the n↑
distribution
sample
becomes
size gets
almost normal
large
regardless of
enough…
shape of
population

x
Why study Central Limit Theorem
• Central limit theorem is the basis for hypothesis tests such as Z test and t test. In many
cases, we will have access to only a sample and the inference about the population has to be
made based on sample statistic.

• An important assumption of CLT is that the random variables have to be independent


and identically distributed.
If the Population is not Normal (continued)

Population Distribution
Sampling distribution
properties:
Central Tendency
μx  μ
μ x
Sampling Distribution
Variation
σ
σx 
(becomes normal as n increases)
Larger
n Smaller
sample size
sample
size

μx x
Alternative version of CLT can be stated as follows:

Let X1, X2, …, Xn be n random variables that are independent and identically
distributed with mean  and standard deviation . Then for large n, mean

_ X1  X 2  ...  X n
X
n
follows a normal distribution with mean  standard error / n
How Large is Large Enough?
• If the population is normal, then X is normally distributed for all values of n.

• For normal population distributions, the sampling distribution of the mean is always normally
distributed

• If the population is non-normal, then X is approximately normal only for larger values of n.

• In most practical situations, a sample size n > 30 will give a sampling distribution that may be
sufficiently large to allow us to use the normal distribution as an approximation for the sampling
distribution of X.
Estimation of Population Parameters
• Estimation is a process used for making inferences about population parameters
based on samples

• Point Estimate: Point estimate of a population parameter is the single


value (or specific value) calculated from sample (thus called statistic).

• Interval Estimate: Instead of a specific value of the parameter, in an


interval estimate the parameter is said to lie in an interval (say between
points a and b) with certain probability (or confidence).
Introduction to
Confidence Interval
Business Analytics 08

S-ar putea să vă placă și