Inferential Statistics

Statistics and Research Methods
1
Topics: Statistics
• Descriptive Statistics
• Probability Theory and Probability
Distributions
• Hypothesis Testing
• Confidence Interval
• Analysis Of Variance (ANOVA)
• Regression and Correlation
• Chi-Squared
2
Topics: Research Methods
• Research Design
• Literature Review
• Sampling
• Data Collection Methods
• Sampling
• Ethical Issues in Research Resource
• IT role in research & Formatting
3
Assessment
• One test – 15%

• One Individual Assignment – 10%
• One Group Assignment – 15%
• Final Exam – 60%
• Note: Pass Mark is B (50%)

4
References
• Probability and Statistics for Engineering and the Sciences, by
Jay L. Devore, Monterey, California.
• Basic Business Statistics Berenson M.L, Levine D.M, Krehbiel,
T.C
• Research Methodology, Methods and Techniques, by C.R.
Kothari
• Research Methods for Business students, by Mark Saunders,
Philip Lewis and Adrian Thornhill
• Plenty of Websites
5
6
Introduction and Descriptive Statistics
7
The Science of Statistics
• Statistics is the science of data. This involves
collecting, classifying, summarizing, organizing,
analyzing and interpreting numerical information.
Statistics
Descriptive Inferential
Statistics Statistics
8
Types of Statistical Applications
• Descriptive statistics utilizes numerical and

graphical methods to look for patterns in a data
set, to summarize the information revealed in a
data set and to present that information in a
convenient form.
• Inferential statistics utilizes sample data to make
estimates, decisions, predictions or other
generalizations about a larger set of data.
9
Descriptive Statistics
Descriptive statistics utilizes numerical and graphical
methods to look for patterns in a data set, to
summarize the information revealed in a data set and
to present that information in a convenient form.
• Collect data
– e.g. Survey
• Present data
– e.g. Tables and graphs X i
n
• Characterize data
– e.g. Sample mean =
10
Inferential Statistics
Inferential statistics utilizes sample data to make estimates,
decisions, predictions or other generalizations about a larger
set of data.
• Estimation
– e.g.: Estimate the population mean weight
using the sample mean weight
• Hypothesis testing
– e.g.: Test the claim that the population
mean weight is 120 pounds
Drawing conclusions and/or making decisions

concerning a population based on sample results.11
Fundamental Elements of Statistics
• An experimental unit is an object about which
we collect data.
– Person
– Place
– Thing
– Event
12
• An population is a set of units in which we are
interested.
– Typically, there are too many experimental units in
a population to consider every one.
• If we can examine every single one, we conduct a
census.
13
• A sample is a subset of the population.
• A variable is a characteristic or property of
an individual unit.
– The values of these characteristics will, not
surprisingly, vary.
– A measure of reliability is a statement about
the degree of uncertainty associated with a
statistical inference. (Based on our analysis, we
think 56% of soda drinkers prefer Pepsi to Coke,
± 5%.)
14
Descriptive Statistics Inferential Statistics
• The population or sample of • Population of interest
interest • One or more variables to be
• One or more variables to be investigated
investigated • The sample of population
units
• Tables, graphs or numerical
summary tools • The inference about the
population based on the
• Identification of patterns in sample data
the data • A measure of reliability of
the inference
15
Types of Data
• Quantitative Data are measurements that are
recorded on a naturally occurring numerical
scale. e.g. Age, GPA, Salary, Cost of books this
semester
• Categorical (Qualitative) Data are measurements
that cannot be recorded on a natural numerical
scale, but are recorded in categories e.g. Live
on/off campus, Major, Gender
16
Methods for Describing Sets of Data
17
Data Presentation
Numerical
Data
Ordered Frequency
Array Distributions
Stem-&-Leaf Histo-
Polygon Ogive
Display gram
18
Ordered Array
• 1. Organizes Data to Focus on Major Features
• 2. Data Placed in Rank Order
– Smallest to Largest
• 3. Data in Raw Form (as Collected)
– 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
• 4. Data in Ordered Array
– 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
19
Stem-and-Leaf Display
• A Stem-and-Leaf Display
shows the number of
observations that share a 2 144677
common value (the stem)
and the precise value of
3 028 26
each observation (the leaf)
4 1
Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
20
Frequency Distribution Table
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Class Frequency
15 but < 25 3
25 but < 35 5
35 but < 45 2
21
Frequency Distribution Table Steps
• 1. Determine Range
• 2. Select Number of Classes
– Usually Between 5 & 15 Inclusive
• 3. Compute Class Intervals (Width)
• 4. Determine Class Boundaries (Limits)
• 5. Compute Class Midpoints
• 6. Count Observations & Assign to Classes
22
Frequency Distribution Table
Example
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Class Midpoint Frequency
15 but < 25 20 3
Width
25 but < 35 30 5
35 but < 45 40 2
(Upper + Lower Boundaries) / 2
Boundaries
23
Relative Frequency &
% Distribution Tables
Relative Frequency Distribution Percentage
Distribution
Class Prop. Class %

15 but < 25 .3 15 but < 25 30.0
25 but < 35 .5 25 but < 35 50.0
35 but < 45 .2 35 but < 45 20.0
class frequency
class relative frequency =
n
class percentage = (class relative frequency) 100

24
Cumulative Percentage
Distribution Table
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Percentage Less than
Lower Class Boundary
Class Cumulative
Percentage
15 but < 25 0.0
Lower Class 25 but < 35 30.0
Boundary
35 but < 45 80.0 30% + 50%
45 but < 55 100.0 80% + 20%
25
Histogram
• Histograms are graphs of the frequency or relative
frequency of a variable.
– Class intervals make up the horizontal axis
– The frequencies or relative frequencies are displayed on the
vertical axis.
Class Freq.
Count 15 but < 25 3
5 25 but < 35 5
35 but < 45 2
Frequency 4
Relative 3
Frequency Bars Touch
2
Percent
1
0 15 25 35 45 55
Lower Boundary
26
Polygon
Class Freq.
Count
15 but < 25 3
5 25 but < 35 5
4 35 but < 45 2
Frequency
Relative 3
Frequency
2 Fictitious
Percent
Class
1
0 10 20 30 40 50 60
Midpoint
27
Cumulative % Polygon (Ogive)
Cumulative % Fictitious
Class
100%
75%
Class Cum. %
50% 15 but < 25 0%
25 but < 35 30%
25% 35 but < 45 80%
45 but < 55 100%
0%
0 15 25 35 45 55
Lower Boundary
28
Categorical Data Presentation
Categorical
Data
Summary
Table
Pie Bar Pareto

Chart Chart Diagram
29
Summary Table
• 1. Lists Categories & No. Elements in Category

• 2. Obtained by Tallying Responses in Category
• 3. May Show Frequencies (Counts), % or Both
Row Is
Tally:
Category Major Count |||| ||||
Accounting 130 |||| ||||
Economics 20
Management 50
Total 200
30
Bar Chart
Horizontal Major Bar Length

Bars for Shows
Categorical Frequency
Variables
Mgmt. or %
Equal Bar
Econ. Widths
1/2 to 1 Bar
Width
Acct.
Zero Point 0 50 100 150

Percent Used Also Frequency
31
Pie Chart
• 1. Shows Breakdown of
Total Quantity Majors
into Categories Mgmt.
• 2. Useful for Showing Econ. 25%
Relative Differences 10% 36°
• 3. Angle Size
– (360°)(Percent) Acct.
65%
(360°) (10%) = 36°
32
Pareto Diagram
Cumulative Bar Midpoint

Percent Polygon (Ogive)
Always %
100%
Vertical
67% Descending
Order
Bar Chart
33%
0%
Equal Bar Acct. Mgmt. Econ.
Widths Major
33
Numerical Descriptive
Measures
34
Summary Measures
Central Tendency Variation

Mean Mode
Median
Range Quartile
Variance Standard
Deviation
Geometric Mean
Coefficient of Variation
35
Measures of Central Tendency
• Various ways to describe the central,

most common or middle value in a
distribution or set of data
– The Mean (Arithmetic Mean)
– The Median
– The Mode
– The Geometric Mean
36
Numerical Measures of
Central Tendency
• Summarizing data sets numerically

– Are there certain values that seem more
typical for the data?
– How typical are they?
37
Central Tendency
• Central tendency is the value or values around
which the data tend to cluster
• Variability shows how strongly the data
cluster around that (those) value(s)
38
Central Tendency
• The mean of a set of quantitative data is the sum

of the observed values divided by the number of
values
n
x i
x i 1
n
39
Central Tendency
n n
x i x i
x i 1
 i 1
n N
• The mean of a sample is typically denoted by x-bar,

but the population mean is denoted by the Greek
symbol μ.
40
Central Tendency
• If x1 = 1, x2 = 2, x3 = 3 and x4 = 4,
x
i 1
i = (1 + 2 + 3 + 4)/4 = 10/4 = 2.5
x
n
41
Central Tendency
• The median of a set of quantitative data is the

value which is located in the middle of the data,
arranged from lowest to highest values (or vice
versa), with 50% of the observations above and
50% below.
42
Central Tendency
50% 50%
Lowest Value Median Highest Value
43
Central Tendency
• Finding the Median, M:

– Arrange the n measurements from smallest to
largest
• If n is odd, M is the middle number
• If n is even, M is the average of the middle two
numbers
44
Central Tendency
• The mode is the most frequently observed
value.
• The modal class is the midpoint of the class
with the highest relative frequency.
45
Geometric Mean
• Equals the nth root of the product of all

observations or values
• For a set of values: x1, x2, x3, x3, ........., xn
• Geometric mean =
46
Example
Jim has 20 problems to do for Problem Time Spent
homework. Some are harder # (Minutes)
than others and take more 1 12
time to solve. We take a 2 4
random sample of 9 problems. 3 3
4 8
Find the mean (arithmetic and 5 7
geometric), median and mode 6 5
for the number of minutes Jim 7 4
spends on his homework. 8 9
9 11
47
Solution: Mean
Problem Time Spent
Sample size (n) = 9 # (Minutes)
1 12
Problems 1 through 9 = x1, x2, x3 … x9,
respectively. 2 4
3 3
Σx = (12 + 4 + 3 + 8 + 7 + 5 + 4 + 9 + 11) = 4 8
63 minutes 5 7
6 5
Σx/n = 63/9 = 7 minutes 7 4
8 9
9 11
48
Solution: Geometric Mean
Problem Time Spent
Sample size (n) = 9
# (Minutes)
Problems 1 through 9 = x1, x2, x3 … x9, 1 12
respectively. 2 4
3 3
4 8
5 7
6 5
GM= 6.31 7 4
8 9
9 11
49
Solution: Median
3
4
Place the data in ascending order as at
4
right.
5
7
(n+1)/2 = (9+1)/2 = 5
8
9
The 5th
ordered observation is 7 and so is
11
the Median. 12
50
Solution: Mode
3
Since the data is already arranged in order
4
from smallest to largest we will keep it that
4
way.
5
7
Only the value 4 occurs >1 time. 8
9
The Mode is 4. 11
12
51
Approximating the Mean from a
Frequency Distribution
• Used when the only source of data is a
frequency
c
distribution
m
j 1
j fj
X
n
n  sample size
c  number of classes in the frequency distribution
m j  midpoint of the jth class
f j  frequencies of the jth class
52
c
Example
m
j 1
j fj Class MP Freq.
X 10 but < 20 15 3
n 20 but < 30 25 6
n  sample size 30 but < 40 35 5
c  number of classes in <the
40 but 50 frequency
45 4distribution
m j  midpoint of the
50 jbut
th <class
60 55 2
f j  frequencies ofTotal
the jth class 20
X = ((15*3) + (25*6) + (35*5) + (45*4) + (55*2))/20

= (45 + 150 + 175 + 180 + 110)/20
= 660/20
= 33
53
Central Tendency
• Perfectly symmetric data set:
– Mean = Median = Mode
• Extremely high value in the data set:
– Mean > Median > Mode
(Rightward skewness)
• Extremely low value in the data set:
– Mean < Median < Mode
(Leftward skewness)
54
Central Tendency
• A data set is skewed if one tail of the
distribution has more extreme observations
that the other tail.
55
Variability
• The mean, median and mode give us an idea
of the central tendency, or where the
“middle” of the data is
• Variability gives us an idea of how spread out
the data are around that middle
56
Measures of Variation
Variation
Variance Standard Deviation Coefficient

Population
of Variation
Range Population
Variance
Standard
Deviation
Sample
Variance Sample
Standard
Interquartile Range Deviation
57
Variability
• The range is equal to the largest measurement
minus the smallest measurement
– Easy to compute, but not very informative
– Considers only two observations (the smallest and
largest)
58
Quartiles
• Quartiles Split Ordered Data into 4 equal
portions
25% 25% 25% 25%
 Q1   Q2   Q3 
• Q1 and Q3 are Measures of Non-central Location

– Q2 = the Median
59
Quartiles
• Each Quartile has position and value
– With the data in an ordered array, the position of Qi
is:
i  n  1
 Qi  
4
– The value of Qi is the value associated with that
position in the ordered array
• Example:
Data in Ordered Array: 11 12 13 16 16 17 18 21 22
1 9  1 12  13
Position of Q1   2.5 Q1   12.5
4 2
60
Quartiles Example
Find the 1st and 3rd Quartiles in the ordered 3
observations at right. 4
4
Position of Q1 = 1(9+1)/4 = 2.5 5
The 2.5th observation = (4+4)/2 = 4 7
8
9
Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5
11
The 7.5th observation = (9+11)/2 = 10
12
61
Interquartile Range (IQR)
• The difference between Q1 and Q3
– The middle 50% of the values
– Also Known as Midspread:
– Resistant to extreme values
• Example:
– Q1 = 12.5, Q3 = 17.5
– 17.5 – 12.5 = 5
11 12 13 16 16 17 17 18 21
– IQR = 5
 Q1   Q3 
62
Range and IQR Example
Find the Range and the Interquartile Range in this distribution.
3
Range = largest – smallest = 12 – 3 = 9.
4
Position of Q1 = 1(9+1)/4 = 2.5
4
5
Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5
7
8
IQR = 10 – 4 = 6
9
11
12
63
Variability
• The sample variance, s2, for a sample of n

measurements is equal to the sum of the
squared distances from the mean, divided
by (n – 1).
n
 (x  x )
i
2
s 
2 i 1
n 1
64
Variability
• The sample standard deviation, s, for a

sample of n measurements is equal to the
square root of the sample variance.
 (x  x )
i
2
s  s2  i 1
n 1
65
Numerical Measures of Variability
• Say a small data set consists of the measurements 1, 2

and 3.
x =2
n
 i
( x  x ) 2
s2  i 1
 (3  2) 2  (2  2) 2  (1  2) 2  / (3  1)
n 1
 
s 2  12  02  12 / 2  2 / 2  1
s  s2  1  1
66
Numerical Measures of Variability
• As before, Greek letters are used for

populations and Roman letters for samples
s2 = sample variance
s = sample standard deviation
s2 = population variance
s = population standard deviation
67
Comparing Standard Deviations
• Greater S (or σ) = more dispersion of data
Data A
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
68
Interpreting the Standard Deviation
• Chebyshev’s Rule
• The Empirical Rule
Both tell us something about where

the data will be relative to the mean.
69
• Chebyshev’s Rule
– Valid for any data set k k2 1/ k2 (1- 1/ k2)%
– For any number k >1, at 2 4 .25 75%
least (1-1/k2)% of the
observations will lie 3 9 .11 89%
within k standard
deviations of the mean
4 16 .0625 93.75%
70
The Bienayme-Chebyshev Rule
• At least (≥) 75% of the observations must be
contained within distances of 2 SD around the
mean
• At least (≥) 88.89% of the observations must
be contained within distances of 3 SD around
the mean
• At least (≥) 93.75% of the observations must
be contained within distances of 4 SD around
the mean
71
The Bienayme-Chebyshev Rule
≥ 75%
≥ 88.89%
≥ 93.75%
- 4sd - 3sd - 2sd µ +2sd +3sd +4sd
72
• The Empirical Rule • For a perfectly
– Useful for mound-shaped, symmetrical and mound-
symmetrical distributions shaped distribution,
– If not perfectly mounded and – ~68% will be within the
__ __
range ( x  s, x  s )
symmetrical, the values are
– ~95% will__ be within the
approximations __
range ( x  2s, x  2s)
– ~99.7% will be within the
range __ __
( x  3s, x  3s)
73
• Hummingbirds beat their

wings in flight an average of 55
times per second.
• Assume the standard deviation
is 10, and that the distribution
is symmetrical and mounded.
– Approximately what percentage
of hummingbirds beat their
wings between 45 and 65 times
per second?
– Between 55 and 65?
– Less than 45?
74

times per second. Since 45 and 65 are exactly
• Assume the standard deviation one standard deviation below
is 10, and that the distribution and above the mean, the
is symmetrical and mounded. empirical rule says that about
– Approximately what 68% of the hummingbirds will
percentage of hummingbirds be in this range.
beat their wings between 45
and 65 times per second?
– Between 55 and 65?
– Less than 45?
75

times per second.
is 10, and that the distribution This range of numbers is from
is symmetrical and mounded. the mean to one standard
– Approximately what deviation above it, or one-half
percentage of hummingbirds of the range in the previous
beat their wings between 45 question. So, about one-half
and 65 times per second? of 68%, or 34%, of the
– Between 55 and 65? hummingbirds will be in this
– Less than 45? range.
76
• Hummingbirds beat their wings in
flight an average of 55 times per
second.
• Assume the standard deviation is
10, and that the distribution is
symmetrical and mounded.
– Approximately what percentage
of hummingbirds beat their Half of the entire data set lies
wings between 45 and 65 times above the mean, and ~34% lie
per second? between 45 and 55 (between
– Between 55 and 65? one standard deviation below
– Less than 45? the mean and the mean), so
~84% (~34% + 50%) are above
45, which means ~16% are
below 45.
77
Exercise
A manufacturer of automobile batteries claims that the average length of

life of its grade A battery is 60 months. However, the guarantee on this
brand is for just 36 months. Suppose the standard deviation of the life
length is known to be 10 months and the frequency distribution of the life-
length data is known to be mound shaped.
• Approximately what percentage of the manufacturer’s grade A batteries
will last more than 50 months, assuming that the manufacturer’s claim is
true?
• Approximately what percentage of the manufacturer’s batteries will last
less than 40 months, assuming that the manufacturer’s claim is true?
• Suppose your battery last 37 months. What could you infer about the
manufacturer’s claim?
78
Coefficient of Variation
• Measure of Relative Variation
• Shows Variation Relative to the Mean
• Used to Compare Two or More Sets of Data Measured in
Different Units
S 
CV   100%
X 
S = Sample Standard Deviation
X = Sample Mean
79
Comparing Coefficient
of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5
S  $5
CVA     100%   100%  10%
Both stocks have
X  $50
the same
• Stock B:
standard
– Average price last year = $100 deviation, but
stock B is less
– Standard deviation = $5 variable relative
to its price
S $5
CVB     100%   100%  5%
X $100
80
Numerical Measures of Relative Standing
• The z-score tells us • Sample z-score

how many standard xx
z
deviations above or s
below the mean a • Population z-score
particular
measurement is. x
z
s
81
xx
times per second.
z
is 10, and that the distribution s
is symmetrical and mounded.
75  55
An individual hummingbird is
measured with 75 beats per z  2.0
second. What is this bird’s z- 10
score?
82
Z Scores
Example:
• If the mean is 14.0 and the standard deviation is 3.0, what is
the Z score for the value 18.5?
X  X 18.5  14.0
Z   1.5
S 3.0
• The value 18.5 is 1.5 standard deviations above the mean
• (A negative Z-score would mean that a value is less than the
mean)
83
• Since ~95% of all the

measurements will be within 2
standard deviations of the
mean, only ~5% will be more
than 2 standard deviations
from the mean.
• About half of this 5% will be far
below the mean, leaving only
about 2.5% of the
measurements at least 2
standard deviations above the
mean.
84
Numerical Measures of Relative Standing
• Z scores are related to the empirical rule:

For a perfectly symmetrical and mound-shaped
distribution,
– ~68 % will have z-scores between -1 and 1
– ~95 % will have z-scores between -2 and 2
– ~99.7% will have z-scores between -3 and 3
85
Methods for Determining Outliers
• An outlier is a measurement that is unusually

large or small relative to the other values.
• Three possible causes:
– Observation, recording or data entry error
– Item is from a different population
– A rare, chance event
86
The Box Plot (“Box-and-Whisker”)
• The box plot is a graph representing
information about certain percentiles for a
data set and can be used to identify outliers
• 5 number summary
– Median, Q1, Q3, Xsmallest, Xlargest
• Box Plot
– Graphical display of data using 5-number ummary
Q1 Q2 Q3
4 6 8 10 12 X largest
X smallest Median 87
Distribution Shape & Box Plot
Left-Skewed Symmetric Right-Skewed
Q1 Q2 Q3 Q1Q2Q3 Q1 Q2 Q3
88
Methods for Determining Outliers
• Outliers and z-scores
– The chance that a z-score is between -3 and +3 is
over 99%.
– Any measurement with |z| > 3 is considered an

outlier.
89
Correlation Coefficient
• Correlation Coefficient = r
– Unit Free
– Measures the strength of the linear relationship
between 2 quantitative variables
• Ranges between –1 and 1
– The Closer to –1, the stronger the negative linear
relationship becomes
– The Closer to 1, the stronger the positive linear
relationship becomes
– The Closer to 0, the weaker any linear relationship
becomes
90
Scatter Plots of Data with Various Correlation Coefficients
• Scattergram (or scatterplot) shows the relationship between two
quantitative variables
Y Y Y
X X X
r = -1 r = -.6 r=0
Y Y
X X
r = .6 r=1 91
Distorting the Truth with Deceptive
Statistics
• Distortions
– Stretching the axis (and the truth)
– Is average relevant?
• Mean, median or mode?
– Is average relevant?
• What about the spread?
92

Inferential Statistics

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Inferential Statistics

Încărcat de

Drepturi de autor:

Formate disponibile

Statistics and Research Methods

• One test – 15%

• Note: Pass Mark is B (50%)

• Descriptive statistics utilizes numerical and

Drawing conclusions and/or making decisions

Class Prop. Class %

class percentage = (class relative frequency) 100

45 but < 55 100.0 80% + 20%

Pie Bar Pareto

• 1. Lists Categories & No. Elements in Category

Horizontal Major Bar Length

Zero Point 0 50 100 150

Cumulative Bar Midpoint

Central Tendency Variation

• Various ways to describe the central,

• Summarizing data sets numerically

• The mean of a set of quantitative data is the sum

• The mean of a sample is typically denoted by x-bar,

• The median of a set of quantitative data is the

Lowest Value Median Highest Value

• Finding the Median, M:

• Equals the nth root of the product of all

X = ((15*3) + (25*6) + (35*5) + (45*4) + (55*2))/20

Variance Standard Deviation Coefficient

• Q1 and Q3 are Measures of Non-central Location

• The sample variance, s2, for a sample of n

• The sample standard deviation, s, for a

• Say a small data set consists of the measurements 1, 2

• As before, Greek letters are used for

Both tell us something about where

- 4sd - 3sd - 2sd µ +2sd +3sd +4sd

• Hummingbirds beat their

• Hummingbirds beat their

• Hummingbirds beat their

A manufacturer of automobile batteries claims that the average length of

• The z-score tells us • Sample z-score

• Since ~95% of all the

• Z scores are related to the empirical rule:

• An outlier is a measurement that is unusually

Left-Skewed Symmetric Right-Skewed

– Any measurement with |z| > 3 is considered an

S-ar putea să vă placă și

X = ((153) + (256) + (355) + (454) + (55*2))/20