Sunteți pe pagina 1din 92

Statistics and Research Methods

1
Topics: Statistics

• Descriptive Statistics
• Probability Theory and Probability
Distributions
• Hypothesis Testing
• Confidence Interval
• Analysis Of Variance (ANOVA)
• Regression and Correlation
• Chi-Squared
2
Topics: Research Methods

• Research Design
• Literature Review
• Sampling
• Data Collection Methods
• Sampling
• Ethical Issues in Research Resource
• IT role in research & Formatting

3
Assessment

• One test – 15%


• One Individual Assignment – 10%
• One Group Assignment – 15%
• Final Exam – 60%

• Note: Pass Mark is B (50%)


4
References
• Probability and Statistics for Engineering and the Sciences, by
Jay L. Devore, Monterey, California.
• Basic Business Statistics Berenson M.L, Levine D.M, Krehbiel,
T.C
• Research Methodology, Methods and Techniques, by C.R.
Kothari
• Research Methods for Business students, by Mark Saunders,
Philip Lewis and Adrian Thornhill

• Plenty of Websites

5
6
Introduction and Descriptive Statistics

7
The Science of Statistics
• Statistics is the science of data. This involves
collecting, classifying, summarizing, organizing,
analyzing and interpreting numerical information.

Statistics

Descriptive Inferential
Statistics Statistics
8
Types of Statistical Applications

• Descriptive statistics utilizes numerical and


graphical methods to look for patterns in a data
set, to summarize the information revealed in a
data set and to present that information in a
convenient form.
• Inferential statistics utilizes sample data to make
estimates, decisions, predictions or other
generalizations about a larger set of data.

9
Descriptive Statistics
Descriptive statistics utilizes numerical and graphical
methods to look for patterns in a data set, to
summarize the information revealed in a data set and
to present that information in a convenient form.

• Collect data
– e.g. Survey
• Present data
– e.g. Tables and graphs X i

n
• Characterize data
– e.g. Sample mean =
10
Inferential Statistics
Inferential statistics utilizes sample data to make estimates,
decisions, predictions or other generalizations about a larger
set of data.
• Estimation
– e.g.: Estimate the population mean weight
using the sample mean weight
• Hypothesis testing
– e.g.: Test the claim that the population
mean weight is 120 pounds

Drawing conclusions and/or making decisions


concerning a population based on sample results.11
Fundamental Elements of Statistics
• An experimental unit is an object about which
we collect data.
– Person
– Place
– Thing
– Event

12
Fundamental Elements of Statistics
• An population is a set of units in which we are
interested.
– Typically, there are too many experimental units in
a population to consider every one.
• If we can examine every single one, we conduct a
census.

13
Fundamental Elements of Statistics
• A sample is a subset of the population.
• A variable is a characteristic or property of
an individual unit.
– The values of these characteristics will, not
surprisingly, vary.
– A measure of reliability is a statement about
the degree of uncertainty associated with a
statistical inference. (Based on our analysis, we
think 56% of soda drinkers prefer Pepsi to Coke,
± 5%.)

14
Fundamental Elements of Statistics
Descriptive Statistics Inferential Statistics
• The population or sample of • Population of interest
interest • One or more variables to be
• One or more variables to be investigated
investigated • The sample of population
units
• Tables, graphs or numerical
summary tools • The inference about the
population based on the
• Identification of patterns in sample data
the data • A measure of reliability of
the inference

15
Types of Data
• Quantitative Data are measurements that are
recorded on a naturally occurring numerical
scale. e.g. Age, GPA, Salary, Cost of books this
semester
• Categorical (Qualitative) Data are measurements
that cannot be recorded on a natural numerical
scale, but are recorded in categories e.g. Live
on/off campus, Major, Gender

16
Methods for Describing Sets of Data

17
Data Presentation
Numerical
Data

Ordered Frequency
Array Distributions

Stem-&-Leaf Histo-
Polygon Ogive
Display gram
18
Ordered Array
• 1. Organizes Data to Focus on Major Features
• 2. Data Placed in Rank Order
– Smallest to Largest
• 3. Data in Raw Form (as Collected)
– 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
• 4. Data in Ordered Array
– 21, 24, 24, 26, 27, 27, 30, 32, 38, 41

19
Stem-and-Leaf Display
• A Stem-and-Leaf Display
shows the number of
observations that share a 2 144677
common value (the stem)
and the precise value of
3 028 26
each observation (the leaf)

4 1

Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
20
Frequency Distribution Table

Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38

Class Frequency
15 but < 25 3
25 but < 35 5
35 but < 45 2

21
Frequency Distribution Table Steps
• 1. Determine Range
• 2. Select Number of Classes
– Usually Between 5 & 15 Inclusive
• 3. Compute Class Intervals (Width)
• 4. Determine Class Boundaries (Limits)
• 5. Compute Class Midpoints
• 6. Count Observations & Assign to Classes
22
Frequency Distribution Table
Example
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Class Midpoint Frequency

15 but < 25 20 3
Width
25 but < 35 30 5

35 but < 45 40 2
(Upper + Lower Boundaries) / 2
Boundaries
23
Relative Frequency &
% Distribution Tables
Relative Frequency Distribution Percentage
Distribution

Class Prop. Class %


15 but < 25 .3 15 but < 25 30.0
25 but < 35 .5 25 but < 35 50.0
35 but < 45 .2 35 but < 45 20.0

class frequency
class relative frequency =
n

class percentage = (class relative frequency) 100


24
Cumulative Percentage
Distribution Table
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Percentage Less than
Lower Class Boundary
Class Cumulative
Percentage
15 but < 25 0.0
Lower Class 25 but < 35 30.0
Boundary
35 but < 45 80.0 30% + 50%

45 but < 55 100.0 80% + 20%

25
Histogram
• Histograms are graphs of the frequency or relative
frequency of a variable.
– Class intervals make up the horizontal axis
– The frequencies or relative frequencies are displayed on the
vertical axis.
Class Freq.
Count 15 but < 25 3
5 25 but < 35 5
35 but < 45 2
Frequency 4

Relative 3
Frequency Bars Touch
2
Percent
1

0 15 25 35 45 55

Lower Boundary
26
Polygon

Class Freq.
Count
15 but < 25 3
5 25 but < 35 5
4 35 but < 45 2
Frequency
Relative 3
Frequency
2 Fictitious
Percent
Class
1

0 10 20 30 40 50 60

Midpoint
27
Cumulative % Polygon (Ogive)

Cumulative % Fictitious
Class
100%

75%
Class Cum. %
50% 15 but < 25 0%
25 but < 35 30%
25% 35 but < 45 80%
45 but < 55 100%

0%
0 15 25 35 45 55

Lower Boundary
28
Categorical Data Presentation
Categorical
Data

Summary
Table

Pie Bar Pareto


Chart Chart Diagram
29
Summary Table

• 1. Lists Categories & No. Elements in Category


• 2. Obtained by Tallying Responses in Category
• 3. May Show Frequencies (Counts), % or Both

Row Is
Tally:
Category Major Count |||| ||||
Accounting 130 |||| ||||

Economics 20
Management 50
Total 200
30
Bar Chart

Horizontal Major Bar Length


Bars for Shows
Categorical Frequency
Variables
Mgmt. or %

Equal Bar
Econ. Widths
1/2 to 1 Bar
Width
Acct.

Zero Point 0 50 100 150


Percent Used Also Frequency
31
Pie Chart

• 1. Shows Breakdown of
Total Quantity Majors
into Categories Mgmt.
• 2. Useful for Showing Econ. 25%
Relative Differences 10% 36°

• 3. Angle Size
– (360°)(Percent) Acct.
65%
(360°) (10%) = 36°

32
Pareto Diagram

Cumulative Bar Midpoint


Percent Polygon (Ogive)
Always %
100%

Vertical
67% Descending
Order
Bar Chart
33%

0%
Equal Bar Acct. Mgmt. Econ.
Widths Major
33
Numerical Descriptive
Measures

34
Summary Measures

Central Tendency Variation


Mean Mode
Median
Range Quartile

Variance Standard
Deviation
Geometric Mean
Coefficient of Variation
35
Measures of Central Tendency

• Various ways to describe the central,


most common or middle value in a
distribution or set of data
– The Mean (Arithmetic Mean)
– The Median
– The Mode
– The Geometric Mean

36
Numerical Measures of
Central Tendency

• Summarizing data sets numerically


– Are there certain values that seem more
typical for the data?
– How typical are they?

37
Numerical Measures of
Central Tendency
• Central tendency is the value or values around
which the data tend to cluster
• Variability shows how strongly the data
cluster around that (those) value(s)

38
Numerical Measures of
Central Tendency

• The mean of a set of quantitative data is the sum


of the observed values divided by the number of
values
n

x i
x i 1
n

39
Numerical Measures of
Central Tendency
n n

x i x i

x i 1
 i 1
n N

• The mean of a sample is typically denoted by x-bar,


but the population mean is denoted by the Greek
symbol μ.

40
Numerical Measures of
Central Tendency

• If x1 = 1, x2 = 2, x3 = 3 and x4 = 4,

x
i 1
i = (1 + 2 + 3 + 4)/4 = 10/4 = 2.5

x
n

41
Numerical Measures of
Central Tendency

• The median of a set of quantitative data is the


value which is located in the middle of the data,
arranged from lowest to highest values (or vice
versa), with 50% of the observations above and
50% below.

42
Numerical Measures of
Central Tendency

50% 50%

Lowest Value Median Highest Value

43
Numerical Measures of
Central Tendency

• Finding the Median, M:


– Arrange the n measurements from smallest to
largest
• If n is odd, M is the middle number
• If n is even, M is the average of the middle two
numbers

44
Numerical Measures of
Central Tendency
• The mode is the most frequently observed
value.
• The modal class is the midpoint of the class
with the highest relative frequency.

45
Geometric Mean

• Equals the nth root of the product of all


observations or values
• For a set of values: x1, x2, x3, x3, ........., xn

• Geometric mean =

46
Example
Jim has 20 problems to do for Problem Time Spent
homework. Some are harder # (Minutes)
than others and take more 1 12
time to solve. We take a 2 4
random sample of 9 problems. 3 3
4 8
Find the mean (arithmetic and 5 7
geometric), median and mode 6 5
for the number of minutes Jim 7 4
spends on his homework. 8 9
9 11

47
Solution: Mean
Problem Time Spent
Sample size (n) = 9 # (Minutes)
1 12
Problems 1 through 9 = x1, x2, x3 … x9,
respectively. 2 4
3 3
Σx = (12 + 4 + 3 + 8 + 7 + 5 + 4 + 9 + 11) = 4 8
63 minutes 5 7
6 5
Σx/n = 63/9 = 7 minutes 7 4
8 9
9 11

48
Solution: Geometric Mean
Problem Time Spent
Sample size (n) = 9
# (Minutes)
Problems 1 through 9 = x1, x2, x3 … x9, 1 12
respectively. 2 4
3 3
4 8
5 7
6 5
GM= 6.31 7 4
8 9
9 11

49
Solution: Median
3
4
Place the data in ascending order as at
4
right.
5
7
(n+1)/2 = (9+1)/2 = 5
8
9
The 5th
ordered observation is 7 and so is
11
the Median. 12

50
Solution: Mode
3
Since the data is already arranged in order
4
from smallest to largest we will keep it that
4
way.
5
7
Only the value 4 occurs >1 time. 8
9
The Mode is 4. 11
12

51
Approximating the Mean from a
Frequency Distribution
• Used when the only source of data is a
frequency
c
distribution
m
j 1
j fj
X
n
n  sample size
c  number of classes in the frequency distribution
m j  midpoint of the jth class
f j  frequencies of the jth class
52
c
Example
m
j 1
j fj Class MP Freq.

X 10 but < 20 15 3
n 20 but < 30 25 6
n  sample size 30 but < 40 35 5
c  number of classes in <the
40 but 50 frequency
45 4distribution
m j  midpoint of the
50 jbut
th <class
60 55 2

f j  frequencies ofTotal
the jth class 20

X = ((15*3) + (25*6) + (35*5) + (45*4) + (55*2))/20


= (45 + 150 + 175 + 180 + 110)/20
= 660/20
= 33
53
Numerical Measures of
Central Tendency
• Perfectly symmetric data set:
– Mean = Median = Mode
• Extremely high value in the data set:
– Mean > Median > Mode
(Rightward skewness)
• Extremely low value in the data set:
– Mean < Median < Mode
(Leftward skewness)

54
Numerical Measures of
Central Tendency
• A data set is skewed if one tail of the
distribution has more extreme observations
that the other tail.

55
Numerical Measures of
Variability
• The mean, median and mode give us an idea
of the central tendency, or where the
“middle” of the data is
• Variability gives us an idea of how spread out
the data are around that middle

56
Measures of Variation
Variation

Variance Standard Deviation Coefficient


Population
of Variation
Range Population
Variance
Standard
Deviation
Sample
Variance Sample
Standard
Interquartile Range Deviation

57
Numerical Measures of
Variability
• The range is equal to the largest measurement
minus the smallest measurement
– Easy to compute, but not very informative
– Considers only two observations (the smallest and
largest)

58
Quartiles
• Quartiles Split Ordered Data into 4 equal
portions
25% 25% 25% 25%

 Q1   Q2   Q3 

• Q1 and Q3 are Measures of Non-central Location


– Q2 = the Median

59
Quartiles
• Each Quartile has position and value
– With the data in an ordered array, the position of Qi
is:
i  n  1
 Qi  
4
– The value of Qi is the value associated with that
position in the ordered array
• Example:
Data in Ordered Array: 11 12 13 16 16 17 18 21 22

1 9  1 12  13
Position of Q1   2.5 Q1   12.5
4 2
60
Quartiles Example
Find the 1st and 3rd Quartiles in the ordered 3
observations at right. 4
4
Position of Q1 = 1(9+1)/4 = 2.5 5
The 2.5th observation = (4+4)/2 = 4 7
8
9
Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5
11
The 7.5th observation = (9+11)/2 = 10
12

61
Interquartile Range (IQR)
• The difference between Q1 and Q3
– The middle 50% of the values
– Also Known as Midspread:
– Resistant to extreme values
• Example:
– Q1 = 12.5, Q3 = 17.5
– 17.5 – 12.5 = 5
11 12 13 16 16 17 17 18 21
– IQR = 5
 Q1   Q3 
62
Range and IQR Example
Find the Range and the Interquartile Range in this distribution.
3
Range = largest – smallest = 12 – 3 = 9.
4
Position of Q1 = 1(9+1)/4 = 2.5
4
The 2.5th observation = (4+4)/2 = 4
5
Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5
7
The 7.5th observation = (9+11)/2 = 10
8
IQR = 10 – 4 = 6
9
11
12

63
Numerical Measures of
Variability

• The sample variance, s2, for a sample of n


measurements is equal to the sum of the
squared distances from the mean, divided
by (n – 1).
n

 (x  x )
i
2

s 
2 i 1
n 1

64
Numerical Measures of
Variability

• The sample standard deviation, s, for a


sample of n measurements is equal to the
square root of the sample variance.

 (x  x )
i
2

s  s2  i 1
n 1

65
Numerical Measures of Variability

• Say a small data set consists of the measurements 1, 2


and 3.
x =2
n

 i
( x  x ) 2

s2  i 1
 (3  2) 2  (2  2) 2  (1  2) 2  / (3  1)
n 1
 
s 2  12  02  12 / 2  2 / 2  1

s  s2  1  1

66
Numerical Measures of Variability

• As before, Greek letters are used for


populations and Roman letters for samples
s2 = sample variance
s = sample standard deviation
s2 = population variance
s = population standard deviation

67
Comparing Standard Deviations
• Greater S (or σ) = more dispersion of data
Data A
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21

Data B
Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21 s = .9258

Data C
Mean = 15.5

11 12 13 14 15 16 17 18 19 20 21 s = 4.57
68
Interpreting the Standard Deviation

• Chebyshev’s Rule
• The Empirical Rule

Both tell us something about where


the data will be relative to the mean.

69
Interpreting the Standard Deviation

• Chebyshev’s Rule
– Valid for any data set k k2 1/ k2 (1- 1/ k2)%
– For any number k >1, at 2 4 .25 75%
least (1-1/k2)% of the
observations will lie 3 9 .11 89%
within k standard
deviations of the mean
4 16 .0625 93.75%

70
The Bienayme-Chebyshev Rule
• At least (≥) 75% of the observations must be
contained within distances of 2 SD around the
mean
• At least (≥) 88.89% of the observations must
be contained within distances of 3 SD around
the mean
• At least (≥) 93.75% of the observations must
be contained within distances of 4 SD around
the mean

71
The Bienayme-Chebyshev Rule

≥ 75%

≥ 88.89%

≥ 93.75%

- 4sd - 3sd - 2sd µ +2sd +3sd +4sd

72
Interpreting the Standard Deviation
• The Empirical Rule • For a perfectly
– Useful for mound-shaped, symmetrical and mound-
symmetrical distributions shaped distribution,
– If not perfectly mounded and – ~68% will be within the
__ __

range ( x  s, x  s )
symmetrical, the values are
– ~95% will__ be within the
approximations __
range ( x  2s, x  2s)
– ~99.7% will be within the
range __ __
( x  3s, x  3s)

73
Interpreting the Standard Deviation

• Hummingbirds beat their


wings in flight an average of 55
times per second.
• Assume the standard deviation
is 10, and that the distribution
is symmetrical and mounded.
– Approximately what percentage
of hummingbirds beat their
wings between 45 and 65 times
per second?
– Between 55 and 65?
– Less than 45?

74
Interpreting the Standard Deviation

• Hummingbirds beat their


wings in flight an average of 55
times per second. Since 45 and 65 are exactly
• Assume the standard deviation one standard deviation below
is 10, and that the distribution and above the mean, the
is symmetrical and mounded. empirical rule says that about
– Approximately what 68% of the hummingbirds will
percentage of hummingbirds be in this range.
beat their wings between 45
and 65 times per second?
– Between 55 and 65?
– Less than 45?

75
Interpreting the Standard Deviation

• Hummingbirds beat their


wings in flight an average of 55
times per second.
• Assume the standard deviation
is 10, and that the distribution This range of numbers is from
is symmetrical and mounded. the mean to one standard
– Approximately what deviation above it, or one-half
percentage of hummingbirds of the range in the previous
beat their wings between 45 question. So, about one-half
and 65 times per second? of 68%, or 34%, of the
– Between 55 and 65? hummingbirds will be in this
– Less than 45? range.

76
Interpreting the Standard Deviation
• Hummingbirds beat their wings in
flight an average of 55 times per
second.
• Assume the standard deviation is
10, and that the distribution is
symmetrical and mounded.
– Approximately what percentage
of hummingbirds beat their Half of the entire data set lies
wings between 45 and 65 times above the mean, and ~34% lie
per second? between 45 and 55 (between
– Between 55 and 65? one standard deviation below
– Less than 45? the mean and the mean), so
~84% (~34% + 50%) are above
45, which means ~16% are
below 45.

77
Exercise

A manufacturer of automobile batteries claims that the average length of


life of its grade A battery is 60 months. However, the guarantee on this
brand is for just 36 months. Suppose the standard deviation of the life
length is known to be 10 months and the frequency distribution of the life-
length data is known to be mound shaped.
• Approximately what percentage of the manufacturer’s grade A batteries
will last more than 50 months, assuming that the manufacturer’s claim is
true?
• Approximately what percentage of the manufacturer’s batteries will last
less than 40 months, assuming that the manufacturer’s claim is true?
• Suppose your battery last 37 months. What could you infer about the
manufacturer’s claim?

78
Coefficient of Variation
• Measure of Relative Variation
• Shows Variation Relative to the Mean
• Used to Compare Two or More Sets of Data Measured in
Different Units
S 
CV   100%
X 
S = Sample Standard Deviation
X = Sample Mean

79
Comparing Coefficient
of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5

S  $5
CVA     100%   100%  10%
Both stocks have
X  $50
the same
• Stock B:
standard
– Average price last year = $100 deviation, but
stock B is less
– Standard deviation = $5 variable relative
to its price
S $5
CVB     100%   100%  5%
X $100
80
Numerical Measures of Relative Standing

• The z-score tells us • Sample z-score


how many standard xx
z
deviations above or s
below the mean a • Population z-score
particular
measurement is. x
z
s

81
Interpreting the Standard Deviation
• Hummingbirds beat their
wings in flight an average of 55
xx
times per second.
• Assume the standard deviation
z
is 10, and that the distribution s
is symmetrical and mounded.

75  55
An individual hummingbird is
measured with 75 beats per z  2.0
second. What is this bird’s z- 10
score?

82
Z Scores
Example:
• If the mean is 14.0 and the standard deviation is 3.0, what is
the Z score for the value 18.5?

X  X 18.5  14.0
Z   1.5
S 3.0
• The value 18.5 is 1.5 standard deviations above the mean
• (A negative Z-score would mean that a value is less than the
mean)

83
Interpreting the Standard Deviation

• Since ~95% of all the


measurements will be within 2
standard deviations of the
mean, only ~5% will be more
than 2 standard deviations
from the mean.
• About half of this 5% will be far
below the mean, leaving only
about 2.5% of the
measurements at least 2
standard deviations above the
mean.

84
Numerical Measures of Relative Standing

• Z scores are related to the empirical rule:


For a perfectly symmetrical and mound-shaped
distribution,
– ~68 % will have z-scores between -1 and 1
– ~95 % will have z-scores between -2 and 2
– ~99.7% will have z-scores between -3 and 3

85
Methods for Determining Outliers

• An outlier is a measurement that is unusually


large or small relative to the other values.
• Three possible causes:
– Observation, recording or data entry error
– Item is from a different population
– A rare, chance event

86
The Box Plot (“Box-and-Whisker”)
• The box plot is a graph representing
information about certain percentiles for a
data set and can be used to identify outliers
• 5 number summary
– Median, Q1, Q3, Xsmallest, Xlargest
• Box Plot
– Graphical display of data using 5-number ummary

Q1 Q2 Q3
4 6 8 10 12 X largest
X smallest Median 87
Distribution Shape & Box Plot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1Q2Q3 Q1 Q2 Q3

88
Methods for Determining Outliers
• Outliers and z-scores
– The chance that a z-score is between -3 and +3 is
over 99%.

– Any measurement with |z| > 3 is considered an


outlier.

89
Correlation Coefficient
• Correlation Coefficient = r
– Unit Free
– Measures the strength of the linear relationship
between 2 quantitative variables
• Ranges between –1 and 1
– The Closer to –1, the stronger the negative linear
relationship becomes
– The Closer to 1, the stronger the positive linear
relationship becomes
– The Closer to 0, the weaker any linear relationship
becomes
90
Scatter Plots of Data with Various Correlation Coefficients
• Scattergram (or scatterplot) shows the relationship between two
quantitative variables

Y Y Y

X X X
r = -1 r = -.6 r=0
Y Y

X X
r = .6 r=1 91
Distorting the Truth with Deceptive
Statistics
• Distortions
– Stretching the axis (and the truth)
– Is average relevant?
• Mean, median or mode?
– Is average relevant?
• What about the spread?

92

S-ar putea să vă placă și