Documente Academic
Documente Profesional
Documente Cultură
1
Topics: Statistics
• Descriptive Statistics
• Probability Theory and Probability
Distributions
• Hypothesis Testing
• Confidence Interval
• Analysis Of Variance (ANOVA)
• Regression and Correlation
• Chi-Squared
2
Topics: Research Methods
• Research Design
• Literature Review
• Sampling
• Data Collection Methods
• Sampling
• Ethical Issues in Research Resource
• IT role in research & Formatting
3
Assessment
• Plenty of Websites
5
6
Introduction and Descriptive Statistics
7
The Science of Statistics
• Statistics is the science of data. This involves
collecting, classifying, summarizing, organizing,
analyzing and interpreting numerical information.
Statistics
Descriptive Inferential
Statistics Statistics
8
Types of Statistical Applications
9
Descriptive Statistics
Descriptive statistics utilizes numerical and graphical
methods to look for patterns in a data set, to
summarize the information revealed in a data set and
to present that information in a convenient form.
• Collect data
– e.g. Survey
• Present data
– e.g. Tables and graphs X i
n
• Characterize data
– e.g. Sample mean =
10
Inferential Statistics
Inferential statistics utilizes sample data to make estimates,
decisions, predictions or other generalizations about a larger
set of data.
• Estimation
– e.g.: Estimate the population mean weight
using the sample mean weight
• Hypothesis testing
– e.g.: Test the claim that the population
mean weight is 120 pounds
12
Fundamental Elements of Statistics
• An population is a set of units in which we are
interested.
– Typically, there are too many experimental units in
a population to consider every one.
• If we can examine every single one, we conduct a
census.
13
Fundamental Elements of Statistics
• A sample is a subset of the population.
• A variable is a characteristic or property of
an individual unit.
– The values of these characteristics will, not
surprisingly, vary.
– A measure of reliability is a statement about
the degree of uncertainty associated with a
statistical inference. (Based on our analysis, we
think 56% of soda drinkers prefer Pepsi to Coke,
± 5%.)
14
Fundamental Elements of Statistics
Descriptive Statistics Inferential Statistics
• The population or sample of • Population of interest
interest • One or more variables to be
• One or more variables to be investigated
investigated • The sample of population
units
• Tables, graphs or numerical
summary tools • The inference about the
population based on the
• Identification of patterns in sample data
the data • A measure of reliability of
the inference
15
Types of Data
• Quantitative Data are measurements that are
recorded on a naturally occurring numerical
scale. e.g. Age, GPA, Salary, Cost of books this
semester
• Categorical (Qualitative) Data are measurements
that cannot be recorded on a natural numerical
scale, but are recorded in categories e.g. Live
on/off campus, Major, Gender
16
Methods for Describing Sets of Data
17
Data Presentation
Numerical
Data
Ordered Frequency
Array Distributions
Stem-&-Leaf Histo-
Polygon Ogive
Display gram
18
Ordered Array
• 1. Organizes Data to Focus on Major Features
• 2. Data Placed in Rank Order
– Smallest to Largest
• 3. Data in Raw Form (as Collected)
– 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
• 4. Data in Ordered Array
– 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
19
Stem-and-Leaf Display
• A Stem-and-Leaf Display
shows the number of
observations that share a 2 144677
common value (the stem)
and the precise value of
3 028 26
each observation (the leaf)
4 1
Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
20
Frequency Distribution Table
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Class Frequency
15 but < 25 3
25 but < 35 5
35 but < 45 2
21
Frequency Distribution Table Steps
• 1. Determine Range
• 2. Select Number of Classes
– Usually Between 5 & 15 Inclusive
• 3. Compute Class Intervals (Width)
• 4. Determine Class Boundaries (Limits)
• 5. Compute Class Midpoints
• 6. Count Observations & Assign to Classes
22
Frequency Distribution Table
Example
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Class Midpoint Frequency
15 but < 25 20 3
Width
25 but < 35 30 5
35 but < 45 40 2
(Upper + Lower Boundaries) / 2
Boundaries
23
Relative Frequency &
% Distribution Tables
Relative Frequency Distribution Percentage
Distribution
class frequency
class relative frequency =
n
25
Histogram
• Histograms are graphs of the frequency or relative
frequency of a variable.
– Class intervals make up the horizontal axis
– The frequencies or relative frequencies are displayed on the
vertical axis.
Class Freq.
Count 15 but < 25 3
5 25 but < 35 5
35 but < 45 2
Frequency 4
Relative 3
Frequency Bars Touch
2
Percent
1
0 15 25 35 45 55
Lower Boundary
26
Polygon
Class Freq.
Count
15 but < 25 3
5 25 but < 35 5
4 35 but < 45 2
Frequency
Relative 3
Frequency
2 Fictitious
Percent
Class
1
0 10 20 30 40 50 60
Midpoint
27
Cumulative % Polygon (Ogive)
Cumulative % Fictitious
Class
100%
75%
Class Cum. %
50% 15 but < 25 0%
25 but < 35 30%
25% 35 but < 45 80%
45 but < 55 100%
0%
0 15 25 35 45 55
Lower Boundary
28
Categorical Data Presentation
Categorical
Data
Summary
Table
Row Is
Tally:
Category Major Count |||| ||||
Accounting 130 |||| ||||
Economics 20
Management 50
Total 200
30
Bar Chart
Equal Bar
Econ. Widths
1/2 to 1 Bar
Width
Acct.
• 1. Shows Breakdown of
Total Quantity Majors
into Categories Mgmt.
• 2. Useful for Showing Econ. 25%
Relative Differences 10% 36°
• 3. Angle Size
– (360°)(Percent) Acct.
65%
(360°) (10%) = 36°
32
Pareto Diagram
Vertical
67% Descending
Order
Bar Chart
33%
0%
Equal Bar Acct. Mgmt. Econ.
Widths Major
33
Numerical Descriptive
Measures
34
Summary Measures
Variance Standard
Deviation
Geometric Mean
Coefficient of Variation
35
Measures of Central Tendency
36
Numerical Measures of
Central Tendency
37
Numerical Measures of
Central Tendency
• Central tendency is the value or values around
which the data tend to cluster
• Variability shows how strongly the data
cluster around that (those) value(s)
38
Numerical Measures of
Central Tendency
x i
x i 1
n
39
Numerical Measures of
Central Tendency
n n
x i x i
x i 1
i 1
n N
40
Numerical Measures of
Central Tendency
• If x1 = 1, x2 = 2, x3 = 3 and x4 = 4,
x
i 1
i = (1 + 2 + 3 + 4)/4 = 10/4 = 2.5
x
n
41
Numerical Measures of
Central Tendency
42
Numerical Measures of
Central Tendency
50% 50%
43
Numerical Measures of
Central Tendency
44
Numerical Measures of
Central Tendency
• The mode is the most frequently observed
value.
• The modal class is the midpoint of the class
with the highest relative frequency.
45
Geometric Mean
• Geometric mean =
46
Example
Jim has 20 problems to do for Problem Time Spent
homework. Some are harder # (Minutes)
than others and take more 1 12
time to solve. We take a 2 4
random sample of 9 problems. 3 3
4 8
Find the mean (arithmetic and 5 7
geometric), median and mode 6 5
for the number of minutes Jim 7 4
spends on his homework. 8 9
9 11
47
Solution: Mean
Problem Time Spent
Sample size (n) = 9 # (Minutes)
1 12
Problems 1 through 9 = x1, x2, x3 … x9,
respectively. 2 4
3 3
Σx = (12 + 4 + 3 + 8 + 7 + 5 + 4 + 9 + 11) = 4 8
63 minutes 5 7
6 5
Σx/n = 63/9 = 7 minutes 7 4
8 9
9 11
48
Solution: Geometric Mean
Problem Time Spent
Sample size (n) = 9
# (Minutes)
Problems 1 through 9 = x1, x2, x3 … x9, 1 12
respectively. 2 4
3 3
4 8
5 7
6 5
GM= 6.31 7 4
8 9
9 11
49
Solution: Median
3
4
Place the data in ascending order as at
4
right.
5
7
(n+1)/2 = (9+1)/2 = 5
8
9
The 5th
ordered observation is 7 and so is
11
the Median. 12
50
Solution: Mode
3
Since the data is already arranged in order
4
from smallest to largest we will keep it that
4
way.
5
7
Only the value 4 occurs >1 time. 8
9
The Mode is 4. 11
12
51
Approximating the Mean from a
Frequency Distribution
• Used when the only source of data is a
frequency
c
distribution
m
j 1
j fj
X
n
n sample size
c number of classes in the frequency distribution
m j midpoint of the jth class
f j frequencies of the jth class
52
c
Example
m
j 1
j fj Class MP Freq.
X 10 but < 20 15 3
n 20 but < 30 25 6
n sample size 30 but < 40 35 5
c number of classes in <the
40 but 50 frequency
45 4distribution
m j midpoint of the
50 jbut
th <class
60 55 2
f j frequencies ofTotal
the jth class 20
54
Numerical Measures of
Central Tendency
• A data set is skewed if one tail of the
distribution has more extreme observations
that the other tail.
55
Numerical Measures of
Variability
• The mean, median and mode give us an idea
of the central tendency, or where the
“middle” of the data is
• Variability gives us an idea of how spread out
the data are around that middle
56
Measures of Variation
Variation
57
Numerical Measures of
Variability
• The range is equal to the largest measurement
minus the smallest measurement
– Easy to compute, but not very informative
– Considers only two observations (the smallest and
largest)
58
Quartiles
• Quartiles Split Ordered Data into 4 equal
portions
25% 25% 25% 25%
Q1 Q2 Q3
59
Quartiles
• Each Quartile has position and value
– With the data in an ordered array, the position of Qi
is:
i n 1
Qi
4
– The value of Qi is the value associated with that
position in the ordered array
• Example:
Data in Ordered Array: 11 12 13 16 16 17 18 21 22
1 9 1 12 13
Position of Q1 2.5 Q1 12.5
4 2
60
Quartiles Example
Find the 1st and 3rd Quartiles in the ordered 3
observations at right. 4
4
Position of Q1 = 1(9+1)/4 = 2.5 5
The 2.5th observation = (4+4)/2 = 4 7
8
9
Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5
11
The 7.5th observation = (9+11)/2 = 10
12
61
Interquartile Range (IQR)
• The difference between Q1 and Q3
– The middle 50% of the values
– Also Known as Midspread:
– Resistant to extreme values
• Example:
– Q1 = 12.5, Q3 = 17.5
– 17.5 – 12.5 = 5
11 12 13 16 16 17 17 18 21
– IQR = 5
Q1 Q3
62
Range and IQR Example
Find the Range and the Interquartile Range in this distribution.
3
Range = largest – smallest = 12 – 3 = 9.
4
Position of Q1 = 1(9+1)/4 = 2.5
4
The 2.5th observation = (4+4)/2 = 4
5
Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5
7
The 7.5th observation = (9+11)/2 = 10
8
IQR = 10 – 4 = 6
9
11
12
63
Numerical Measures of
Variability
(x x )
i
2
s
2 i 1
n 1
64
Numerical Measures of
Variability
(x x )
i
2
s s2 i 1
n 1
65
Numerical Measures of Variability
i
( x x ) 2
s2 i 1
(3 2) 2 (2 2) 2 (1 2) 2 / (3 1)
n 1
s 2 12 02 12 / 2 2 / 2 1
s s2 1 1
66
Numerical Measures of Variability
67
Comparing Standard Deviations
• Greater S (or σ) = more dispersion of data
Data A
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = 4.57
68
Interpreting the Standard Deviation
• Chebyshev’s Rule
• The Empirical Rule
69
Interpreting the Standard Deviation
• Chebyshev’s Rule
– Valid for any data set k k2 1/ k2 (1- 1/ k2)%
– For any number k >1, at 2 4 .25 75%
least (1-1/k2)% of the
observations will lie 3 9 .11 89%
within k standard
deviations of the mean
4 16 .0625 93.75%
70
The Bienayme-Chebyshev Rule
• At least (≥) 75% of the observations must be
contained within distances of 2 SD around the
mean
• At least (≥) 88.89% of the observations must
be contained within distances of 3 SD around
the mean
• At least (≥) 93.75% of the observations must
be contained within distances of 4 SD around
the mean
71
The Bienayme-Chebyshev Rule
≥ 75%
≥ 88.89%
≥ 93.75%
72
Interpreting the Standard Deviation
• The Empirical Rule • For a perfectly
– Useful for mound-shaped, symmetrical and mound-
symmetrical distributions shaped distribution,
– If not perfectly mounded and – ~68% will be within the
__ __
range ( x s, x s )
symmetrical, the values are
– ~95% will__ be within the
approximations __
range ( x 2s, x 2s)
– ~99.7% will be within the
range __ __
( x 3s, x 3s)
73
Interpreting the Standard Deviation
74
Interpreting the Standard Deviation
75
Interpreting the Standard Deviation
76
Interpreting the Standard Deviation
• Hummingbirds beat their wings in
flight an average of 55 times per
second.
• Assume the standard deviation is
10, and that the distribution is
symmetrical and mounded.
– Approximately what percentage
of hummingbirds beat their Half of the entire data set lies
wings between 45 and 65 times above the mean, and ~34% lie
per second? between 45 and 55 (between
– Between 55 and 65? one standard deviation below
– Less than 45? the mean and the mean), so
~84% (~34% + 50%) are above
45, which means ~16% are
below 45.
77
Exercise
78
Coefficient of Variation
• Measure of Relative Variation
• Shows Variation Relative to the Mean
• Used to Compare Two or More Sets of Data Measured in
Different Units
S
CV 100%
X
S = Sample Standard Deviation
X = Sample Mean
79
Comparing Coefficient
of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5
S $5
CVA 100% 100% 10%
Both stocks have
X $50
the same
• Stock B:
standard
– Average price last year = $100 deviation, but
stock B is less
– Standard deviation = $5 variable relative
to its price
S $5
CVB 100% 100% 5%
X $100
80
Numerical Measures of Relative Standing
81
Interpreting the Standard Deviation
• Hummingbirds beat their
wings in flight an average of 55
xx
times per second.
• Assume the standard deviation
z
is 10, and that the distribution s
is symmetrical and mounded.
75 55
An individual hummingbird is
measured with 75 beats per z 2.0
second. What is this bird’s z- 10
score?
82
Z Scores
Example:
• If the mean is 14.0 and the standard deviation is 3.0, what is
the Z score for the value 18.5?
X X 18.5 14.0
Z 1.5
S 3.0
• The value 18.5 is 1.5 standard deviations above the mean
• (A negative Z-score would mean that a value is less than the
mean)
83
Interpreting the Standard Deviation
84
Numerical Measures of Relative Standing
85
Methods for Determining Outliers
86
The Box Plot (“Box-and-Whisker”)
• The box plot is a graph representing
information about certain percentiles for a
data set and can be used to identify outliers
• 5 number summary
– Median, Q1, Q3, Xsmallest, Xlargest
• Box Plot
– Graphical display of data using 5-number ummary
Q1 Q2 Q3
4 6 8 10 12 X largest
X smallest Median 87
Distribution Shape & Box Plot
Q1 Q2 Q3 Q1Q2Q3 Q1 Q2 Q3
88
Methods for Determining Outliers
• Outliers and z-scores
– The chance that a z-score is between -3 and +3 is
over 99%.
89
Correlation Coefficient
• Correlation Coefficient = r
– Unit Free
– Measures the strength of the linear relationship
between 2 quantitative variables
• Ranges between –1 and 1
– The Closer to –1, the stronger the negative linear
relationship becomes
– The Closer to 1, the stronger the positive linear
relationship becomes
– The Closer to 0, the weaker any linear relationship
becomes
90
Scatter Plots of Data with Various Correlation Coefficients
• Scattergram (or scatterplot) shows the relationship between two
quantitative variables
Y Y Y
X X X
r = -1 r = -.6 r=0
Y Y
X X
r = .6 r=1 91
Distorting the Truth with Deceptive
Statistics
• Distortions
– Stretching the axis (and the truth)
– Is average relevant?
• Mean, median or mode?
– Is average relevant?
• What about the spread?
92