Documente Academic
Documente Profesional
Documente Cultură
Chapter 1
Statistics
Statistics is a science that involves the extraction of
information from numerical data obtained during an
experiment or from a sample. It involves the design
of the experiment or sampling procedure, the
collection and analysis of the data, and making
inferences (statements) about the population based
upon information in a sample.
Chapter 1
STATISTICS
So put another way
Statistics is the Science of Reasoning
with Data. It is the science of
collecting, organizing, summarizing,
and analyzing information to draw
conclusions or answer questions.
Chapter 1
Variable
any characteristic of an individual
can take different values for different
individuals
Chapter 1
S A S I n s t it u t e I n c . is g a t h e r in g d a ta r e la t e d t o t h e a t te n d e e s o f t h is c o u r s e
p r o v id e t h e f o llo w in g in f o r m a tio n . T h is d a ta is b e in g g a th e r e d f o r p u r p o s e
c e r ta in c o n c e p t s w ith in t h is c o u r s e a n d w ill n o t b e u s e d f o r a n y o th e r p u r p
Data
Statistics is the
Science of
Reasoning with
Data
_ _ _ _ _ _ _ _ _ _ _ F ir s t a n d L a s t I n itia ls
___________ G ender
F
F e m a le
M a le
_ _ _ _ _ _ _ _ _ _ _ S ta t e o f B ir t h ( 2 - c h a r a c t e r a b b r e v ia tio n )
_ _ _ _ _ _ _ _ _ _ _ N u m b e r o f Y e a r s w it h P r e s e n t E m p lo y e r
_ _ _ _ _ _ _ _ _ _ _ P r o f e s s io n
So what is Data?
Data is the Collection of
values of a set of
characteristics for a set
of individuals, collected
for a particular study.
BPS - 5th Ed.
Chapter 1
E x e c u tiv e
F in a n c e /A c c o u n tin g
B u s in e s s A n a ly s t
E n g in e e r in g
S a le s / M a r k e t in g
P r o d u c tio n
R &D
S ta t is t ic ia n
In f o r m a tio n S y s t e m s
C o n s u lta n t
T r a in e r /E d u c a to r
F u ll- t im e S t u d e n t
O th e r
S A S a n d a ll o t h e r S A S I n s t it u t e I n c . p r o d u c t o r s e r v ic e n a m e s a r e r e g is t e r e d t r a d e m a r k s o f S A S I n s t it u t e I n c .
Row: Single
observation
JM
TM
GS
DW
BW
.
.
M
M
F
F
F
.
.
GA
CT
FL
AL
NC
.
.
17
5
2
10
12
.
.
F
E
A
D
B
.
.
Chapter 1
A population:
Is the group to be studied.
Includes all of the individuals in the group.
A sample:
Is a subset of the population.
Is often used in analyses because getting access to the
entire population is impractical.
Descriptive Statistics: Statements about the Sample.
Organizing and summarizing the information collected.
Inferential Statistics: Generalization of sample results to
statements about the Population.
BPS - 5th Ed.
Chapter 1
Process of Statistics
Chapter 1
Chapter 1
Descriptive
Statistics
Descriptive Statistics
After we collect the raw data (from a sample survey or
a designed experiment), we can:
Describe the data using visual methods (Chapter 1)
Describe the data using numeric methods (Chapter
2)
Different methods are appropriate for different types
of data.
Descriptive Statistics: Statements about the Sample
Focus is on interpreting and presenting the data that has been
collected via the sample.
No attempt is made to generalize to other (larger) groups, such
as the population.
BPS - 5th Ed.
Chapter 1
10
Case Study
The Effect of Hypnosis
on the
Immune System
reported in Science News, Sept. 4, 1993, p. 153
Chapter 1
11
Case Study
The Effect of Hypnosis
on the
Immune System
Objective:
To determine if hypnosis strengthens the
disease-fighting capacity of immune cells.
Chapter 1
12
Case Study
65 college students.
33 easily hypnotized
32 not easily hypnotized
Chapter 1
13
Case Study
Students randomly assigned to one
of three conditions
subjects hypnotized, given mental
exercise
subjects relaxed in sensory deprivation
tank
control group (no treatment)
Chapter 1
14
Case Study
white blood cell counts re-measured after
one week
the two white blood cell counts are
compared for each group
results
hypnotized group showed larger jump in white
blood cells
easily hypnotized group showed largest
immune enhancement
Chapter 1
15
Case Study
Variables measured
categorical
quantitative
16
Case Study
Weight Gain Spells
Heart Risk for Women
Weight, weight change, and coronary heart disease
in women. W.C. Willett, et. al., vol. 273(6), Journal
of the American Medical Association, Feb. 8, 1995.
(Reported in Science News, Feb. 4, 1995, p. 108)
Chapter 1
17
Case Study
Weight Gain Spells
Heart Risk for Women
Objective:
To recommend a range of body mass index
(a function of weight and height) in terms of
coronary heart disease (CHD) risk in women.
Chapter 1
18
Case Study
Study started in 1976 with 115,818
women aged 30 to 55 years and
without a history of previous CHD.
Each womans weight (body mass)
was determined.
Each woman was asked her weight at
age 18.
Chapter 1
19
Case Study
The cohort of women were followed
for 14 years.
The number of CHD (fatal and
nonfatal) cases were counted (1292
cases).
Chapter 1
20
Case Study
Variables measured
quantitative
categorical
Chapter 1
21
Distribution
Tells what values a variable takes
and how often it takes these values
Can be a table, graph, or function
Chapter 1
22
Quantitative variables
Histograms
Stemplots (stem-and-leaf plots)
Chapter 1
23
Examples of Displaying
Data
The next 6 slides will show different
ways to display the same
information.
The first 3 slides show different ways
to display information on the Class
Make-up on the First Day of Class
The next 3 slides display information
on U.S. solid waste.
BPS - 5th Ed.
Chapter 1
24
Count
Percent
Freshman
18
41.9%
Sophomore
10
23.3%
Junior
14.0%
Senior
20.9%
Total
43
100.1%
Chapter 1
25
Chapter 1
26
Chapter 1
27
Percent of total
Food scraps
25.9
11.2 %
Glass
12.8
5.5 %
Metals
18.0
7.8 %
Paper, paperboard
86.7
37.4 %
Plastics
24.7
10.7 %
15.8
6.8 %
Wood
12.7
5.5 %
Yard trimmings
27.7
11.9 %
Other
7.5
3.2 %
Total
231.9
100.0 %
Chapter 1
28
Chapter 1
29
Chapter 1
30
Chapter 1
31
Frequency/Relative
Frequency
A frequency distribution
lists:
Each of the categories.
The frequency, or the count, of the observations that
belong to each category.
Frequency
Counts
Frequency
BPS - 5th Ed.
(or Percents)
Chapter 1
32
Example
Consider the following data set:
blue, blue, green, red, red, blue, red, blue
The frequency distribution for this qualitative
data is:
Color
Frequency
Blue
Green
Red
Frequency
Table
Chapter 1
33
Example
Consider the following data set:
blue, blue, green, red, red, blue, red, blue
The frequency distribution for this
qualitative data
is:
Color
Frequency
Blue
Green
Red
Frequency
Table
Chapter 1
34
Example (cont.)
The relative frequency distribution is computed as
follows:
Sum of all frequencies = _____
Blue has a relative frequency of __________________
Green has a relative frequency of _________________
Percent
Red Proportion
has a relative frequency of ___________________
Color
Relative
Frequency
Color
Relative
Frequency
Blue
Blue
Green
Green
Red
Red
Relative Frequency
Table
Chapter 1
35
Example (cont.)
The relative frequency distribution is computed as
follows:
Sum of all frequencies = 8
Blue has a relative frequency of 4/8 or or 0.5 or
50%
Green
has a relative frequency of Percent
1/8 or 0.125 or
Proportion
12.5%
Color
Relative
Color
Relative
Red has aFrequency
relative frequency of 3/8 or Frequency
0.375 or
Blue
4/8 or 1/2
Blue
50%
37.5%
Green
1/8
Green
12.5%
Red
3/8
Red
37.5%
Relative Frequency
Table
Chapter 1
36
A bar graph:
Bar Graphs
Relative Frequency
frequency tableChapter
for 1qualitative data.
37
Bar
Graphs
(cont.)
Good practices in constructing bar graphs:
The horizontal scale:
The categories should be equally spaced.
The rectangles should have the same widths and have some space
between them.
The qualitative variable associated with the categories should be
identified via a meaningful label.
Chapter 1
38
Chapter 1
39
Percent
Chapter 1 of Patients
40
Chapter 1
Blood
Type
A
AB
B
O
N=
Count
18
4
6
22
50
Percent
36.00
8.00
12.00
44.00
41
chart c1;
title "Distribution of Blood Types for the 50 Patients";
percent;
bar.
Distribution of Blood Types for the 50 Patients
In MINITAB,
double-click
any title or label
to change.
40
Percent
Default label in
MINITAB is Percent
instead of Relative
Frequency
50
30
20
10
AB
Blood Type
Percent within all data.
Chapter 1
42
Default label in
MINITAB is Count
instead of Frequency
25
20
Count
MTB >
SUBC>
SUBC>
SUBC>
15
10
Categories arranged
so bars are decreasing
in height.
AB
Blood Type
Pareto Chart: A Bar Graph whose bars are drawn in decreasing order of
frequency or relative frequency.
BPS - 5th Ed.
Chapter 1
43
where:
cy is the column containing the data
Chapter 1
44
Chapter 1
45
Chapter 1
46
Pie Charts
A pie chart is a circle divided into sections, one for each
category.
The area (angle) of each sector is proportional to the
frequency/relative frequency of that category.
Pie charts are useful for showing the relative proportions of each
category, compared to the whole.
Chapter 1
47
Chapter 1
48
Frequency
18
AB
22
Total:
50
Relative
Frequency
Degree Measure
1.000
360
Chapter 1
49
Blood Type
A
Frequency
18
Relative
Frequency
0.36
AB
0.08
360*.08=28.8, 29
0.12
360*0.12=43.2, 43
22
0.44
360*0.44=158.4, 158
Total:
50
1.000
360
Chapter 1
Degree Measure
360*0.36=129.6,130
50
piechart c1;
title "Distribution of Blood Types for the 50 Patients";
slabel;
pcategory;
SUBC> percent
In MINITAB,
double-click
title to change.
AB
8.0%
B
12.0%
Chapter 1
51
Chapter 1
52
Chapter 1
53
Chapter 1
54
Discrete
Quantitative
Data
Discrete quantitative data can be presented in tables
and bar graphs in several of the same ways as
qualitative data.
Frequency/Relative Frequency Distribution
Values listed in a table - use the discrete values instead
of the category names.
List frequencies or relative frequencies.
Chapter 1
55
Wendy's Example
Frequency and relative frequency distributions:
MTB > tally c1;
SUBC> counts;
SUBC> percents.
a)
b)
c)
d)
Count
1
6
1
4
7
11
5
2
2
1
40
Percent
2.50
15.00
2.50
10.00
17.50
27.50
12.50
5.00
5.00
2.50
Chapter 1
56
Wendy's Example
Proportion of Intervals
Number of Intervals
Chapter 1
57
Continuous Quantitative
Continuous data cannot be
put directly into frequency
Data
tables or displayed in histograms because continuous
data do not have any obvious categories.
Categories are created using classes, or intervals of
numbers. (No predefined categories.)
The continuous data is then grouped into the classes.
Just as for discrete data:
The classes and the number (or proportion) of values in
each can be put into a table to form a frequency (or
relative frequency) distribution.
A histogram can be created from the frequency/relative
frequency distribution.
Chapter 1
Classes
Definitions:
Chapter 1
59
Classes (cont.)
Class Width: Difference between consecutive lower class
limits (with the exception of open-ended classes).
For the class 30 39, the class width = 40 30 = 10.
(All the classes (20 29, 30 39, 40 49, 50 59) have the same widths.)
Chapter 1
60
Frequency Tables
The classes and the number of values in each can be put
into a table (called a frequency table):
1.
2.
3.
4.
5.
1147
40 49
1090
50 59
493
60 and older
110
Chapter 1
61
Chapter 1
62
Histograms
A histogram is a "picture" of a frequency/relative frequency table for
quantitative data.
Classes
0-1.99
2-3.99
...
12-13.99
14-15.99
To construct a histogram:
1.
2.
3.
4.
5.
Chapter 1
63
Histograms (cont.)
Important points of histogram construction:
Plot and label only the lower class limits, in between the
bars.
Provide a descriptive title.
Generic title: Distribution of Name of the Variable for
Describe the Items.
(You must substitute the correct words for the problem at
hand for the items in italics.)
Chapter 1
64
7
6
Default label in
MINITAB is
Frequency
Frequency
5
4
3
2
1
0
Chapter 1
6
8
Volume (in millions)
10
65
Frequency
5
4
3
2
1
0
6
8
Volume (in millions)
10
Number of Days
6
5
4
3
2
1
0
Chapter 1
6
8
Volume (in millions)
10
66
Stock
Example
(cont.)
Double-click any tick mark label;
opens this dialog box.
Chapter 1
67
Number of Days
7
6
5
4
3
2
1
0
6
8
Volume (in millions)
10
Double-click x-axis
or tick mark label;
opens this dialog
box.
BPS - 5th Ed.
Number of Days
7
6
5
4
3
2
1
0
Chapter 1
6
7
8
Volume (in millions)
10
11
68
MINITAB Histogram
where:
cy is the column containing the data.
limit_1 is the lower limit of the first class.
limit_2 is the next lower class limit if one more class were to
exist (or the last lower class limit plus class width).
class_width is the class width.
Use percent subcommand for relative frequency histogram.
BPS - 5th Ed.
Chapter 1
69
MINITAB Histogram
To make a histogram with
tick marks you define, can use
(cont.)
the subcommand cutpoint.
However, to use the cutpoint subcommand, you must figure
out values for limit_1, limit_2, and class_width; MINITAB
does not compute these for you. (Use DESCRIBE command
to find sample size, minimum, and maximum.)
Chapter 1
70
Additional Example
a) Determine the class width.
b) Identify the classes.
c) Which class has the highest
frequency?
d) What MINITAB "cutpoint"
subcommand would you use
to generate this histogram?
e) Approximately how many
states had between 200 and
399 traffic fatalities?
f) Approximately what percent
of states had less than 200
traffic fatalities?
Chapter 1
71
Additional Example
a) Determine the class width. 200
b) Identify the classes. 0-199,
200-399,400-599,600-799,
800-999,1000-1199,1200-1399,
1400-1599,1600-1799
c) Which class has the highest
frequency? 0-200
d) What MINITAB "cutpoint"
subcommand would you use
to generate this histogram?
SUBC> cutpoint 0:1800/200
e) Approximately how many
states had between 200 and
399 traffic fatalities? 15
f) Approximately what percent of
states had less than 200 traffic
fatalities? 20/50= 40%
BPS - 5th Ed.
Chapter 1
72
Case Study
Weight Data
Introductory Statistics class
Spring, 1997
Virginia Commonwealth University
Chapter 1
73
Weight Data
Chapter 1
74
Chapter 1
75
Number of students
100
120 140
160
180 200
Weight
220 240
260
280
Chapter 1
76
Stem-and-Leaf Plot
A stem-and-leaf plot is a different way to represent
quantitative data that is similar to a histogram.
To draw a stem-and-leaf plot, each data value must be
broken up into two components. In the simplest
scenario:
The stem consists of all the digits except for the
right-most one.
The leaf consists of the right-most digit.
Chapter 1
77
Raw Data
5.6 6.5 7.3 7.8 7.8
. . . . . . . . . .
16.8 17.0 17.6 17.8 18.0
78
Chapter 1
79
Stem-and-Leaf: Manual
Solution
Sorted data:
7.8 8.1 8.3 8.6 8.8 9.1 9.4 10.2 11.0 11.7
Chapter 1
Leaf
Include zeros.
Modifying Data
Minimum = 0.05
Maximum = 14.48
1
2
5
7
10
11
19
(2)
19
14
10
6
4
2
1
Truncated
MTB > stem c1
Rounded
Chapter 1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1
0
339
17
139
4
00114578
89
23369
1489
1139
79
15
9
5
81
Split Stems
Include
stem even
when no
leaves.
a)
b)
c)
d)
Chapter 1
82
Modifications to Stem-and-Leaf
If we wanted to comparePlots
two sets of data, we could draw
two stem-and-leaf plots using the same stem, with leaves
going left (for one set of data) and right (for the other set).
There are cases where constructing a descending stemand-leaf plot could also be appropriate (for test scores, for
example).
Chapter 1
83
Case Study
Weight Data
Introductory Statistics class
Spring, 1997
Virginia Commonwealth University
Chapter 1
84
Weight Data
Chapter 1
85
Weight Data:
Stemplot
(Stem & Leaf Plot)
Key
20|3 means
203 pounds
Stems = 10s
Leaves = 1s
Chapter 1
10
11
12 5
13
14 2
15
16
17
18 2
19
20
21
22
23
24
25
26
192
152
135
86
Dot Plot
Number of arrivals at Wendy's Data
A dot plot is a graph where a
dot is placed over the value
each time it is observed.
(Used with discrete data and
small sets of continuous data.)
Chapter 1
87
Chapter 1
88
Chapter 1
89
Uniform
A variable has a uniform distribution
when:
Each of the values tends to occur with
the same frequency.
The histogram looks flat.
Chapter 1
90
Bell-Shaped
A variable has a bell-shaped (or
mound-shaped) distribution when:
Most of the values fall in the middle.
The frequencies tail off to the left and to
the right.
It is symmetric (i.e., left half mirror
image of right half).
Chapter 1
91
Symmetric
Bell-Shaped
Chapter 1
92
Symmetric
Mound-Shaped
Chapter 1
93
Right-Skewed
A variable has a skewed right
distribution when:
The distribution is not symmetric.
The tail to the right is longer than the
tail to the left.
The arrow from the middle to the long
tail points right.
Right
BPS - 5th Ed.
Chapter 1
94
Asymmetric
Skewed to the Right
Chapter 1
95
Left-Skewed
A variable has a skewed left
distribution when:
The distribution is not symmetric.
The tail to the left is longer than the tail
to the right.
The arrow from the middle to the long
tail points left.
Left
BPS - 5th Ed.
Chapter 1
96
Asymmetric
Skewed to the Left
Chapter 1
97
98
Time Plots
A time plot shows behavior over time.
Time is always on the horizontal axis, and the
variable being measured is on the vertical
axis.
Look for an overall pattern (trend), and
deviations from this trend. Connecting the
data points by lines may emphasize this trend.
Look for patterns that repeat at known regular
intervals (seasonal variations).
Chapter 1
99
Time Plots
Time-Series Data: Variable is measured at different points in
time.
Time-Series Plot: Time-series data (vertical axis) plotted
against time (horizontal axis). Lines are then drawn
connecting the points.
Identify long term trends.
Identify regularly occurring patterns with time (seasonality).
Chapter 1
100
Percent of Class
That Are Freshman
50%
40%
30%
20%
10%
0%
1985
1986
1987
1988
1989
1990
1991
1992
1993
Chapter 1
101
Chapter 1
102
Outliers
Extreme values that fall outside the
overall pattern
May occur naturally
May occur due to error in recording
May occur due to error in measuring
Observational unit may be
fundamentally different
Chapter 1
103