Sunteți pe pagina 1din 38

Copyright The McGraw-Hill Companies, Inc.

Permission required for


reproduction or display. CHAPTER 3

Data Description

3-1

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display. Objectives

Summarize data using measures of central


tendency, such as the mean, median, mode,
and midrange.

Describe data using the measures of


variation, such as the range, variance, and
standard deviation.

Identify the position of a data value in a data


set using various measures of position, such
as percentiles, deciles, and quartiles.
3-2

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Objectives (contd.)

Use the techniques of exploratory data


analysis, including stem-and-leaf plots,
boxplots, and five number summaries to
discover various aspects of data.

3-3

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display. Introduction

Statistical methods can be used to summarize


data.

Measures of averages are also called measures


of central tendency and include the mean,
median, mode, and midrange.

Measures that determine the spread of data


values are called measures of variation or
measures of dispersion and include the range,
variance, and standard deviation.
3-4

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Introduction (contd.)

Measures of position tell where a specific data


value falls within the data set or its relative
position in comparison with other data
values.

The most common measures of position are


percentiles, deciles, and quartiles.

3-5

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Introduction (contd.)

The measures of central tendency, variation,


and position are part of what is called
traditional statistics. This type of data is
typically used to confirm conjectures about
the data.

3-6

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Introduction (contd.)

Another type of statistics is called exploratory


data analysis. These techniques include the
stem-and-leaf plot, the boxplot, and the fivenumber summary. They can be used to explore
data to see what they show.

3-7

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Basic Vocabulary

A statistic is a characteristic or measure


obtained by using the data values from a
sample.

A parameter is a characteristic or measure


obtained by using all the data values for a
specific population.

When the data in a data set is ordered it is


called a data array.
3-8

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
General Rounding Rule

In statistics the basic


rounding rule is that
when computations
are done in the
calculation, rounding
should not be done
until the final answer
is calculated.

3-9

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproductionCentral
or display. Tendency: The Mean

The mean, also known as the arithmetic


average, is the sum of the values divided by
the total number of values. Th

Rounding rule: the mean should be rounded


to one more decimal place than occurs in the
raw data.

The type of mean that considers an additional


factor is called the weighted mean.
3-10

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display.
Central
Tendency: The Mean (contd.)

One computes the mean by using all the


values of the data.

The mean varies less than the median or


mode when samples are taken from the same
population and all three measures are
computed for these samples.

The mean is used in computing other


statistics, such as variance.
3-11

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display.
Central
Tendency: The Mean (contd.)

The mean for the data set is unique, and not


necessarily one of the data values.

The mean cannot be computed for an openended frequency distribution.

The mean is affected by extremely high or low


values and may not be the appropriate
average to use in these situations.

3-12

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display.
Central
Tendency: Median and Mode

The median is the halfway point in a data set.


The symbol for the median is MD.

The median is found by arranging the data in


order and selecting the middle point.

The value that occurs most often in a data set


is called the mode.

The mode for grouped data, or the class with


the highest frequency, is the modal class.
3-13

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display.Tendency: The Median
Central

The median is used when one must find the


center of middle value of a data set.

The median is used when one must determine


whether the data values fall into the upper half or
lower half of the distribution.

The median is used to find the average of an openended distribution.

The median is affected less than the mean by


extremely high or extremely low values.
3-14

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproductionCentral
or display. Tendency: The Mode

The mode is used when the most typical case


is desired.

The mode is the easiest average to compute.

The mode can be used when the data are


nominal, such as religious preference, gender,
or political affiliation.

The mode is not always unique. A data set can


have more than one mode, or the mode may
not exist for a data set.
3-15

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display.
Central
Tendency: The Midrange

The midrange is defined as the sum of the


lowest and highest values in the data set
divided by 2.

The symbol for midrange is MR.

3-16

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
display.
CentralorTendency:
The Midrange (contd.)

The midrange is easy to compute.

The midrange gives the midpoint.

The midrange is affected by extremely high or


low values in a data set.

3-17

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Distribution Shapes

In a positively skewed or right skewed


distribution, the majority of the data values
fall to the left of the mean and cluster at the
lower end of the distribution.

3-18

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproductionDistribution
or display.
Shapes (contd.)

In a symmetrical distribution, the data values


are evenly distributed on both sides of the
mean.

3-19

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproductionDistribution
or display.
Shapes (contd.)

When the majority of the data values fall to


the right of the mean and cluster at the upper
end of the distribution, with the tail to the
left, the distribution is said to be negatively
skewed or left skewed.

3-20

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display. The Range

The range is the highest value minus the


lowest value in a data set.

The symbol R is used for the range.

3-21

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display. and Standard Deviation
Variance

The variance is the average of the squares of


the distance each value is from the mean. The
symbol for the population variance is 2.

The standard deviation is the square root of


the variance. The symbol for the population
standard deviation is . Rounding rule: The
final answer should be rounded to one more
decimal place than the original data.
3-22

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Coefficient of Variation

The coefficient of variation is the standard


deviation divided by the mean. The result is
expressed as a percentage.

The coefficient of variation is used to compare


standard deviations when the units are
different for the two variables being compared.

3-23

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display. and Standard Deviation
Variance

Variances and standard deviations can be


used to determine the spread of the data. If
the variance or standard deviation is large,
the data are more dispersed. The information
is useful in comparing two or more data sets
to determine which is more variable.

The measures of variance and standard


deviation are used to determine the
consistency of a variable.
3-24

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display.
Variance
and Standard Deviation (contd.)

The variance and standard deviation can be


used to estimate the percentage of data values
that fall within a specified interval in a
distribution.

The variance and standard deviation are used


quite often in inferential statistics.

3-25

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Chebyshevs Theorem

The proportion of values from a data set that


will fall within k standard deviations of the
mean will be at least 1 1/k2; where k is a
number greater than 1.

This theorem applies to any distribution


regardless of its shape.

3-26

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display.
Empirical
Rule for Normal Distributions

Approximately 68% of the data values fall


within one standard deviation of the mean.

Approximately 95% of the data values will fall


within two standard deviations of the mean.

Approximately 99.7% of the data values will


fall within three standard deviations of the
mean.

3-27

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Standard Scores

A standard score or z score is used when


direct comparison of raw scores is impossible.

A standard score or z score for a value is


obtained by subtracting the mean from the
value and dividing the result by the standard
deviation.

3-28

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display. Percentiles

Percentiles are position measures used in


educational and health-related fields to
indicate the position of an individual in a
group.

Percentiles divide the data set into 100 equal


parts.

The Pth percentile is a value where P % of the


data values are less than or equal to the value.
3-29

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Quartiles and Deciles

Quartiles divide the distribution into four


groups. The quartiles are denoted by
Q1, Q2, and Q3. Note that Q1 is the same
as the 25th percentile; Q2 is the same as
the 50th percentile or the median; and
Q3 corresponds to the 75th percentile.

Deciles divide the distribution into 10 groups.


They are denoted by D1, D2, etc.
3-30

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Outliers

An outlier is an extremely high or an extremely


low data value when compared with the rest of
the data values.

Outliers can be the result of measurement or


observational error.

When a distribution is normal or bell-shaped,


data values that are beyond three standard
deviations of the mean can be considered
suspected outliers.
3-31

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction orExploratory
display.
Data Analysis

The purpose of exploratory data analysis is to


examine data in order to find out what
information can be discovered. For example:
Are

there any gaps in the data?

Can

any patterns be discerned?

3-32

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Stem-and-Leaf Plots

A stem-and-leaf plot is a data plot that uses


part of a data value as the stem and part of
the data value as the leaf to form groups or
classes.

It has the advantage over grouped frequency


distribution of retaining the actual data while
showing them in graphic form.

3-33

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction
or display.
Boxplots
and Five-Number Summaries

Boxplots are graphical representations of a fivenumber summary of a data set. The five specific
values that make up a five-number summary are:

The lowest value of data set (minimum)

Q1 (or 25th percentile)

The median (or 50th percentile)

Q3 (or 75th percentile)

The highest value of data set (maximum)

3-34

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Summary

Some basic ways to summarize data include


measures of central tendency, measures of
variation or dispersion, and measures of
position.

The three most commonly used measures of


central tendency are the mean, median, and
mode. The midrange is also used to represent
an average.
3-35

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Summary (contd.)

The three most commonly used measurements


of variation are the range, variance, and
standard deviation.

The most common measures of position are


percentiles, quartiles, and deciles.

Data values are distributed according to


Chebyshevs theorem and in special cases, the
empirical rule.
3-36

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display.
Summary (contd.)

The coefficient of variation is used to describe


the standard deviation in relationship to the
mean.

These methods are commonly called traditional


statistics.

Other methods, such as the stem-and-leaf plot,


the boxplot, and five-number summary, are part
of exploratory data analysis; they are used to
examine data to see what they reveal.
3-37

Copyright The McGraw-Hill Companies, Inc. Permission required for


reproduction or display. Conclusions

By combining all of these


techniques together, the
student is now able to
collect, organize,
summarize and present
data.

3-38

S-ar putea să vă placă și