Sunteți pe pagina 1din 12

STATISTICS

Statistics is the study of collecting, organizing, and analyzing data.


Those taking both levels of the test must know some statistics. Those
taking Level 1 should study the first part of this section: Measures of
Location. Those taking Level 2 should also study the next section:
Measures of Variability.
Measures of Location
Measures of location are used to represent the central value of the data.
There are three common measures of central location: mean, median,
and mode. The one that is typically the most useful is the arithmetic
mean, which is computed by adding up all of the individual data values
and dividing by the number of values.
The mean of a list of n numbers is equal
to the sum of the numbers divided by n.
Example 1
A researcher wishes to determine the average (arithmetic mean) amount
of time a particular prescription drug remains in the bloodstream of
users. She examines five people who have taken the drug and determines
the amount of time the drug has remained in each of their bloodstreams.
In hours, these times are: 24.3, 24.6, 23.8, 24.0, and 24.3. What is the
mean number of hours that the drug remains in the bloodstream of these
experimental participants?
Solution To find the mean, we begin by adding up all of the measured
values. In this case,
24.3 + 24.6 + 23.8 + 24.0 + 24.3 = 121.
We then divide by the number of participants (five) and obtain
121
=24.2
5

as the mean.

Example 2
Suppose the participant with the 23.8-hour measurement had actually
been measured incorrectly, and a measurement of 11.8 hours obtained
instead. What would the mean number of hours have been?
In this case, the sum of the data values is
24.3 + 24.6 + 11.8 + 24.0 + 24.3 = 109,
and the mean becomes

109
=
5

21.8.

This example exhibits the fact that the mean can be greatly thrown off
by one incorrect measurement. Similarly, one measurement that is
unusually large or unusually small can have great impact upon the mean.
A measure of location that is not impacted as much by extreme values is
called the median. The median of a group of numbers is simply the value
in the middle when the data values are arranged in numerical order. In
the event that there is an even number of data values, we find the median
by computing the number halfway between the two values in the middle
(that is, we find the mean of the two middle values).
The median of a list of numbers is the number in the middle
when the numbers are ordered from least to greatest or from
greatest to least. When there is an even number of values, the
median is equal to the mean of the two middle numbers.

This numerical measure is sometimes used in the place of the mean


when we wish to minimize the impact of extreme values.

Example 3
What is the median value of the data from Example 1?
What is the median value of the modified data from Example 2?
Solution
Arrange both sets of data in increasing order:
23.8, 24.0, 24.3, 24.3, 24.6;
11.8, 24.0, 24.3, 24.3, 24.6;
In both cases, the median is 24.3. Clearly, the median was not impacted
by the one unusually small observation.
Another measure of location is called the mode. The mode is simply the
most frequently occurring value in a series of data.
A mode of a list of numbers is a number that occurs most
often in the list.
For example, 7 is the mode of 2, 7, 5, 8, 7, and 12.
The list 2, 4, 2, 8, 2, 4, 7, 4, 9, and 11 has two modes, 2 and 4.
In the examples above, the mode is 24.3. The mode is determined in an
experiment when we wish to know which outcome has happened the
most often.

Measures of Variability
Measures of location provide only information about the middle value.
They tell us nothing, however, about the spread or the variability of the
data. Yet sometimes knowing the variability of a set of data is very
important. To see why, examine the example below.
Example 3
Consider an individual who has the choice of getting to work using
either public transportation or her own car. Suppose that over the period
of several weeks, the individual used both modes of transportation and
recorded the amount of travel time associated with these two different
ways of getting to work.
Travel time using a car:
28, 28, 29, 29, 30, 30, 31, 31, 32, 32 (min)
Travel time using public transportation:
24, 25, 26, 27, 28, 29, 30, 33, 36, 42 (min)
It turns out that both methods of transportation average 30 minutes.
Even though the average travel time is the same (30 minutes), do the
alternatives possess the same degree of reliability?
Solution
At first glance, it might appear, therefore, that both alternatives offer the
same service. However, for most people, the variability exhibited for
public transportation would be of concern. To protect against arriving
late, one would have to allow for 42 minutes of travel time using public
transportation, but with a car one would only have to allow a maximum
of 32 minutes. Also of concern are the wide extremes that must be
expected when using public transportation.

Thus, we can see that when we look at a set of data, we may wish to not
only consider the average value of the data but also the variability of the
data. The easiest way to measure the variability of the data is to
determine the difference between the greatest and the least values. This
is called the range.
The range of the data in Example 1 is 24.6 23.8 = 0.8.
The range of the data in Example 2 is 24.6 11.8 = 12.8.
Note how the one faulty measurement in Example 2 has changed the
range. For this reason, it is usually desirable to use another more reliable
measure of variability, called the standard deviation.
The standard deviation is an extremely important measure of variability;
however, it is rather complicated to compute. To understand the meaning
of the standard deviation, suppose you have a set of data
that has a mean of 120 and a standard deviation of 10. As long as this
data is normally distributed (most reasonable sets of data are), we can
conclude that approximately 68 percent of the data values lie within one
standard deviation of the mean. This means, in this case, that 68 percent
of the data values lie between
120 10 = 110 and 120 + 10 = 130.
Similarly, about 95 percent of the data values will lie within two
standard deviations from the mean; that is, in this case, between
120 2 10 = 100 and 120 + 2 10 =140.
Finally, about 99.7 percent (which is to say, virtually all) of the data will
lie within three standard deviations from the mean. In this case, this
means that almost all of the data values will fall between
120 3 10 = 90 and 120 + 3 10 =150.
....

Problems
Problem 1
During the twelve months of 1998, an executive charged 4, 1, 5, 6, 3, 5,
1, 0, 5, 6, 4, and 3 business luncheons at the Wardlaw Club. What was
the mean monthly number of luncheons charged by the executive?
Solution
The mean number of luncheons charged was
4+1+5+ 6+3+5+1+0+ 5+6+ 4+3 43
= =3.58
12
12

Problem 2
Brian got grades of 92, 89, and 86 on his first three math tests. What
grade must he get on his final test to have an overall average of 90?
Solution
Let G 5 the grade on the final test. Then,
92+ 89+86+G
=90
4

Multiply by 4.
92 + 89 + 86 + G = 360
267 + G = 360
G = 93.
Brian must get a 93 on the final test.

Problem 3
In order to determine the expected mileage for a particular car, an
automobile manufacturer conducts a factory test on five of these cars.
The results, in miles per gallon, are 25.3, 23.6, 24.8, 23.0, and 24.3.
What is the mean mileage? What is the median mileage?
Solution
The mean mileage is
25.3+ 23.6+24.8+23.0+24.3 121
=
=24.2
5
5

miles per gallon. The median mileage is 24.3 miles per gallon.
Problem 4
In problem 3 above, suppose the car with the 23.6 miles per gallon had a
faulty fuel injection system and obtained a mileage of 12.8 miles per
gallon instead. What would have been the mean mileage? What would
have been the median mileage?
Solution
The mean mileage would have been
25.3+ 12.8+ 24.8+23.0+24.3 110.2
=
=
5
5

22.04

miles per gallon. The median mileage would have been 24.3 miles per
gallon, which is the same as it was in problem 3.
Problem 5
In a recent survey, fifteen people were asked for their favorite
automobile color. The results were: red, blue, white, white, black, red,

red, blue, gray, blue, black, green, white, black, and red. What was the
modal choice?
Solution
The modal choice is red, which was chosen by four people.
Problem 6
An elevator is designed to carry a maximum weight of 3,000 pounds. Is
it overloaded if it carries 17 passengers with a mean weight of 140
pounds?
Solution
Since the mean is the total of the data divided by the number of pieces of
data, that is mean =

total
number

, we have (mean)(number) = total. Thus, the

weight of the people on the elevator totals (17)(140) = 2 380. It is


therefore not overloaded.
Problem 7
The annual incomes of five families living on Larchmont Road are
$32,000, $35,000, $37,500, $39,000, and $320,000. What is the range of
the annual incomes?
Solution
The range is $320,000 $32,000 = $288,000. It can be seen that the
range is not a particularly good measure of variability, since four of the
five values are within $7,000 of each other.
Problem 8
The average length of time required to complete a jury questionnaire is
40 minutes, with a standard deviation of 5 minutes. What is the
probability that it will take a prospective juror between 35 and 45
minutes to complete the questionnaire?
Solution
About 68%

Problem 9
Using the information in problem 8, what is the probability that it will
take a prospective juror between 30 and 50 minutes to complete the
questionnaire?
Solution
About 95%
Problem 10
The scores on a standardized admissions test are normally distributed
with a mean of 500 and a standard deviation of 100. What is the
probability that a randomly selected student will score between 400 and
600 on the test?
Solution
About 68%

Calculating Quartiles
1. Arrange the data in order from least to greatest. The median of
the data is the second quartile, Q2.
2. Now consider the lower half of the data. The median of these
data is the first (lower) quartile, Q1.*
3. Next, consider the upper half of the data. The median of these
data is the third (upper) quartile, Q3.*
4. Finally, the interquartile range (IQR) is equal to Q3 Q1.
*Note: If the number of data points is odd, exclude Q2, the median of
the entire data set, before separating it into halves to calculate Q1 or Q3.
Example 4 Consider the list 1, 2, 4, 5, 5, 5, 5, 7, and 9. Determine the
mean, the mode, the median, and the quartiles.
Solution
The mean is

43
9

. The mode is 5. The median is 5,

Q2 is the median, or 5. Q1 =
The IQR = Q3 Q1 = 3.

2+ 4
2

= 3. Q3 =

5+ 7
2

= 6.

Correlation
Very often, researchers need to determine whether any relationship
exists between two variables that they are measuring. For example, they
may wish to determine whether an increase in one variable implies that a
second variable is likely to have increased as well, or whether an
increase in one variable implies that another variable is likely to have
decreased.
The correlation coefficient is a number that can be used to measure the
degree of the relationship between two variables. The value of a
correlation coefficient can range between 1 and +1. A correlation of +1
indicates a perfect positive correlation; the two variables under
consideration increase and decrease together. A correlation of 1 is a
perfect negative correlation; when one variable increases, the other
decreases, and vice versa. If the correlation is 0, there is no relationship
between the behavior of the variables.
Consider a correlation coefficient that is a positive fraction. Such a
correlation represents a positive relationship; as one variable increases,
the other will tend to increase. The closer that correlation coefficient is
to 1, the stronger the relationship will be. Now, consider a correlation
coefficient that is a negative fraction. Such a correlation represents a
negative relationship; as one variable increases, the other will tend to
decrease. The closer the correlation coefficient is to 1, the stronger this
inverse relationship will be.

As an example, consider the relationship between height and weight in


human beings. Since weight tends to increase as height increases, you
might expect that the correlation coefficient for the variables of height
and weight would be near +1. On the other hand, consider the
relationship between maximum pulse rate and age. In general, maximum
pulse rate decreases with age, so you might expect that the correlation
coefficient for these two variables would be near 1.
The following graphs are three scatterplots depicting the relationships
between two variables. In the first, the plotted points almost lie on a
straight line going up to the right. This is indicative of a strong positive
correlation (a correlation near +1). The second scatterplot depicts a
strong negative correlation (a correlation near 1), and the final
scatterplot depicts two variables that are unrelated and probably have a
correlation that is close to 0.

One common mistake in the interpretation of correlation coefficients that


you should avoid is the assumption that a high coefficient indicates a
cause-and-effect relationship. This is not always the case. An example
that is frequently given in statistics classes is the fact that there is a high
correlation between gum chewing and crime in the United States. That is
to say, as the number of gum chewers went up, there was a similar
increase in the number of crimes committed. Obviously, this does not
mean that there is any cause and effect between chewing gum and
committing a crime. The fact is, simply, that as the population of the
United States increased, both gum chewing and crime increased.

S-ar putea să vă placă și