Sunteți pe pagina 1din 105

Ways of describing the data

1. Measure of central tendency


2. Measure of dispersion
3. Measure of shape
4. Outliers
5. Changes over time
Central tendency: Locate the “middle” of the data
set
Measure of Central Tendency:
Mode

The mode is the score with the highest frequency of occurrences.

It is the easiest score to spot in a distribution.

It is the only way to express the central tendency of a nominal


level variable.
Measure of Central Tendency:
Mode
n Most frequently occurring value ~
18 18

14 14

10 f 10
f
6
6
2
2

A B C D F
Rep Dem Ind
exam grades
Political affiliation

18
18
14
14
f 10
f 10
6
6
2
2
5 7 9 11 13 15 17 19 21
5 7 9 11 13 15 17 19 21
# of presentations
# of presentations
Measure of Central Tendency:
Mode
Mode
• The data entry that occurs with the greatest frequency.
• If no entry is repeated the data set has no mode.
• If two entries occur with the same greatest frequency,
each entry is a mode (bimodal).

a) 5.40 1.10 0.42 0.73 0.48 1.10  Mode is 1.10


b) 27 27 27 55 55 55 88 88 99  Bimodal - 27 & 55

c) 1 2 3 6 7 8 9 10  No Mode
The mode
• Calculation of the mode from a frequency
distribution
The observation with the largest frequency is
the mode
Example
A group of 13 real estate agents were asked how many houses they
had sold in the past month. Find the mode.
Number of houses sold F
0 2
1 5
2 6
Total 13

The observation with the largest frequency (6) is 2. Hence the mode of
these data is 2.
The mode
• Calculation of the mode from a grouped frequency
distribution
– It is not possible to calculate the exact value
of the mode of the original data from a
grouped frequency distribution
– The class interval with the largest frequency
is called the modal class

Mo  L 
d1
i 
d1  d2
Where
L = the real lower limit of the modal class
d1 = the frequency of the modal class minus the
frequency of the previous class
d2 = the frequency of the modal class minus the
frequency of the next class above the modal class
i = the length of the class interval of the modal class
The median
The median is the middle observation in a set

50% of the data have a value less than the median, and 50% of the data
have a value greater than the median.

Calculation of the median from raw data


Let n = the number of observations

~ n1
If n is odd, x
2
n
If n is even, the median is the mean of the 2 th observation
n 
and the   1 th observation
2 
The median
• Calculation of the median from a frequency distribution
– This involves constructing an extra column
(cf) in which the frequencies are cumulated
Number of pieces Frequency f Cumulative frequency
cf
1 10 10
2 12 22
3 16 38

f  38

– Since n is even, the median is the average of


the 16th and 17th observations
– From the cf column, the median is 2
The median
• Calculation of the median from a grouped frequency
distribution
– It is possible to make an estimate of the
median
– The class interval that contains the median is
called the median class n 
 C
x  L   i 
~ 2
Where f
~
x = the median
L = the real lower limit of the median class
n = Σf = the total number of observations in the set
C = the cumulative frequency in the class
immediately before the median class
f = the frequency of the median class
i = the length of the real class interval of the
median class
Measure of Central Tendency:
median.

The median is the middle-ranked score (50th percentile).

If there is an even number of scores, it is the arithmetic


average of the two middle scores.

The median is unchanged by outliers. Even if Bill Gates


were deleted from the U.S. economy, the median asset of
U.S. citizens would remain (more or less) the same.
• Median
Midpoint of a data set
– values ½ smaller, ½ larger ~

10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Median
Median: The value of the data that occupies the middle position when
the data are ranked in order according to size

Notes:
~
 Denoted by “x tilde”: x

 The population median,  (uppercase mu, Greek alphabet), is


the data value in the middle position of the entire population

To find the median:


1. Rank the data
x )  n 1
2. Determine the depth of the median: d ( ~
2
3. Determine the value of the median
Example
 Example: Find the median for the set of data:
{4, 8, 3, 8, 2, 9, 2, 11, 3}
Solution:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11
2. Find the depth: d (~
x )  (9 1)/ 2  5
3. The median is the fifth number from either end in the ranked
data: ~
x 4
Suppose the data set is {4, 8, 3, 8, 2, 9, 2, 11, 3, 15}:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15
2. Find the depth: d ( ~x )  (10  1) / 2  5.5
3. The median is halfway between the fifth and sixth
observations: ~
x  (4 8)/ 2  6
Determine the mode.
The Mean

The mean is the arithmetic average of the scores.

Sum of all observations


Mean =
Number of observations

_  Xi
_________
i
X =
N
Statistical Notation
• Formula for mean: X
N
• Σ: summate
– add all that follows
• X: observation
– value of an observation
• N: number of observations
– Or data points ~
Example
 Example: The following data represents the number of accidents
in each of the last 8 years at a dangerous intersection.
Find the mean number of accidents: 8, 9, 3, 5, 2, 6, 4, 5:

1
Solution: x  (8  9  3  5  2  6  4  5)  5.25
8

 In the data above, change 6 to 26:


1
Solution: x  (8  9  3  5  2  26  4  5)  7.75
8

Note: The mean can be greatly influenced by outliers


The Mean

The mean is the arithmetic average of the scores.

The mean is the center of gravity of a distribution. Deleting


Bill Gates’ assets would change the national mean income.

_  Xi
_________
i
X =
N
The mean
• Calculation of the mean from a frequency distribution
– It is useful to be able to calculate a mean
directly from a frequency table

– The calculation of the mean is found from


the formula:

x
 fx
f
where
Σf = the sum of the frequencies
Σfx = the sum of each observation multiplied by its
frequency
The mean
• Calculation of the mean from a grouped frequency
distribution
– The mean can only be estimated from a grouped frequency distribution
– Assume that the observations are spread evenly throughout each class
interval

x
 fm
f
where:
Σfm = the sum of the midpoint of a class interval and
that class interval’s frequency
Σf = the sum of the frequencies
1. Determine the mean, median, mode
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62,
64, 65, 65, 67, 68, 68, 70

2. Make a group frequency distribution out of the


data using the a class width of 5. determine the
mean, median, and mode of the group frequency
distribution.
The mean
• Weighted means
– Weighted arithmetic mean or weighted mean
is calculated by assigning weights (or
measures of relative importance) to the
observations to be averaged
– If observation x is assigned weight w, the
formula for the weighted mean is:

x
 xw
w
– The weights are usually expressed as
percentages or fractions
• A student's grade in a Psychology course is
comprised of tests (40%), quizzes (20%),
papers (20%), and a final project (20%). His
scores for each of the categories are 85 (tests),
100 (quizzes), 92 (papers) and 84 (final
project). Calculate his overall grade.
• In a History course, a student's grade is
composed of papers (40%), tests (40%) and a
final exam (20%). The student has earned a 90
value on all papers and 80 value on all tests.
What is the minimum score the student needs to
earn on the final exam to achieve an overall
grade of 87.0?
Central Tendency
• Describes most typical values
– Depends on level of measurement
• Mode (all levels)
– Most frequently occurring value
• Median (only ordinal & interval/ratio)
– value where ½ observations above & ½ below
• Mean (only interval/ratio)
– Arithmetic average ~
• Determine the mode, the median, the mean
Choosing the “Best Average”
• The shape of your data and the existence of any
outliers may help you choose the best average:
Shapes of Distributions
Symmetry in Data Sets
The analysis of a data set often depends on whether the
distribution is symmetric or non-symmetric.
Symmetric distribution: the pattern of frequencies from a
central point is the same (or nearly so) from the left and right.
Symmetry in Data Sets
Non-symmetric distribution: the patterns from a central
point from the left and right are different.
Skewed to the left: a tail extends out to the left.
Skewed to the right: a tail extends out to the right.
Measures of Dispersion
• Measures of central tendency alone cannot
completely characterize a set of data. Two very
different data sets may have similar measures of
central tendency.

• Measures of dispersion are used to describe the


spread, or variability, of a distribution

• Common measures of dispersion: range, variance,


and standard deviation
Measures of Variation (“Spread”)
Another important characteristic of
quantitative data is how much the data varies,
or is spread out.

The 2 most common method of measuring


spread are:
1. Range
2. Standard deviation and Variance
Range
Range: The difference in value between the highest-valued (H) and
the lowest-valued (L) pieces of data:
range H  L

• Other measures of dispersion are based on the following quantity

Deviation from the Mean: A deviation from the mean, x ,x


is the difference between the value of x and the mean x
The wait time to see a bank teller is studied at 2
banks.
Example: Finding the Range
Bank A has multiple lines, one for each teller.
Bank B has a single wait line for 1st available
teller.

5 wait times (in minutes) are sampled from each


bank:
Bank A: 5.2 6.2 7.5 8.4 9.2
Bank B: 6.6 6.8 7.5 7.7 7.9
• Bank
Solution:
A: Range Finding
=? the Range
• Bank B: Range = ?

• Note: The range is easy to compute, but


only uses 2 values. Do the following 2 sets
vary the same?

– Set A: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
– Set B: 1, 10, 10, 10, 10, 10, 10, 10, 10, 10
Example
 Example: Consider the sample {12, 23, 17, 15, 18}.
Find 1) the range and 2) each deviation from the mean.

Solutions:

1) range H  L  2312 11 2) x  1(12  2317 1518) 17


5

Data Deviation from Mean


x x x
_________________________
12 -5
23 6
17 0
15 -2
18 1
Mean Absolute Deviation
Note:  (x  x)  0 (Always!)

Mean Absolute Deviation: The mean of the absolute values of the


deviations from the mean:

Mean absolute deviation  n  | x  x |


1

For the previous example:


1
n
 
| x x|  1      14 
5
(5 6 0 2 1)
5
2.8
Standard Deviation and Variance
Measures the typical amount data deviates
from the mean.

Sample Variance, s 2 :

 ( x  x ) 2
• s2 
n 1

Sample Standard Deviation, s:


( x  x ) 2
s s  2
• n 1
Interpreting Standard Deviation

• Standard deviation is a measure of the


typical amount an entry deviates from the
mean.
• The more the entries are spread out, the
greater the standard deviation.
Finding Sample Variance & Standard Deviation
x
1. Find the mean of the sample x
data set. n

2. Find deviation of each entry. xx


3. Square each deviation. ( x  x )2
4. Add to get the sum of the ( x  x ) 2
deviations squared.
 ( x  x ) 2

5. Divide by n – 1 to get the s2 


sample variance. n 1

6. Find the square root to get ( x  x ) 2


the sample standard s
deviation.
n 1
Notes
 The shortcut formula for the sample variance:

 x  2

x  n
2

s2 
n 1

 The unit of measure for the standard deviation is the


same as the unit of measure for the data
Find the Standard Deviation and Variance for
Bank A (multi-line)
 x 36.5
x   7.3 min Wait time, Deviation: x – x Squares: (x – x)2
n 5 x (in min)
5.2 5.2 – 7.3 = -2.1 (–2.1)2 = 4.41
6.2 6.2 – 7.3 = ( )2 =

( x  x )2 7.5 7.5 – 7.3 = ( )2 =


s 
2
 8.4 8.4 – 7.3 = ( )2 =
n 1
9.2 9.2 – 7.3 = ( )2 =
x  x  
2
 x  36.5 Σ(x – x) =

s  s2 
• Round to one more decimal than the data.
• Don’t round until the end.
• Include the appropriate units.
Find the Standard Deviation and Variance for
Bank B (1 wait line)
 x 36.5
x   7.3 min Wait time, Deviation: x – x Squares: (x – x)2
n 5 x (in min)
6.6
6.8

 ( x  x ) 2 7.5
s2   7.7
n 1
7.9
x  x  
2
 x  36.5 Σ(x – x) =

s  s2 
• Round to one more decimal than the data.
• Don’t round until the end.
• Include the appropriate units.
Example
 Example: Find the 1) variance and 2) standard deviation for the
data {5, 7, 1, 3, 8}:
Solutions:
First: x  1(5 7 1 3 8)  48
.
5
x x x ( x  x )2
5 0.2 0.04
7 2.2 4.84
1 -3.8 14.44
3 -1.8 3.24
8 3.2 10.24
Sum: 24 0 32.08

1
1) s 2  ( 32 . 8 )  8 . 2 2) s  8 . 2  2 . 86
4
You grow 20 crystals from a solution and
measure the length of each crystal in
millimeters. Here is your data: 9, 2, 5, 4, 12,
7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4
Calculate the sample standard deviation of the
length of the crystals.
Sample versus Population
Standard Deviation
Note: Unlike x and µ, the formulas for s and σ are not
mathematically the same:
Sample Standard Deviation

• ( x  x ) 2
s s  2

n 1
Population Standard Deviation

 ( x   ) 2

•   2 
N
Standard Deviation: Key Points

 s0 ( When would s = 0 ?)

 The standard deviation is a measure of variation of all


values from the mean. The larger s is, the more the
data varies.
 The units of the standard deviation s are the same as
the units of the original data values. (The variance
has units).

 The value of the standard deviation s can increase


dramatically with the inclusion of one or more
outliers (data values far away from all others)
Using Technology

The gas mileage of 2 cars is sampled over


various conditions:

Car A: 21.1 21.2 20.8 19.8 23.8 (mpg)


Car B: 25.2 19.1 18.0 24.4 20.3 (mpg)

Which car do you think gets “better” mpg?

Use a calculator to find the mean and standard


deviation for each to justify your choice.
Standard Deviation and “Spread”

How does “s” show how much the data varies?


Three methods:
1. Range Rule of Thumb
2. Chebyshev’s Theorem
3. The Empirical Rule
The Range Rule of Thumb

Range Rule: For most data sets, the majority of the data lies
within 2 standard deviations of the mean.
Recall: Range = High – Lo
Estimate: Range ≈ 4s

Alternatively, If the range is known, you can use the range


rule to estimate the standard deviation:

s
Range
4
Using the Range Rule of Thumb

A sample of women’s heights has a mean of 64 inches and a


standard deviation of 2.5 inches.

Using the range rule, “most” women fall within what


heights?

What would be an “unusual” height?


Chebyshev’s Theorem
Chebyshev’s Theorem: The proportion of any distribution that lies
within k standard deviations of the mean is at least 1  (1/k2), where
k is any positive number larger than 1. This theorem applies to all
distributions of data.
Illustration:

at least
1 12
k

x  ks x x  ks
Chebyshev’s Theorem

 For K = 2, at least 3/4 (or 75%) of all values lie


within 2 standard deviations of the mean

 For K = 3, at least 8/9 (or 89%) of all values lie


within 3 standard deviations of the mean
Using Chebyshev’s Theorem

A sample of salaries at an elementary school has a mean of


$32,000 and a standard deviation of $3000.
Use Chebyshev’s Theorem to describe how the salaries are
spread out.
Would a salary of $28,000 be “unusual?”
Would a salary of $45,000 be “unusual”?
Example
 Example: At the close of trading, a random sample of 35
technology stocks was selected. The mean selling
price was 67.75 and the standard deviation was 12.3.
Use Chebyshev’s theorem (with k = 2, 3) to describe
the distribution.
Solutions:
Using k=2: At least 75% of the observations lie within 2 standard
deviations of the mean:
( x  2s, x  2s )  (67.75  2(12.3), 67.75  2(12.3)  (43.15, 92.35)

Using k=3: At least 89% of the observations lie within 3


standard deviations of the mean:
( x  3s, x  3s )  (67.75  3(12.3), 67.75  3(12.3)  (30.85, 104.65)
Empirical Rule
Empirical Rule: If a variable is normally distributed, then:
1. Approximately 68% of the observations lie within 1 standard
deviation of the mean
2. Approximately 95% of the observations lie within 2 standard
deviations of the mean
3. Approximately 99.7% of the observations lie within 3 standard
deviations of the mean

Notes:
 The empirical rule is more informative than Chebyshev’s theorem since
we know more about the distribution (normally distributed)
 Also applies to populations
 Can be used to determine if a distribution is normally distributed
The Empirical Rule
The Empirical Rule
The Empirical Rule
Example
 Example: A random sample of plum tomatoes was selected
from a local grocery store and their weights recorded.
The mean weight was 6.5 ounces with a standard
deviation of 0.4 ounces. If the weights are normally
distributed:
1) What percentage of weights fall between 5.7 and 7.3?
2) What percentage of weights fall above 7.7?

Solutions:
1) ( x  2s, x  2s)  (65
.  2(0.4), 65
.  2(0.4))  (57
. , 7.3)
Approximately 95% of the weights fall between 5.7 and 7.3
2) ( x  3s, x  3s)  (65
.  3(0.4), 65
.  3(0.4))  (53
. , 7.7)
Approximately 99.7% of the weights fall between 5.3 and 7.7
Approximately 0.3% of the weights fall outside (5.3, 7.7)
Approximately (0.3/2)=0.15% of the weights fall above 7.7
A Note about the Empirical Rule
Note: The empirical rule may be used to determine whether or
not a set of data is approximately normally distributed

1. Find the mean and standard deviation for the data

2. Compute the actual proportion of data within 1, 2, and 3


standard deviations from the mean

3. Compare these actual proportions with those given by the


empirical rule

4. If the proportions found are reasonably close to those of the


empirical rule, then the data is approximately normally
distributed
Example:
A sample Using
of IQs has the Empirical
a symmetric distribution with
a mean of 100 and a standard
Rule deviation of 15.
1.Sketch the distribution.
2.68% of people have an IQ between what 2 values?
3.What percent of people have an IQ between 70
and 130?
4.What percent of people have an IQ between 100
and 115?
5.What percent of people have an IQ above 145?
70
Mean & Standard
Deviation of Frequency Distribution
• If the data is given in the form of a frequency
distribution, we need to make a few changes to
the formulas for the mean, variance, and standard
deviation

• Complete the extension table in order to find


these summary statistics
To Calculate
• In order to calculate the mean, variance, and standard
deviation for data:
1. In an ungrouped frequency distribution, use the
frequency of occurrence, f, of each observation

2. In a grouped frequency distribution, we use the frequency


of occurrence associated with each class midpoint:

  xf 
2

 x 2
f 
x
xf
s 
2 f
f  f 1
Example
 Example: A survey of students in the first grade at a local school
asked for the number of brothers and/or sisters for
each child. The results are summarized in the table
below. Find 1) the mean, 2) the variance, and
3) the standard deviation:
Solutions:
First: x f xf x2 f
0 15 0 0
1 17 17 17
2 23 46 92
4 5 20 80
5 2 10 50
Sum: 62 93 239

239  (93) 2

2) s2  62 62 . 128
1) x  93/ 62 15 1  163 3) s 163
. . .
Problem
• Find the mean and the variance for this
grouped frequency distribution:
Class Boundaries f
2–6 7
6 – 10 15
10 – 14 22
14 – 18 14
18 – 22 2
z-Score
z-Score: The position a particular value of x has relative to the mean,
measured in standard deviations. The z-score is found by the
formula:
value  mean x  x
z 
st.dev. s

Notes:
 Typically, the calculated value of z is rounded to the nearest

hundredth
 The z-score measures the number of standard deviations

above/below, or away from, the mean


 z-scores typically range from -3.00 to +3.00

 z-scores may be used to make comparisons of raw scores


Example
 Example: A certain data set has mean 35.6 and standard
deviation 7.1. Find the z-scores for 46 and 33:

Solutions:

z  x s x  46 35.6 1.46
7.1
46 is 1.46 standard deviations above the mean

 x  x  33  35.6  
z 0.37
s 7.1
33 is 0.37 standard deviations below the mean.
Quartiles
Quartiles: Values of the variable that divide the ranked data into
quarters; each set of data has three quartiles
1. The first quartile, Q1, is a number such that at most 25% of
the data are smaller in value than Q1 and at most 75% are
larger
2. The second quartile, Q2, is the median
3. The third quartile, Q3, is a number such that at most 75%
of the data are smaller in value than Q3 and at most 25%
are larger
Ranked data, increasing order

25% 25% 25% 25%


L Q1 Q2 Q3 H
Percentiles
Percentiles: Values of the variable that divide a set of ranked
data into 100 equal subsets; each set of data has 99 percentiles.
The kth percentile, Pk, is a value such that at most k% of the data
is smaller in value than Pk and at most (100  k)% of the data is
larger.
at most k % at most (100 - k )%
L Pk H

Notes:
 The 1st quartile and the 25th percentile are the same: Q1 = P25
 The median, the 2nd quartile, and the 50th percentile are
x  Q2  P50
all the same: ~
Finding Pk (and Quartiles)
• Procedure for finding Pk (and quartiles):
1. Rank the n observations, lowest to highest
2. Compute A = (nk)/100
3. If A is an integer:
– d(Pk) = A.5 (depth)
– Pk is halfway between the value of the data in the Ath
position and the value of the next data

If A is a fraction:
– d(Pk) = B, the next larger integer
– Pk is the value of the data in the Bth position
Example
 Example: The following data represents the pH levels of a
random sample of swimming pools in a California
town. Find: 1) the first quartile, 2) the third quartile,
and 3) the 37th percentile:
5.6 5.6 5.8 5.9 6.0
6.0 6.1 6.2 6.3 6.4
6.7 6.8 6.8 6.8 6.9
7.0 7.3 7.4 7.4 7.5

Solutions:
1) k = 25: (20) (25) / 100 = 5, depth = 5.5, Q1 = 6
2) k = 75: (20) (75) / 100 = 15, depth = 15.5, Q3 = 6.95

3) k = 37: (20) (37) / 100 = 7.4, depth = 8, P37 = 6.2


Midquartile
Midquartile: The numerical value midway between the first and third
quartile:
Q1  Q3
midquartile 2

 Example: Find the midquartile for the 20 pH values in the


previous example:
Q1  Q3 6  6.95 12.95
midquartile     6.475
2 2 2

Note: The mean, median, midrange, and midquartile are all measures
of central tendency. They are not necessarily equal. Can you
think of an example when they would be the same value?
5-Number Summary
5-Number Summary: The 5-number summary is composed of:
1. L, the smallest value in the data set
2. Q1, the first quartile (also P25)
3. ~
x , the median (also P50 and 2nd quartile)
4. Q3, the third quartile (also P75)
5. H, the largest value in the data set

Notes:
 The 5-number summary indicates how much the data is
spread out in each quarter
 The interquartile range is the difference between the first and third
quartiles. It is the range of the middle 50% of the data
Box-and-Whisker Display
Box-and-Whisker Display: A graphic representation of the
5-number summary:
• The five numerical values (smallest, first quartile, median, third
quartile, and largest) are located on a scale, either vertical or
horizontal
• The box is used to depict the middle half of the data that lies
between the two quartiles
• The whiskers are line segments used to depict the other half of the
data
• One line segment represents the quarter of the data that is smaller
in value than the first quartile
• The second line segment represents the quarter of the data that is
larger in value that the third quartile
Example
 Example: A random sample of students in a sixth grade class
was selected. Their weights are given in the table
below. Find the 5-number summary for this data and
construct a boxplot:
63 64 76 76 81 83
85 86 88 89 90 91
92 93 93 93 94 97
99 99 99 101 108 109
112
Solution:

63 85 92 99 112
L Q1 ~
x Q3 H
Boxplot for Weight Data
Weights from Sixth Grade Class

60 70 80 90 100 110
Weight

L Q1 ~
x Q3 H
Quartiles
• Quartiles divide data into four equal parts

– First quartile—Q1
• 25% of observations are below Q1 and 75% above Q1
• Also called the lower quartile

– Second quartile—Q2
• 50% of observations are below Q2 and 50% above Q2
• This is also the median

– Third quartile—Q3
• 75% of observations are below Q3 and 25% above Q3
• Also called the upper quartile
• Calculating quartiles
Example
The sorted observations are:
25, 29, 31, 39, 43, 48, 52, 63, 66, 90

Find the first and third quartile


Solution
The number of observations n = 10
n 10
Since   2.5 we define m = 3. Therefore,
4 4
Lower quartile = 3rd observation from the lower end =
31

Upper quartile = 3rd observation from the upper end


= 63
Quartiles
• Calculation of the quartiles from a grouped frequency
distribution
– The class interval that contains the relevant
quartile is called the quartile class
n   3n 
  C   C
Q1  L    i  Q3  L    i 
4 4
f f

where:
L = the real lower limit of the quartile class (containing Q1 or
Q3)
n = Σf = the total number of observations in the entire data set
C = the cumulative frequency in the class immediately before
the quartile class
f = the frequency of the relevant quartile class
i = the length of the real class interval of the relevant quartile
class
Deciles, percentiles and fractiles
• Further division of a distribution into a number of equal parts is
sometimes used; the most common of these are deciles, percentiles,
and fractiles

• Deciles divide the sorted data into 10 sections

• Percentiles divide the distribution into 100 sections

• Instead of using a percentile we would refer to a fractile


– For example, the 30th percentile is the 0.30
fractile
The geometric mean
• When dealing with quantities that change over a period,
we would like to know the mean rate of change
• Examples include
– The mean growth rate of savings over
several years
– The mean ratios of prices from one
year to the next

• Geometric mean of n observations x1, x2, x3,…xn is given


by:
Geometric mean  n x1 x2 x3 ... xn
Skewness
• The skewness of a distribution is measured by comparing the
relative positions of the mean, median and mode

• Distribution is symmetrical
Mean = Median = Mode

• Distribution skewed right


– Median lies between mode and mean, and mode is less than mean

• Distribution skewed left


– Median lies between mode and mean, and mode is greater than mean
Skewness
• There are two measures commonly associated with the shapes of a
distribution — Kurtosis and skewness

• Kurtosis is the degree to which a distribution is peaked

• The kurtosis for a normal distribution is zero

• If the kurtosis is sharper than a normal distribution, the kurtosis is


positive

• If it is flatter than a normal distribution, the kurtosis is negative


Summary
• We have looked at calculating the mode,
median and mean from grouped and
ungrouped data

• We also looked at calculating quartiles,


deciles, percentiles and fractiles

• We have discussed calculating and


interpreting the geometric mean

• Lastly we determined the significance of the


skewness of a distribution
Percentiles and Percentile Ranks
• A percentile is the score at which a specified
percentage of scores in a distribution fall below
– To say a score 53 is in the 75th percentile is to say
that 75% of all scores are less than 53
• The percentile rank of a score indicates the percentage
of scores in the distribution that fall at or below that
score.
– Thus, for example, to say that the percentile rank of
53 is 75, is to say that 75% of the scores on the
exam are less than 53.
Psy 427 - Cal State Northridge 96
Percentile
• Scores which divide distributions into specific
proportions
– Percentiles = hundredths
P1, P2, P3, … P97, P98, P99
– Quartiles = quarters
Q1, Q2, Q3
– Deciles = tenths
D1, D2, D3, D4, D5, D6, D7, D8, D9
• Percentiles are the SCORES
Psy 427 - Cal State Northridge 97
Percentile Rank
• What percent of the scores fall below a particular
score?
( Rank  .5)
PR  100
N
• Percentile Ranks are the Ranks not the scores

Psy 427 - Cal State Northridge 98


Example: Percentile Rank

• Ranking no ties – just number them


Score: 1 3 4 5 6 7 8 10
Rank: 1 2 3 4 5 6 7 8

• Ranking with ties - assign midpoint to ties


Score: 1 3 4 6 6 8 8 8
Rank: 1 2 3 4.5 4.5 7 7 7

Psy 427 - Cal State Northridge 99


Step 1 Step 2 Step 3 Step 4

Assign
Midpoint Percentile Rank
Data Order Number to Ties (Apply Formula)
9 1 1 1 2.381
5 2 2 2 7.143  Steps to
2
3
3
3
3
4
4
4
16.667
16.667
Calculating
3 3 5 4 16.667 Percentile Ranks
4 4 6 7 30.952
8 4 7 7 30.952
9 4 8 7 30.952
1 5 9 10 45.238
7 5 10 10 45.238  Example:
4 5 11 10 45.238
8 6 12 12 54.762
( Rank3  .5)
3 7 13 14 64.286 PR  3  100 
7 7 14 14 64.286 N
6 7 15 14 64.286
5 8 16 17.5 80.952 (4  .5)
7 8 17 17.5 80.952  100  16.667
4 8 18 17.5 80.952 21
5 8 19 17.5 80.952
8 9 20 20.5 95.238
8 9 21 20.5 Psy 427 -95.238
Cal State Northridge 100
Percentile
X P  ( p)(n  1)
• Where XP is the score at the desired percentile, p is the
desired percentile (a number between 0 and 1) and n is
the number of scores)
• If the number is an integer, than the desired percentile
is that number
• If the number is not an integer than you can either
round or interpolate; for this class we’ll just round
(round up when p is below .50 and down when p is
above .50)
Psy 427 - Cal State Northridge 101
Percentile
• Apply the formula X P  ( p)(n  1)
1. You’ll get a number like 7.5 (think of it as
place1.proportion)
2. Start with the value indicated by place1 (e.g. 7.5,
start with the value in the 7th place)
3. Find place2 which is the next highest place
number (e.g. the 8th place) and subtract the value
in place1 from the value in place2, this distance1
4. Multiple the proportion number by the distance1
value, this is distance2
5. Add distance2Psyto427the value
- Cal State in place1 and that is102
Northridge
Example: Percentile
• Example 1: 25th percentile:
{1, 4, 9, 16, 25, 36, 49, 64, 81}
• X25 = (.25)(9+1) = 2.5
– place1 = 2, proportion = .5
– Value in place1 = 4
– Value in place2 = 9
– distance1 = 9 – 4 = 5
– distance2 = 5 * .5 = 2.5
– Interpolated value = 4 + 2.5 = 6.5

Psy 427 - Cal State Northridge 103
6.5 is the 25th percentile
Example: Percentile
• Example 2: 75th percentile
{1, 4, 9, 16, 25, 36, 49, 64, 81}
• X75 = (.75)(9+1) = 7.5
– place1 = 7, proportion = .5
– Value in place1 = 49
– Value in place2 = 64
– distance1 = 64 – 49 = 15
– distance2 = 15 * .5 = 7.5
– Interpolated value = 49 + 7.5 = 56.5
– 56.5 is the 75th Psy
percentile
427 - Cal State Northridge 104
Quartiles
• To calculate Quartiles you simply find the scores
the correspond to the 25, 50 and 75 percentiles.
• Q1 = P25, Q2 = P50, Q3 = P75

Psy 427 - Cal State Northridge 105

S-ar putea să vă placă și