Sunteți pe pagina 1din 33

Simple Descriptive Statistics

Review and Examples


You will likely make use of all three measures of central
tendency (mode, median, and mean), as well as some
key measures of dispersion (standard deviation, zscores, and the coefficient of variation), along with the
statistics that describe the shape of a distribution
(skewness and kurtosis) at some point if you work with
numeric data sets in an academic or research context
In this lecture, we will review the procedures for
calculating these statistics, and work through an
example for each of the statistics (using a small data set,
smaller than those that are typically found in research
applications)
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review


1. Mode This is the most frequently occurring
value in the distribution
2. Median This is the value of a variable such
that half of the observations are above and
half are below this value i.e. this value divides
the distribution into two groups of equal size
3. Mean a.k.a. average, the most commonly
used measure of central tendency

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review


1. Mode This is the most frequently occurring value in
the distribution
Procedure for finding the mode of a data set:
1) Sort the data, putting the values in ascending order
2) Count the instances of each value (if this is continuous
data with a high degree of precision and many decimal
places, this may be quite tedious)
3) Find the value that has the most occurrences this is
the mode (if more than one value occurs an equal
number of times and these exceed all other counts, we
have multiple modes)
Use the mode for multi-modal or nominal data sets
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review


2. Median - of the values are above & below this value
Procedure for finding the median of a data set:
1) Sort the data, putting the values in ascending order
2) Find the value with an equal number of values above
and below it (if there are an even number of values, you
will need to average two values together):
Odd number of observations [(n-1)/2]+1 values from
the lowest, e.g. n=19 [(19-1)/2]+1 = 10th value
Even number of observations average the (n/2) and
[(n/2)+1] values, e.g. n=20 average the 10th and 11th
Use the median with assymetric distributions, when you
suspect outliers are present, or with ordinal data
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review


3. Mean a.k.a. average, the most commonly used
measure of central tendency
Procedure for finding the mean of a
data set:
1) Sum all the values in the data set
2) Divide the sum by the number of
values in the data set

i=n

x=

xi
i=1

Use the mean when you have interval or ratio data sets
with a large sample size, few (or no?) outliers, and a
reasonably symmetric unimodal distribution
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review


An example data set: Daily low temperatures recorded
in Chapel Hill from January 18, 2005 through January
31, 2005 in degrees Fahrenheit:
Jan. 18 11 degrees
Jan. 25 25 degrees
Jan. 19 11 degrees
Jan. 26 33 degrees
Jan. 20 25 degrees
Jan. 27 22 degrees
Jan. 21 29 degrees
Jan. 28 18 degrees
Jan. 22 27 degrees
Jan. 29 19 degrees
Jan. 23 14 degrees
Jan. 30 30 degrees
Jan. 24 11 degrees
Jan. 31 27 degrees
For these 14 values, we will calculate all three measures
of central tendency - the mode, median, and mean
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review


1. Mode Find the most frequently occurring value
1) Sort the data, putting the values in ascending order:
11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33
2) Count the instances of each value:
11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33
3x
1x 1x 1x 1x 2x
2x 1x 1x 1x
3) Find the value that has the most occurrences:
In this case, the mode is 11 degrees Fahrenheit, but is
this a good measure of the central tendency of this data?
Had there only been two days with a recorded
temperature of 11 degrees, what would be the mode?
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review


2. Median - of the values are above & below this value
1) Sort the data, putting the values in ascending order:
11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33
2) Find the value with an equal number of values above
and below it (if there are an even number of values, you
will need to average two values together):
Even number of observations average the (n/2) and
[(n/2)+1] values
Here, n=14 average the (14/2) and [(14/2)+1] values,
i.e. the 7th and 8th values (22+25)/2 = 23.5 degrees F
Here, the median is 23.5 degrees F is this a good
measure of central tendency for this data?
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review


3. Mean a.k.a. average, the most commonly used
measure of central tendency
i=n
1) Sum all the values in the data set

xi

i=1
11 + 11 + 11 + 14 + 18 + 19 + 22 + 25 + 25 + 27 + 27 + 29 + 30 + 33
= 302

2) Divide the sum by the number of values in the data set


Here, n=14, so calculate the mean using 302/14 = 21.57
The mean is 21.57 degrees F is this a good measure
of central tendency for this data set?
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion Review


1. Standard Deviation This is the most frequently used
measure of dispersion because it has the same units as
the values and their mean
2. Z-scores These express the difference from the mean
in terms of standard deviations of an individual value,
and thus can be compared to z-scores drawn from other
data sets or distributions
3. Coefficient of Variation This is an overall measure of
dispersion that is normalized with respect to the mean
from the same distribution, and thus is comparable to
coefficients of variation from other data sets because it is
a normalized measure of dispersion
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion Review


1. Standard Deviation Standard deviation is
calculated by taking the square root of variance:
i=N

(xi

i=N

2
(x

x)
i

2
)

i=1

Population standard deviation

S=

i=1

n-1

Sample standard deviation

Why do we prefer standard deviation over


variance as a measure of dispersion? Magnitude
of values and units match means.

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


1. Standard Deviation This is the most frequently used
measure of dispersion because it has the same units as
the values and their mean (unlike variance)
Procedure for finding the standard deviation of a data set:
1) Calculate the mean
2) Calculate the statistical distances (xi x) for each value
3) Square each of the statistical distances (xi x)2
4) Sum the squared statistical distances, the sum of squares
5) Divide the sum of squares by N for a population or by
(n-1) for a sample this gives you the variance
6) Take the square root of the variance to get the standard
deviation

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


2. Z-scores These express the difference from the mean
in terms of standard deviations of an individual value,
and thus can be compared to z-scores drawn from other
data sets or distributions
Procedure for finding the z-score of an observation:
1) Calculate the mean
2) Calculate the statistical distances (xi x) for each value
where we wish find the z-score
3) Calculate the standard deviation
4) Calculate the z-score using the formula

Z-score = x - x
S
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


3. Coefficient of Variation This is an overall measure of
dispersion that is normalized with respect to the mean
from the same distribution, and thus is comparable to
coefficients of variation from other data sets because it is
a normalized measure of dispersion
Procedure for finding the coef. of variation for a data set:
1) Calculate the mean
2) Calculate the standard deviation
3) Calculate the coefficient of variation using the formula

Coefficient of variation =
or
(*100%)

x
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


We will use the same example data set: Daily low CH
temps. Jan. 18-31, 2005 in degrees F:
Jan. 18 11 degrees
Jan. 25 25 degrees
Jan. 19 11 degrees
Jan. 26 33 degrees
Jan. 20 25 degrees
Jan. 27 22 degrees
Jan. 21 29 degrees
Jan. 28 18 degrees
Jan. 22 27 degrees
Jan. 29 19 degrees
Jan. 23 14 degrees
Jan. 30 30 degrees
Jan. 24 11 degrees
Jan. 31 27 degrees
For these 14 values, we will calculate the three measures
of dispersion listed above - the standard deviation, some
z-scores and the coefficient of variation for this data set
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


1. Standard Deviation This is the most frequently used
measure of dispersion because it has the same units as
the values and their mean (unlike variance)
1) Calculate the mean
We have previously found the mean = 21.57 degrees F
2) Calculate the statistical distances (xi x) for each value
Jan. 18 (11 21.57) = -10.57
Jan. 25 (25 21.57) = 3.43
Jan. 19 (11 21.57) = -10.57
Jan. 26 (33 21.57) = 11.43
Jan. 20 (25 21.57) = 3.43
Jan. 27 (22 21.57) = 0.43
Jan. 21 (29 21.57) = 7.43
Jan. 28 (18 21.57) = -3.57
Jan. 22 (27 21.57) = 5.43
Jan. 29 (19 21.57) = -2.57
Jan. 23 (14 21.57) = -7.57
Jan. 30 (30 21.57) = 8.42
Jan. 24 (11 21.57) = -10.57
Jan. 31 (27 21.57) = 5.42
I have rounded the values for display here to 2 decimal places, ideally
you want to do as little rounding as possible

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


1. Standard Deviation cont.
3) Square each of the statistical distances (xi x)2
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24

(-10.57)2 = 111.76
(-10.57)2 = 111.76
(3.43)2 = 11.76
(7.43)2 = 55.18
(5.43)2 = 29.57
(7.57)2 = 57.33
(-10.57)2 = 111.76

Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31

(3.43)2 = 11.76
(11.43)2 = 130.61
(0.43)2 = 0.18
(-3.57)2 = 12.76
(-2.57)2 = 6.61
(8.43)2 = 71.04
(5.43)2 = 29.57

4) Sum the squared statistical distances, the sum of squares


i=n

Sum of Squares =

(xi

2
x) =

751.43

i=1

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


1. Standard Deviation cont.
5) Divide the sum of squares by N for a population or by
(n-1) for a sample this gives you the variance
Here, our sample n =14, so 751.43/(14-1) = 57.8
6) Take the square root of the variance to calculate the
standard deviation
Taking the square root of our variance (57.8) gives us
the standard deviation for our data set 57.8 = 7.6

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


2. Z-scores We will calculate z-scores for the lowest and
highest temperatures in our sample (11 and 33 degrees)
1) Calculate the mean
We have previously found the mean = 21.57 degrees F
2) Calculate the statistical distances (xi x) for each value
where we wish find the z-score
We have already calculated these statistical distances:
Jan. 18 (11 21.57) = -10.57

Jan. 26 (33 21.57) = 11.43

3) Calculate the standard deviation


We have already calculated the standard deviation for
our data set and found it to be = 7.6 degrees
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


2. Z-scores cont.
4) Calculate the z-score using the formula

Z-score = x - x
S
i.e. divide the statistical distances by the standard deviation
Jan. 18 -10.57 / 7.6 =

-1.39

Jan. 26 11.43 / 7.6 = 1.5

If we had another set of minimum temperatures from a


previous January (from 2004, for example), we could
calculate the z-scores for values from that data set, and
make a reasonable comparison to these values

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review


3. Coefficient of Variation This is a normalized measure
of dispersion for the variation throughout a data set
1) Calculate the mean
We have previously found the mean = 21.57 degrees F
2) Calculate the standard deviation
We have previously found the std. dev. = 7.6 degrees F
3) Calculate the coefficient of variation using the formula

Coefficient of variation =
or
(*100%)

Using the example values: 7.6/21.57 = 0.3524 or 35.24%


This value could be compared with that from 2004 etc.
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness and Kurtosis - Review


1. Skewness This statistic measures the
degree of asymmetry exhibited by the data
(i.e. whether there are more observations on
one side of the mean than the other)
2. Kurtosis This statistic measures the
degree to which the distribution is flat or
peaked

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness and Kurtosis - Review


1. Skewness This statistic measures the degree of
asymmetry exhibited by the data (i.e. whether
there are more observations on one side of the
mean than the other): i=N
3
(xi x)

Skewness

i=1

3
ns
Because the exponent in this moment is odd,
skewness can be positive or negative; positive
skewness has more observations below the mean
than above it (negative vice-versa)

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness and Kurtosis - Review


1. Skewness This statistic measures the degree of
asymmetry exhibited by the data
Procedure for finding the skewness of a data set:
1) Calculate the mean
2) Calculate the statistical distances (xi x) for each value
3) Cube each of the statistical distances (xi x)3
4) Sum the cubed statistical distances, the sum of cubes
(i.e. this is the numerator in the skewness formula)
5) Divide the sum of cubes by the sample size multiplied
by the standard deviation cubes (i.e. the denominator is
n*S3 in [ (xi x)3] / [ n*S3])
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness and Kurtosis - Review


2. Kurtosis This statistic measures how flat or
peaked the distribution is, and is formulated as:
i=N

Kurtosis

4
(x

x)
i
i=1

4
ns

-3

The 3 is included in this formula because it


results in the kurtosis of a normal distribution to
have the value 0 (this condition is also termed
having a mesokurtic distribution)
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness and Kurtosis - Review


2. Kurtosis This statistic measures how flat or peaked
the distribution is
Procedure for finding the kurtosis of a data set:
1) Calculate the mean
2) Calculate the statistical distances (xi x) for each value
3) Raise each of the statistical distances to the 4th power,
i.e. (xi x)4
4) Sum the statistical distances to the 4th power (xi x)4
5) Divide the sum by the sample size multiplied by the
standard deviation raised to the 4th power (i.e. the
denominator is n*S4 in [ (xi x)4] / [ n*S4])
6) Subtract 3 from [ (xi x)4] / [ n*S4]

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review


We will use the same example data set: Daily low CH
temps. Jan. 18-31, 2005 in degrees F:
Jan. 18 11 degrees
Jan. 25 25 degrees
Jan. 19 11 degrees
Jan. 26 33 degrees
Jan. 20 25 degrees
Jan. 27 22 degrees
Jan. 21 29 degrees
Jan. 28 18 degrees
Jan. 22 27 degrees
Jan. 29 19 degrees
Jan. 23 14 degrees
Jan. 30 30 degrees
Jan. 24 11 degrees
Jan. 31 27 degrees
Using these 14 values, we will calculate the two
distribution shape descriptive statistics listed above, the
skewness and kurtosis for this data set
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review


1. Skewness This statistic measures the degree of
asymmetry exhibited by the data
1)

2)

Calculate the mean


We have previously found the mean = 21.57 degrees F
Calculate the statistical distances (xi x) for each value
We have previously calculated the statistical distances:
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24

(11 21.57) = -10.57


(11 21.57) = -10.57
(25 21.57) = 3.43
(29 21.57) = 7.43
(27 21.57) = 5.43
(14 21.57) = -7.57
(11 21.57) = -10.57

Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31

(25 21.57) = 3.43


(33 21.57) = 11.43
(22 21.57) = 0.43
(18 21.57) = -3.57
(19 21.57) = -2.57
(30 21.57) = 8.42
(27 21.57) = 5.42
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review


1. Skewness cont.
3) Cube each of the statistical distances (xi x)3
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24

(-10.57)3 = -1181.41
(-10.57)3 = -1181.41
(3.43)3 = 40.3
(7.43)3 = 409.94
(5.43)3 = 159.98
(7.57)3 = -434.04
(-10.57)3 = -1181.41

Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31

(3.43)3 = 40.3
(11.43)3 = 1492.71
(0.43)3 = 0.08
(-3.57)3 = -45.55
(-2.57)3 = -17
(8.43)3 = 598.77
(5.43)3 = 159.98

4) Sum the cubed statistical distances, the sum of cubes


i=n

Sum of cubes =

(xi

3
x) =

-1138.78

i=1

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review


1. Skewness cont.
5) Divide the sum of cubes (-1138.78) by n*S3 (S=7.6 from
above):
-1138.78
(xi x)3
-1138.78
-1138.78
=
=
=
= -0.1851
14*(7.6)3 14*438.98
n*S3
6145.72
The negative value of skewness indicates that our sample
distribution has greater frequencies at the higher values of
temperature (although interpreting skewness with a sample
this small and a distribution that is not really normally
shaped is somewhat of a stretch )
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review


2. Kurtosis This statistic measures the degree to which
the distribution is flat or peaked
1) Calculate the mean
We have previously found the mean = 21.57 degrees F
2) Calculate the statistical distances (xi x) for each value
We have previously calculated the statistical distances:
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24

(11 21.57) = -10.57


(11 21.57) = -10.57
(25 21.57) = 3.43
(29 21.57) = 7.43
(27 21.57) = 5.43
(14 21.57) = -7.57
(11 21.57) = -10.57

Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31

(25 21.57) = 3.43


(33 21.57) = 11.43
(22 21.57) = 0.43
(18 21.57) = -3.57
(19 21.57) = -2.57
(30 21.57) = 8.42
(27 21.57) = 5.42
David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review


2. Kurtosis cont.
3) Raise each of the statistical distances to the 4th power
(xi x)4
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24

(-10.57)4 = 12489.2
(-10.57)4 = 12489.2
(3.43)4 = 138.18
(7.43)4 = 3045.24
(5.43)4 = 868.44
(7.57)4 = 3286.33
(-10.57)4 = 12489.2

Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31

(3.43)4 = 138.18
(11.43)4 = 17059.56
(0.43)4 = 0.03
(-3.57)4 = 162.69
(-2.57)4 = 43.72
(8.43)4 = 5046.8
(5.43)4 = 868.44

4) Sum the statistical distances raised to the 4th power


i=n

Sum of 4th powers =

4 = 68125.24
(x

x)
i
i=1

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review


2. Kurtosis cont.
5) Divide the sum of 4th powers (68125.24) by n*S4 (S=7.6
from above):
68125.24
(xi x)4
68125.24 68125.24
=
=
=
= 1.4564
14*(7.6)4 14*3341.1 46775.32
n*S4
6) Subtract 3 from [ (xi x)4] / [ n*S4]
Using our values, the kurtosis is 1.4564 3 = -1.5436
Because this kurtosis is <0, this sample has a
platykurtic distribution meaning the curve is flatter
than a normal curve (but caveats to interpretation apply)
David Tenenbaum GEOG 090 UNC-CH Spring 2005

S-ar putea să vă placă și