Simple Descriptive Statistics

Simple Descriptive Statistics
Review and Examples

You will likely make use of all three measures of central
tendency (mode, median, and mean), as well as some
key measures of dispersion (standard deviation, zscores, and the coefficient of variation), along with the
statistics that describe the shape of a distribution
(skewness and kurtosis) at some point if you work with
numeric data sets in an academic or research context
In this lecture, we will review the procedures for
calculating these statistics, and work through an
example for each of the statistics (using a small data set,
smaller than those that are typically found in research
applications)
David Tenenbaum GEOG 090 UNC-CH Spring 2005
Measures of Central Tendency - Review

1. Mode This is the most frequently occurring
value in the distribution
2. Median This is the value of a variable such
that half of the observations are above and
half are below this value i.e. this value divides
the distribution into two groups of equal size
3. Mean a.k.a. average, the most commonly
used measure of central tendency

1. Mode This is the most frequently occurring value in
the distribution
Procedure for finding the mode of a data set:
1) Sort the data, putting the values in ascending order
2) Count the instances of each value (if this is continuous
data with a high degree of precision and many decimal
places, this may be quite tedious)
3) Find the value that has the most occurrences this is
the mode (if more than one value occurs an equal
number of times and these exceed all other counts, we
have multiple modes)
Use the mode for multi-modal or nominal data sets

2. Median - of the values are above & below this value
Procedure for finding the median of a data set:
1) Sort the data, putting the values in ascending order
2) Find the value with an equal number of values above
and below it (if there are an even number of values, you
will need to average two values together):
Odd number of observations [(n-1)/2]+1 values from
the lowest, e.g. n=19 [(19-1)/2]+1 = 10th value
Even number of observations average the (n/2) and
[(n/2)+1] values, e.g. n=20 average the 10th and 11th
Use the median with assymetric distributions, when you
suspect outliers are present, or with ordinal data

3. Mean a.k.a. average, the most commonly used
measure of central tendency
Procedure for finding the mean of a
data set:
1) Sum all the values in the data set
2) Divide the sum by the number of
values in the data set
i=n
x=
xi
i=1
Use the mean when you have interval or ratio data sets
with a large sample size, few (or no?) outliers, and a
reasonably symmetric unimodal distribution

An example data set: Daily low temperatures recorded
in Chapel Hill from January 18, 2005 through January
31, 2005 in degrees Fahrenheit:
Jan. 18 11 degrees
Jan. 25 25 degrees
Jan. 19 11 degrees
Jan. 26 33 degrees
Jan. 20 25 degrees
Jan. 27 22 degrees
Jan. 21 29 degrees
Jan. 28 18 degrees
Jan. 22 27 degrees
Jan. 29 19 degrees
Jan. 23 14 degrees
Jan. 30 30 degrees
Jan. 24 11 degrees
Jan. 31 27 degrees
For these 14 values, we will calculate all three measures
of central tendency - the mode, median, and mean

1. Mode Find the most frequently occurring value
1) Sort the data, putting the values in ascending order:
11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33
2) Count the instances of each value:
11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33
3x
1x 1x 1x 1x 2x
2x 1x 1x 1x
3) Find the value that has the most occurrences:
In this case, the mode is 11 degrees Fahrenheit, but is
this a good measure of the central tendency of this data?
Had there only been two days with a recorded
temperature of 11 degrees, what would be the mode?

2. Median - of the values are above & below this value
1) Sort the data, putting the values in ascending order:
11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33
2) Find the value with an equal number of values above
and below it (if there are an even number of values, you
will need to average two values together):
Even number of observations average the (n/2) and
[(n/2)+1] values
Here, n=14 average the (14/2) and [(14/2)+1] values,
i.e. the 7th and 8th values (22+25)/2 = 23.5 degrees F
Here, the median is 23.5 degrees F is this a good
measure of central tendency for this data?

3. Mean a.k.a. average, the most commonly used
measure of central tendency
i=n
1) Sum all the values in the data set
xi
i=1
11 + 11 + 11 + 14 + 18 + 19 + 22 + 25 + 25 + 27 + 27 + 29 + 30 + 33
= 302
2) Divide the sum by the number of values in the data set

Here, n=14, so calculate the mean using 302/14 = 21.57
The mean is 21.57 degrees F is this a good measure
of central tendency for this data set?
Measures of Dispersion Review

1. Standard Deviation This is the most frequently used
measure of dispersion because it has the same units as
the values and their mean
2. Z-scores These express the difference from the mean
in terms of standard deviations of an individual value,
and thus can be compared to z-scores drawn from other
data sets or distributions
3. Coefficient of Variation This is an overall measure of
dispersion that is normalized with respect to the mean
from the same distribution, and thus is comparable to
coefficients of variation from other data sets because it is
a normalized measure of dispersion
Measures of Dispersion Review

1. Standard Deviation Standard deviation is
calculated by taking the square root of variance:
i=N
(xi
i=N
2
(x
x)
i
2
)
i=1
Population standard deviation
S=
i=1
n-1
Sample standard deviation
Why do we prefer standard deviation over

variance as a measure of dispersion? Magnitude
of values and units match means.
Measures of Dispersion - Review

the values and their mean (unlike variance)
Procedure for finding the standard deviation of a data set:
1) Calculate the mean
2) Calculate the statistical distances (xi x) for each value
3) Square each of the statistical distances (xi x)2
4) Sum the squared statistical distances, the sum of squares
5) Divide the sum of squares by N for a population or by
(n-1) for a sample this gives you the variance
6) Take the square root of the variance to get the standard
deviation

2. Z-scores These express the difference from the mean
in terms of standard deviations of an individual value,
and thus can be compared to z-scores drawn from other
data sets or distributions
Procedure for finding the z-score of an observation:
where we wish find the z-score
3) Calculate the standard deviation
4) Calculate the z-score using the formula
Z-score = x - x
S

3. Coefficient of Variation This is an overall measure of
dispersion that is normalized with respect to the mean
from the same distribution, and thus is comparable to
coefficients of variation from other data sets because it is
a normalized measure of dispersion
Procedure for finding the coef. of variation for a data set:
3) Calculate the coefficient of variation using the formula
Coefficient of variation =
or
(*100%)
x

We will use the same example data set: Daily low CH
temps. Jan. 18-31, 2005 in degrees F:
Jan. 18 11 degrees
Jan. 25 25 degrees
Jan. 19 11 degrees
Jan. 26 33 degrees
Jan. 20 25 degrees
Jan. 27 22 degrees
Jan. 21 29 degrees
Jan. 28 18 degrees
Jan. 22 27 degrees
Jan. 29 19 degrees
Jan. 23 14 degrees
Jan. 30 30 degrees
Jan. 24 11 degrees
Jan. 31 27 degrees
For these 14 values, we will calculate the three measures
of dispersion listed above - the standard deviation, some
z-scores and the coefficient of variation for this data set

the values and their mean (unlike variance)
We have previously found the mean = 21.57 degrees F
Jan. 18 (11 21.57) = -10.57
Jan. 25 (25 21.57) = 3.43
Jan. 19 (11 21.57) = -10.57
Jan. 26 (33 21.57) = 11.43
Jan. 20 (25 21.57) = 3.43
Jan. 27 (22 21.57) = 0.43
Jan. 21 (29 21.57) = 7.43
Jan. 28 (18 21.57) = -3.57
Jan. 22 (27 21.57) = 5.43
Jan. 29 (19 21.57) = -2.57
Jan. 23 (14 21.57) = -7.57
Jan. 30 (30 21.57) = 8.42
Jan. 24 (11 21.57) = -10.57
Jan. 31 (27 21.57) = 5.42
I have rounded the values for display here to 2 decimal places, ideally
you want to do as little rounding as possible

1. Standard Deviation cont.
3) Square each of the statistical distances (xi x)2
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24
(-10.57)2 = 111.76
(-10.57)2 = 111.76
(3.43)2 = 11.76
(7.43)2 = 55.18
(5.43)2 = 29.57
(7.57)2 = 57.33
(-10.57)2 = 111.76
Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31
(3.43)2 = 11.76
(11.43)2 = 130.61
(0.43)2 = 0.18
(-3.57)2 = 12.76
(-2.57)2 = 6.61
(8.43)2 = 71.04
(5.43)2 = 29.57
4) Sum the squared statistical distances, the sum of squares

i=n
Sum of Squares =
(xi
2
x) =
751.43
i=1

1. Standard Deviation cont.
5) Divide the sum of squares by N for a population or by
(n-1) for a sample this gives you the variance
Here, our sample n =14, so 751.43/(14-1) = 57.8
6) Take the square root of the variance to calculate the
standard deviation
Taking the square root of our variance (57.8) gives us
the standard deviation for our data set 57.8 = 7.6

2. Z-scores We will calculate z-scores for the lowest and
highest temperatures in our sample (11 and 33 degrees)
where we wish find the z-score
We have already calculated these statistical distances:
Jan. 18 (11 21.57) = -10.57
Jan. 26 (33 21.57) = 11.43

We have already calculated the standard deviation for
our data set and found it to be = 7.6 degrees

2. Z-scores cont.
4) Calculate the z-score using the formula
Z-score = x - x
S
i.e. divide the statistical distances by the standard deviation
Jan. 18 -10.57 / 7.6 =
-1.39
Jan. 26 11.43 / 7.6 = 1.5
If we had another set of minimum temperatures from a

previous January (from 2004, for example), we could
calculate the z-scores for values from that data set, and
make a reasonable comparison to these values

3. Coefficient of Variation This is a normalized measure
of dispersion for the variation throughout a data set
We have previously found the std. dev. = 7.6 degrees F
3) Calculate the coefficient of variation using the formula
Coefficient of variation =
or
(*100%)
Using the example values: 7.6/21.57 = 0.3524 or 35.24%

This value could be compared with that from 2004 etc.
Skewness and Kurtosis - Review

1. Skewness This statistic measures the
degree of asymmetry exhibited by the data
(i.e. whether there are more observations on
one side of the mean than the other)
2. Kurtosis This statistic measures the
degree to which the distribution is flat or
peaked

1. Skewness This statistic measures the degree of
asymmetry exhibited by the data (i.e. whether
there are more observations on one side of the
mean than the other): i=N
3
(xi x)
Skewness
i=1
3
ns
Because the exponent in this moment is odd,
skewness can be positive or negative; positive
skewness has more observations below the mean
than above it (negative vice-versa)

asymmetry exhibited by the data
Procedure for finding the skewness of a data set:
3) Cube each of the statistical distances (xi x)3
4) Sum the cubed statistical distances, the sum of cubes
(i.e. this is the numerator in the skewness formula)
5) Divide the sum of cubes by the sample size multiplied
by the standard deviation cubes (i.e. the denominator is
n*S3 in [ (xi x)3] / [ n*S3])

2. Kurtosis This statistic measures how flat or
peaked the distribution is, and is formulated as:
i=N
Kurtosis
4
(x
x)
i
i=1
4
ns
-3
The 3 is included in this formula because it

results in the kurtosis of a normal distribution to
have the value 0 (this condition is also termed
having a mesokurtic distribution)

2. Kurtosis This statistic measures how flat or peaked
the distribution is
Procedure for finding the kurtosis of a data set:
3) Raise each of the statistical distances to the 4th power,
i.e. (xi x)4
4) Sum the statistical distances to the 4th power (xi x)4
5) Divide the sum by the sample size multiplied by the
standard deviation raised to the 4th power (i.e. the
denominator is n*S4 in [ (xi x)4] / [ n*S4])
6) Subtract 3 from [ (xi x)4] / [ n*S4]
Skewness & Kurtosis - Review

We will use the same example data set: Daily low CH
temps. Jan. 18-31, 2005 in degrees F:
Jan. 18 11 degrees
Jan. 25 25 degrees
Jan. 19 11 degrees
Jan. 26 33 degrees
Jan. 20 25 degrees
Jan. 27 22 degrees
Jan. 21 29 degrees
Jan. 28 18 degrees
Jan. 22 27 degrees
Jan. 29 19 degrees
Jan. 23 14 degrees
Jan. 30 30 degrees
Jan. 24 11 degrees
Jan. 31 27 degrees
Using these 14 values, we will calculate the two
distribution shape descriptive statistics listed above, the
skewness and kurtosis for this data set

asymmetry exhibited by the data
1)
2)
Calculate the mean

Calculate the statistical distances (xi x) for each value
We have previously calculated the statistical distances:
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24
(11 21.57) = -10.57

(11 21.57) = -10.57
(25 21.57) = 3.43
(29 21.57) = 7.43
(27 21.57) = 5.43
(14 21.57) = -7.57
(11 21.57) = -10.57
Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31
(25 21.57) = 3.43

(33 21.57) = 11.43
(22 21.57) = 0.43
(18 21.57) = -3.57
(19 21.57) = -2.57
(30 21.57) = 8.42
(27 21.57) = 5.42

1. Skewness cont.
3) Cube each of the statistical distances (xi x)3
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24
(-10.57)3 = -1181.41
(-10.57)3 = -1181.41
(3.43)3 = 40.3
(7.43)3 = 409.94
(5.43)3 = 159.98
(7.57)3 = -434.04
(-10.57)3 = -1181.41
Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31
(3.43)3 = 40.3
(11.43)3 = 1492.71
(0.43)3 = 0.08
(-3.57)3 = -45.55
(-2.57)3 = -17
(8.43)3 = 598.77
(5.43)3 = 159.98
4) Sum the cubed statistical distances, the sum of cubes

i=n
Sum of cubes =
(xi
3
x) =
-1138.78
i=1

1. Skewness cont.
5) Divide the sum of cubes (-1138.78) by n*S3 (S=7.6 from
above):
-1138.78
(xi x)3
-1138.78
-1138.78
=
=
=
= -0.1851
14*(7.6)3 14*438.98
n*S3
6145.72
The negative value of skewness indicates that our sample
distribution has greater frequencies at the higher values of
temperature (although interpreting skewness with a sample
this small and a distribution that is not really normally
shaped is somewhat of a stretch )

2. Kurtosis This statistic measures the degree to which
the distribution is flat or peaked
We have previously calculated the statistical distances:
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24
(11 21.57) = -10.57

(11 21.57) = -10.57
(25 21.57) = 3.43
(29 21.57) = 7.43
(27 21.57) = 5.43
(14 21.57) = -7.57
(11 21.57) = -10.57
Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31
(25 21.57) = 3.43

(33 21.57) = 11.43
(22 21.57) = 0.43
(18 21.57) = -3.57
(19 21.57) = -2.57
(30 21.57) = 8.42
(27 21.57) = 5.42

2. Kurtosis cont.
3) Raise each of the statistical distances to the 4th power
(xi x)4
Jan. 18
Jan. 19
Jan. 20
Jan. 21
Jan. 22
Jan. 23
Jan. 24
(-10.57)4 = 12489.2
(-10.57)4 = 12489.2
(3.43)4 = 138.18
(7.43)4 = 3045.24
(5.43)4 = 868.44
(7.57)4 = 3286.33
(-10.57)4 = 12489.2
Jan. 25
Jan. 26
Jan. 27
Jan. 28
Jan. 29
Jan. 30
Jan. 31
(3.43)4 = 138.18
(11.43)4 = 17059.56
(0.43)4 = 0.03
(-3.57)4 = 162.69
(-2.57)4 = 43.72
(8.43)4 = 5046.8
(5.43)4 = 868.44
4) Sum the statistical distances raised to the 4th power

i=n
Sum of 4th powers =
4 = 68125.24
(x
x)
i
i=1

2. Kurtosis cont.
5) Divide the sum of 4th powers (68125.24) by n*S4 (S=7.6
from above):
68125.24
(xi x)4
68125.24 68125.24
=
=
=
= 1.4564
14*(7.6)4 14*3341.1 46775.32
n*S4
6) Subtract 3 from [ (xi x)4] / [ n*S4]
Using our values, the kurtosis is 1.4564 3 = -1.5436
Because this kurtosis is <0, this sample has a
platykurtic distribution meaning the curve is flatter
than a normal curve (but caveats to interpretation apply)

Simple Descriptive Statistics - Review and Examples

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Simple Descriptive Statistics - Review and Examples

Încărcat de

Drepturi de autor:

Formate disponibile

Review and Examples

Measures of Central Tendency - Review

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency - Review

Measures of Central Tendency - Review

Measures of Central Tendency - Review

Measures of Central Tendency - Review

Measures of Central Tendency - Review

Measures of Central Tendency - Review

Measures of Central Tendency - Review

2) Divide the sum by the number of values in the data set

Measures of Dispersion Review

Measures of Dispersion Review

Population standard deviation

Sample standard deviation

Why do we prefer standard deviation over

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review

Measures of Dispersion - Review

Measures of Dispersion - Review

Measures of Dispersion - Review

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review

4) Sum the squared statistical distances, the sum of squares

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review

Jan. 26 (33 21.57) = 11.43

3) Calculate the standard deviation

Measures of Dispersion - Review

Jan. 26 11.43 / 7.6 = 1.5

If we had another set of minimum temperatures from a

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Dispersion - Review

Using the example values: 7.6/21.57 = 0.3524 or 35.24%

Skewness and Kurtosis - Review

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness and Kurtosis - Review

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness and Kurtosis - Review

Skewness and Kurtosis - Review

The 3 is included in this formula because it

Skewness and Kurtosis - Review

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review

Skewness & Kurtosis - Review

Calculate the mean

(11 21.57) = -10.57

(25 21.57) = 3.43

Skewness & Kurtosis - Review

4) Sum the cubed statistical distances, the sum of cubes

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Skewness & Kurtosis - Review

Skewness & Kurtosis - Review

(11 21.57) = -10.57

(25 21.57) = 3.43

Skewness & Kurtosis - Review

4) Sum the statistical distances raised to the 4th power

Sum of 4th powers =