Documente Academic
Documente Profesional
Documente Cultură
Descriptive Statistics
Introduction to Statistics:
1. Statistics: Science of learning from data. It deals with collection, organization, analysis,
and interpretation of data.
Ex. (a) Observe gender variable among all students in this class: Male/ female
(b) Observe infant birth weights in a hospital: 7 lbs., 6 lbs 2 oz, 9 lbs etc.
2. Collect data.
4. Draw conclusion/inference.
10. To determine average parking time of patrons who park at the mall, a statistician monitors
and records parking times of 105 patrons on July 6, 2012.
Any data set is a list of observations for a variable (numerical or non numerical).
Variable
Examples:
Nominal variable: (a) color: “red”, “blue”, “green”, etc (b) nationality: “American”,
“Canadian”, “Australian” etc, (c) Gender: “male”, “female”, (d) zip codes
Ordinal variable: (a) letter grades: “A”, “B”, “C”, “D”, and “F”, (b) 'completely agree',
'mostly agree', 'mostly disagree', 'completely disagree' when measuring opinion.
Discrete variable: Number of objects (cars, books, bottles, pens, players, committee
members, coins, students, cellular phones etc.)
Continuous variable: Height, weight, distance, volume, temperature, cholesterol level, age etc.
3
Ex. Organize the following data by constructing a frequency distribution and relative frequency
distribution for the colors of M&Ms:
Br Y G O Bl R R Bl O G Y Br G Y Br O Bl R G O R Br Y
G Br Y O R O R Y Br R Br Br Y Y R Br R Br Br Y Y Br
1. Bar graph: A graph constructed by labeling each category of data on the x-axis or the y-axis
and frequency/ relative frequency on the other axis.
• Bars are rectangles separate from one another and of equal width.
• Bars represent categories and their heights represent frequencies/ relative frequencies.
Ex. Using the same data for M&Ms construct a bar graph of the frequency ditribution for
M&M colors:
freq = c(12,10,9,6,3,5)
Gems = matrix(freq, ncol=1, byrow=T)
rownames(Gems) = c("Brown","Yellow","Red","Orange","Blue", "Green")
colnames(Gems) = c("Frequency")
Gems = as.table(Gems)
Gems = as.data.frame(Gems) # reading as data frame
2. Pie chart: A pie chart is a circle divided into sectors. The sectors represent the categories and
the area of a sector is proportional to relative frequency of the category.
Ex. Following data represents the marital status of US residents (in millions) 18 years of age or
older in 2006. Calculate the relative frequencies in percentage.
6
freq = c(125,290,30,55)
MaritalStatus = matrix(freq, ncol=1, byrow=T)
rownames(MaritalStatus) = c("Never Married", "Married", "Widowed", "Divorced")
colnames(MaritalStatus) = c("Frequency")
MaritalStatus = as.table(MaritalStatus)
MaritalStatus = as.data.frame(MaritalStatus)
For a discrete variable, if the possible values are relatively few then consider each value as a
category.
Ex. The following discrete data represent the number of cars in a household based on a random
sample of 50 households. Construct a frequency and relative frequency distribution.
7
3 0 1 2 1 1 1 2 0 2 4 2
2 2 1 2 2 0 2 4 1 1 3 2
4 1 2 1 2 2 3 3 2 1 2 2
0 3 2 2 2 3 2 1 2 2 1 1
3 5
Frequency Distribution:
# ======R codes for histogram & barplot of car data ======
d = c(3,0,1,2,1,1,1,2,0,2,4,2,2,2,1,2,2,0,2,4,1,1,3,2,4,1,2,1,2,2,3,3,2,1,
2,2,0,3,2,2,2,3,2,1,2,2,1,1,3,5)
hist(d, main="histogram of car data", xlab="number of cars", ylab="number of households",
yaxt="n", breaks=c(0,1,2,3,4,5,6), right=FALSE)
axis(2, at=c(0:22), tck=-.025, las=0)
cars = as.table(cars)
cars = as.data.frame(cars) # reading as data frame
• For continuous data or for a discrete data with relatively large number of possible
values, first construct a grouped frequency distribution to draw a histogram.
1. Find the range of the data, where range = (highest data value - lowest data value)
The common class length, denoted as d, is called class width/ bin width (d). Choose class width
and number of classes such that their product is slightly larger than the range.
9
3. Pick a starting point (lower limit) smaller than the lowest data value and add the class width to
get the lower limit of the next class. Ensure the upper limit of the first and the lower limit of the
second class do not overlap (classes are disjoint) and classes cover the entire range of data.
Ex. The following data represent integer scores for a statistics exam.
60 47 82 95 88 72 67 66 68 98 90 77
86 58 64 95 74 72 88 74 77 39 90 63
68 97 70 64 70 70 58 78 89 44 55 85
82 83 72 77 72 86 50 94 92 80 91 75
76 78
2. Select number of classes and class width (for the first table),
number of classes = 7 and class width = 10. (7*10 = 70 works b/c 70 is larger than range)
3. Pick a starting point smaller than the lowest data value 39 (30 which is 1st lower limit).
Next, add the class width 10 to 30 to get the lower limit 40 of the 2nd class.
Important: For a grouped frequency distribution, classes must be disjoint and must cover the
entire range.
10
REMARK: Shape of a histogram (distribution) for a data set changes as the class width (d) or
the number of classes 𝑘 changes. If 𝑛 = sample size, then, one way to find 𝑘 is to use the
Sturges formula: 𝑘 =< 𝑙𝑜𝑔!! > +1 (from binomial assuming normality). For 𝑛 = 50, 𝑘 = 7
Draw a histogram of the same data with 13 classes & class width 5:
Identifying the shape of a distribution: (The shape describes a quantitative variable/ data)
11
2. Stem and Leaf Plots: This is a sorting/ graphing technique sometimes used in computer
applications when the data sets are small.
In this plot, each data value is split into a “stem” and a “leaf” where the “leaf” is usually the
last/rightmost digit of the number and the other digits to the left of the “leaf” form the “stem”.
Sorting the data first helps in drawing this plot.
Ex. Construct a stem and leaf plot of the following data (ages of people).
12 20 23 32 35 38 38 39 41 43 43 50
51 52 53 53 55 58 59 59 85
Stem Leaf
1 2
2 03
3 25889
4 133
5 012335899
8 5
NOTE: The shape of the stem and leaf plot, if rotated 90 degree anticlockwise, resembles the
shape of a histogram. Usually there is no need to sort the leaves, although computer packages
typically do.
age = c(12,20,23,32,35,38,38,39,41,43,43,50,51,52,53,53,55,58,59,59,85)
stem(age, scale =2)
12
a) Sample Mean (𝒙): Let 𝑥! , 𝑥! , … , 𝑥! be a set of sample values from a population. The
!
!!! !!
sample mean is: 𝑥=
!
𝑥=2
b) Sample Median (𝒙): The value that lies in the middle of a data set after arranging a data set
in ascending or descending order.
!!!
For odd sample size n: median (𝑥) = !
𝑡ℎ value in the sorted data.
! !
For even sample size n: median (𝑥) = average of !
𝑡ℎ and !
+ 1 𝑡ℎ value in the sorted data.
13
c) Sample Mode: The value(s) of the data that occurs most frequently in the data set.
Sample Percentiles: After arranging a dataset in increasing order, percentiles divide the data into
100 equal parts.
The 𝒌𝒕𝒉 percentile 𝑷𝒌 of a data set is a value such that at least 𝑘 % of the observations are less than
or equal to 𝑃! and at least (100 − 𝑘) % of the observations are greater than or equal to 𝑃! .
!
Alternative notation: 100𝑝!! (i.e., 𝑘!! ) percentile where 0 < 𝑝 = !"" < 1.
Remark: Just like population percentiles, sample percentiles may not be unique.
Ex. Find 75th, 40th , and 50th percentile of rainfall (inches) in Boston during the month of April
for 15 years:
9.6, 2.5, 3.9, 4.1, 5.9, 1.1, 2.7, 4, 4.7, 6.1, 1.8, 3.4, 4, 5.2, 6.2
Ordered data = 1.1, 1.8, 2.5, 2.7, 3.4, 3.9, 4.0, 4.0, 4.1, 4.7, 5.2, 5.9, 6.1, 6.2, 9.6.
𝑟 = 15 ∗ 0.75 = 11.25; ceiling of 𝑟 = 12; 75th percentile = 5.9
𝑟 = 15 ∗ 0.4 = 6; integer; 40th percentile = average of 3.9 and 4 = 3.95
𝑟 = 15 ∗ 0.5 = 7.5; ceiling 𝑜𝑓 𝑟 = 8; 50th percentile = 8th value = 4.0
rain = c(9.6,2.5,3.9,4.1,5.9,1.1,2.7,4,4.7,6.1,1.8,3.4,4,5.2,6.2)
quantile(rain, probs=c(0.75, 0.40, 0.50) )
R output:
75% 40% 50%
5.55 3.96 4.00
Sample Quartiles: After arranging a data set in ascending order, 3 quartiles: first quartile 𝑄! ,
second quartile 𝑄! , and third quartile 𝑄! divide the data into 4 equal parts. 𝑄! = median.
Thus: 𝑄! = 𝑃!" , 𝑄! = 𝑃!" , 𝑄! = 𝑃!" . There are alternative ways to find sample quartiles. For
example, first order a data set to find the median that divides the dataset into 2 halves. Then 𝑄! is
the median of left half (with smaller values) and 𝑄! is the median of right half (with larger
values).
• Measures of dispersion are numerical values that describe the spread or variability in the data.
a) Range (R) = largest data value – smallest data value (not very informative and robust)
𝑅 = 9 – (−3) = 12
16
! ! !
! ! ! !!! !
!!! !! !
!! !! !! !! !! !! … ! !! !! ! !! !! !
𝑺𝟐 = !!!
= !!!
!!!
OR 𝑆 ! = !!!
!
! ! !
! ! !!! !
!!! !! !
!
Sample standard deviation (S) = has same unit of measurement as 𝑥! 𝑠.
!!!
• The greater the values of 𝑆 ! 𝑜𝑟 𝑆, the greater the spread of the data and vice versa.
• Interquartile Range (IQR): 𝑄! − 𝑄! represents the spread of the data between 𝑃!" and 𝑃!" , i.e.,
the middle 50% of the data. High value of IQR indicates the data is more spread out and vice
versa.
• Outliers: Data points outside the lower/ upper fences can be defined to be outliers.
• If outliers don’t affect a statistic substantially, it is considered resistant/ robust.
• Five number summary: Min value, 𝑄! , median 𝑄! , 𝑄! 𝑎𝑛𝑑 max value are called the 5 number
summary.
• Box Plot: Graphical representation of the five number summary, the upper and lower fences, and
outliers (if they exist).
Ex. Following is the data for interest rates charged by ten credit card companies.
17
6.5% 12% 14.4% 14.4% 14.3% 13% 13.3% 13.9% 9.9% 14.5%
Sorted data: 6.5, 9.9, 12.0, 13.0, 13.3, 13.9, 14.3, 14.4, 14.4, 14.5
Step 5: Data value less than lower fence: 6.5 Data value greater than upper fence: None
18
interest = c(6.5,12,14.4,14.4,14.3,13,13.3,13.9,9.9,14.5)
boxplot(interest, main="boxplot of interest rate data",
horizontal=TRUE)
axis(1, at=c(6:15), tck=-.025, las=0)
Using Box plots to describe distributions:
Based on the hand drawn box plot above, what type of skewness does the distribution have?
!
!/! !! !! !
𝜸𝟏 = !!!
!!
For normal distribution 𝛾! = 0. Note that in computing 𝛾! , the standard deviation 𝑆 is computed
using n instead of n-1. There are other measures of skewness such as adjusted Fisher-Pearson
coefficient of skewness, Galton skewness, The Pearson 2 skewness coefficient (see
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm)
That is why the data sets that have bell-shaped histograms are often called the normal data sets.
For the normal data sets we have the Empirical Rule that follows from the properties of the
normal distribution.
Ex. SAT math scores have a bell-shaped distribution with mean 515 and standard deviation 114.
a) What percentage of SAT scores is less than 401 or greater than 629?
20
Examples: (a) shoe size and weight; (b) height and weight of a student; (c) the amount of time
spent studying for statistics exam and the score on the exam.
21
abline(lm(data[,2] ~ data[,1]))
(a) Linear: Points may follow an imaginary line. (b) Nonlinear: Points may follow a curve.
Notice: warmer weather leads to more sales; there seem to be a linear relationship between
temperature and sale variable.
Sample correlation coefficient (r) measures the strength and direction of linear relationship
between two numerical variables.
• r is a unit less measure such that −1 ≤ 𝑟 ≤ 1. Examine the scatter diagrams below:
22
! ! ! !
If 𝑥 and 𝑦 are the sample means and 𝑆! = !!! !!!(𝑥! − 𝑥)! and 𝑆! = !!! !!!(𝑦! − 𝑦)!
are sample standard deviations for n data pairs 𝑥! , 𝑦! , then:
Ex. Find r for the data (x, y) = (0, 4), (3, 6), (4, 7), (5, 7)
Properties of r:
(a) is 2.61
(d) is 0.61 using inches and pounds, but converting inches to centimeters would
Important Note: Association doesn't imply causation i.e., two variables are associated does not
necessarily mean that one variable causes the other.
Wrong conclusion: High correlation imply that the firemen cause the fire
Ex 2. “The faster windmills are observed to rotate, the more wind is observed to be.”
Wrong conclusion: wind is caused by the rotation of windmills. In practice, it is rather the other
way around.
Ex 3. “Children that watch a lot of TV are the most violent. So, TV makes children more
violent.” Not necessarily true. This could easily be the other way round; that is, violent children
like watching more TV than less violent ones.
Ex 4. Study shows that there is a significant correlation (0.791) between a country’s consumption
of chocolate and the number of Nobel prizes (averaged per person).