Sunteți pe pagina 1din 23

1

Descriptive Statistics
Introduction to Statistics:

1. Statistics: Science of learning from data. It deals with collection, organization, analysis,
and interpretation of data.

2. Data: A list of observations for a variable (may be numerical or non numerical).

Ex. (a) Observe gender variable among all students in this class: Male/ female

(b) Observe infant birth weights in a hospital: 7 lbs., 6 lbs 2 oz, 9 lbs etc.

3. The process of Statistics:

1. Identify the research objective.

2. Collect data.

3. Organize, summarize, and analyze data.

4. Draw conclusion/inference.

4. A population: The entire group of individuals that is of our interest.

5. A sample: A subset of the population being studied.

Ex. To investigate: “what percentage of students in NISER is female?” randomly select 50


students near University Union.

Population: all NISER students;


Sample: those 50 students selected for the study.

6. A parameter: Numerical characteristic of a population or a model (typically unknown).

Ex. In the ex above, true proportion of female students at NISER.

7. A statistic: Numerical summary of data (typically estimates an unknown parameter).

Ex. Proportion of females students selected in the sample (Sample proportion)

8. Variable: Ex. Gender, infant birth weight, etc.


2

Define the following terms in the study/ poll described below:

10. To determine average parking time of patrons who park at the mall, a statistician monitors
and records parking times of 105 patrons on July 6, 2012.

a) Variable: Parking times of patrons who park at the mall

b) Population: All patrons who parked at the mall.

c) Sample: 105 patrons who parked at the mall on July 6, 2012

Describing and summarizing data sets

Any data set is a list of observations for a variable (numerical or non numerical).

Variable

Qualitative/Categorical (Math operation NOT meaningful) Quantitative/numerical(Math operation meaningful)

Nominal (ordering NOT meaningful) Ordinal (ordering meaningful) Discrete Continuous

Examples:

Nominal variable: (a) color: “red”, “blue”, “green”, etc (b) nationality: “American”,
“Canadian”, “Australian” etc, (c) Gender: “male”, “female”, (d) zip codes

Ordinal variable: (a) letter grades: “A”, “B”, “C”, “D”, and “F”, (b) 'completely agree',
'mostly agree', 'mostly disagree', 'completely disagree' when measuring opinion.

Discrete variable: Number of objects (cars, books, bottles, pens, players, committee
members, coins, students, cellular phones etc.)
Continuous variable: Height, weight, distance, volume, temperature, cholesterol level, age etc.
3

Data can be summarized: Graphically and Numerically.

Organization and Visual Representation of Categorical Data

Categorical Data: Observations corresponding to a categorical variable.

Frequency: Number of occurrences within a time period.

!"#$%#&'( !" ! !"#$%&'(


Relative Frequency of a category =
!"# !" !"" !"#$%#&'(#)

Frequency distribution/table: Lists categories of data and the corresponding frequencies of


each category.

Relative Frequency Distribution/table: Lists categories of data and the corresponding


relative frequencies of each category.

Ex. Organize the following data by constructing a frequency distribution and relative frequency
distribution for the colors of M&Ms:

Br Y G O Bl R R Bl O G Y Br G Y Br O Bl R G O R Br Y

G Br Y O R O R Y Br R Br Br Y Y R Br R Br Br Y Y Br

Frequency/ Relative Frequency Table: (Approximate to 3 decimal places)

Category (Color) Tally Frequency Relative Frequency

Brown ||||| ||||| || 12 12/45


Yellow ||||| ||||| 10 10/45
Red ||||| |||| 9 9/45
Orange ||||| | 6 6/45
Blue ||| 3 3/45
Green ||||| 5 5/45
Total Frequency = 45 Total Rel. Freq = 1
4

1. Bar graph: A graph constructed by labeling each category of data on the x-axis or the y-axis
and frequency/ relative frequency on the other axis.

• Bars are rectangles separate from one another and of equal width.
• Bars represent categories and their heights represent frequencies/ relative frequencies.

Ex. Using the same data for M&Ms construct a bar graph of the frequency ditribution for
M&M colors:

# ===== R code for Bar Plot of M&M data =====

freq = c(12,10,9,6,3,5)
Gems = matrix(freq, ncol=1, byrow=T)
rownames(Gems) = c("Brown","Yellow","Red","Orange","Blue", "Green")
colnames(Gems) = c("Frequency")
Gems = as.table(Gems)
Gems = as.data.frame(Gems) # reading as data frame

barplot(Gems$Freq, main="bar graph of M & M colors", xlab="Colors", ylab="Frequency", names.arg =


Gems$Var1, col=rainbow(length(freq)) )

Pareto chart: A bar graph in which the bars are drawn in decreasing order of frequency or
relative frequency.
5

# ===== R code for Pareto Plot of M&M data =====

After the previous code… type:

Gems_sorted = Gems[order(Gems[,3], decreasing=TRUE), ] # sorting data frame by freq


Gems_sorted
barplot(Gems_sorted$Freq, main="pareto graph of M & M colors", xlab="Colors",
ylab="Frequency", names.arg = Gems_sorted$Var1,
col=rainbow(length(freq)))

2. Pie chart: A pie chart is a circle divided into sectors. The sectors represent the categories and
the area of a sector is proportional to relative frequency of the category.

Ex. Following data represents the marital status of US residents (in millions) 18 years of age or
older in 2006. Calculate the relative frequencies in percentage.
6

# ===== R code for Pie chart of Marital Status data =====

freq = c(125,290,30,55)
MaritalStatus = matrix(freq, ncol=1, byrow=T)
rownames(MaritalStatus) = c("Never Married", "Married", "Widowed", "Divorced")
colnames(MaritalStatus) = c("Frequency")
MaritalStatus = as.table(MaritalStatus)
MaritalStatus = as.data.frame(MaritalStatus)

rel.freq = round(freq/sum(freq)*100, digits = 1)


lbls = paste(MaritalStatus$Var1," (", rel.freq,"%)", sep = "")
pie(freq, labels = lbls, col=rainbow(length(freq)))

Organization and Visual Representation of Quantitative/Numerical Data

Quantitative data: Observations corresponding to a quantitative variable.

For a discrete variable, if the possible values are relatively few then consider each value as a
category.

Ex. The following discrete data represent the number of cars in a household based on a random
sample of 50 households. Construct a frequency and relative frequency distribution.
7

3 0 1 2 1 1 1 2 0 2 4 2
2 2 1 2 2 0 2 4 1 1 3 2
4 1 2 1 2 2 3 3 2 1 2 2
0 3 2 2 2 3 2 1 2 2 1 1
3 5

Frequency Distribution:

Category (No. of Cars) Tally Frequency


0 |||| 4
1 ||||| ||||| ||| 13
2 ||||| ||||| ||||| ||||| || 22
3 ||||| || 7
4 ||| 3
5 | 1


# ======R codes for histogram & barplot of car data ======
d = c(3,0,1,2,1,1,1,2,0,2,4,2,2,2,1,2,2,0,2,4,1,1,3,2,4,1,2,1,2,2,3,3,2,1,
2,2,0,3,2,2,2,3,2,1,2,2,1,1,3,5)
hist(d, main="histogram of car data", xlab="number of cars", ylab="number of households",
yaxt="n", breaks=c(0,1,2,3,4,5,6), right=FALSE)
axis(2, at=c(0:22), tck=-.025, las=0)

freq = c(sum(d==0),sum(d==1), sum(d==2), sum(d==3), sum(d==4), sum(d==5))


cars = matrix(freq, ncol=1, byrow=T)
rownames(cars) = c("no car","1 car","2 cars","3 cars","4 cars", "5 cars")
colnames(cars) = c("Frequency")
8

cars = as.table(cars)
cars = as.data.frame(cars) # reading as data frame

barplot(cars$Freq, main="bar graph of number of cars", xlab="number of cars",


ylab="number of households", names.arg = cars$Var1,
col=rainbow(length(freq)))

1. Histogram: A graph representing the frequency distribution of a numerical data. A
histogram consists of rectangles of equal width that touch each another. In a histogram,
classes are marked on the horizontal axis and frequencies are represented by heights on the
vertical axis.

Ex. Using the car data given above, construct a histogram:

• For continuous data or for a discrete data with relatively large number of possible
values, first construct a grouped frequency distribution to draw a histogram.

Steps to construct a grouped frequency distribution and draw a histogram:

1. Find the range of the data, where range = (highest data value - lowest data value)

2. Select number of classes (k) and create classes/bins/intervals of equal width.

The common class length, denoted as d, is called class width/ bin width (d). Choose class width
and number of classes such that their product is slightly larger than the range.
9

3. Pick a starting point (lower limit) smaller than the lowest data value and add the class width to
get the lower limit of the next class. Ensure the upper limit of the first and the lower limit of the
second class do not overlap (classes are disjoint) and classes cover the entire range of data.

Ex. The following data represent integer scores for a statistics exam.

60 47 82 95 88 72 67 66 68 98 90 77
86 58 64 95 74 72 88 74 77 39 90 63
68 97 70 64 70 70 58 78 89 44 55 85
82 83 72 77 72 86 50 94 92 80 91 75
76 78

(a) Construct a histogram for the given data.

1. Range of the data = highest data value - lowest data value = 59

2. Select number of classes and class width (for the first table),

number of classes = 7 and class width = 10. (7*10 = 70 works b/c 70 is larger than range)

3. Pick a starting point smaller than the lowest data value 39 (30 which is 1st lower limit).

Next, add the class width 10 to 30 to get the lower limit 40 of the 2nd class.

Important: For a grouped frequency distribution, classes must be disjoint and must cover the
entire range.
10

# ======== R Code for drawing a histogram of exam score data =====

score = c(60,47,82,95,88,72,67,66,68,98,90,77,86,58,64,95,74,72, 88,74,77,39,90,


63,68,97,70,64,70,70,58,78,89,44,55,85,82,83,72,77,72,86,50,94,92,80,91,75,76,78)
hist( score, nclass=7, main="Histogram of scores with class width=10", xlab="scores",
ylab="frequency", col= "blue")

REMARK: Shape of a histogram (distribution) for a data set changes as the class width (d) or
the number of classes 𝑘 changes. If 𝑛 = sample size, then, one way to find 𝑘 is to use the
Sturges formula: 𝑘 =< 𝑙𝑜𝑔!! > +1 (from binomial assuming normality). For 𝑛 = 50, 𝑘 = 7

Draw a histogram of the same data with 13 classes & class width 5:

Identifying the shape of a distribution: (The shape describes a quantitative variable/ data)
11

2. Stem and Leaf Plots: This is a sorting/ graphing technique sometimes used in computer
applications when the data sets are small.

In this plot, each data value is split into a “stem” and a “leaf” where the “leaf” is usually the
last/rightmost digit of the number and the other digits to the left of the “leaf” form the “stem”.
Sorting the data first helps in drawing this plot.

Ex. Construct a stem and leaf plot of the following data (ages of people).

12 20 23 32 35 38 38 39 41 43 43 50
51 52 53 53 55 58 59 59 85

Stem Leaf

1 2

2 03

3 25889

4 133

5 012335899

8 5

NOTE: The shape of the stem and leaf plot, if rotated 90 degree anticlockwise, resembles the
shape of a histogram. Usually there is no need to sort the leaves, although computer packages
typically do.

# ======== stem and leaf plot of age data =======

age = c(12,20,23,32,35,38,38,39,41,43,43,50,51,52,53,53,55,58,59,59,85)
stem(age, scale =2)
12

Measures of Central Tendency


Measures of central tendency are numerical values that locate, in some sense, the center of the
data.

a) Sample Mean (𝒙): Let 𝑥! , 𝑥! , … , 𝑥! be a set of sample values from a population. The
!
!!! !!
sample mean is: 𝑥=
!

Ex. Find the sample mean of 1, −3, 3, 5, 0, 6

𝑥=2

b) Sample Median (𝒙): The value that lies in the middle of a data set after arranging a data set
in ascending or descending order.
!!!
For odd sample size n: median (𝑥) = !
𝑡ℎ value in the sorted data.

! !
For even sample size n: median (𝑥) = average of !
𝑡ℎ and !
+ 1 𝑡ℎ value in the sorted data.
13

Ex 1. Find the sample median of 4, 1, 8, 6, 5

Ordered data: 1, 4, 5, 6, 8 Median = 5

Ex 2. Find the sample median of 9, 7, 10, 9, 6, 8

Ordered data: 6, 7, 8, 9, 10 Median = (8+9)/2 = 8.5

R Code: “mean(x)” and “median(x)” where x is the vector of data values.

c) Sample Mode: The value(s) of the data that occurs most frequently in the data set.

Ex. Find the mode(s) of the following data sets.

(a) 0, 1, 2, 3 , 3, −3, 3, 6 ; Mode =

(b) {-1, 1, 1.5 , 1.5, 3, 4, 4.5, 4.5, 6}; Modes =

(c) {-1, 1, 2, 3, 4}; Mode =

(a) (b) (c)


14

Relation between mean/ median and skewness of distribution:

Other Measures of Location: Percentiles and Quartiles

Sample Percentiles: After arranging a dataset in increasing order, percentiles divide the data into
100 equal parts.

The 𝒌𝒕𝒉 percentile 𝑷𝒌 of a data set is a value such that at least 𝑘 % of the observations are less than
or equal to 𝑃! and at least (100 − 𝑘) % of the observations are greater than or equal to 𝑃! .

!
Alternative notation: 100𝑝!! (i.e., 𝑘!! ) percentile where 0 < 𝑝 = !"" < 1.

Remark: Just like population percentiles, sample percentiles may not be unique.

An algorithm to compute sample 𝟏𝟎𝟎𝒑𝒕𝒉 percentile:


1. Arrange the data in increasing order and compute 𝑟 = 𝑛𝑝.
2. If 𝑟 is not an integer: round it up to the nearest integer greater than 𝑟, i.e., ceiling of 𝑟 .
The rounded integer is the position (index) of the sample 100𝑝!! percentile in the ordered list of
numbers.
3. If 𝑟 is an integer: the sample 100𝑝!! percentile is found as the average of data values in
positions 𝑟 and 𝑟 + 1.
15

Ex. Find 75th, 40th , and 50th percentile of rainfall (inches) in Boston during the month of April
for 15 years:
9.6, 2.5, 3.9, 4.1, 5.9, 1.1, 2.7, 4, 4.7, 6.1, 1.8, 3.4, 4, 5.2, 6.2

Ordered data = 1.1, 1.8, 2.5, 2.7, 3.4, 3.9, 4.0, 4.0, 4.1, 4.7, 5.2, 5.9, 6.1, 6.2, 9.6.
𝑟 = 15 ∗ 0.75 = 11.25; ceiling of 𝑟 = 12; 75th percentile = 5.9
𝑟 = 15 ∗ 0.4 = 6; integer; 40th percentile = average of 3.9 and 4 = 3.95
𝑟 = 15 ∗ 0.5 = 7.5; ceiling 𝑜𝑓 𝑟 = 8; 50th percentile = 8th value = 4.0

# ========= R code for finding percentiles of rainfall data ===========

rain = c(9.6,2.5,3.9,4.1,5.9,1.1,2.7,4,4.7,6.1,1.8,3.4,4,5.2,6.2)
quantile(rain, probs=c(0.75, 0.40, 0.50) )

R output:
75% 40% 50%
5.55 3.96 4.00

Sample Quartiles: After arranging a data set in ascending order, 3 quartiles: first quartile 𝑄! ,
second quartile 𝑄! , and third quartile 𝑄! divide the data into 4 equal parts. 𝑄! = median.

Thus: 𝑄! = 𝑃!" , 𝑄! = 𝑃!" , 𝑄! = 𝑃!" . There are alternative ways to find sample quartiles. For
example, first order a data set to find the median that divides the dataset into 2 halves. Then 𝑄! is
the median of left half (with smaller values) and 𝑄! is the median of right half (with larger
values).

Measures of Dispersion (Range, Variance, Standard deviation, IQR)

• Measures of dispersion are numerical values that describe the spread or variability in the data.

a) Range (R) = largest data value – smallest data value (not very informative and robust)

Ex. Find the range of the data set −3, 0, −2, 9, 5, 0, −2

𝑅 = 9 – (−3) = 12
16

b) Sample variance (𝑺𝟐 ): If 𝑥! , 𝑥! , … , 𝑥! are data from a population,

! ! !
! ! ! !!! !
!!! !! !
!! !! !! !! !! !! … ! !! !! ! !! !! !
𝑺𝟐 = !!!
= !!!
!!!
OR 𝑆 ! = !!!
!

! ! !
! ! !!! !
!!! !! !
!
Sample standard deviation (S) = has same unit of measurement as 𝑥! 𝑠.
!!!

• The greater the values of 𝑆 ! 𝑜𝑟 𝑆, the greater the spread of the data and vice versa.

R code: for variance: var(x), standard deviation: sd(x)

• Interquartile Range (IQR): 𝑄! − 𝑄! represents the spread of the data between 𝑃!" and 𝑃!" , i.e.,
the middle 50% of the data. High value of IQR indicates the data is more spread out and vice
versa.

Lower fence = 𝑄! − 1.5 (𝐼𝑄𝑅)


Fences:
Upper fence = 𝑄! + 1.5 (𝐼𝑄𝑅)

• Outliers: Data points outside the lower/ upper fences can be defined to be outliers.
• If outliers don’t affect a statistic substantially, it is considered resistant/ robust.

• As a measure of central tendency______________ is least affected by outliers, and thus,


it is a more robust measure than ________________.

• As a measure of dispersion______________ is a more robust measure than


________________.

• Five number summary: Min value, 𝑄! , median 𝑄! , 𝑄! 𝑎𝑛𝑑 max value are called the 5 number
summary.

• Box Plot: Graphical representation of the five number summary, the upper and lower fences, and
outliers (if they exist).
Ex. Following is the data for interest rates charged by ten credit card companies.
17

6.5% 12% 14.4% 14.4% 14.3% 13% 13.3% 13.9% 9.9% 14.5%

Sorted data: 6.5, 9.9, 12.0, 13.0, 13.3, 13.9, 14.3, 14.4, 14.4, 14.5

Step 1: Interquartile range (IQR): 𝑄! − 𝑄! = 14.4 − 12 = 2.4

Lower Fence = 𝑄! − 1.5 𝐼𝑄𝑅 = 12 − 3.6 = 8.4,

Upper Fence = 𝑄! + 1.5 𝐼𝑄𝑅 = 14.4 + 3.6 = 18

Step 4: Smallest value larger than the lower fence: 9.9

Largest value smaller than the upper fence: 14.5

Step 5: Data value less than lower fence: 6.5 Data value greater than upper fence: None
18

# ========= R code for boxplot of interest rate data ==========

interest = c(6.5,12,14.4,14.4,14.3,13,13.3,13.9,9.9,14.5)
boxplot(interest, main="boxplot of interest rate data",
horizontal=TRUE)
axis(1, at=c(6:15), tck=-.025, las=0)

Using Box plots to describe distributions:

Based on the hand drawn box plot above, what type of skewness does the distribution have?

Measures of Skewness: If 𝑥! , 𝑥! , … , 𝑥! are data from a population, then the Fisher-Pearson’s


moment coefficient of skewness is:

!
!/! !! !! !
𝜸𝟏 = !!!
!!

For normal distribution 𝛾! = 0. Note that in computing 𝛾! , the standard deviation 𝑆 is computed
using n instead of n-1. There are other measures of skewness such as adjusted Fisher-Pearson
coefficient of skewness, Galton skewness, The Pearson 2 skewness coefficient (see
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm)

Normal Data Sets


Many real data sets have bell-shaped histograms. A “bell-shaped” histogram often refers to the
ideal bell-shaped curve, the normal curve that defines a normal distribution. (will discuss later)
19

That is why the data sets that have bell-shaped histograms are often called the normal data sets.
For the normal data sets we have the Empirical Rule that follows from the properties of the
normal distribution.

Empirical Rule for a Bell-shaped Distribution:

Ex. SAT math scores have a bell-shaped distribution with mean 515 and standard deviation 114.

a) What percentage of SAT scores is less than 401 or greater than 629?
20

b) What percentage of SAT scores is greater than 743?

c) What percentage of SAT scores fall between 287 and 515?

d) 99.7% of SAT scores fall between _______ and _______.

Paired data sets and the Sample correlation coefficient


Two numerical variables are often studied together to investigate any possible relationship
between the two variables.

Examples: (a) shoe size and weight; (b) height and weight of a student; (c) the amount of time
spent studying for statistics exam and the score on the exam.

Such data sets consist of pairs of observations 𝑥! , 𝑦! for 𝑖 = 1,2, … , 𝑛.

Example: Ice Cream Sales

The local ice cream shop keeps track of how


much ice cream they sell in a day and the
noon temperature on that day. Observations
for the last 12 days are recorded in the table.

In this ex, 𝑥! , … , 𝑥!" are 12 temperatures in C


and 𝑦! , … , 𝑦!" are 12 sales in dollar.

Scatter diagram/plot: A 2-dimensional


graph where x-axis and y-axis represent data
values from x-variable and y-variable
respectively.


21

Scatter plot of temperature versus sales (in dollar)

R code to draw scatter


plot: data = read.table
("icecream.txt")
plot(data[,1],data[,2],xlab="te
mperature",ylab="sales",type
= "p", col = "red", lwd=2)

abline(lm(data[,2] ~ data[,1]))

Points on the scatter plot may create some patterns:

(a) Linear: Points may follow an imaginary line. (b) Nonlinear: Points may follow a curve.

Notice: warmer weather leads to more sales; there seem to be a linear relationship between
temperature and sale variable.

• Is there a numerical measure that determines linear relationship between variables?

Sample correlation coefficient (r) measures the strength and direction of linear relationship
between two numerical variables.

• r is a unit less measure such that −1 ≤ 𝑟 ≤ 1. Examine the scatter diagrams below:


22

! ! ! !
If 𝑥 and 𝑦 are the sample means and 𝑆! = !!! !!!(𝑥! − 𝑥)! and 𝑆! = !!! !!!(𝑦! − 𝑦)!
are sample standard deviations for n data pairs 𝑥! , 𝑦! , then:

Sample Correlation Coefficient (Pearson):


𝑛 𝑛
1 𝑖=1 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
𝑟= 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 =
(𝑛 − 1)𝑆! 𝑆! 𝑛 𝑛 2
𝑖=1
𝑖=1 𝑥𝑖 − 𝑥 2
𝑖=1 𝑦𝑖 − 𝑦

Ex. Find r for the data (x, y) = (0, 4), (3, 6), (4, 7), (5, 7)

Properties of r:

• r measures strength of linear relationships only.


• -1 ≤ r ≤ 1 where r > 0: positive linear relation, r < 0: negative linear relation
• r near 0: weak linear relationship and r near +/- 1 : points lie close to a straight line.
• If we interchange x and y, does the correlation change?
• Correlation is not affected by linear transformation of data

Ex: What’s wrong with these statements?

The correlation between height and weight of Computer Science students

(a) is 2.61

(b) is 0.61 inches per pound

(c) is 0.61, so the correlation between weight and height is -0.61

(d) is 0.61 using inches and pounds, but converting inches to centimeters would

make r > 0.61 (since an inch equals about 2.54 centimeters)


23

Important Note: Association doesn't imply causation i.e., two variables are associated does not
necessarily mean that one variable causes the other.

Ex 1. “number of firemen and severity of fire is highly correlated”

Wrong conclusion: High correlation imply that the firemen cause the fire

Ex 2. “The faster windmills are observed to rotate, the more wind is observed to be.”

Wrong conclusion: wind is caused by the rotation of windmills. In practice, it is rather the other
way around.

Ex 3. “Children that watch a lot of TV are the most violent. So, TV makes children more
violent.” Not necessarily true. This could easily be the other way round; that is, violent children
like watching more TV than less violent ones.

Ex 4. Study shows that there is a significant correlation (0.791) between a country’s consumption
of chocolate and the number of Nobel prizes (averaged per person).

S-ar putea să vă placă și