Documente Academic
Documente Profesional
Documente Cultură
economists
Spring 2018 C-D
Department of Statistics
NCT 1-2; (16.1)
Today
Describe data (variables) by means of
● Tables
– Distributions, frequency distributions (frequency = number of)
● Diagrams, graphics
– One variable at a time, univariate
– Categorical and numerical variables
● Numerically
– Location (“where are the values typically located”)
– Variation, dispersion (“how spread out are they”)
Classification of variables
Variables
Categorical Numerical
(qualitative) (quantitative)
Discrete Continuous
Scale level
Differences and ratios
are well-defined, true Ratio
zero exists
Numerical,
quantitative data
Differences are well-
defined but not ratios, Interval
true zero does not exist
Ordered categories
(ranking order) Ordinal
Categorical,
qualitative data
● The entire set of all 𝒏𝒏𝒌𝒌 over all possible 𝒌𝒌 is called the
frequency distribution (sv. frekvensfördelningen)
30 Blåbär
Pareto Choklad
20
ordered
10 Vanilj
0 Lakrits
Hallon Blåbär Choklad Vanilj Lakrits
40
No. of apartments 30
- frequencies 20
10
0
1 2 3 4 5 6
No. of rooms per apartment
Ordered by magnitude
● Open classes (e.g. >65) are indicated with e.g. dotted lines
– we don’t know where it ends and thus neither the area!
50
Frekvens
40
30 Frekvens
20
10
0
40 60%
Frekvens
30 Kumulativa procenttal
40%
20
10
20% Note! The red ogive is an
increasing function (never
0 0% decreasing) and never goes
higher than 100 %
40,0% 120,0%
35,0% height = A
100,0%
30,0%
80,0%
25,0%
increase = B
20,0% height = B 60,0%
15,0%
40,0%
increase = A
10,0%
20,0%
5,0%
0,0% 0,0%
0-20 21 – 40 41 – 60 61 – 80 81 – 100 101 – 120 121 – 140 > 140 0 20 40 60 80 100 120 140 160
● Numerical, discrete, few values Bar chart, one bar for each discrete value
Stem-and-Leaf Displays
● sv. stambladdiagram
● Provides exact values and visualizes the 8 8
distribution 7 3
● In this example 6 3
0
1975 1976 1977 1978 1979 1980 1981 1982
● Denote the variable by 𝒙𝒙 (or some other letter 𝑦𝑦, 𝑧𝑧, 𝑢𝑢, 𝑣𝑣, …)
● Ex. 0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3 yields
4
𝑛𝑛𝑘𝑘 1 20 5
𝑥𝑥̅ = � 𝑘𝑘 = 3∙0+2∙1+3∙2+4∙3 = = = 1,6667
𝑛𝑛 12 12 3
𝑘𝑘=0
● Q1 = 1st quartile
25% of observations below, 75% above
● Q3 = 3rd quartile
75 % of observations below, 25 % above
Ex. {11, 12, 14, 15, 17, 18, 20, 21, 21, 23, 30, 40}, 𝒏𝒏 = 𝟏𝟏𝟏𝟏
40
40th percentile: 𝑛𝑛 + 1 = 12 + 1 ∙ 0,4 = 𝟓𝟓, 𝟐𝟐 ⇒ 𝑎𝑎 = 𝟓𝟓 och 𝑏𝑏 = 𝟎𝟎, 𝟐𝟐
100
=MIN(A1:A12)
=QUARTILE.EXC (A1:A12;1)
=MEDIAN(A1:A12)
=QUARTILE.EXC (A1:A12;3)
=MAX(A1:A12)
Median Max
Extreme
Min Q1 Q3 values
20 30 40 50 60 70 80
Variability: Variance
● Average squared distance to the mean
● Standard deviation:
– Restores unit of measurement 𝑠𝑠𝑥𝑥 = 𝑠𝑠𝑥𝑥2 𝜎𝜎𝑥𝑥 = 𝜎𝜎𝑥𝑥2
– sv. standardavvikelse
Excel: ’=VAR.S(…)’
● Population variance
∑𝑁𝑁 2 ∑𝑁𝑁 2 2
𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝜇𝜇) 𝑖𝑖=1 𝑥𝑥𝑖𝑖 − 𝑁𝑁𝜇𝜇
𝜎𝜎𝑥𝑥2 = =
𝑁𝑁 𝑁𝑁
Excel: ’=VAR.P(…)’
4
𝑥𝑥𝑖𝑖 2 3 5 8 18 3,5
3
𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ -2,5 -1,5 0,5 3,5 0 2,5
2
2 21
𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 6,25 2,25 0,25 12,25 1,5
2,5
1
𝑥𝑥𝑖𝑖2 4 9 25 64 102 0,5
0
0 1 2 3 4 5 6 7 8 9
Rule μ±σ μ ± 2σ μ ± 3σ
Under some
Empirical: ca 68 % ca 95 % ca 100 % conditions
”bellshaped”
μ ± 2σ
Variables
Categorical Numerical
𝑖𝑖 1 2 3 4 5 6 7 8 9 10 ∑𝑖𝑖
𝑥𝑥𝑖𝑖 5 2 3 6 5 2 5 3 5 4 40
1 1 40
Mean: 𝑥𝑥̅ = ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 = 5 + 2 + ⋯+ 4 = = 4,0
𝑛𝑛 10 10
𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 1 −2 −1 2 1 −2 1 −1 1 0 0
𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 2 1 4 1 4 1 4 1 1 1 0 18
1 1 18
Variance: 𝑠𝑠𝑥𝑥2 = ∑𝑛𝑛𝑖𝑖=1(𝑥𝑥𝑖𝑖 −𝑥𝑥̅ )2 = 1 + 4 + 1 + ⋯+ 0 = = 2,0
𝑛𝑛−1 9 9
𝑖𝑖 1 2 3 4 5 6 7 8 9 10 ∑𝑖𝑖
𝑥𝑥𝑖𝑖 5 2 3 6 5 2 5 3 5 4 40
1 1 40
Mean: 𝑥𝑥̅ = ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖 = 2 + 2 + ⋯+ 6 = = 4,0
𝑛𝑛 10 10
𝑥𝑥𝑖𝑖2 25 4 9 36 25 4 25 9 25 16 178
alt. formula:
1 1 178−160
Variance: 𝑠𝑠𝑥𝑥2 = ∑𝑛𝑛𝑖𝑖=1 𝑥𝑥𝑖𝑖2 − 𝑛𝑛𝑥𝑥̅ 2 = 178 − 10 ∙ 42 = = 2,0
𝑛𝑛−1 9 9
(𝑖𝑖) 1 2 3 4 5 6 7 8 9 10 ∑𝑖𝑖
𝑥𝑥(𝑖𝑖) 2 2 3 3 4 5 5 5 5 6 40