Documente Academic
Documente Profesional
Documente Cultură
Chapter 3
1
Index
4. Measures of Shape
• Skewness
• Kurtosis
• Box and Whisker plots
5. Measures of Association
• Correlation
2
We can use single numbers called “Summery Statistics’ to describe
characteristics of a data set. Two these characteristics are particularly
important to decision makers:
1. Central tendency
2. Dispersion
Central Tendency:
Central tendency is the middle point of a distribution. Measures of
central tendency are also known as Measures of location. Measures of
central tendency yield information about the center, or middle part, of a
group of a numbers. It does not focus on the span of data set or how far
values are from the middle numbers.
Dispersion:
Dispersion is the spread of the data in a distribution, that is, the extent
to which the observations are scattered.
Objectives:
Mode:
The mode is a measure of central tendency. It is the most common value in a distribution
E.g. the mode of 3, 4, 4, 5, 5, 5, 8 is 5. Because 5 is occurring for the most of the time.
When to use: Use the mode when the data is non-numeric or when asked to choose
the most popular item.
3
• Advantages:
• Extreme values (outliers) do not affect the mode.
• Disadvantages:
• Not as popular as mean and median.
• Not necessarily unique - may be more than one answer
• When no values repeat in the data set, the mode is every value and is useless.
• When there is more than one mode, it is difficult to interpret and/or compare.
Median
The data must be ranked (sorted in ascending order) first. The median is the number in the
middle.
To find the depth of the median, there are several formulas that could be used, the one that we
will use is:
Depth of median = 0.5 * (n + 1)
When to use: Use the median to describe the middle of a set of data that does have an outlier.
• Advantages:
• Extreme values (outliers) do not affect the median as strongly as they do the mean.
• Useful when comparing sets of data.
• It is unique - there is only one answer.
Disadvantages:
• Not as popular as mean.
Mean:-
The Mean is the average of a group of numbers and computed by summing all
numbers and dividing by the number of numbers. The population mean is
represented by the Greek letter µ . The sample mean is represented by x . The
formulas for computing the population mean and the sample mean are given below.
• Population mean:
4
N
∑x i
x1 + x 2 + ... + x N
µ= i =1
=
N N
• Sample mean:
x 1 + x 2 + ... + x n
n
∑x i
x= i =1 =
n n
When to use: Use the mean to describe the middle of a set of data that does not have an outlier.
• Advantages:
• Most popular measure in fields such as business, engineering and computer science.
• It is unique - there is only one answer.
• Useful when comparing sets of data.
• Disadvantages:
• Affected by extreme values (outliers)
Percentiles:
• They are measures of central tendency that divide a group of data
into 100 parts
• At least n% of the data lie below the nth percentile, and at most
(100 - n)% of the data lie above the nth percentile
• Example: 90th percentile indicates that at least 90% of the data lie
below it, and at most 10% of the data lie above it
• The median and the 50th percentile have the same value.
• Applicable for ordinal, interval, and ratio data
• Not applicable for nominal data
For Calculation:
• Organize the data into an ascending ordered array.
• Calculate the percentile location:
P
i= ( n)
100
30
i= (8) = 2.4
100
• The location index, i, is not a whole number; i+1 = 2.4+1=3.4; the whole
number portion is 3; the 30th percentile is at the 3rd location of the array;
the 30th percentile is 13.
Quartiles
• Measures of central tendency that divide a group of data into four subgroups
• Q1: 25% of the data set is below the first quartile
• Q2: 50% of the data set is below the second quartile
• Q3: 75% of the data set is below the third quartile
• Q1 is equal to the 25th percentile
• Q2 is located at 50th percentile and equals the median
• Q3 is equal to the 75th percentile
• Quartile values are not necessarily members of the data set
E.g.
• Ordered array: 106, 109, 114, 116, 121, 122, 125, 129
• Q1
25 109 +114
i= (8) = 2 Q1 = = 111 .5
100 2
• Q2:
50 116 +121
i= (8) = 4 Q2 = = 118 .5
100 2
• Q3:
75 122 +125
i= (8) = 6 Q3 = = 123 .5
100 2
Measures of Variability:
Ungrouped Data
• Measures of variability describes the spread or the dispersion of a set of data.
6
3.1 RANGE:
“The range is the different between the highest and lowest observed values.
Advantages of range:
• It is easy to understand and to find
• It is used in quality assurance, where the range is used to to construct a
control charts.
Disadvantages of range:
• Its usefulness as a measure of dispersion is limited.
• It is only consider highest and lowest value of a distribution
• It is heavily affected by extreme values.
• It is not used in open ended series.
Example:
Inter quartile range is the values of the first and third quartiles. The interquartile range (IQR) is
the range of the middle 50% of the scores in a distribution. It is less affected by extremes.
It is computed as follows:
IQR = Q3 – Q1
For E.g. if the 75th percentile is 8 and the 25th percentile is 6. The Interquartile
range is therefore 2.
3.2VARIANCE :
Variance in population:
7
Variability can also be defined in terms of how close the scores in the distribution are to
the middle of the distribution. Using the mean as the measure of the middle of the distribution,
the variance is defined as the average squared difference of the scores from the mean.
Example:
Variance in Sample:
If the variance in a sample is used to estimate the variance in a population, then the previous
formula underestimates the variance and the following formula should be used:
n
s2 = ∑ (x
i =1
i - x) 2
n −1
where s2 is the estimate of the .Since, in practice, the variance is usually computed in a sample,
this formula is most often used..
Standard deviation:
σ
2
σ =
2
S= S
8
• Useful in describing how far individual items in a distribution depart from the
mean of the distribution.
• Indicator of financial risk
• Quality Control in construction of quality control charts & process capability
studies
• Comparing populations for household incomes in two cities & employee
absenteeism at two plants
EMPIRICAL RULE:
• This rule is often used to quickly get a rough estimate of something's probability, given
its standard deviation, if the population is assumed normal, thus also as a simple test for
outliers (if the population is assumed normal), and as a normality test (if the population is
potentially not normal).
9
Ran Population in
ge range
μ±
1σ 68 %
μ±
2σ 95 %
μ±
3σ 99.7 %
CHEBYSHEV’S THEOREM
CHEBYSHEVS THEOREM
Minimum Proportion Of
Number Of Standard
Distance From The Mean Values Falling Without
Deviation
Distance
6. z Scores:-
10
If the Z score is negative, the raw value (x) is below the mean. If
the z score is positive, the raw value (x) is above the mean.
For example, for a data set that is normally distributed with a mean of
50 and a standard deviation of 10, suppose a statistics want to
determine the z score for a value of 70. The value is 20 units above the
mean, so the z value is,
7. Coefficient of Variation:-
The Coefficient of variation is a statistic that is the ratio of the
standard deviation to the mean expressed in percentage.
11
• Class frequencies are the weights
µ=∑
fM
∑f
=
∑fM
N
f 1M 1 + f 2 M 2 + f 3 M 3 +⋅⋅⋅ + fiMi
=
f 1 + f 2 + f 3 +⋅⋅⋅ + fi
4. Measures of Shape
Skewness
When they are displayed graphically, some distributions of data have many more observations on one
side of the graph than the other. Distributions with most of their observations on the left (toward
lower values) are said to be skewed right; and distributions with most of their observations on the
right (toward higher values) are said to be skewed left.
12
13
1. Arithmetic Mean:
• The arithmetic mean of a set of data is the sum of the data values
divided by the number of observations.
If the data set is from a sample, then the sample mean, is:
n
∑x i
X= i =1
n
n = sample size and Σ means "to add"
If the data set is from a population, then the population
mean,µ is: N
∑ xi
x + x 2 + ... + x N
µ= i =1
= 1
N N
14
• The Weighted Mean:
15