Sunteți pe pagina 1din 60

Data Analysis

Kulwant Singh Kapoor


Data Structure
The process of arranging data in groups or
classes according to resemblances and
similarities is technically called
classification.
Types of Classification:
Geographical
Chronological
Qualitative
Quantitative

Geographical Data
In geographical classification data are classified on the
basis of place.
Example: geographical distribution of National Income
COUNTRY INCOME IN US DOLLARS
Canada 7950
USA 7880
West Germany 7510
France 6730
USSR 2800
India 500
Chronological Data
When the data are classified on the basis of time,
also known as time series.
Example: production of polio vaccine by a company
X.
YEAR No. of Vaccines
2005 12,800
2006 15,600
2007 18,200
2008 16,600
2009 20,000
2010 20,800
Qualitative Data
When data are classified on the basis of descriptive
characteristics or attributes.
Examples:
Male/ Female
Strongly agree/ Agree/Disagree/Strongly Disagree
Low/Medium/High
Diabetic/Non- Diabetic
Hypertensive/Mildly Hypertensive/Non
Hypertensive
Quantitative Classification
When classification is based on characteristics
which are capable of Quantitative measurement.
Example:
Height/Weight
Income/Expenditure
Blood Pressure
Body Temperature
Blood Count

Quantitative Data
Ungrouped Grouped
Raw Data Discreet data Continuous data
Mean
Median
Mode
Quartile
Percentile
MEASURE OF CENTRAL
TENDENCY
MEAN
Arithmetic Mean of a given set of observations is
their sum divided by the number of observations.
For example if X
1
, X
2,
X
3,..
X
n
are the given n
observations then their arithmetic mean, denoted
by



1 2 1
........
n
i
n i
x
x x x
X
n n
=
+ + +
= =

EXAMPLE 1
MARKS OF 24
STUDENTS
12 43 54 67 87 98 65 43
54 67 89 90 98 76 54 56
54 98 89 78 90 98 99 87
TOTAL 1746
# OF OBSERVATIONS 24
MEAN 72.75
Arithmetic's Mean for Un-Grouped
Series
Employee Income X-A
1 1000 -
2 1500 -
3 800 -
4 1200 -
5 900 -



For discreet data mean is calculated with
respect to frequencies.
In Case of continuous data, the value of X is
taken as the mid value of the corresponding
class.



1 1 2 2 1
1 2
1
........
......
n
i i
n n i
n
n
i
i
f x
f x f x f x
X
f f f
f
=
=
+ + +
= =
+ + +

EXAMPLE 2 NUMBER OF STUDENTS ABSENT IN A YEAR


X f Xf
1 8 8
2 9 18
3 21 63
4 32 128
5 12 60
6 22 132
7 24 168
8 37 296
9 15 135
10 20 200
TOTAL 200 1208
MEAN 6.04

Marks Students X-40
X f d F*d
20 8 - -
30 12 - -
40 20 - -
50 10 - -
60 6 - -
70 4 - -
Total 60
EXAMPLE 3 DISTRIBUTION OF NUMBER OF
PROCESSED ARTICLES PER DAY
PER PERSON
LIMITS f X fX
80-100 7 90 630
100-120 50 110 5500
120-140 80 130 10400
140-160 60 150 9000
160-180 3 170 510
TOTAL 200 26040
MEAN 130.2
Mathematical Properties of
Arithmetic Mean
Property 1 The Algebraic sum of the
deviations of the given set of
observations from their arithmetic
mean is zero
Property 2 If the sizes and the mean
of two component series is known then
the mean of resultant series obtained
on combining the given series can be
found
Merits and demerits of
Arithmetic Mean
Merits:
i. It is rigidly defined.
ii. It is easy to calculate and understand.
iii. It is based on all the observations
iv. It is suitable for further mathematical
treatment.
v. Of all the averages, arithmetic mean is
affected least by fluctuations of sampling
or arithmetic mean is a stable average.
(contd.)
Merits and demerits of
Arithmetic Mean
Demerits:

i. It is affected by extreme observations.
ii. It cannot be used in case of open end classes such as less than 10
and more than 70, etc.
iii. It can not be determined by inspection nor can it be located
graphically.
iv. It cannot be used in dealing with qualitative characteristics.
v. It cannot be obtained if a single observation is missing or lost.
vi. It is not representative of the distribution and hence is not a suitable
measure of location
vii. It may lead to wrong conclusion if the details of the data from which
it is obtained are not available.
viii. Arithmetic mean may not be one of the values which the variable
actually takes and is termed as fictitious mean

Mean For Combined Data
If is the mean for observations and

If is the mean for observations

The combined mean is given by




1
X
1
n
2
X
2
n
1 1 2 2
1 2
n X n X
X
n n
+
=
+
Example
Mean height of 25 Male worker in the
factory is 61 inches and Mean height of 35
female worker is the same factory is 58
inches. Find out the combine Mean of 60
workers

Median
Median is that value of the variable which
divides the group in two equal parts, one
part comprising all the values greater and
the other, all the values less than the
median.

Median is only a positional average i.e, its
value depends on the position occupied by
a value in the frequency distribution.
Calculation of Median
Case I: Ungrouped data: If the number of observation is odd,
then the median is the middle value after the observations
have been arranged in ascending or descending order of
magnitude.

Case II: Discreet Distribution: In case of frequency
distribution where the variable takes the value X
1,
X
2,, ,
X
n
with respective frequencies
1,

2,, ,

n
with =N, total
frequency, median is the size of the (N+1)/2th item or
observation. In this case the use of cumulative frequency
(c. .) distribution facilitates the calculations.

EXAMPLE 4
MARKS OF 10
STUDENTS ARE
4 7 6 8 9 4 3 2 7 8
IN ORDER 2 3 4 4 6 7 7 8 8 9
MEDIAN 6.5
MARKS OF 11
STUDENTS ARE
4 7 6 8 9 4 3 2 7 8 4
IN ORDER 2 3 4 4 4 6 7 7 8 8 9
MEDIAN 6
8 COINS ARE TOSSED AND NUMBER OF HEAD ARE NOATED
THE EXPERIMENT IS REPEATED 256 TIMES
# HEADS FREQUENCY
X f CF xf
0 1 1 0
1 9 10 9
2 26 36 52
3 59 95 177
4 72 167 288
5 52 219 260
6 29 248 174
7 7 255 49
8 1 256 8
N/2 128 1017
MEDIAN 4 mean 3.972656
Case III: Continuous distribution:
Compute cumulative frequency (cf)
Find N/2
See cf just greater than N/2
The corresponding class contains the median value
called median class



2
h N
Median l C
f
| |
= +
|
\ .
Where l is the lower limit of median class
f is the frequency of the median class
H is the magnitude of the median class
N is the total frequency
C is the CF of the class preceding the median class



Merits:

i. It is rigidly defined
ii. It is easy to understand and calculate for a non medical
person.
iii. It is not affected by extreme observations and as such is very
useful in the case of skewed distributions
iv. It can be computed by dealing with the distribution with open
end classes
v. It can sometimes be located by simple inspection and can
also be computed graphically
vi. It is the only average to be used while dealing with qualitative
characteristics which can not be measured quantitatively but
still can be arranged in ascending oe descending order of
magnitude.
Merits And Demerits
Merits And Demerits
Demerits:
i. In case of even number of observations of
ungrouped data it can not be determined
exactly.
ii. It is not based on each and every item of the
distribution.
iii. It is not suitable for further mathematical
treatment.
iv. It is relatively less stable than mean, particularly
for small samples.
Quartile
The values which divide the given
data into four equal parts are
known as quartiles. Therefore,
there will be only three such points
Quartile
The values which divide the given data into four
equal parts are known as quartiles. Therefore,
there will be only three such points Q
1,
Q
2 and
Q
3
such that Q
1
Q
2
Q
3
termed as the three quartiles.
Q
1
known as the lower or first quartile is the value
which has 25% of the items of the distribution
below it and consequently 75% of the items are
greater than it. Q
2 ,
the second quartile coincides
with the median and has equal number of
observations above and below it. Q
3
upper or third
quartile, has 75% of the observations below it and
consequently 25% of the observations above it

1
4
h N
Q l C
f
| |
= +
|
\ .
3
3
4
h N
Q l C
f
| |
= +
|
\ .
Percentile
Percentiles are the values which divide the
series into 100 equal parts. So, there are 99
percentiles P
1
, P
2
P
99
such that P
1
P
2

P
99.
The i
th

percentile value is:
100
i
h iN
P l C
f
| |
= +
|
\ .
MODE
Mode is the value which has the
greatest frequency density
Mode for continuous distribution is
given by



( )
( ) ( )
1 0
1 0 2 1
h f f
Mode l
f f f f

= +

EXAMPLE 7
f x xf
10-20 4 15 60
20-30 6 25 150
30-40 5 35 175
40-50 10 45 450
50-60 20 55 1100
60-70 22 65 1430
70-80 21 75 1575
80-90 6 85 510
90-100 2 95 190
100-110 1 105 105
f1=22 h=10 5745
f0=20 97
f2=21 mean 59.2268
l=60
mode= 66.6666667
Measures of Dispersion
Range
Quartile deviation
Mean Deviation
Variance
Standard deviation
RANGE


max min
Range X X =
Range is the difference between the two extreme
observations of distribution
OR
It is the difference between the greatest (maximum) and the
smallest (minimum) observation of the distribution.
It is the simplest but crude measure of dispersion. It is
rigidly defined, readily comprehensible and easiest to
compute requiring very little calculations
EXAMPLE
MARKS OF STUDENTS
ROLL NO. MARKS SORTED
123 98 52
125 95 56
126 96 56
127 87 66
128 56 78
134 52 87
135 89 89
136 78 95
137 56 96
138 66 98
RANGE 98-52= 46
RANGE
Merits and Demerits of Range
It is not based in the entire set of data.
Its value varies very widely from sample to
sample.
If the X
max
and X
min

remain unaltered and all the
other values are replaced by a set of observation
the range of distribution remains the same.
It can not be used when dealing with open end
classes
Not Suitable for mathematical treatment.
It is very sensitive to the size of the sample.
It is too indefinite to be used as a practical
measure of dispersion.
QUARTILE DEVIATION


3 1
D
2
Q Q
Quartile eviation

=
It is a measure of dispersion based on the upper quartile
Q
3
and the lower quartile Q
1.

Inter-quartile Range= Q
3
- Q
1

Quartile Deviation is obtained from inter quartile range
on dividing by 2.
Merits and Demerits of Quartile
Merits:

It is quite easy to understand & calculate.
It makes use of 50% of the data & as such is
better measure than range
As it ignore 25% of data from the beginning and
25% from the top end, it is not affected at all by
extreme observations.
It can be Computed from the Frequency
distribution with open end classes .
(Contd.)
Demerits:

It is not based on all observations.
It is affected considerably by
fluctuations of sampling.
It is not suitable for further
mathematical treatment.

Merits and Demerits of Quartile
EXAMPLE
DISTRIBUTION OF MONTHLY EARNING
MONTH EARNING
1 10239
2 10250
3 10251
4 10251
5 10257
6 10258
7 10260
8 10261
9 10262
10 10262
11 10273
12 10275
Q1 10251
Q3 10262
QUARTILE DEVIATION 5.5
MEAN DEVIATION


1
D
i
Mean eviation X X
n
=

1
D
i i
Mean eviation f X X
N
=

Average or Mean deviation is the average amount of scatter


of the items in a distribution from either the mean or the
median, ignoring the signs of deviation. The average that is
taken of the scatter is an arithmetic mean, which accounts for
the fact that this measure is often called the mean deviation.
For grouped data
For ungrouped data


EXAMPLE
DISTRIBUTION OF SERIES OF DAILY RENTS
HOUSE RENT X-MEAN
1 3000 1819
2 3000 1819
3 3000 1819
4 3750 1069
5 4000 819.4
6 4000 819.4
7 4000 819.4
8 4500 319.4
9 4750 69.44
10 5000 180.6
11 5000 180.6
12 5000 180.6
13 5250 430.6
14 5250 430.6
15 5500 680.6
16 6250 1431
17 6500 1681
18 9000 4181
TOTAL 86750 18750
MEAN 4819.4
EXAMPLE
DISTRIBUTION OF HEIGHTS OF STUDENTS
HEIGHT # OF STUDENTS
X f fX f(X-MEAN)
158 15 2370 49.1667
159 20 3180 45.5556
160 32 5120 40.8889
161 35 5635 9.72222
162 33 5346 23.8333
163 21 3423 36.1667
164 10 1640 27.2222
165 8 1320 29.7778
166 6 996 28.3333
TOTAL 180 29030 290.667
MEAN 161.278
MD 1.61481
STANDARD DEVIATION
It is defined as the positive square root of the
mean of the squares of the deviations of the given
observations from their mean

( )
2
1
Standard Deviation
i
X X
n
o = =

( )
2
1
Standard Deviation
i i
f X X
N
o = =

For un-grouped data


For grouped data
VARIANCE


( )
2
2
1
i i
Variance f X X
N
o = =

( )
2
2
1
i
Variance X X
n
o = =

It is the square of standard deviation and is denoted


by
2

For un-grouped data
For grouped data
PROPERTIES OF STANDARD
DEVIATION
PROPERTY 1
is independent of change of origin but not scale
PROPERTY 2
Is the minimum value of the root mean square deviation
PROPERTY 3
Is suitable for further mathematical treatment
PROPERTY 4
SD < Range

MERITS AND DEMERITS OF SD
Is the most important and widely used
measure of dispersion
It is defined on all the observations
The squaring of the deviations removes the
drawback of ignoring the signs of deviations
in computing the mean deviation
It is affected least by fluctuations of
sampling

EXAMPLE
X (X-MEAN)^2
12 13.69
15 0.49
24 68.89
12 13.69
13 7.29
15 0.49
14 2.89
12 13.69
16 0.09
24 68.89
TOTAL 157 190.1
MEAN 15.7
VARIANCE19.01
SD 4.36
EXAMPLE
# LETTERS IN WORD FREQUENCY X-MEAN
X f fX d fd^d
1 3 3 -3.277 32.208
2 8 16 -2.277 41.463
3 9 27 -1.277 14.667
4 10 40 -0.277 0.765
5 5 25 0.723 2.617
6 4 24 1.723 11.880
7 3 21 2.723 22.251
8 1 8 3.723 13.864
9 3 27 4.723 66.932
10 1 10 5.723 32.757
TOTAl 47 201 239.404
MEAN 4.277
VARIANCE 5.094
EXAMPLE f x xf d^2 fd^2
30-39 1 29.5-39.5 34.5 34.5 1128.96 1128.96
40-49 4 39.5-49.5 44.5 178 556.96 2227.84
50-59 14 49.5-59.5 54.5 763 184.96 2589.44
60-69 20 59.5-69.5 64.5 1290 12.96 259.2
70-79 22 69.5-79.5 74.5 1639 40.96 901.12
80-89 12 79.5-89.5 84.5 1014 268.96 3227.52
90-99 2 89.5-99.5 94.5 189 696.96 1393.92
TOTAL 75 5107.5 11728
MEAN 68.1
VARIANCE 156
SD 12.5
CORRELATION
When the relationships of quantitative
nature, the appropriate statistical tool for
discovering and measuring the relationship
and expressing it in a brief formula is known
as correlation

It is defined as an analysis of the co-
variation between two or more variables



Types of Correlation
a) Positive and negative correlation
b) Linear and non-linear correlation


METHODS OF STUDYING
CORRELATION
1. Scatter diagram
2. Karl Pearsons coefficient of
correlation
3. Bi-variate correlation method
4. Rank correlation

Scatter Diagram
Karl Pearsons Coefficient of
Correlation
Is a numerical measure of linear
relationship between them and is
defined as the ratio of the covariance
between X & Y to the product of the
standard deviations
( , )
x y
Cov x y
r
o o
=
2 2
1
( )( )
1 1
( ) ( )
x x y y
n
r
x x y y
n n

=


2 2 2 2
( )( )
[ ( ) ][ ( ) ]
n xy x y
r
n x x n y y

=



EXAMPLE
ADVERTISING Sales
EXPENSES
x y dx=x-mx dy=y-my dx^2 dy^2 dxdy
39 47 -26 -19 676 361 494
65 53 0 -13 0 169 0
62 58 -3 -8 9 64 24
90 86 25 20 625 400 500
82 62 17 -4 289 16 -68
75 68 10 2 100 4 20
25 60 -40 -6 1600 36 240
98 91 33 25 1089 625 825
36 51 -29 -15 841 225 435
78 84 13 18 169 324 234
650 660 0 0 5398 2224 2704
mx= 65
my= 66
r= 0.78