Documente Academic
Documente Profesional
Documente Cultură
Mathematical Statistics
Biostatistics
Vital statistics
Ongoing collection by govt. Agencies of data relating to events such as
births, deaths, marriages, divorces and adoptions. etc
Uses of biostatistics
To define and quantify the nature and extent of illness and death in the
community.
For comparison
For research
Data
Data are the basic building blocks of statistics and refers to the individual
values presented, measured or observed.
(HIS / HMIS/DHIS)
Health resources
Outcome of a service
Financial reports.
For research
1. Census
Held every 10th year and information is collected about demographic and
socio-economic characteristics of population.
Methods
Enumerative- Pakistan / USA
Questionnaire - England
Combination
types
De facto
Person is counted at the place he / she is found at the time of
counting
De jure
Minus
(deaths + emigrants)
Example.
130.6 x [1 + 2 .2/100]6
= 148 million
3. Notification of diseases
4. Hospital records
5. Disease registries
Sample: Part of a population is sample and the process of taking sample is called
sampling.
a. Quantifiable
Policies
b. Non- Quantifiable
Laws etc
a. Primary Data: Data collected specifically for the problem under study is called
primary data.
b. Secondary Data: Using already collected data or the other data sources for
research purposes is called secondary data e.g. HMIS, Desk Review
a. Un- Grouped Data: Original information or raw material of an enquiry is called
un-grouped data
For example
63,64,70,70,70,71,65,64,64,63,61,62
b. Grouped Data: For example Age, made groups is called grouped data
Nominal Data Data is divided into named categories e.g. Male and Female,
Black/white, Nominal data that falls only in two groups is called dichotomous data
Ordinal Data Data can be placed in meaningful order like 1st, 2nd ,3rd in class
Interval scale data:-like ordinal data but in addition they have meaningful interval but
do not have absolute zero. e.g. on the Celsius scale (C0) the difference between 1000 and
900 is the same as difference between500 and 400 .However interval scale do not have
absolute zero, so 100C0 is not twice as hot as 50C0,because 0C0 does not indicates
complete absence of heat.
Ratio Scale Data:- Have the same properties as an interval scale however it has absolute
Zero, Most Biomedical variables form a ratio scale ,e.g. Weight in Gms or pounds,Time
in seconds or days, BP in mm of Hg. Pulse rate in beats /min. Zero Pulse rate indicates an
absolute lack of heartbeat. Therefore it is correct to say that a pulse rate of 120 beats/min
is twice as a pulse rate 60 beats/min.
b. Quantitative or Numerical Data: Measured Data is called Quantitative or numerical data
Variable: Any numerical value which varies from one individual to other
e.g. Height and weight of the individuals are variable.
-Characteristics of person, object or phenomenon that can take on different values.
Variables are represented by letters X, Y, Z
Constant: Any value which is fixed is called constant.
Constants are represented by letters a, b, c
Constant for example = 22/7
Classification of Variables
• Quantitative variables (Where we measure) Expression of numerical
value of variable(Age, weight, height, Parity)
• Qualitative Variables (Where we Count) Expression of quality of
variable (Sex, color, occupation race etc.)
• a. Independent Variable (input variable)
Variables that are used to describe the factors that are assumed to cause or
influence problems.
e.g. Smoking causes lung cancer
In which smoking is independent while lung cancer is dependent
b. Dependent Variable (outcome variable):
The variable that gets modified under the influence of independent variable
a. Continuous Variable: Any variable which can assume any value in a
given range is a continuous variable e.g. weight, height, speed of a car (0-
150km/h)
b. Discontinuous Variable (Discrete variable): Variable which can assume
a specific value (none in-between) is called discontinuous variable
For example: Number of rooms in a house
Family size (number of family members)
These cannot be in fractions
60-61 1 8.3
62-63 3 25.0
64-65 4 33.3
66-67 0 -
68-69 0 -
70-71 4 33.3
Total 12 100%
Occupation Frequency
Business 50
Business 50 0.1
denoted by
For ungrouped data:
Where
= Mid points of the class intervals
Example:
= 67/11
= 6.1
Mean, Median, Mode are called averages
Mean is the best average among these
Generally when word average is spoken, it means “Mean”
Median: Central values which divide the distribution (data) into two equal
half. One half is greater and the other half is less than median
Median
For Even Numbers:
Arrange the data in ascending or descending order
Sum of central two values
Median =__________________________
2
For Example:
Even number values are,
10, 8, 7, 16, 17, 20
Arrange the data in order
7,8,10,16,17,20
The central two values are 10 and 16
10 +16
Median = ___________
2
Median = 13
Median
Where
Exmaple:
First calculate
15/2 7.5
7.5 fall in which group cumulative frequency is called the Median group
(falls in which minimum cumulative frequency)
Median
= 4+ 3/3
= 4+1
=5
Quartile:
If you divide the data into four equal parts, the numerical values are called “
quartile”.
Q1 Q2 Q3 Q4
Q1 =
Q2 =
Q3 =
Q4 =
Decile: If data is divided into 10 equal parts, the numerical values are called
docile.
D1, D2---------------------------------------------,D 10
D1 =
D2 =
.
.
.
D10 =
Percentile:
P1, P2-----------------------------------------------------,P 100
P1 =
P2 =
.
.
.
D100 =
Mode = 60
Mode =
h is the class interval
Example:
50----------------------60 10
Mode =
=
=
= 44.6
0
Measures of dispersions include
1) Range
2) Quartile deviation
3) Mean deviation
4) Variance
5) Standard deviation
Range: Difference between the maximum and the minimum values
Un-grouped data:
R=
Where
Example:
60, 65, 68, 70, 72
R = 72 -60 = 12
Range = upper limit of the last class interval – lower limit of the 1st class
interval
R = 70 – 10
= 60
Quartile Deviation:
Q3 – Q1
Q.D = ________
2
Q3 = Third quartile
Q1 = 1st Quartile
Mean Deviation:
M. D =
_________
∑
Sum of all the deviation score from the mean is always zero
Example:
2, 4, 6, 8, 10
5
= 30/5
=6
l l
2 2-6= -4 4
4 4-6 = -2 2
6 6-6 = 0 0
8 8-6 = 2 2
10 10-6 = 4 4
∑ ∑ = 12
M.D =
= 2.4
For grouped data:
M.D =
_________
Standard Deviation:
It is the +ve square root of the mean of square deviation from the mean.
It measures variation from the central point.
It is the square root of variance.
S.D =
Variance: Mean of the squares of all the deviation scores in the distribution
is called variance.
For un-grouped data:
S.D =
Variance =
Deviation score =
Variance is less if majority of the values are closer
Variance is more when majority of the values are far
Example:
Standard deviation of un-grouped data
IQ of 10 students turns out to be as follows
115, 140, 133, 125, 120, 126, 136, 124, 132, 129
Calculate the standard deviation by using this formula if the values are
greater than 30
S.D =
S.D =
115 115 – 128 = -13 169
140 140 – 128 = +12 144
133 133 – 128 = + 5 25
125 125 – 128 = -3 9
120 120 – 128 = -8 64
126 126 – 128 = -2 4
136 136 – 128 = +8 64
124 124 – 128 = -8 16
132 132 – 128 = +4 16
129 129 – 128 = +1 1
∑ ∑
S.D =
Correlation:
Concurrence of two variables more often than would be expected by
chance is called association.
Relationship between two and more than two series or groups is called
correlation. Technically it is interdependent of one group to another group.
Positive correlation: if the values of one group increases and other group
also increases then we say correlation is positive.
Example: relationship of age and weight
r=
Example:
X Y
2 1
4 3
6 5
8 7
10 9
2 1 -4 -4 16 16 16
4 3 -2 -2 4 4 4
6 5 0 0 0 0 0
8 7 2 2 4 4 4
10 9 4 4 16 16 16
= 40 = 40 =40
r=
r=
r= 40/40
Presentation of data
Simple (cities and population)
Tables
Double or complex
Frequency Polygon
Graphs
Diagrams Pictogram
Histogram
Symmetrical
Bar- Charts:
Merits: It provides quick glance over observation and shows interval where
main concentration lies.
Population
frequency
Blood Groups
population
Pictogram:
City A
City B
City C
Histogram:
It consists of set of rectangles whose bases are marked by class intervals along X-axis and whose heights
are proportional to frequencies with respected classes, (just like cells in histology closely packed)
20
15
Frequency 10
5
0 5 10 15 20 25 Height
Pie- Diagram:
For example:
Professionals 13.2%
230
° 48 °
Unskilled
63.8 %
82 ° Skilled 22.8 %
Frequency Polygon:
20
frequency15 15
10
5 10 15 20 25
20
frequency 15
10
5 10 15 20 25
Frequency polygon
Line Diagram:
Case of Malaria:
Cases
Time (Years)
Scattered diagram:
600
500
Temperature and weekly deaths due to respiratory infections
400
300
200
100
No. of
deaths
26 37 41 43 48
Mean weekly Temperature of weather
Each plotted data point represents one observation, draw how well a
straight line could fit the plotted points, called “line of best fit”.
Symmetrical Curves
Bimodal
Here the frequencies tend to pile up at one end or the other end of the
distribution. Mode
Mode
Median
Median
Mean
Mean
Sampling:
Population or Universe: Universe is a “defined whole” about which the
information is desired. It may concern of individuals, families, households, objects
etc.
Advantages of Sampling:
Representative sample:
A representative sample is the one with which can draw valid inference
regarding the population parameters. Parameters are the unit values of the
universe under study. The values of samples are statistics which help in
inferring the parameters.
Representative sample: If it closely resembles the population from which it
is drawn.
Types of sampling techniques:
Probability sampling:
When each element in the population has known chance of being included
in the sample.
Simple random: An important sampling technique in which each sampling
unit of a population has an equal probability of being included in the
sample.
Procedure:
1) Prepare a sampling frame list showing all the units
2) Decide on the number to be selected (sample size)
3) Select the required number of units through,
Drawing lots “lottery method” when sample is small
Use random tables especially if the sample size is large.
Life Table:
Requirements:
1) Age specific death rates
2) Life expectancy rate
They are taken from record of survey.
Construction:
Column I: Frequency by age
Column II: No of persons surviving that age of life, or of no. of persons
on which table was started.
Column III: Probability of dying within the given year
Column IV: No of persons dying in successive years of age in each
successive age group.
Advantages:
1) It is used by the life insurance companies. (how much premium will be
received at this age to onward and what is life expectancy)
2) Health authorities for long term planning
3) No. of years added to life by surgical or medical techniques or prevention
and control program’s efficacy can also be determined.
4) It gives guidance whether to continue or not to continue the program
1 = Absolute certainty
If a random sample of 10 people were drawn an infinite number of times from the
population of 100 people then the probability of each person included in the
sample would be
Standard error:
It is a measure of the extent to which the samples mean deviate from the true
population mean.
if we draw large number of random samples of equal size from the same
population. Then means of all these samples are not the same because of the
“Sampling error”. If we draw curve (distribution) of these sample means, they
spread out to form a distribution called “random sampling distribution of means”
The random sampling distribution of mean will always tend to be normal (normal
distribution curve) irrespective of the shape of the population distribution from
which samples were drawn. This is called the “Central limit theorem. Theorem
also states that the mean of the random sampling distribution of means (
equal to the mean of the original population.
( =
Standard error ( x) =
Where
Where
n= sample size
Standard error is inversely related to the square root of the sample size. So the
large the sample size becomes the more closely will the sample mean represent
the true populations mean. This is the reason why the results of large studies or
surveys are more trusted than the results of small ones.
Sx =
Where
Where
n = Sample size
Tests of significance:
The information obtained from the sample is used to make decision about the
population. For example on the basis of sample, we are required to decide
whether a certain drug is effective in curing a particular disease or not OR a
medical researcher might he required to decide on the basis of experimental
evidence whether a certain vaccine is superior to the other which is already in the
market. We use certain rules and procedures. These are called “TESTS OF
SIGNIFICANCE” or “TESTS OF HYPOTHESIS”.
More than two Means (Is one sample ANOVA with F-test
mean significantly different from the
more than one other sample means?
P-Value 0.05
P-Value 0.01
P-Value 0.1
Confidence level
Region of
Acceptance
2.5% 2.5%
Two Tail
Normal curve
95%
Area of
Acceptance
Normal curve
In the theory of testing the hypothesis two types of error are committed.
1) We reject the hypothesis when it is true ( type I error)or α-error
(Probability of rejecting null hypothesis when it is in-fact true)
2) We accept the hypothesis when it is false ( type II error) or β-error
(Probability of not rejecting null hypothesis when it is in-fact false)
Chi- Square
( ) = 5%
Then test statistic
Chi- Square
Where
O = observed values
E= Expected values
∑= summation
Region of
Accpetance
Region of rejection
3.84 16.66
Calculated value of chi – square is greater than table value, so it falls in
the region of rejection. So we reject the Null hypothesis and accept the
alternate hypothesis.
P - Value is < 0.05
So
Conclusion:
There is significant relationship between sex and smoking
(There is statistically significant difference of smoking by gender)
P- Value > 0.05: We accept the null hypothesis (calculated value of test of
significance falls in the region of acceptance) and reject the alternate hypothesis.
In results chance cannot be excluded, difference between sample mean (X) and
hypothesized population mean ( ) is insignificant and so results (difference)
are insignificant.
Student t-test:
t= hyp
SX
Where
X = Mean sample
Example: The Principal of a medical college states that the college’s students are
highly intelligent group with an average IQ of 135. This claim constitutes a
hypothesis that can be tested.
H=
0 hyp = µ = 135 (Hypothesized Population mean (Stated mean) is equal to 135)
115, 140, 133, 125, 126, 136, 124, 132, 129, 120
Sx =
Where
S.D = n-1 = 10 - 1
Sx =
= 7.542/√10
= 2.385
t= hyp
SX
= 128-135
2.385
= -2.935
d.f= n-1
=10 – 1 = 9
So:
Region of
Acceptance
So
So the medical college students are not highly intelligent group with an average IQ
not equal to 135.
z- test:
z=
Where
= Mean of sample
= Mean of population
= Standard error
Study questions
H=
0 hyp = 65 inches (Hypothesized Population mean is equal to 65 inches)
Any curve which is smooth, symmetrical and bell shaped is called “normal curve”.
Mean
Median
Mode
However there is one standard normal curve. The characteristics of this curve are;
1) Bell shaped
2) Mean, Median, Mode all are at the same point
3) Mean is zero
4) Tails go to infinity
5) Area of the curve is equal to 1 ( or 100%)
6) Mean 1S.D = 68.3%
1S.D or 2S.D on each side of the mean is called confidence limit. The interval
between them is called confidence interval. We can say with confidence that 68%
of the distribution lies within approximately 1S.D of the Mean.
Life Table:
Requirements:
1) Age specific death rates
2) Life expectancy rate
They are taken from record of survey.
Construction:
Column I: Frequency by age
Column II: No of persons surviving that age of life, or of no. of persons
on which table was started.
Column III: Probability of dying within the given year
Column IV: No of persons dying in successive years of age in each
successive age group.
Advantages:
1) It is used by the life insurance companies. (how much premium will be
received at this age to onward and what is life expectancy)
2) Health authorities for long term planning
3) No. of years added to life by surgical or medical techniques or prevention
and control program’s efficacy can also be determined.
4) It gives guidance whether to continue or not to continue the program
Detailed calculations and column definitions
A standard abridged life table is presented in Example 2. This section goes through
The number of years in each age interval. For example for the <1 age group n = 1, for
the 1-4 age group n = 4 and for all other age groups including 85+ n = 5.
Usually it is assumed that death occurs uniformly across time and that on average
people will live 0.5 of the interval before death. However, there are some cases
where we know that death does not occur uniformly across time within age groups.
For example, for those aged under 1 we assume that the average proportion of the
or
n * nMx
1 – probability of dying
or
1 - nqx
interval
or
lx-n * npx-n
or
lx-lx+n
average proportion of year lived by those who die*number of deaths during interval)
or
At age 85+ everybody dies during the interval so an adjustment has to be made.
Whatever is used as an estimate of the number of years lived has little impact on
L85+ = l85
M85+
This is the ‘number of person years lived through the interval’ column summed from
the bottom.
or
Tx+n + nLx
Expectation of life (ex)
or
Tx
lx
Z-score:-Location of any element a normal distribution can be
expressed in terms of how many S.D it lies above or below the mean of
the distribution.
z=
σ
Table of Z-score:
Table of z-score states what proportion of any normal distribution lies
above or below any given Z-score.
: : :
: : :
: : :
ANSWERS
Inferential