Lecture Notes Community Biostats Typed Without Errors-1

Basic Definitions and concepts
Statistics: It is a science of systemic collection, classification, tabulation,

presentation, analysis and interpretation of data.
 It is a science of facts and figures
Mathematical Statistics
Statistics Agricultural statistics
Applied Statistics Socioeconomic statistics
Biostatistics
Biostatistics (Medical Statistics)
Biostatistics (Medical Statistics): It is the basic science of collection, classification,

analysis, quantification and interpretation of data in relation to vital events.
Biostatistics: It is branch of statistics deals with vital events in a human

population. These vital events are births, deaths, sickness, marriage, divorce etc.
 Statistics related to health and disease related states and events.
Vital statistics
Ongoing collection by govt. Agencies of data relating to events such as
births, deaths, marriages, divorces and adoptions. etc
Uses of biostatistics
To define and quantify the nature and extent of illness and death in the
community.
To establish causation for existence of health problems .
To plan health measures(Health programs)
To evaluate outcome of health measures
For comparison
For research
Data
Data are the basic building blocks of statistics and refers to the individual
values presented, measured or observed.
Data: Facts and figures you collect is called data
Data something assumed as facts and made the basis of reasoning or

calculation.
Information:-When data is arranged in such a form that it becomes meaningful is

called information.
Health information system(HIS,HMIS, DHIS)
(HIS / HMIS/DHIS)
A mechanism for the collection, processing , analyzing and transmission of

information required for organizing and operating services and research.
Information is needed about:-

Demography and vital statistics.(Science of population)
Environmental health
Health status indicators
Health resources
Health services utilization.
Outcome of a service
Financial reports.
Uses of health information
To measure the health status of a community
For comparison and conclusion
For planning and management
To see performance of a health care programme
To assess satisfaction of consumer
For research
Sources of health data
1. Census
Held every 10th year and information is collected about demographic and
socio-economic characteristics of population.
Methods
Enumerative- Pakistan / USA
Questionnaire - England
Combination
types
De facto
Person is counted at the place he / she is found at the time of
counting
De jure
Person is counted at place of his / her usual /normal residence
Intercensual population estimation

1. Natural increase method
(previous census + births + immigrants)
Minus
(deaths + emigrants)
2. Arithmetic progression method
Base population x [ 1 + gr/100 x no. Of years]
Example.
Population on 1-7-1998 = 130.6 million & gr = 2.2%
Estimated population on 1-7-2004 =

130.6 x [1 + 2.2/100 x 6] = 147.8 million
3. Geometric progression method
Base population x [1 + gr/100]no. Of yrs
130.6 x [1 + 2 .2/100]6
= 148 million
2. Registration of vital events
Births, deaths, marriages, divorces, adoptions etc.
Union council  tehsil. Council  distt. Council,
3. Notification of diseases
4. Hospital records
5. Disease registries
6. Record Linkages:-"Record linkage" is the term used by statisticians,

epidemiologists, and historians, among others, to describe the process
of joining records from one data source with another that describe the
same entity.
Large administrative databases are increasingly being used to compare

mortality across hospitals. We can used Automatch (Software) running
on a Pentium microcomputer for the record linkage, to link patient
records in our hospital's database (N = 253,836) with mortality files from
California (N = 1,312,779) and the U.S. We identified death records of
88% of 494 patients with cancer metastatic to the liver, 84% of 164
patients with pancreatic cancer, and 91% of 126 patients with CD4
counts of less than 50. Hospital data can be accurately linked with state
and national vital statistics using commercial record linkage software.
7. Health facility records
8. Health man power statistics
9. Environmental health statistics
10. Population based epidemiological studies
11. Demographic surveys
12. Economic surveys
13. Non –quantifiable information e.g. Policies, laws etc

Population: Number of people living in a specific area, but in statistics, population
means “number of group, number of things”
Sample: Part of a population is sample and the process of taking sample is called
sampling.
Statistics: Any numerical value computed from a sample is known as statistics.
Parameters: Any numerical value computed from the population is known as

Parameters.
Classification of Data
Qualitative (Nominal, Ordinal,Interval scale,Ratio.S)
a. Quantifiable
Quantitative (Continuous, Discontinuous)
Policies
b. Non- Quantifiable
Laws etc
a. Primary Data: Data collected specifically for the problem under study is called
primary data.
b. Secondary Data: Using already collected data or the other data sources for
research purposes is called secondary data e.g. HMIS, Desk Review
a. Un- Grouped Data: Original information or raw material of an enquiry is called
un-grouped data
For example
Height of 4th year MBBS class in inches
63,64,70,70,70,71,65,64,64,63,61,62
b. Grouped Data: For example Age, made groups is called grouped data
Class Interval Frequency

60-62 3
62-64 5
64-66 8
66-68 5
68-70 6
70-72 1
a. Qualitative or Categorical Data (Counted)

It is further divided into
 Nominal Data Data is divided into named categories e.g. Male and Female,
Black/white, Nominal data that falls only in two groups is called dichotomous data
 Ordinal Data  Data can be placed in meaningful order like 1st, 2nd ,3rd in class
 Interval scale data:-like ordinal data but in addition they have meaningful interval but
do not have absolute zero. e.g. on the Celsius scale (C0) the difference between 1000 and
900 is the same as difference between500 and 400 .However interval scale do not have
absolute zero, so 100C0 is not twice as hot as 50C0,because 0C0 does not indicates
complete absence of heat.
 Ratio Scale Data:- Have the same properties as an interval scale however it has absolute
Zero, Most Biomedical variables form a ratio scale ,e.g. Weight in Gms or pounds,Time
in seconds or days, BP in mm of Hg. Pulse rate in beats /min. Zero Pulse rate indicates an
absolute lack of heartbeat. Therefore it is correct to say that a pulse rate of 120 beats/min
is twice as a pulse rate 60 beats/min.
b. Quantitative or Numerical Data: Measured Data is called Quantitative or numerical data
Variable: Any numerical value which varies from one individual to other
e.g. Height and weight of the individuals are variable.
 -Characteristics of person, object or phenomenon that can take on different values.
Variables are represented by letters X, Y, Z
Constant: Any value which is fixed is called constant.
Constants are represented by letters a, b, c
Constant for example = 22/7
Classification of Variables
• Quantitative variables (Where we measure) Expression of numerical
value of variable(Age, weight, height, Parity)
• Qualitative Variables (Where we Count) Expression of quality of
variable (Sex, color, occupation race etc.)
• a. Independent Variable (input variable)
Variables that are used to describe the factors that are assumed to cause or
influence problems.
e.g. Smoking causes lung cancer
In which smoking is independent while lung cancer is dependent
b. Dependent Variable (outcome variable):
The variable that gets modified under the influence of independent variable
a. Continuous Variable: Any variable which can assume any value in a
given range is a continuous variable e.g. weight, height, speed of a car (0-
150km/h)
b. Discontinuous Variable (Discrete variable): Variable which can assume
a specific value (none in-between) is called discontinuous variable
For example: Number of rooms in a house
Family size (number of family members)
These cannot be in fractions
Frequency distribution table: Arrangement of the data according to the

size and magnitude is known as frequency distribution
For example:
Class Interval ( age group) Frequency percentage
0-4 4 33.3
5-9 4 33.3
10-14 2 16.6
15-19 1 8.3
20-24 1 8.3
Total 12 100 %
Make frequency distribution of the following data.

63,64,70,70,70,71,65,64,64,63,61,62
Class Interval Frequency % age

(Class boundaries)
60-62 1 8.3
62-64 3 25.0
64-66 4 33.3
66-68 0 -
68-70 0 -
70-72 4 33.3
Total 12 100%
Class Interval Frequency % age
60-61 1 8.3
62-63 3 25.0
64-65 4 33.3
66-67 0 -
68-69 0 -
70-71 4 33.3
Total 12 100%
Classification: When you arrange data into classes it is called classification.

Tabulation: Presentation in the form of table is called tabulation.
Table: Systematic arrangement in the form of columns and rows is called
Table.
Analysis: When test is applied (test of significance etc)
Interpretation: Results and conclusion
Factor: A fact or circumstance that helps to bring about

a result
Various frequency distributions

Frequency
• Number of pieces of data in a given category or class of
qualitative data
Frequency distribution
• Listing frequency of each category or class in a tabular
form
Relative frequency (Probability)
• Ratio of frequency of a given class to total number of
observation made
• It is denoted by (f/N)
Cumulative frequency
Frequencies Distribution Table of qualitative variables
Occupation Frequency
Government employee 100
Business 50
Factory worker 150

Farmer 200
Occupation Frequency Relative frequency

Total 500
Govt. employee 100 0.2
Business 50 0.1
Factory worker 150 0.3
Farmer 200 0.4
Total 500 1.0
Distribution Table of Relative frequency

Formula for Relative frequency = f/N
Frequency, Relative frequency and Percentage distribution Table
Educational status Freq RF %
No formal education 100 0.2 20.0

Primary 50 0.1 10.0
Middle 150 0.3 30.0
Metric 200 0.4 40.0
Total 500 1.0 100.0
Percentage = Relative frequecyx100 Or

Percentage = f/N x100
BIOSTATISTICS(MEDICAL STATISTICS)
Descriptive Statistics.
Merely describe organize, summarize data, they refer only to
the actual data available. Example, Mean blood Pressure of a
group of patients.
Inferential Statistics.
Involve making inferences that go beyond the actual data i.e.
generalize to a population after having observed only a sample.
Example:
Mean blood pressure of all American predict the effectiveness
of a new drug for all patients with a particular disease after it
has been tested on only a small sample of the patients.
Researcher: Use information provided by the sample to draw
conclusion about the population.
Use of Inferential Statistics.
1. How a statistic (Such of the Mean of the sample) can be
used to estimate a parameter (such as mean of the
population) with a known degree of confidence.
2. More important use in the hypothesis testing.
Measures of central tendency (averages)(Central Values)
Data tends to cluster in the centre.

An entre distribution can be characterized by one typical measure that
represents all the observations called measure of central tendency.
These measures include:-
1) Arithmetic Mean or Mean
2) Median
3) Mode
Mean: Sum of all values divided by the total number of values. Mean is
denoted by
For ungrouped data:
For grouped data:
Where
= Mid points of the class intervals
(Sum of all the frequencies)
Example:
Class Interval Frequency X (mid fx

(f) point)
2-------------4 2 3 6
4-------------6 3 5 15
6-------------8 4 7 28
8------------10 2 9 18
Total n= 11 ∑fx= 67
= 67/11
= 6.1
 Mean, Median, Mode are called averages
 Mean is the best average among these
 Generally when word average is spoken, it means “Mean”
Median: Central values which divide the distribution (data) into two equal
half. One half is greater and the other half is less than median
For Odd numbers

For un-grouped data
For Even numbers
 Arrange the data in ascending or descending order
For odd number:

Arrange data in order
For example:
2,3,5,3,1,7,9
Total values = 7
Arrange the data in order
1, 2, 3, 3, 5, 7, 9
Median
For Even Numbers:
Arrange the data in ascending or descending order
Sum of central two values
Median =__________________________
2
For Example:
Even number values are,
10, 8, 7, 16, 17, 20
Arrange the data in order
7,8,10,16,17,20
The central two values are 10 and 16
10 +16
Median = ___________
2
Median = 13
For grouped Data:
Median
Where
Exmaple:

0--------------------2 2
2--------------------4 4
4--------------------6 3
6--------------------8 5
8-------------------10 1
n= 15
First calculate
15/2 7.5
7.5 fall in which group cumulative frequency is called the Median group
(falls in which minimum cumulative frequency)
Class Interval Frequency Cumulative Frequency

0-------------2 2 2
2------------4 4 6 (Cumulative frequency preceding the median group)
4---------6 3 9 (Median group)
6------------8 5 14
8----------10 1 15
Median
= 4+ 3/3
= 4+1
=5
Quartile:
If you divide the data into four equal parts, the numerical values are called “
quartile”.
Q1 Q2 Q3 Q4
Q1 =
Q2 =
Q3 =
Q4 =
Decile: If data is divided into 10 equal parts, the numerical values are called
docile.
D1, D2---------------------------------------------,D 10
D1 =
D2 =
.
.
.
D10 =
Percentile:
P1, P2-----------------------------------------------------,P 100
P1 =
P2 =
.
.
.
D100 =
 What is the difference between percentile and percentage?

Percentile: When data is divided into 100 equal parts, each part is called
percentile.
Percentage: It is proportion, how many out of 100.
3% data is below 3rd percentile and 97% data is above 3rd percentile.
4% data is below 4th percentile and 96% data is above 4th percentile
Mode:
Mode is a value in a data or distribution which occurs most frequently. It is
the most popular value in the data.(Highest point on frequency polygon
curve)
For un-grouped data:
Example:
60, 62, 61, 60, 69, 68, 60
Mode = 60
For grouped data:
Mode =
h is the class interval
Example:

10---------------------20 5
20---------------------30 7
30----------------------40 11
40-----------------50 16  (Mode group)
50----------------------60 10 
Mode =
=
=
= 44.6
Measures of dispersion (Measures of Variability)

Dispersion means variation between the individual values from the central
point or spread-ness of data.
Data set A= 4, 5,6,6,9 Data set B=1, 2, 6,6,15
Mean, Median, Mode=6 Mean, Median, Mode=6
Two data sets A&B their mean median and mode are same ,despite these
similarities, these two data sets are obviously different, therefore describing
a data set in terms of measures of central alone is clearly inadequate. They
differ in terms of their variability –extent to which their scores (Individual
vales) are clustered together or scattered about.
Range 9-4=5 Range 15-1=14
0
Measures of dispersions include
1) Range
2) Quartile deviation
3) Mean deviation
4) Variance
5) Standard deviation
Range: Difference between the maximum and the minimum values
Un-grouped data:
R=
Where
Example:
60, 65, 68, 70, 72
R = 72 -60 = 12
For Grouped data:

10-------------------------------20 2
20-------------------------------30 5
30-------------------------------40 7
40-------------------------------50 6
50-------------------------------60 4
60-------------------------------70 1
Range = upper limit of the last class interval – lower limit of the 1st class
interval
R = 70 – 10
= 60
Quartile Deviation:
Q3 – Q1
Q.D = ________
2
Q3 = Third quartile
Q1 = 1st Quartile
Mean Deviation:
M. D =
_________
∑
Sum of all the deviation score from the mean is always zero
Example:
2, 4, 6, 8, 10
5
= 30/5
=6
l l
2 2-6= -4 4
4 4-6 = -2 2
6 6-6 = 0 0
8 8-6 = 2 2
10 10-6 = 4 4
∑ ∑ = 12
l l = absolute values means ignoring signs
M.D =
= 2.4
For grouped data:
M.D =
_________
Standard Deviation:
It is the +ve square root of the mean of square deviation from the mean.
It measures variation from the central point.
It is the square root of variance.
S.D =
Variance: Mean of the squares of all the deviation scores in the distribution
is called variance.
For un-grouped data:
S.D =
Variance =
Deviation score =
Variance is less if majority of the values are closer
Variance is more when majority of the values are far
Example:
Standard deviation of un-grouped data
IQ of 10 students turns out to be as follows
115, 140, 133, 125, 120, 126, 136, 124, 132, 129
Calculate the standard deviation by using this formula if the values are
greater than 30
S.D =
When sample size is less than 30 then formula for S.D is
S.D =
115 115 – 128 = -13 169
140 140 – 128 = +12 144
133 133 – 128 = + 5 25
125 125 – 128 = -3 9
120 120 – 128 = -8 64
126 126 – 128 = -2 4
136 136 – 128 = +8 64
124 124 – 128 = -8 16
132 132 – 128 = +4 16
129 129 – 128 = +1 1
∑ ∑
512 = 512 = 7.54

√10 – 1 √ 9
S.D for grouped data is
S.D =
Correlation:
Concurrence of two variables more often than would be expected by
chance is called association.
Relationship between two and more than two series or groups is called
correlation. Technically it is interdependent of one group to another group.
Positive correlation: if the values of one group increases and other group
also increases then we say correlation is positive.
Example: relationship of age and weight
Negative correlation: if value of one group increases and other group

decreases, then we say the relationship is negative. Example: Relationship
of dose of insulin and blood glucose level.
Zero correlation: if value of one group increases or decreases and other
group has no change, then we say that there is no relationship (zero
relationship).
Correlation is measured by
Co- efficient of co- relation denoted by “ ”
Value of “r” ranging from -1 to +1
r=
Example:
X Y
2 1
4 3
6 5
8 7
10 9
CALCULATE CO- RELATION CO-EFFICIENT
2 1 -4 -4 16 16 16
4 3 -2 -2 4 4 4
6 5 0 0 0 0 0
8 7 2 2 4 4 4
10 9 4 4 16 16 16
= 40 = 40 =40
r=
r=
r= 40/40
= +1 (complete positive co-relation)
Presentation of data
Simple (cities and population)
Tables
Double or complex
Frequency Polygon
Graphs
Cumulative Frequency Polygon
Simple Bar chart
Bar chart Double or multiple Bar
Component Bar chart
Diagrams Pictogram
Histogram
Pie- Chart ( Pie- diagram)
Symmetrical
Special Curves Skewed
Bar- Charts:
Merits: It provides quick glance over observation and shows interval where
main concentration lies.
Population
frequency
Blood Groups
Simple Bar- Chart
population
Lahore Multan Karachi

Cities
Qualitative Data: Quantitative Data
 Bar- Charts 1. Frequency Polygon

 Map- Diagram 2. Cumulative freq polygon
 Pictogram 3. Line Diagram
 Pie- Diagram 4. Scattered Diagram
5. Histogram
Pictogram:
Consists of some pictures or small symbolic figures
City A
City B
City C
Key: = Population of 10,000
Histogram:
It consists of set of rectangles whose bases are marked by class intervals along X-axis and whose heights
are proportional to frequencies with respected classes, (just like cells in histology closely packed)

0-------------------5 18
5------------------10 15
10------------------15 10
15------------------20 15
20------------------25 5
20
15
Frequency 10
5
0 5 10 15 20 25 Height
Pie- Diagram:
Data is presented in the form of circles.
For example:
Occupation Frequency Angle %age

Professionals 25 48° 13.29%
Skilled 43 82° 22.87%
Un skilled 120 230° 63.87%
Total 188 360° 100%
Professionals 13.2%
230
° 48 °
Unskilled
63.8 %
82 ° Skilled 22.8 %
Frequency Polygon:
It is a graphical distribution between mid-points of class intervals and frequencies.

It is made by connecting mid points of class intervals in histogram.
0-------------------5 18
5------------------10 15
10----------------15 10
15----------------20 15
20----------------25 5
20
frequency15 15
10
5 10 15 20 25
Frequency Polygon is a graphical distribution between mid-points of class intervals

and frequencies.
It is made by connecting mid points of class intervals in histogram.
20
frequency 15
10
5 10 15 20 25
Frequency polygon
Line Diagram:
A line diagram is used to show trends of events with passage of time.
Case of Malaria:
Cases
2000 2001 2002 2003 2004
Time (Years)
Scattered diagram:
Scattered diagram shows the relationship between two variables.
600
500
Temperature and weekly deaths due to respiratory infections
400
300
200
100
No. of
deaths
26 37 41 43 48
Mean weekly Temperature of weather
Each plotted data point represents one observation, draw how well a
straight line could fit the plotted points, called “line of best fit”.
Different types of frequency curves:
1) Symmetrical distribution or Curves:

A curve is said to be symmetrical if the values equidistant from central
point.
Curve can be folded along the central point in such a way that two half of
the curve coincide.
Normal Curve Bimodal curve
Symmetrical Curves
Bimodal
Skewed Distribution or Curves:

A curve is said to be skewed when it departs from symmetry. Here the
frequencies.
Here the frequencies tend to pile up at one end or the other end of the
distribution. Mode
Mode
Median
Median
Mean
Mean
Negative Skewed Positive Skewed

J- Shaped L- Shaped
Types of Skewed Curves
Sampling:
Population or Universe: Universe is a “defined whole” about which the
information is desired. It may concern of individuals, families, households, objects
etc.
Sampling: A subset or part of the population is a sample from that population.

The process of selecting the sample is called sampling.
Advantages of Sampling:
 By drawing a sample from a large population, cost of the study is reduced

considerably
 The study can be done with greater speed i.e. in less time and with the
greater accuracy
 Drawing sample can increase the scope of a study since several aspects can
be studied
 Some tests which are not possible in the population can be applied to a
smaller population and important information can be obtained
 Study can be in depth and more suitable
 Study of entire universe in unnecessary and is not cost effective

Sampling Units:
This forms the basis of sampling procedures. They are the breakdown of the
population into smaller parts which are distinct, unambiguous and non
overlapping so that each element of the population belongs to one and only
one sampling unit e.g., individuals who are 15-20 years of age, household,
schools, industrial unit etc.
When the list of all individuals, households, schools, industries etc are
drawn, this is called the Sampling frame.
For example: 20 households had to get sprayed with an anti -malarial spray,
out of a total 100 houses were selected randomly. Thus the list of all the
households formed the sampling frame while each household was the
sampling unit.
An element in the item under study e.g. blood pressure, cholesterol
Representative sample:
A representative sample is the one with which can draw valid inference
regarding the population parameters. Parameters are the unit values of the
universe under study. The values of samples are statistics which help in
inferring the parameters.
Representative sample: If it closely resembles the population from which it
is drawn.
Types of sampling techniques:
There are two major types of sampling procedures
Probability: Non probability
 Simple random > Convenience

 Systematic > Purposive
 Cluster > Quota sampling
 Stratified > snow ball sampling
Probability sampling:
When each element in the population has known chance of being included
in the sample.
Simple random: An important sampling technique in which each sampling
unit of a population has an equal probability of being included in the
sample.
Procedure:
1) Prepare a sampling frame list showing all the units
2) Decide on the number to be selected (sample size)
3) Select the required number of units through,
 Drawing lots “lottery method” when sample is small
 Use random tables especially if the sample size is large.
Systematic sampling: A pre determined system is used for this type of

sampling. Steps are as follows:
1) List the total number of units in the population (N)
2) Decide the sample size (n)
3) Calculate the sampling ratio i.e N/n ( Example: sample of 100 out of
1000 = 1000/100 = 10)
4) Select randomly the first unit out of first ten and then interview every K-
th unit i.e. every 10th or 4th when frame is not close, K-th patient visiting
a clinic or OPD.
Cluster sampling: Selection is made of clusters of group such as Mohalla,

buildings, villages, housing unit etc. each cluster is treated as a single unit in
the selection process. One can select (randomly) only a few clusters.
Example: We want to select 100 students from Medical Colleges of Lahore:

It is more economical to select a random set of group or clusters such as
random set of 10 Medical Colleges and then interviewing 10 students in
each of those 10 Medical Colleges.
Stratified sampling: Sometimes population contains highly variable

materials and a simple random sampling fails to represent the population
then the population is divided into number of groups (or strata) of units in
such a way that units within each group are as similar as possible .Then
simple random techniques are applied in the groups. The process of
dividing the population is called stratification and the groups are called
stratas. This technique is called stratified random sampling.
Non-probability sampling: In Non-probability sampling, the selection of
elements (units) are not based on probability theory but a personal
judgment plays a role in the selection of sample. There is no assurance that
each element will have the same chance of being included in the sample.
Convenience Sampling: Sample is selected in a haphazard fashion. This may
be because of convenience, less cost, etc.
Quota sampling: In this information is collected from the specified number
of individuals e.g quota of old and young population etc.
used in market survey
Purposive sampling: This sampling is done on the basis of some pre-

determined idea. (Clinical knowledge etc)
E.g. study on eye complications of diabetes mellitus; I shall take diabetics

for at least 10 years.
Snowball sampling: (or chain sampling, chain-referral sampling, referral
sampling) is a non-probability sampling technique where existing study
subjects recruit future subjects from among their acquaintances. Thus the
sample group appears to grow like a rolling snowball. As the sample builds
up, enough data is gathered to be useful for research. This sampling
technique is often used in hidden populations which are difficult for
researchers to access; example populations would be drug users or sex
workers.
Probability Non Probability sample

1) Time consuming - Less time consuming
2) Minimal Bias -Bias Maximum
3) More authentic -authenticity debatable
4) Results can be generalized - results cannot be generalized
over population
5) Expensive - Economical
6) Inconvenient -Convenient
7) Technically skilled - less skilled operator required
Operator required
8) Same chance - Choice plays a role
Life Table:
Tabular display of life expectancy in which group of people born on same

day and at same place is taken and mortality in respect of age and sex on
hypothetical basis is represented.
It is assumed that 10,000 or 100,000,00 infants born on the same day at a
place and their deaths are observed and recorded as the pass through each
year of life at same rate as experienced at these basis by the general
population of the same place during same time period.
Requirements:
1) Age specific death rates
2) Life expectancy rate
They are taken from record of survey.
Construction:
Column I: Frequency by age
Column II: No of persons surviving that age of life, or of no. of persons
on which table was started.
Column III: Probability of dying within the given year
Column IV: No of persons dying in successive years of age in each
successive age group.
Column V: Life expectancy at each age group
Advantages:
1) It is used by the life insurance companies. (how much premium will be
received at this age to onward and what is life expectancy)
2) Health authorities for long term planning
3) No. of years added to life by surgical or medical techniques or prevention
and control program’s efficacy can also be determined.
4) It gives guidance whether to continue or not to continue the program
Probability (Relative Frequency)

The probability of an event is a quantitative measure of the proportion of all
possible equally likely outcomes that are favourable to the event. It is denoted by
p.
Ludo dice = likely outcome of sixes/no. of possible outcomes
Probability of an event = Number of likely outcome
Number of possible outcome
Probabilities are usually expressed as decimal fractions, not as percentages, most

lie in between 0 and 1.
Zero = zero probability
1 = Absolute certainty
Probability of an event can be expressed as a ratio of number of likely outcome to

the number of possible outcomes.
Example: if a fair coin were tossed infinite number of times then
Probability of heads = 0.50
Probability of tails = 050
If a random sample of 10 people were drawn an infinite number of times from the
population of 100 people then the probability of each person included in the
sample would be
P = 10/ 100 = 0.10

Example: Pulse of group of normal healthy person was 72 with S.D of 72. What is
P that a male chosen at random would be formed to have a pulse 80 or more?
Standard error:
It is a measure of the extent to which the samples mean deviate from the true
population mean.
if we draw large number of random samples of equal size from the same
population. Then means of all these samples are not the same because of the
“Sampling error”. If we draw curve (distribution) of these sample means, they
spread out to form a distribution called “random sampling distribution of means”
The random sampling distribution of mean will always tend to be normal (normal
distribution curve) irrespective of the shape of the population distribution from
which samples were drawn. This is called the “Central limit theorem. Theorem
also states that the mean of the random sampling distribution of means (
equal to the mean of the original population.
( =
The standard deviation of the “random sampling distribution of means” is called

“Standard error”.
Standard error ( x) =
Where
= standard deviation of population

n = sample size
Standard error
Where
n= sample size
Standard error is inversely related to the square root of the sample size. So the
large the sample size becomes the more closely will the sample mean represent
the true populations mean. This is the reason why the results of large studies or
surveys are more trusted than the results of small ones.
We do study on sample, so (population standard deviation) will not be known.

So, standard error cannot be calculated, instead standard error can be estimated
using data that are available from the sample alone. So it is called “ESTIMATED
STANDARD ERROR OF THE MEANS”
Sx =
Where
S= standard deviation of the sample
The “ESTIMATED STANDARD ERROR “, it is called standard error in many research

articles
Estimated standard error is symbolized by Sx

Sx =
Where
S = standard deviation of the sample
n = Sample size
Tests of significance:
The information obtained from the sample is used to make decision about the
population. For example on the basis of sample, we are required to decide
whether a certain drug is effective in curing a particular disease or not OR a
medical researcher might he required to decide on the basis of experimental
evidence whether a certain vaccine is superior to the other which is already in the
market. We use certain rules and procedures. These are called “TESTS OF
SIGNIFICANCE” or “TESTS OF HYPOTHESIS”.
For example: Chi- Square test, Student t-test, Z test etc

Null hypothesis:
It is a statement which is to be tested for possible rejection. It is denoted by “H ” It

0
is also called Hypothesis of no difference.
For example: H No. difference between teaching method A and B.

0=
Alternate Hypothesis: It is another statement, if we reject Null hypothesis then

we accept alternate hypothesis. It is denoted by H1 or HA
HA = There is difference between teaching method A and B.

Choosing tests of significance/Co-relational technique
Question concerning Nominal Ordinal Interval or
data data ratio scale
data
Difference between two proportions Chi-square
One or Two Means(What is the True t test or Z-test if

mean of the population?, Is one sample sample is more 100
mean significantly different from the
other sample mean?)
More than two Means (Is one sample ANOVA with F-test
mean significantly different from the
more than one other sample means?
Variances (Are the variances in two F-test

samples significantly different?)
Association (To what degree two Spearman Pearson r

variables are correlated?) p
Predicting the value of a one variable on Regression

the basis of other variable
Level of significance:
It is denoted by . It is the probability of rejecting the Null hypothesis when it is

true. It is pre-determined small value.
P-Value 0.05
P-Value 0.01
P-Value 0.1
: It means there are 5% chances out of 100 we reject the Null

hypothesis and 95% confident to accept the Null hypothesis when it is true.
: Only 1% chances to reject the null hypothesis and 99%

confident that we accept the Null hypothesis when it is true.
P value
Accepted probability of error in decisions
Internationally accepted equal to 5% or 0.05 in fraction
It can be stated or accepted below and above 0.05 depending upon

study sample
It is also known as α error, type 1 error or significance level
On two tail of normal curve it is a/2 = 0.025 on both side
Confidence level
It is equal to 1-a (0.95 or 95%)
It is the probability of making correct decisions (rejection the null

hypothesis when it is false)
95%
Region of
Acceptance
2.5% 2.5%
Region of rejection region of rejection
Two Tail
Normal curve
95%
Area of
Acceptance
0.05% region of rejection
Normal curve
Regions of acceptance and rejection:
Possible results of a sampling experiment can be divided into two groups.
1) Results leading us to accept the hypothesis

2) Results leading us to reject the hypothesis
When calculated value of the test of significance falls in the region of
rejection, we reject the Null hypothesis. When calculation value of the test
of significance falls in the region of acceptance, we accept the Null
hypothesis
 Region of rejection
 Region of acceptance
Type I Error and Type II Error:
In the theory of testing the hypothesis two types of error are committed.
1) We reject the hypothesis when it is true ( type I error)or α-error
(Probability of rejecting null hypothesis when it is in-fact true)
2) We accept the hypothesis when it is false ( type II error) or β-error
(Probability of not rejecting null hypothesis when it is in-fact false)
K is the observed values (Sample value)

General procedures of testing hypothesis:
It involves following 6 steps
1) State Null hypothesis and Alternate hypothesis

2) Decide level of significance ( )
Normally used 5%
3) State the test statistics to be used. Important tests of significance are
a) Z-test (if n>30)
b) Student t-test
c) Chi – Square test etc
4) Compute the value of test statistics (test of significance) in order to
decide whether to reject or accept the null Hypothesis.
5) If the results fall in the region of rejection. We reject the Null hypothesis.
If the results fall in the region of acceptance, we accept the Null
hypothesis.
6) State the conclusion in words.
Chi-square test:
Example: Is there any relationship between sex and smoking?
Suppose we select the random sample of 100 people and found the
results as follows.
Male Female Total

Smoking 30 (a) 10 (b) 40
Non smoking 20 (c) 40 (d) 60
50 50 100
Total
To test this question, we shall state the Null hypothesis.
H = There is no relationship between ( Null hypothesis) sex and smoking.

0
(no difference of smoking by sex)
HA = There is relationship between (alternate hypothesis) sex and

smoking. (There is difference of smoking by sex)
E value = horizontal total Vertical total

Grand Total
E value of cell (a) = 40 50

100
= 20
E value of cell (b) = 40 50
100
= 20
E value of cell (c) = 60 50

100
= 30
E value of cell (d) = 60 50

100
= 30
Chi- Square
O values E values O–E
30 20 10 100 100/20 =5.00

10 20 -10 100 100/20 =5.00
20 30 -10 100 100/30 =3.33
40 30 10 100 100/30=3.33
∑= 16.66
Choose level of significance ( )
( ) = 5%
Then test statistic
Chi- Square
Where
O = observed values
E= Expected values
∑= summation
Calculated value of chi- square is 16.66

Degree of freedom = (row – 1) (column -1)
df = (r – 1) ( c – 1)
= (2 – 1) (2- 1)
= 1
Table value of chi- square at 1 degree of freedom is
3. 84
Region of
Accpetance
Region of rejection
3.84 16.66
Calculated value of chi – square is greater than table value, so it falls in
the region of rejection. So we reject the Null hypothesis and accept the
alternate hypothesis.
P - Value is < 0.05
So
Conclusion:
There is significant relationship between sex and smoking
(There is statistically significant difference of smoking by gender)
If P is less than or equal to 0.05 it is regarded as statistically significant if we choose ( ) = 5%

P-value 0.05: We reject the null hypothesis (calculated value of test of
significance falls in the region of rejection) and accept the alternate hypothesis.
Result was unlikely to have occurred by chance. The difference is not merely due
to chance but statistically significant (significant difference).It means likelihood of
results (difference) having occurred by chance is 0.05or less and 95% confident
that results were not obtained by chance. However 5% chance that the null
hypothesis is infect true, although it is being rejected, are still there.
P- Value > 0.05: We accept the null hypothesis (calculated value of test of
significance falls in the region of acceptance) and reject the alternate hypothesis.
In results chance cannot be excluded, difference between sample mean (X) and
hypothesized population mean ( ) is insignificant and so results (difference)
are insignificant.
Student t-test:
t= hyp
SX
Where
X = Mean sample
hyp = Hypothesized population mean(Stated population Mean)
Sx = Estimated standard error
Example: The Principal of a medical college states that the college’s students are
highly intelligent group with an average IQ of 135. This claim constitutes a
hypothesis that can be tested.
H=
0 hyp = µ = 135 (Hypothesized Population mean (Stated mean) is equal to 135)
(There is no difference between sample and population being compared.)
HA = hyp =µ 135 (Population Mean is not equal to 135)
Choose level of significance ( )

Let’s choose
( ) = 5%
We take sample of 10 students. Their IQs turned out to be as follows.
115, 140, 133, 125, 126, 136, 124, 132, 129, 120
Sample Mean (X) = 128
Calculating estimated standard error of sample
Sx =
Where
S = standard deviation of the sample
n= Sample size 512
S.D = n-1 = 10 - 1
= 7.542 This is standard deviation of the Sample

115 115-128= -13 169
140 140-128= +12 144
133 133-128= +5 25
125 125-128= -3 9
120 120-128= -8 64
126 126- 128= -2 4
136 136- 128= +8 64
124 124-128= -4 16
132 132- 128= +4 16
129 129-128= +1 1
∑= 512
Sx =
= 7.542/√10
= 2.385
t= hyp
SX
= 128-135
2.385
= -2.935
Degree of freedom for t-test is
d.f= n-1
=10 – 1 = 9
Table values of t-table at d.f= 9 and P= 0.05 is 2.262
So:
Region of
Acceptance
Region of rejection region of rejection
t-calc = -2.935 t-crit - 2.262 t-crit +2.26

(calculated (Table value) (Table value)
value)
The calculated value falls in the region of rejection
So
We reject the Null hypothesis and accept the alternate Hypothesis.

Conclusion:
hyp 135 (not equal to 135)
So the medical college students are not highly intelligent group with an average IQ
not equal to 135.
z- test:
z=
Where
= Mean of sample
= Mean of population
= Standard error
Study questions
Is mean Ht of girls and boys are different?

Example: The Principal of a medical college states that the college’s students
average height is 65 inches. This claim constitutes a hypothesis that can be tested.
H=
0 hyp = 65 inches (Hypothesized Population mean is equal to 65 inches)
(There is no difference between sample and population being compared.)
(There is no difference between stated mean and estimated mean)
HA = hyp 65 inches (Population Mean is not equal to 65 inches)
(There is significant difference between sample and population being compared.)
(There is significant difference between stated mean and estimated mean)

Example: The Principal of a medical college states that the mean height of boys
students is same as mean height of Girls students. This claim constitutes a
hypothesis that can be tested.
H = (There is no difference between mean height of Boys and Girls students.)

0
HA = (There is significant difference between mean height of Boys and Girls

students.)
Concept of normal distribution curve
Almost all physiological variables in nature are distributed in such a way that
majority of the values lie around mean and very few values appear on the
extremes. (normal distribution pattern)
Any curve which is smooth, symmetrical and bell shaped is called “normal curve”.
Mean
Median
Mode
However there is one standard normal curve. The characteristics of this curve are;
1) Bell shaped
2) Mean, Median, Mode all are at the same point
3) Mean is zero
4) Tails go to infinity
5) Area of the curve is equal to 1 ( or 100%)
6) Mean 1S.D = 68.3%
Mean 2S.D = 95.4%
Mean 3S.D = 99.7%

7) This curve is the basis of the theory of probability.
Confidence limit OR confidence Interval:
1S.D or 2S.D on each side of the mean is called confidence limit. The interval
between them is called confidence interval. We can say with confidence that 68%
of the distribution lies within approximately 1S.D of the Mean.
Life Table:
Tabular display of life expectancy in which group of people born on same

day and at same place is taken and mortality in respect of age and sex on
hypothetical basis is represented.
It is assumed that 10,000 or 100,000,00 infants born on the same day at a
place and their deaths are observed and recorded as the pass through each
year of life at same rate as experienced at these basis by the general
population of the same place during same time period.
Requirements:
1) Age specific death rates
2) Life expectancy rate
They are taken from record of survey.
Construction:
Column I: Frequency by age
Column II: No of persons surviving that age of life, or of no. of persons
on which table was started.
Column III: Probability of dying within the given year
Column IV: No of persons dying in successive years of age in each
successive age group.
Column V: Life expectancy at each age group
Advantages:
1) It is used by the life insurance companies. (how much premium will be
received at this age to onward and what is life expectancy)
2) Health authorities for long term planning
3) No. of years added to life by surgical or medical techniques or prevention
and control program’s efficacy can also be determined.
4) It gives guidance whether to continue or not to continue the program
Detailed calculations and column definitions
A standard abridged life table is presented in Example 2. This section goes through
the calculation of each individual column.
Width of the interval (n)
The number of years in each age interval. For example for the <1 age group n = 1, for
the 1-4 age group n = 4 and for all other age groups including 85+ n = 5.
Average proportion of the year lived by those who die (nax)
Usually it is assumed that death occurs uniformly across time and that on average
people will live 0.5 of the interval before death. However, there are some cases
where we know that death does not occur uniformly across time within age groups.
For example, for those aged under 1 we assume that the average proportion of the
year lived by those who die is 0.1.
The probability of dying (nqx)
Number of years in interval * age-specific death rate
1 + number of years in interval (1 – average proportion of year lived by those who
die)*age specific death rate
or
n * nMx
1 + n (1- nax) * nMx
The probability of surviving (npx)
1 – probability of dying
or
1 - nqx
Number of persons alive at the start of the interval (lx)

This is a hypothetical population, in this case 100,000 alive/born at age 0. Those
alive at age 1-4 in this case are:
Probability of surviving the previous interval * population alive at start of previous
interval
or
lx-n * npx-n
Number of deaths during interval (ndx)
Population alive at start of interval – population alive at start of next interval
or
lx-lx+n
Number of person years lived through the interval (nLx)
Number of years in interval (number of persons alive at start of next interval +
average proportion of year lived by those who die*number of deaths during interval)
or
n(lx+n + nax * ndx)
At age 85+ everybody dies during the interval so an adjustment has to be made.
Whatever is used as an estimate of the number of years lived has little impact on
overall life expectancy, however, it is usual to use the following estimate:
L85+ = l85
M85+
Total number of person years lived after the interval (nTx)
This is the ‘number of person years lived through the interval’ column summed from
the bottom.
or
Tx+n + nLx
Expectation of life (ex)
This is the number of years a person aged x can be expected to live.
Total number of person years lived after the interval
Number of person years alive at the start of the interval
or
Tx
lx
Z-score:-Location of any element a normal distribution can be
expressed in terms of how many S.D it lies above or below the mean of
the distribution.
z=
σ
Table of Z-score:
Table of z-score states what proportion of any normal distribution lies
above or below any given Z-score.
Z (a) Area between mean and z (b) Area beyond z
0.00 0.0000 0.5000
0.01 0.0040 0.4960
0.02 0.0080 0.4920
: : :
1.00 0.3413 * 0.1587
: : :
2.00 0.4772 + 0.0228
: : :
3.00 0.4987 # 0.0013

Example:-
Resting Heart Rate of 4th Year MBBS students of SIMS is
normally distributed with Mean of 70 and SD of 10.
a)What is the location of an element 85 beat/min?
b)What population having heart rate above 85beat/min?
c)What is the location of an element 65 beat/min?
d)What population having heart rate below 65beat/min?
ANSWERS
a) 85-70/10 =+1.5 SD (1.5 SD above the mean)
b) Table value from Z-table (area beyond 1.5 SD is 0.0668)

=6.68 %( 6.68% population having heart rate above 85
beats /min or we can say probability of one randomly selected
person from this population having heart rate above 85 beat
/min is 6.68% or 0.0668)
c) 65-70/10= -0.5 SD (0.5 SD below the mean)
d) Table value from Z-table (area beyond 0.5 SD is 0.3085)
=30.85% or we can say probability of one randomly selected
person from this population having heart rate below 65 beat
/min is 30.85% or 0.3085)
e)What heart rate divides the fastest beating 5% of the

population from the remaining 95%?
e) Z-score that divide the top 5% of the area under the curve
from remaining area (95%)
= nearest figure to 5% (0.05) in Z- table =0.0495
0.0495 corresponding to z- score of 1.65
=corresponding heart rate therefore lies 1.65 S.D above the
mean.
= Mean + 1.65 SD
= μ + 1.65 SD
=70+1.65*10
=86.5 beat/min
Heart rate 86.5 beat /min divides the fastest beating 5% of the
population from the remaining 95% or We conclude that the
fastest beating 5% of this population has a heart rate above
86.6 beats/ min.
Analysis by type of data

Qualitative variables
Frequencies
• Simple frequency
• Relative frequency
• Cumulative frequency
Percentages
• Cumulative percentage
• Rates (Prevalence rate & Incidence rate)
Test of significance
• Chi-square test
Quantitative Variables
A) Central values
• Mean
• Median
• Mode
• 50th percentile
B) Dispersions
• Range
• Mean deviation
• Standard deviation
• Variance
• Percentiles
C)
Statistics:=
 Descriptive
 Inferential

Lecture Notes Community Biostats Typed Without Errors-1

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lecture Notes Community Biostats Typed Without Errors-1

Încărcat de

Drepturi de autor:

Formate disponibile

Basic Definitions and concepts

Statistics: It is a science of systemic collection, classification, tabulation,

 It is a science of facts and figures

Statistics Agricultural statistics

Applied Statistics Socioeconomic statistics

Biostatistics (Medical Statistics)

Biostatistics (Medical Statistics): It is the basic science of collection, classification,

Biostatistics: It is branch of statistics deals with vital events in a human

 Statistics related to health and disease related states and events.

To establish causation for existence of health problems .

To plan health measures(Health programs)

To evaluate outcome of health measures

Data: Facts and figures you collect is called data

Data something assumed as facts and made the basis of reasoning or

Information:-When data is arranged in such a form that it becomes meaningful is

Health information system(HIS,HMIS, DHIS)

A mechanism for the collection, processing , analyzing and transmission of

Information is needed about:-

Health status indicators

Health services utilization.

Uses of health information

To measure the health status of a community

For comparison and conclusion

For planning and management

To see performance of a health care programme

To assess satisfaction of consumer

Sources of health data

Person is counted at place of his / her usual /normal residence

Intercensual population estimation

2. Arithmetic progression method

Base population x [ 1 + gr/100 x no. Of years]

Population on 1-7-1998 = 130.6 million & gr = 2.2%

Estimated population on 1-7-2004 =

3. Geometric progression method

Base population x [1 + gr/100]no. Of yrs

2. Registration of vital events

Births, deaths, marriages, divorces, adoptions etc.

Union council  tehsil. Council  distt. Council,

6. Record Linkages:-"Record linkage" is the term used by statisticians,

Large administrative databases are increasingly being used to compare

7. Health facility records

8. Health man power statistics

9. Environmental health statistics

10. Population based epidemiological studies

11. Demographic surveys

12. Economic surveys

13. Non –quantifiable information e.g. Policies, laws etc

Statistics: Any numerical value computed from a sample is known as statistics.

Parameters: Any numerical value computed from the population is known as

Qualitative (Nominal, Ordinal,Interval scale,Ratio.S)

Quantitative (Continuous, Discontinuous)

Height of 4th year MBBS class in inches

Class Interval Frequency

a. Qualitative or Categorical Data (Counted)

Frequency distribution table: Arrangement of the data according to the

Make frequency distribution of the following data.

Class Interval Frequency % age

Class Interval Frequency % age

Classification: When you arrange data into classes it is called classification.

Factor: A fact or circumstance that helps to bring about

Various frequency distributions

Frequencies Distribution Table of qualitative variables

Government employee 100

Factory worker 150