Sunteți pe pagina 1din 84

Basic Definitions and concepts

Statistics: It is a science of systemic collection, classification, tabulation,


presentation, analysis and interpretation of data.

 It is a science of facts and figures

Mathematical Statistics

Statistics Agricultural statistics

Applied Statistics Socioeconomic statistics

Biostatistics

Biostatistics (Medical Statistics)

Biostatistics (Medical Statistics): It is the basic science of collection, classification,


analysis, quantification and interpretation of data in relation to vital events.

Biostatistics: It is branch of statistics deals with vital events in a human


population. These vital events are births, deaths, sickness, marriage, divorce etc.

 Statistics related to health and disease related states and events.

Vital statistics
Ongoing collection by govt. Agencies of data relating to events such as
births, deaths, marriages, divorces and adoptions. etc
Uses of biostatistics

To define and quantify the nature and extent of illness and death in the
community.

To establish causation for existence of health problems .

To plan health measures(Health programs)

To evaluate outcome of health measures

For comparison

For research

Data
Data are the basic building blocks of statistics and refers to the individual
values presented, measured or observed.

Data: Facts and figures you collect is called data

Data something assumed as facts and made the basis of reasoning or


calculation.

Information:-When data is arranged in such a form that it becomes meaningful is


called information.

Health information system(HIS,HMIS, DHIS)

(HIS / HMIS/DHIS)

A mechanism for the collection, processing , analyzing and transmission of


information required for organizing and operating services and research.

Information is needed about:-


Demography and vital statistics.(Science of population)
Environmental health

Health status indicators

Health resources

Health services utilization.

Outcome of a service

Financial reports.

Uses of health information

To measure the health status of a community

For comparison and conclusion

For planning and management

To see performance of a health care programme

To assess satisfaction of consumer

For research

Sources of health data

1. Census
Held every 10th year and information is collected about demographic and
socio-economic characteristics of population.

Methods
Enumerative- Pakistan / USA
Questionnaire - England

Combination

types

De facto
Person is counted at the place he / she is found at the time of
counting

De jure

Person is counted at place of his / her usual /normal residence

Intercensual population estimation


1. Natural increase method
(previous census + births + immigrants)

Minus

(deaths + emigrants)

2. Arithmetic progression method

Base population x [ 1 + gr/100 x no. Of years]

Example.

Population on 1-7-1998 = 130.6 million & gr = 2.2%

Estimated population on 1-7-2004 =


130.6 x [1 + 2.2/100 x 6] = 147.8 million

3. Geometric progression method

Base population x [1 + gr/100]no. Of yrs

130.6 x [1 + 2 .2/100]6

= 148 million

2. Registration of vital events

Births, deaths, marriages, divorces, adoptions etc.

Union council  tehsil. Council  distt. Council,

3. Notification of diseases

4. Hospital records

5. Disease registries

6. Record Linkages:-"Record linkage" is the term used by statisticians,


epidemiologists, and historians, among others, to describe the process
of joining records from one data source with another that describe the
same entity.

Large administrative databases are increasingly being used to compare


mortality across hospitals. We can used Automatch (Software) running
on a Pentium microcomputer for the record linkage, to link patient
records in our hospital's database (N = 253,836) with mortality files from
California (N = 1,312,779) and the U.S. We identified death records of
88% of 494 patients with cancer metastatic to the liver, 84% of 164
patients with pancreatic cancer, and 91% of 126 patients with CD4
counts of less than 50. Hospital data can be accurately linked with state
and national vital statistics using commercial record linkage software.

7. Health facility records

8. Health man power statistics

9. Environmental health statistics

10. Population based epidemiological studies

11. Demographic surveys

12. Economic surveys

13. Non –quantifiable information e.g. Policies, laws etc


Population: Number of people living in a specific area, but in statistics, population
means “number of group, number of things”

Sample: Part of a population is sample and the process of taking sample is called
sampling.

Statistics: Any numerical value computed from a sample is known as statistics.

Parameters: Any numerical value computed from the population is known as


Parameters.
Classification of Data

Qualitative (Nominal, Ordinal,Interval scale,Ratio.S)

a. Quantifiable

Quantitative (Continuous, Discontinuous)

Policies

b. Non- Quantifiable

Laws etc

a. Primary Data: Data collected specifically for the problem under study is called
primary data.

b. Secondary Data: Using already collected data or the other data sources for
research purposes is called secondary data e.g. HMIS, Desk Review
a. Un- Grouped Data: Original information or raw material of an enquiry is called
un-grouped data

For example

Height of 4th year MBBS class in inches

63,64,70,70,70,71,65,64,64,63,61,62

b. Grouped Data: For example Age, made groups is called grouped data

Class Interval Frequency


60-62 3
62-64 5
64-66 8
66-68 5
68-70 6
70-72 1

a. Qualitative or Categorical Data (Counted)


It is further divided into

 Nominal Data Data is divided into named categories e.g. Male and Female,
Black/white, Nominal data that falls only in two groups is called dichotomous data
 Ordinal Data  Data can be placed in meaningful order like 1st, 2nd ,3rd in class
 Interval scale data:-like ordinal data but in addition they have meaningful interval but
do not have absolute zero. e.g. on the Celsius scale (C0) the difference between 1000 and
900 is the same as difference between500 and 400 .However interval scale do not have
absolute zero, so 100C0 is not twice as hot as 50C0,because 0C0 does not indicates
complete absence of heat.
 Ratio Scale Data:- Have the same properties as an interval scale however it has absolute
Zero, Most Biomedical variables form a ratio scale ,e.g. Weight in Gms or pounds,Time
in seconds or days, BP in mm of Hg. Pulse rate in beats /min. Zero Pulse rate indicates an
absolute lack of heartbeat. Therefore it is correct to say that a pulse rate of 120 beats/min
is twice as a pulse rate 60 beats/min.
b. Quantitative or Numerical Data: Measured Data is called Quantitative or numerical data
Variable: Any numerical value which varies from one individual to other
e.g. Height and weight of the individuals are variable.
 -Characteristics of person, object or phenomenon that can take on different values.
Variables are represented by letters X, Y, Z
Constant: Any value which is fixed is called constant.
Constants are represented by letters a, b, c
Constant for example = 22/7
Classification of Variables
• Quantitative variables (Where we measure) Expression of numerical
value of variable(Age, weight, height, Parity)
• Qualitative Variables (Where we Count) Expression of quality of
variable (Sex, color, occupation race etc.)
• a. Independent Variable (input variable)
Variables that are used to describe the factors that are assumed to cause or
influence problems.
e.g. Smoking causes lung cancer
In which smoking is independent while lung cancer is dependent
b. Dependent Variable (outcome variable):
The variable that gets modified under the influence of independent variable
a. Continuous Variable: Any variable which can assume any value in a
given range is a continuous variable e.g. weight, height, speed of a car (0-
150km/h)
b. Discontinuous Variable (Discrete variable): Variable which can assume
a specific value (none in-between) is called discontinuous variable
For example: Number of rooms in a house
Family size (number of family members)
These cannot be in fractions

Frequency distribution table: Arrangement of the data according to the


size and magnitude is known as frequency distribution
For example:
Class Interval ( age group) Frequency percentage
0-4 4 33.3
5-9 4 33.3
10-14 2 16.6
15-19 1 8.3
20-24 1 8.3
Total 12 100 %

Make frequency distribution of the following data.


63,64,70,70,70,71,65,64,64,63,61,62

Class Interval Frequency % age


(Class boundaries)
60-62 1 8.3
62-64 3 25.0
64-66 4 33.3
66-68 0 -
68-70 0 -
70-72 4 33.3
Total 12 100%

Class Interval Frequency % age

60-61 1 8.3
62-63 3 25.0
64-65 4 33.3
66-67 0 -
68-69 0 -
70-71 4 33.3
Total 12 100%

Classification: When you arrange data into classes it is called classification.


Tabulation: Presentation in the form of table is called tabulation.
Table: Systematic arrangement in the form of columns and rows is called
Table.
Analysis: When test is applied (test of significance etc)
Interpretation: Results and conclusion

Factor: A fact or circumstance that helps to bring about


a result

Various frequency distributions


Frequency
• Number of pieces of data in a given category or class of
qualitative data
Frequency distribution
• Listing frequency of each category or class in a tabular
form
Relative frequency (Probability)
• Ratio of frequency of a given class to total number of
observation made
• It is denoted by (f/N)
Cumulative frequency

Frequencies Distribution Table of qualitative variables

Occupation Frequency

Government employee 100

Business 50

Factory worker 150


Farmer 200

Occupation Frequency Relative frequency


Total 500

Govt. employee 100 0.2

Business 50 0.1

Factory worker 150 0.3

Farmer 200 0.4

Total 500 1.0

Distribution Table of Relative frequency


Formula for Relative frequency = f/N

Frequency, Relative frequency and Percentage distribution Table

Educational status Freq RF %

No formal education 100 0.2 20.0


Primary 50 0.1 10.0
Middle 150 0.3 30.0
Metric 200 0.4 40.0
Total 500 1.0 100.0

Percentage = Relative frequecyx100 Or


Percentage = f/N x100
BIOSTATISTICS(MEDICAL STATISTICS)
Descriptive Statistics.
Merely describe organize, summarize data, they refer only to
the actual data available. Example, Mean blood Pressure of a
group of patients.
Inferential Statistics.
Involve making inferences that go beyond the actual data i.e.
generalize to a population after having observed only a sample.
Example:
Mean blood pressure of all American predict the effectiveness
of a new drug for all patients with a particular disease after it
has been tested on only a small sample of the patients.
Researcher: Use information provided by the sample to draw
conclusion about the population.
Use of Inferential Statistics.
1. How a statistic (Such of the Mean of the sample) can be
used to estimate a parameter (such as mean of the
population) with a known degree of confidence.
2. More important use in the hypothesis testing.

Measures of central tendency (averages)(Central Values)

Data tends to cluster in the centre.


An entre distribution can be characterized by one typical measure that
represents all the observations called measure of central tendency.
These measures include:-
1) Arithmetic Mean or Mean
2) Median
3) Mode
Mean: Sum of all values divided by the total number of values. Mean is

denoted by
For ungrouped data:

For grouped data:

Where
= Mid points of the class intervals

(Sum of all the frequencies)

Example:

Class Interval Frequency X (mid fx


(f) point)
2-------------4 2 3 6
4-------------6 3 5 15
6-------------8 4 7 28
8------------10 2 9 18
Total n= 11 ∑fx= 67

= 67/11
= 6.1
 Mean, Median, Mode are called averages
 Mean is the best average among these
 Generally when word average is spoken, it means “Mean”

Median: Central values which divide the distribution (data) into two equal
half. One half is greater and the other half is less than median

For Odd numbers


For un-grouped data
For Even numbers

 Arrange the data in ascending or descending order

For odd number:


Arrange data in order
For example:
2,3,5,3,1,7,9
Total values = 7
Arrange the data in order
1, 2, 3, 3, 5, 7, 9

Median
For Even Numbers:
Arrange the data in ascending or descending order
Sum of central two values
Median =__________________________
2
For Example:
Even number values are,
10, 8, 7, 16, 17, 20
Arrange the data in order
7,8,10,16,17,20
The central two values are 10 and 16
10 +16
Median = ___________
2
Median = 13

For grouped Data:

Median
Where

Exmaple:

Class Interval Frequency


0--------------------2 2
2--------------------4 4
4--------------------6 3
6--------------------8 5
8-------------------10 1
n= 15

First calculate

15/2 7.5
7.5 fall in which group cumulative frequency is called the Median group
(falls in which minimum cumulative frequency)

Class Interval Frequency Cumulative Frequency


0-------------2 2 2
2------------4 4 6 (Cumulative frequency preceding the median group)
4---------6 3 9 (Median group)
6------------8 5 14
8----------10 1 15

Median

= 4+ 3/3
= 4+1
=5
Quartile:
If you divide the data into four equal parts, the numerical values are called “
quartile”.

Q1 Q2 Q3 Q4

Q1 =

Q2 =

Q3 =

Q4 =

Decile: If data is divided into 10 equal parts, the numerical values are called
docile.
D1, D2---------------------------------------------,D 10

D1 =

D2 =
.
.
.

D10 =

Percentile:
P1, P2-----------------------------------------------------,P 100

P1 =

P2 =
.
.
.

D100 =

 What is the difference between percentile and percentage?


Percentile: When data is divided into 100 equal parts, each part is called
percentile.
Percentage: It is proportion, how many out of 100.
3% data is below 3rd percentile and 97% data is above 3rd percentile.
4% data is below 4th percentile and 96% data is above 4th percentile
Mode:
Mode is a value in a data or distribution which occurs most frequently. It is
the most popular value in the data.(Highest point on frequency polygon
curve)
For un-grouped data:
Example:
60, 62, 61, 60, 69, 68, 60

Mode = 60

For grouped data:

Mode =

h is the class interval
Example:

Class Interval Frequency


10---------------------20 5
20---------------------30 7
30----------------------40 11

40-----------------50 16  (Mode group)

50----------------------60 10 

Mode =
=

=
= 44.6

Measures of dispersion (Measures of Variability)


Dispersion means variation between the individual values from the central
point or spread-ness of data.
Data set A= 4, 5,6,6,9 Data set B=1, 2, 6,6,15
Mean, Median, Mode=6 Mean, Median, Mode=6
Two data sets A&B their mean median and mode are same ,despite these
similarities, these two data sets are obviously different, therefore describing
a data set in terms of measures of central alone is clearly inadequate. They
differ in terms of their variability –extent to which their scores (Individual
vales) are clustered together or scattered about.
Range 9-4=5 Range 15-1=14

0
Measures of dispersions include
1) Range
2) Quartile deviation
3) Mean deviation
4) Variance
5) Standard deviation
Range: Difference between the maximum and the minimum values
Un-grouped data:
R=
Where

Example:
60, 65, 68, 70, 72
R = 72 -60 = 12

For Grouped data:

Class Interval Frequency


10-------------------------------20 2
20-------------------------------30 5
30-------------------------------40 7
40-------------------------------50 6
50-------------------------------60 4
60-------------------------------70 1

Range = upper limit of the last class interval – lower limit of the 1st class
interval
R = 70 – 10
= 60

Quartile Deviation:
Q3 – Q1
Q.D = ________
2

Q3 = Third quartile
Q1 = 1st Quartile

Mean Deviation:
M. D =
_________


Sum of all the deviation score from the mean is always zero

Example:
2, 4, 6, 8, 10

5
= 30/5
=6

l l

2 2-6= -4 4
4 4-6 = -2 2
6 6-6 = 0 0
8 8-6 = 2 2
10 10-6 = 4 4

∑ ∑ = 12

l l = absolute values means ignoring signs

M.D =

= 2.4
For grouped data:
M.D =
_________

Standard Deviation:
It is the +ve square root of the mean of square deviation from the mean.
It measures variation from the central point.
It is the square root of variance.
S.D =

Variance: Mean of the squares of all the deviation scores in the distribution
is called variance.
For un-grouped data:

S.D =

Variance =

Deviation score =
Variance is less if majority of the values are closer
Variance is more when majority of the values are far
Example:
Standard deviation of un-grouped data
IQ of 10 students turns out to be as follows
115, 140, 133, 125, 120, 126, 136, 124, 132, 129

Calculate the standard deviation by using this formula if the values are
greater than 30

S.D =

When sample size is less than 30 then formula for S.D is

S.D =
115 115 – 128 = -13 169
140 140 – 128 = +12 144
133 133 – 128 = + 5 25
125 125 – 128 = -3 9
120 120 – 128 = -8 64
126 126 – 128 = -2 4
136 136 – 128 = +8 64
124 124 – 128 = -8 16
132 132 – 128 = +4 16
129 129 – 128 = +1 1

∑ ∑

512 = 512 = 7.54


√10 – 1 √ 9

S.D for grouped data is

S.D =
Correlation:
Concurrence of two variables more often than would be expected by
chance is called association.
Relationship between two and more than two series or groups is called
correlation. Technically it is interdependent of one group to another group.
Positive correlation: if the values of one group increases and other group
also increases then we say correlation is positive.
Example: relationship of age and weight

Negative correlation: if value of one group increases and other group


decreases, then we say the relationship is negative. Example: Relationship
of dose of insulin and blood glucose level.
Zero correlation: if value of one group increases or decreases and other
group has no change, then we say that there is no relationship (zero
relationship).
Correlation is measured by
Co- efficient of co- relation denoted by “ ”
Value of “r” ranging from -1 to +1

r=
Example:

X Y
2 1
4 3
6 5
8 7
10 9

CALCULATE CO- RELATION CO-EFFICIENT

2 1 -4 -4 16 16 16
4 3 -2 -2 4 4 4
6 5 0 0 0 0 0
8 7 2 2 4 4 4
10 9 4 4 16 16 16
= 40 = 40 =40

r=

r=

r= 40/40

= +1 (complete positive co-relation)

Presentation of data
Simple (cities and population)

Tables

Double or complex

Frequency Polygon

Graphs

Cumulative Frequency Polygon

Simple Bar chart

Bar chart Double or multiple Bar

Component Bar chart

Diagrams Pictogram

Histogram

Pie- Chart ( Pie- diagram)

Symmetrical

Special Curves Skewed

Bar- Charts:
Merits: It provides quick glance over observation and shows interval where
main concentration lies.
Population

frequency

Blood Groups

Simple Bar- Chart

population

Lahore Multan Karachi


Cities

Qualitative Data: Quantitative Data

 Bar- Charts 1. Frequency Polygon


 Map- Diagram 2. Cumulative freq polygon
 Pictogram 3. Line Diagram
 Pie- Diagram 4. Scattered Diagram
5. Histogram

Pictogram:

Consists of some pictures or small symbolic figures

City A

City B

City C

Key: = Population of 10,000

Histogram:
It consists of set of rectangles whose bases are marked by class intervals along X-axis and whose heights
are proportional to frequencies with respected classes, (just like cells in histology closely packed)

Class Interval Frequency


0-------------------5 18
5------------------10 15
10------------------15 10
15------------------20 15
20------------------25 5

20

15

Frequency 10
5

0 5 10 15 20 25 Height

Pie- Diagram:

Data is presented in the form of circles.

For example:

Occupation Frequency Angle %age


Professionals 25 48° 13.29%
Skilled 43 82° 22.87%
Un skilled 120 230° 63.87%
Total 188 360° 100%

Professionals 13.2%
230
° 48 °
Unskilled
63.8 %
82 ° Skilled 22.8 %

Frequency Polygon:

It is a graphical distribution between mid-points of class intervals and frequencies.


It is made by connecting mid points of class intervals in histogram.
Class Interval Frequency
0-------------------5 18
5------------------10 15
10----------------15 10
15----------------20 15
20----------------25 5

20
frequency15 15

10

5 10 15 20 25

Frequency Polygon is a graphical distribution between mid-points of class intervals


and frequencies.

It is made by connecting mid points of class intervals in histogram.

20
frequency 15
10

5 10 15 20 25

Frequency polygon

Line Diagram:

A line diagram is used to show trends of events with passage of time.

Case of Malaria:

Cases

2000 2001 2002 2003 2004

Time (Years)

Scattered diagram:

Scattered diagram shows the relationship between two variables.

600

500
Temperature and weekly deaths due to respiratory infections
400

300

200

100
No. of

deaths

26 37 41 43 48
Mean weekly Temperature of weather

Each plotted data point represents one observation, draw how well a
straight line could fit the plotted points, called “line of best fit”.

Different types of frequency curves:

1) Symmetrical distribution or Curves:


A curve is said to be symmetrical if the values equidistant from central
point.
Curve can be folded along the central point in such a way that two half of
the curve coincide.
Normal Curve Bimodal curve

Symmetrical Curves

Bimodal

Skewed Distribution or Curves:


A curve is said to be skewed when it departs from symmetry. Here the
frequencies.

Here the frequencies tend to pile up at one end or the other end of the
distribution. Mode
Mode

Median
Median

Mean
Mean

Negative Skewed Positive Skewed


J- Shaped L- Shaped

Types of Skewed Curves

Sampling:
Population or Universe: Universe is a “defined whole” about which the
information is desired. It may concern of individuals, families, households, objects
etc.

Sampling: A subset or part of the population is a sample from that population.


The process of selecting the sample is called sampling.

Advantages of Sampling:

 By drawing a sample from a large population, cost of the study is reduced


considerably
 The study can be done with greater speed i.e. in less time and with the
greater accuracy
 Drawing sample can increase the scope of a study since several aspects can
be studied
 Some tests which are not possible in the population can be applied to a
smaller population and important information can be obtained
 Study can be in depth and more suitable
 Study of entire universe in unnecessary and is not cost effective

Sampling Units:
This forms the basis of sampling procedures. They are the breakdown of the
population into smaller parts which are distinct, unambiguous and non
overlapping so that each element of the population belongs to one and only
one sampling unit e.g., individuals who are 15-20 years of age, household,
schools, industrial unit etc.
When the list of all individuals, households, schools, industries etc are
drawn, this is called the Sampling frame.
For example: 20 households had to get sprayed with an anti -malarial spray,
out of a total 100 houses were selected randomly. Thus the list of all the
households formed the sampling frame while each household was the
sampling unit.
An element in the item under study e.g. blood pressure, cholesterol

Representative sample:
A representative sample is the one with which can draw valid inference
regarding the population parameters. Parameters are the unit values of the
universe under study. The values of samples are statistics which help in
inferring the parameters.
Representative sample: If it closely resembles the population from which it
is drawn.
Types of sampling techniques:

There are two major types of sampling procedures

Probability: Non probability

 Simple random > Convenience


 Systematic > Purposive
 Cluster > Quota sampling
 Stratified > snow ball sampling

Probability sampling:
When each element in the population has known chance of being included
in the sample.
Simple random: An important sampling technique in which each sampling
unit of a population has an equal probability of being included in the
sample.
Procedure:
1) Prepare a sampling frame list showing all the units
2) Decide on the number to be selected (sample size)
3) Select the required number of units through,
 Drawing lots “lottery method” when sample is small
 Use random tables especially if the sample size is large.

Systematic sampling: A pre determined system is used for this type of


sampling. Steps are as follows:
1) List the total number of units in the population (N)
2) Decide the sample size (n)
3) Calculate the sampling ratio i.e N/n ( Example: sample of 100 out of
1000 = 1000/100 = 10)
4) Select randomly the first unit out of first ten and then interview every K-
th unit i.e. every 10th or 4th when frame is not close, K-th patient visiting
a clinic or OPD.

Cluster sampling: Selection is made of clusters of group such as Mohalla,


buildings, villages, housing unit etc. each cluster is treated as a single unit in
the selection process. One can select (randomly) only a few clusters.

Example: We want to select 100 students from Medical Colleges of Lahore:


It is more economical to select a random set of group or clusters such as
random set of 10 Medical Colleges and then interviewing 10 students in
each of those 10 Medical Colleges.

Stratified sampling: Sometimes population contains highly variable


materials and a simple random sampling fails to represent the population
then the population is divided into number of groups (or strata) of units in
such a way that units within each group are as similar as possible .Then
simple random techniques are applied in the groups. The process of
dividing the population is called stratification and the groups are called
stratas. This technique is called stratified random sampling.
Non-probability sampling: In Non-probability sampling, the selection of
elements (units) are not based on probability theory but a personal
judgment plays a role in the selection of sample. There is no assurance that
each element will have the same chance of being included in the sample.
Convenience Sampling: Sample is selected in a haphazard fashion. This may
be because of convenience, less cost, etc.
Quota sampling: In this information is collected from the specified number
of individuals e.g quota of old and young population etc.
used in market survey

Purposive sampling: This sampling is done on the basis of some pre-


determined idea. (Clinical knowledge etc)

E.g. study on eye complications of diabetes mellitus; I shall take diabetics


for at least 10 years.
Snowball sampling: (or chain sampling, chain-referral sampling, referral
sampling) is a non-probability sampling technique where existing study
subjects recruit future subjects from among their acquaintances. Thus the
sample group appears to grow like a rolling snowball. As the sample builds
up, enough data is gathered to be useful for research. This sampling
technique is often used in hidden populations which are difficult for
researchers to access; example populations would be drug users or sex
workers.

Probability Non Probability sample


1) Time consuming - Less time consuming
2) Minimal Bias -Bias Maximum
3) More authentic -authenticity debatable
4) Results can be generalized - results cannot be generalized
over population
5) Expensive - Economical
6) Inconvenient -Convenient
7) Technically skilled - less skilled operator required
Operator required
8) Same chance - Choice plays a role

Life Table:

Tabular display of life expectancy in which group of people born on same


day and at same place is taken and mortality in respect of age and sex on
hypothetical basis is represented.
It is assumed that 10,000 or 100,000,00 infants born on the same day at a
place and their deaths are observed and recorded as the pass through each
year of life at same rate as experienced at these basis by the general
population of the same place during same time period.

Requirements:
1) Age specific death rates
2) Life expectancy rate
They are taken from record of survey.
Construction:
Column I: Frequency by age
Column II: No of persons surviving that age of life, or of no. of persons
on which table was started.
Column III: Probability of dying within the given year
Column IV: No of persons dying in successive years of age in each
successive age group.

Column V: Life expectancy at each age group

Advantages:
1) It is used by the life insurance companies. (how much premium will be
received at this age to onward and what is life expectancy)
2) Health authorities for long term planning
3) No. of years added to life by surgical or medical techniques or prevention
and control program’s efficacy can also be determined.
4) It gives guidance whether to continue or not to continue the program

Probability (Relative Frequency)


The probability of an event is a quantitative measure of the proportion of all
possible equally likely outcomes that are favourable to the event. It is denoted by
p.
Ludo dice = likely outcome of sixes/no. of possible outcomes

Probability of an event = Number of likely outcome

Number of possible outcome

Probabilities are usually expressed as decimal fractions, not as percentages, most


lie in between 0 and 1.

Zero = zero probability

1 = Absolute certainty

Probability of an event can be expressed as a ratio of number of likely outcome to


the number of possible outcomes.

Example: if a fair coin were tossed infinite number of times then

Probability of heads = 0.50

Probability of tails = 050

If a random sample of 10 people were drawn an infinite number of times from the
population of 100 people then the probability of each person included in the
sample would be

P = 10/ 100 = 0.10


Example: Pulse of group of normal healthy person was 72 with S.D of 72. What is
P that a male chosen at random would be formed to have a pulse 80 or more?

Standard error:
It is a measure of the extent to which the samples mean deviate from the true
population mean.

if we draw large number of random samples of equal size from the same
population. Then means of all these samples are not the same because of the
“Sampling error”. If we draw curve (distribution) of these sample means, they
spread out to form a distribution called “random sampling distribution of means”

The random sampling distribution of mean will always tend to be normal (normal
distribution curve) irrespective of the shape of the population distribution from
which samples were drawn. This is called the “Central limit theorem. Theorem
also states that the mean of the random sampling distribution of means (
equal to the mean of the original population.

( =

The standard deviation of the “random sampling distribution of means” is called


“Standard error”.

Standard error ( x) =

Where

= standard deviation of population


n = sample size
Standard error

Where

n= sample size

Standard error is inversely related to the square root of the sample size. So the
large the sample size becomes the more closely will the sample mean represent
the true populations mean. This is the reason why the results of large studies or
surveys are more trusted than the results of small ones.

We do study on sample, so (population standard deviation) will not be known.


So, standard error cannot be calculated, instead standard error can be estimated
using data that are available from the sample alone. So it is called “ESTIMATED
STANDARD ERROR OF THE MEANS”

Sx =

Where

S= standard deviation of the sample

The “ESTIMATED STANDARD ERROR “, it is called standard error in many research


articles

Estimated standard error is symbolized by Sx


Sx =

Where

S = standard deviation of the sample

n = Sample size

Tests of significance:

The information obtained from the sample is used to make decision about the
population. For example on the basis of sample, we are required to decide
whether a certain drug is effective in curing a particular disease or not OR a
medical researcher might he required to decide on the basis of experimental
evidence whether a certain vaccine is superior to the other which is already in the
market. We use certain rules and procedures. These are called “TESTS OF
SIGNIFICANCE” or “TESTS OF HYPOTHESIS”.

For example: Chi- Square test, Student t-test, Z test etc


Null hypothesis:

It is a statement which is to be tested for possible rejection. It is denoted by “H ” It


0

is also called Hypothesis of no difference.

For example: H No. difference between teaching method A and B.


0=

Alternate Hypothesis: It is another statement, if we reject Null hypothesis then


we accept alternate hypothesis. It is denoted by H1 or HA

HA = There is difference between teaching method A and B.


Choosing tests of significance/Co-relational technique
Question concerning Nominal Ordinal Interval or
data data ratio scale
data

Difference between two proportions Chi-square

One or Two Means(What is the True t test or Z-test if


mean of the population?, Is one sample sample is more 100
mean significantly different from the
other sample mean?)

More than two Means (Is one sample ANOVA with F-test
mean significantly different from the
more than one other sample means?

Variances (Are the variances in two F-test


samples significantly different?)

Association (To what degree two Spearman Pearson r


variables are correlated?) p

Predicting the value of a one variable on Regression


the basis of other variable
Level of significance:

It is denoted by . It is the probability of rejecting the Null hypothesis when it is


true. It is pre-determined small value.

P-Value 0.05

P-Value 0.01

P-Value 0.1

: It means there are 5% chances out of 100 we reject the Null


hypothesis and 95% confident to accept the Null hypothesis when it is true.

: Only 1% chances to reject the null hypothesis and 99%


confident that we accept the Null hypothesis when it is true.
P value

Accepted probability of error in decisions

Internationally accepted equal to 5% or 0.05 in fraction

It can be stated or accepted below and above 0.05 depending upon


study sample

It is also known as α error, type 1 error or significance level

On two tail of normal curve it is a/2 = 0.025 on both side

Confidence level

It is equal to 1-a (0.95 or 95%)

It is the probability of making correct decisions (rejection the null


hypothesis when it is false)
95%

Region of

Acceptance

2.5% 2.5%

Region of rejection region of rejection

Two Tail
Normal curve

95%

Area of

Acceptance

0.05% region of rejection

Normal curve

Regions of acceptance and rejection:

Possible results of a sampling experiment can be divided into two groups.

1) Results leading us to accept the hypothesis


2) Results leading us to reject the hypothesis
When calculated value of the test of significance falls in the region of
rejection, we reject the Null hypothesis. When calculation value of the test
of significance falls in the region of acceptance, we accept the Null
hypothesis
 Region of rejection
 Region of acceptance
Type I Error and Type II Error:

In the theory of testing the hypothesis two types of error are committed.
1) We reject the hypothesis when it is true ( type I error)or α-error
(Probability of rejecting null hypothesis when it is in-fact true)
2) We accept the hypothesis when it is false ( type II error) or β-error
(Probability of not rejecting null hypothesis when it is in-fact false)

K is the observed values (Sample value)


General procedures of testing hypothesis:

It involves following 6 steps

1) State Null hypothesis and Alternate hypothesis


2) Decide level of significance ( )
Normally used 5%
3) State the test statistics to be used. Important tests of significance are
a) Z-test (if n>30)
b) Student t-test
c) Chi – Square test etc
4) Compute the value of test statistics (test of significance) in order to
decide whether to reject or accept the null Hypothesis.
5) If the results fall in the region of rejection. We reject the Null hypothesis.
If the results fall in the region of acceptance, we accept the Null
hypothesis.
6) State the conclusion in words.
Chi-square test:
Example: Is there any relationship between sex and smoking?
Suppose we select the random sample of 100 people and found the
results as follows.

Male Female Total


Smoking 30 (a) 10 (b) 40
Non smoking 20 (c) 40 (d) 60
50 50 100
Total

To test this question, we shall state the Null hypothesis.

H = There is no relationship between ( Null hypothesis) sex and smoking.


0

(no difference of smoking by sex)

HA = There is relationship between (alternate hypothesis) sex and


smoking. (There is difference of smoking by sex)

E value = horizontal total Vertical total


Grand Total

E value of cell (a) = 40 50


100
= 20
E value of cell (b) = 40 50
100
= 20

E value of cell (c) = 60 50


100
= 30

E value of cell (d) = 60 50


100
= 30

Chi- Square

O values E values O–E

30 20 10 100 100/20 =5.00


10 20 -10 100 100/20 =5.00
20 30 -10 100 100/30 =3.33
40 30 10 100 100/30=3.33
∑= 16.66

Choose level of significance ( )

( ) = 5%
Then test statistic

Chi- Square
Where

O = observed values
E= Expected values
∑= summation

Calculated value of chi- square is 16.66


Degree of freedom = (row – 1) (column -1)
df = (r – 1) ( c – 1)
= (2 – 1) (2- 1)
= 1
Table value of chi- square at 1 degree of freedom is
3. 84

Region of
Accpetance

Region of rejection

3.84 16.66
Calculated value of chi – square is greater than table value, so it falls in
the region of rejection. So we reject the Null hypothesis and accept the
alternate hypothesis.
P - Value is < 0.05
So
Conclusion:
There is significant relationship between sex and smoking
(There is statistically significant difference of smoking by gender)

If P is less than or equal to 0.05 it is regarded as statistically significant if we choose ( ) = 5%


P-value 0.05: We reject the null hypothesis (calculated value of test of
significance falls in the region of rejection) and accept the alternate hypothesis.
Result was unlikely to have occurred by chance. The difference is not merely due
to chance but statistically significant (significant difference).It means likelihood of
results (difference) having occurred by chance is 0.05or less and 95% confident
that results were not obtained by chance. However 5% chance that the null
hypothesis is infect true, although it is being rejected, are still there.

P- Value > 0.05: We accept the null hypothesis (calculated value of test of
significance falls in the region of acceptance) and reject the alternate hypothesis.
In results chance cannot be excluded, difference between sample mean (X) and
hypothesized population mean ( ) is insignificant and so results (difference)
are insignificant.

Student t-test:

t= hyp

SX

Where

X = Mean sample

hyp = Hypothesized population mean(Stated population Mean)

Sx = Estimated standard error

Example: The Principal of a medical college states that the college’s students are
highly intelligent group with an average IQ of 135. This claim constitutes a
hypothesis that can be tested.
H=
0 hyp = µ = 135 (Hypothesized Population mean (Stated mean) is equal to 135)

(There is no difference between sample and population being compared.)

HA = hyp =µ 135 (Population Mean is not equal to 135)

Choose level of significance ( )


Let’s choose
( ) = 5%

We take sample of 10 students. Their IQs turned out to be as follows.

115, 140, 133, 125, 126, 136, 124, 132, 129, 120

Sample Mean (X) = 128

Calculating estimated standard error of sample

Sx =

Where

S = standard deviation of the sample

n= Sample size 512

S.D = n-1 = 10 - 1

= 7.542 This is standard deviation of the Sample


115 115-128= -13 169
140 140-128= +12 144
133 133-128= +5 25
125 125-128= -3 9
120 120-128= -8 64
126 126- 128= -2 4
136 136- 128= +8 64
124 124-128= -4 16
132 132- 128= +4 16
129 129-128= +1 1
∑= 512

Sx =

= 7.542/√10

= 2.385

t= hyp

SX
= 128-135

2.385

= -2.935

Degree of freedom for t-test is

d.f= n-1

=10 – 1 = 9

Table values of t-table at d.f= 9 and P= 0.05 is 2.262

So:

Region of

Acceptance

Region of rejection region of rejection

t-calc = -2.935 t-crit - 2.262 t-crit +2.26


(calculated (Table value) (Table value)
value)

The calculated value falls in the region of rejection

So

We reject the Null hypothesis and accept the alternate Hypothesis.


Conclusion:

hyp 135 (not equal to 135)

So the medical college students are not highly intelligent group with an average IQ
not equal to 135.

z- test:

z=

Where

= Mean of sample

= Mean of population

= Standard error

Study questions

Is mean Ht of girls and boys are different?


Example: The Principal of a medical college states that the college’s students
average height is 65 inches. This claim constitutes a hypothesis that can be tested.

H=
0 hyp = 65 inches (Hypothesized Population mean is equal to 65 inches)

(There is no difference between sample and population being compared.)

(There is no difference between stated mean and estimated mean)

HA = hyp 65 inches (Population Mean is not equal to 65 inches)

(There is significant difference between sample and population being compared.)

(There is significant difference between stated mean and estimated mean)


Example: The Principal of a medical college states that the mean height of boys
students is same as mean height of Girls students. This claim constitutes a
hypothesis that can be tested.

H = (There is no difference between mean height of Boys and Girls students.)


0

HA = (There is significant difference between mean height of Boys and Girls


students.)
Concept of normal distribution curve
Almost all physiological variables in nature are distributed in such a way that
majority of the values lie around mean and very few values appear on the
extremes. (normal distribution pattern)

Any curve which is smooth, symmetrical and bell shaped is called “normal curve”.

Mean
Median
Mode

However there is one standard normal curve. The characteristics of this curve are;

1) Bell shaped
2) Mean, Median, Mode all are at the same point
3) Mean is zero
4) Tails go to infinity
5) Area of the curve is equal to 1 ( or 100%)
6) Mean 1S.D = 68.3%

Mean 2S.D = 95.4%

Mean 3S.D = 99.7%


7) This curve is the basis of the theory of probability.

Confidence limit OR confidence Interval:

1S.D or 2S.D on each side of the mean is called confidence limit. The interval
between them is called confidence interval. We can say with confidence that 68%
of the distribution lies within approximately 1S.D of the Mean.

Life Table:

Tabular display of life expectancy in which group of people born on same


day and at same place is taken and mortality in respect of age and sex on
hypothetical basis is represented.
It is assumed that 10,000 or 100,000,00 infants born on the same day at a
place and their deaths are observed and recorded as the pass through each
year of life at same rate as experienced at these basis by the general
population of the same place during same time period.

Requirements:
1) Age specific death rates
2) Life expectancy rate
They are taken from record of survey.
Construction:
Column I: Frequency by age
Column II: No of persons surviving that age of life, or of no. of persons
on which table was started.
Column III: Probability of dying within the given year
Column IV: No of persons dying in successive years of age in each
successive age group.

Column V: Life expectancy at each age group

Advantages:
1) It is used by the life insurance companies. (how much premium will be
received at this age to onward and what is life expectancy)
2) Health authorities for long term planning
3) No. of years added to life by surgical or medical techniques or prevention
and control program’s efficacy can also be determined.
4) It gives guidance whether to continue or not to continue the program
Detailed calculations and column definitions
A standard abridged life table is presented in Example 2. This section goes through

the calculation of each individual column.

Width of the interval (n)

The number of years in each age interval. For example for the <1 age group n = 1, for

the 1-4 age group n = 4 and for all other age groups including 85+ n = 5.

Average proportion of the year lived by those who die (nax)

Usually it is assumed that death occurs uniformly across time and that on average

people will live 0.5 of the interval before death. However, there are some cases

where we know that death does not occur uniformly across time within age groups.

For example, for those aged under 1 we assume that the average proportion of the

year lived by those who die is 0.1.

The probability of dying (nqx)

Number of years in interval * age-specific death rate

1 + number of years in interval (1 – average proportion of year lived by those who

die)*age specific death rate

or

n * nMx

1 + n (1- nax) * nMx

The probability of surviving (npx)

1 – probability of dying

or

1 - nqx

Number of persons alive at the start of the interval (lx)


This is a hypothetical population, in this case 100,000 alive/born at age 0. Those

alive at age 1-4 in this case are:

Probability of surviving the previous interval * population alive at start of previous

interval

or

lx-n * npx-n

Number of deaths during interval (ndx)

Population alive at start of interval – population alive at start of next interval

or

lx-lx+n

Number of person years lived through the interval (nLx)

Number of years in interval (number of persons alive at start of next interval +

average proportion of year lived by those who die*number of deaths during interval)

or

n(lx+n + nax * ndx)

At age 85+ everybody dies during the interval so an adjustment has to be made.

Whatever is used as an estimate of the number of years lived has little impact on

overall life expectancy, however, it is usual to use the following estimate:

L85+ = l85

M85+

Total number of person years lived after the interval (nTx)

This is the ‘number of person years lived through the interval’ column summed from

the bottom.

or

Tx+n + nLx
Expectation of life (ex)

This is the number of years a person aged x can be expected to live.

Total number of person years lived after the interval

Number of person years alive at the start of the interval

or

Tx

lx
Z-score:-Location of any element a normal distribution can be
expressed in terms of how many S.D it lies above or below the mean of
the distribution.

z=

σ
Table of Z-score:
Table of z-score states what proportion of any normal distribution lies
above or below any given Z-score.

Z (a) Area between mean and z (b) Area beyond z

0.00 0.0000 0.5000

0.01 0.0040 0.4960

0.02 0.0080 0.4920

: : :

1.00 0.3413 * 0.1587

: : :

2.00 0.4772 + 0.0228

: : :

3.00 0.4987 # 0.0013


Example:-
Resting Heart Rate of 4th Year MBBS students of SIMS is
normally distributed with Mean of 70 and SD of 10.
a)What is the location of an element 85 beat/min?
b)What population having heart rate above 85beat/min?
c)What is the location of an element 65 beat/min?
d)What population having heart rate below 65beat/min?

ANSWERS

a) 85-70/10 =+1.5 SD (1.5 SD above the mean)

b) Table value from Z-table (area beyond 1.5 SD is 0.0668)


=6.68 %( 6.68% population having heart rate above 85
beats /min or we can say probability of one randomly selected
person from this population having heart rate above 85 beat
/min is 6.68% or 0.0668)
c) 65-70/10= -0.5 SD (0.5 SD below the mean)
d) Table value from Z-table (area beyond 0.5 SD is 0.3085)
=30.85% or we can say probability of one randomly selected
person from this population having heart rate below 65 beat
/min is 30.85% or 0.3085)

e)What heart rate divides the fastest beating 5% of the


population from the remaining 95%?
e) Z-score that divide the top 5% of the area under the curve
from remaining area (95%)
= nearest figure to 5% (0.05) in Z- table =0.0495
0.0495 corresponding to z- score of 1.65
=corresponding heart rate therefore lies 1.65 S.D above the
mean.
= Mean + 1.65 SD
= μ + 1.65 SD
=70+1.65*10
=86.5 beat/min
Heart rate 86.5 beat /min divides the fastest beating 5% of the
population from the remaining 95% or We conclude that the
fastest beating 5% of this population has a heart rate above
86.6 beats/ min.

Analysis by type of data


Qualitative variables
Frequencies
• Simple frequency
• Relative frequency
• Cumulative frequency
Percentages
• Cumulative percentage
• Rates (Prevalence rate & Incidence rate)
Test of significance
• Chi-square test
Quantitative Variables
A) Central values
• Mean
• Median
• Mode
• 50th percentile
B) Dispersions
• Range
• Mean deviation
• Standard deviation
• Variance
• Percentiles
C)
Statistics:=
 Descriptive

 Inferential

S-ar putea să vă placă și