Documente Academic
Documente Profesional
Documente Cultură
Gynecology
Preamble:
What is Data?
Measurable characteristics of a sampling unit (or subject) of a
population, that yields information about the population.
Type of Data:
There are mainly two types. viz: Broadly, data can either be Categorical
or Numerical
Categorical Data:
The simplest type of observation that is made on a subject that comes to
the clinic is the allocation (indeed the classification) of the subject to one of
only two categories that relate to the presence or absence of some
attributes.
Examples:
• Pregnant/Not Pregnant
• Married/Single
• Hypertensive/Normotensive
• Diabetic/Non-Diabetic.
Numerical Data:
There are two main types viz: Discrete and continuous.
1
Examples: Height, weight, age, body temperature, blood pressure, serum
cholesterol, etc.
Right Censored (Suspended): These are the cases (of life data) composed of
subjects that did not fail.
Examples: 8 breast cancer cases, 5, failed at the end of experiment then the
remaining 3 would be regarded as suspended (right censored) data.
Left Censored Data: In this case, failure time is only known to be before a
certain time.
Example: Suppose an experiment scheduled for inspection after 12 hours is
found to have failed before inspection. Thus, what is known is that the
experiment failed sometime before 12 hours (i.e. between 0 and 12 hours) but
nor exactly when.
Variable:
• A Variable is any attribute, Phenomenon or event that can have different
values.
• A variable can either be quantitative of qualitative
• A quantitative variable describes a characteristic in terms of a numerical
value. The value may vary from subject to subject or from time to time in
the same subject. The value is expressed in units of measurement.
Examples: Height in meters, Blood pressure in mm/Hg, weight in kilograms,
etc.
2
• A qualitative variable describes the attribute of a characteristic (by
classifying it into categories to which the subject either belongs or does not
belong).
Examples: State of origin, Tribe or Ethnic group, etc.
Continuous Variable:
A Variable with potentially infinite number of possible values in any interval. It
can assume either integral or fractional values and can be measured to
different levels of accuracy. Continuous variable is realized through actual
measurements.
Examples: Weight of babies delivered in a Health facility could be 314, 2.98.
2.94, 3.10 kg.
Discrete Variable:
Can have a number of values in any interval. The values are invariably whole
numbers. They are integers. Discrete variable is usually realized through
counting.
Examples: Number of children in a family, number of clinic in a community,
number of children delivered within a given period in a Teaching Hospital, etc.
3
Example: (Part of Patient’s Form)
Patient’s Number:
Data of Registration:
Data of Birth:
Marital status:
Religion:
Ethnic group/Tribe:
Height (m):
Weight (kg)
Syst Diast
Number of Pregnancies:
Number of Deliveries:
Number of Abortions:
4
Examples:
• An investigation of the effects of FGM on complications during delivery
• An investigation of breastfeeding practices among women who registered a
birth in the previous year.
• A study to investigate whether the use of hormonal contraceptives affect
the fertility status of the users.
5
Steps in the Planning of a Survey
Analysis of Data:
The general methodology for the analysis of data (in O & G) is of two types; viz:
Descriptive and Inferential.
Descriptive Statistics:
Descriptive statistics are the statistical tools for the organization and
summarization of data. They describe a set of data which eventually provides a
6
basis for a generalization about a population when only a sample is observed.
Descriptive statistics point up a characteristic of the population being studied.
Descriptive statistics simply summarize a mass of data into a few simple ideas.
In data analysis, descriptive statistics are presented in tables which provides
summary statistics for continuous, numeric variables. The summary statistics
includes:
• measures of central tendency such as mean, median and mode
• measures of dispersion (spread of the distribution) such as range and
standard deviation (including variance of the distribution)
• measures of distribution such as skewness and kurtosis which indicate
how much a distribution varies from a normal distribution.
Tabular Presentation
This is the presentation of data in tables so as to organize them into a compact
and readily comprehensible form. For example, a frequency distribution table
gives the number of observations at different values or classes of the variable.
Tabular presentation could be handled as:
(a) Single variable frequencies:
• For a qualitative variable (such as the distribution of the state of origin of
100-women who visited the ANC in the last one year).
• For a large data set of a quantitative variable requiring grouping of the
data into classes (such as the distribution of the weight of new born babies
in a Teaching Hospital)
(b) Cross-tabulation:
• Two dimensional tables, in which two variables are cross-tabulated (such
as the cross-classification of weight of babies at birth and economic status
of their parents).
7
• Three-dimensional tables, in which three variables are cross-classified
(such as outcome of treatment by sex and by age group).
Diagrammatic presentation
Diagrammatic presentation is the use of a diagram to show the distribution of
data. The methods of diagrammatic presentation of data are:
(a) Qualitative or Categorical Data
Pie Charts
A circle is divided into sectors with areas proportional to the frequencies or the
relative frequencies of the categories of the variable.
Bar Charts
The bars are constructed to show the frequency or relative frequency for each
category of the attribute. The bars are usually equal in width. It is important
that the vertical scale should start at zero; otherwise the heights of the bars
will not be proportional to the frequencies.
Measures of Location:
One of the first statistics usually computed for a set of data is a measure of
central tendency such as the Mean, Median and the Mode.
The Mean:
8
Most frequently used in data analysis. The Mean may be considered as the
center of gravity of the distribution.
Mean:
∑ xi Raw data
i =1
X=
n
k
∑ f i xi
i =1
X= k Group data
∑ fi
i =1
The Median:
It is the point in the distribution with 50% of the measures of scores on each
side of it. That is, it is the midpoint of the distribution for even number of
n n+2
observations; the median occupies the point between th and th
2 2
positions when the values of the observations are arranged in order of
magnitude. When the number of observations is odd, the Median occupies the
n +1
th position in the ordered arrangements. For the grouped data case, the
2
Median is estimated by using the expression:
n
−Cf
2
= L1 + C
Median
i
fi
Where
L1 = lower class boundary of the median class
n= number of observations
C f = Cumulative frequency of the class just before the median class
Ci = Median class interval
f i = frequency of the median class
The Mode:
This is simply the value that occurs most frequently in the distribution. For
the grouped frequency case, the Mode is estimated by using the expression:
(f − fa ) × c
Mode = L1 +
( f − fa ) + ( f − fb )
9
Where
L1 = lower class boundary of the modal class
f= modal frequency
f a = Frequency of the class after the modal class
f b = Frequency of the class before the modal class
C = Modal class interval
Variance:
This is the mean of the squared differences (deviations) between the mean and
each observed value. It is mathematically expressed as:
∑ ( xi − X )
n 2
k
(
∑ f i xi − X ) 2
Variance, S2 i =1 = i =1 k
=
n−1 ∑ fi − 1
i =1
Standard Deviation:
The square root of the variance
∑ ( xi − X )
n 2
Standard deviation S i =1
=
n −1
Inferential Statistics:
Usually when samples are studied, the investigator will be interested in going
beyond the sample and would want to make inference about the population
from which the sample was drawn. Thus, from the knowledge of the
descriptive statistics such as the mean and variance from sample values,
inferences about the same traits in the population are made. The use of
inferential statistics is basic to Medical research. The exploits in inferential
statistics include: Confidence Interval, Test of hypothesis, contingency Tables,
Nonparametric Tests, Regression and Correlation analysis, ANOVA, etc.
Confidence Interval:
10
Confidence Interval combines the features of estimates from a sample with
known properties of the normal distribution to get an idea about the
uncertainty associated with a single sample estimate of the population
parameter. Confidence interval gives a range of values for which one can be
confident would include the true value.
σ 12 σ 22
The 100(1 − α )% C I = X − X 2 ± Z (α ) +
2 n1 n2
1 1
OR C I = X − X 2 ± t n1+ n2 −2 .(α 2) S p + ,
n1 n2
p0 q0
The 100(1 − α )% C I = P ± Z (α ).
2 n
The 100(1 − α )% C I = Ρ1 − Ρ 2 ± Z (α
( ) (
Ρ 1− Ρ Ρ 1− Ρ
+
)
2
). n1 n2
11
• In medical research, tests of significance allow us to decide whether the sample estimates,
or differences between estimates are within their normal biological variation, commonly
called variability due to chance.
Procedure for testing statistical hypothesis
• State the null hypothesis
• State the alternative hypothesis (indicate 1 – tail or 2 – tail)
• State the level of significance (explain type 2 errors)
• Choose the test statistic (explain parametric and non-parametric tests)
• Compute the numerical value of the statistic from the observed data
• Compare the calculated value of test statistic with tabulated values in
appropriate standard distribution tables at a specified probability level of
significance
• Decide whether or not to reject the null hypothesis according to the p-value
Test for Single Mean:
12
X − µ0
T=
S
n
Test statistics are created along the lines given for the test for single mean, and
the decisions follow accordingly.
Finally, Tests of proportions are handled by the use of Z~ test for large samples
or by the use of t – test for small samples.
Contingency Tables:
Test for Associations between two categorical variables is by the use of the χ ~
2
distribution
Nonparametric Tests:
In the tests for means, proportions and association, there is a fundamental
assumption of the knowledge of the distribution of the test statistics and
indeed the knowledge of the functional form of the distribution of the variables
under consideration. When there is no knowledge of the functional form of the
basic density function of the variables, then it is usually good to resort to the
Nonparametric test such as:
13
• The Wilcoxon (Rank sum) test
• The Mann-Whitney U – test
• The Median test
• The Sign test
m( N + 1)
SW −
2
Reject H0 when Z = > Z (α )
mn( N + 1) 2
12
m(m + 1)
Test statistic: U = SW −
2
Where SW is as in Wilcoxon test
mn
U=
2
Reject H0 when Z = > Z (α )
mn( N + 1) 2
12
Correlation:
Correlation is the method of analysis used when studying the measure of
relationship (association) between two continuous variables – e.g. – percentage
of body fat and age or normal adults. The actual measure of the association is
14
done by calculating the correlation coefficient r. The correlation coefficient r
can take any value between –1 and +1.
∑ ( X i − X )(Yi − Y )
n
i =1
r=
∑ ( X i − X ) ∑ (Yi − Y )
n 2 n 2
i =1 i =1
Regression:
Linear regression describe the linear relationship between variables and can be
used to predict the value of one variable for an individual when we only known
the other variable. Consider a simple case of: Fetal weight (kg) and Non-
pregnant Maternal weight. Here we consider the fetal weight as the response
(or outcome) variable while the maternal weight is the predictor variable.
These are also called the dependent and independent variables respectively.
The linear relationship between the dependent (Y) and the independent (X)
variables is given as:
Y = α + βX
∑ ( X i − X )(Yi − Y )
n
∧
β= i =1
∑(Xi − X )
n 2
i =1
∧ ∧
α =Y −β X
∧ ∧
Hence, Y =α+ β X
which is used for prediction.
Multiple Regression:
15
Y = α + β1 X 1 + β 2 X 2 + ... + β p X p
e.g. – obesity, smoking and snoring
YSnoring = α + β1 X Smoking + β 2 X Obesity
Logistic Regression:
Good for prediction for dichotomous variables.
ANOVA TABLE
S. V. d. f. SS MS F-Ratio
16
Treatment k–1 SStr SStr/k–1 = MStr MS tr
= FCal
MS E
Error k(n – 1) SSE SSE/k(n–1)= MSE
Total kn – 1 SST
5. Under H0 and the assumptions in (3) being correct, Fcal under F – Ratio in
the ANOVA table has Fk-1,(n – 1) – distribution. Hence, we find the critical
point by reading off Fk-1,(n – 1) ( α ) from the F – distribution table for the
appropriate level of significance.
6. Compare the values of Fcal from the ANOVA table and Fk-1,(n – 1) ( α ) – from
the statistical table.
If Fcal > Fk-1,(n – 1) ( α ) then reject the null hypothesis.
7. Draw a conclusion.
Remark
When the sample sizes (i.e. the number of observations in each
treatment) are not all equal, necessary adjustment must be made in the
computation of sums of squares.
Example
Six patients each were tested on four types of oral contraceptive to
investigate the average reaction time.
Risk Estimation:
Disease
Yes No Total
Yes a b c+b
No c d c+d
Exposure Total a +c b+d n = a + b + c +
d
17
the exposed group relative to those who are not exposed. It is the
ratio of the incidence of disease in the exposed group divided by the
corresponding incidence of disease in the non-exposed group.
a /( a + b) a c + d a (c + d )
Thus, RR = = . =
c /(c + d ) a + b c c ( a + b)
Remarks:
1. RR of 1.0 indicates that the incidence rates of disease in the
exposed and non-exposed groups are identical and thus
indicates that there is no association observed between the
exposure and the disease.
2. A value of RR greater than 1.0 indicates a positive association
or an increase risk among those exposed (to a factor).
3. Analogously, a RR less than 1.0 means that there is inverse
association or a decrease risk among those exposed.
4. RR may change (in some cases) with time e.g. RR for 1 year
exposure might be different from RR for 10 years exposure.
18
a
c ad
OR ≡ b
=
d
bc
Worked Examples
Example 1: Blood pressure levels were measured in 100 diabetic and 100
non-diabetic women aged 40 – 49 years. Mean systolic blood pressures were
146.4 mm Hg (with standard deviation of 18.5) among the diabetics and 140.4
mm Hg (with standard deviation of 16.8) among the non-diabetics. By making
the necessary assumptions, calculate the 95% confidence interval for the
difference of means of the blood pressures of the two groups of women.
Solution: Assume that the blood pressures of each of the two groups of
women are normally distributed. Hence, assume that the difference of means
of the blood pressures is also normally distributed.
⇒α = 0.025
2
The formula for 100(1 − α )% CI for difference of two means is:
S12 S 22
X 1 − X 2 ± Z (α ) . +
2 n1 n2
18.5 2 16.8 2
146.4 − 140.4 ± 1.96 +
100 100
i.e. 6 ± 1.96 × 2498979792
i.e. 6 ± 4.898
(1.102, 10.898)
∴ 95% confidence interval for the difference of mean is: 1.1 to 10.9
19
Example 2: A team of medical researchers wished to measure the level of
weight gained by users of oral contraceptives. The weights of 12 women were
taken before and after the use of the contraceptive within one year interval.
But unfortunately, one of the women died before the end of the year, and
therefore there was no result for her (this is indicated by * in the date set).
Estimate the weight of the woman that died before the experiment was
concluded.
Weights of Women
Before (X) After (Y)
50 61
55 61
60 59
65 71
70 80
75 76
79.5 *
80 90
85 106
90 98
95 100
100 114
x y x2 y2 xy
50 61 2500 3721 3050
55 61 3025 3721 3355
60 59 3600 3481 3540
65 71 4225 5041 4615
70 80 4900 6400 5600
75 76 5625 5776 5700
80 90 6400 8100 7200
85 106 7225 11236 9010
20
90 98 8100 9604 8820
95 100 9025 10000 9500
100 114 10000 12996 11400
825 916 64625 80076 71790
β = ∑ i 2i ∑ i ∑2 i = 1.1236
∧ n x y − x y
n∑ x i − ( ∑ x i )
∧
α = Y − β X = −0.9973
∧ ∧
∴ Y = α + β X = −0.9973 + 1.1236 X
Hence, when X = 79.5 we have
Y = −0.9973 + 1.1236 × 79.5 = 88.3289
That is, the estimated weight of the woman that died (after one year) would
have been 88.33kg.
Solution:
H 0 : µ = 120 = µ 0
H 1 : µ ≠ 120 ≠ µ 0
X − µ0
t=
test statistic is S
µ
Let α = 0.05
21
Since we have a two sided test we put α = 0.025 in each tail of the
2
distribution
∴ we find t14 (0.025 ) = 2.1448 (obtained from statistical table)
96 − 120
computed t,
t= = −2.65
35
15
∴ t = 2.65
Decision rule:
Since t = 2.65 > t14 (0.025 ) = 2.1448
We shall reject the null hypothesis.
Conclusion: Based on the given data we shall conclude that the mean of
the population from which the sample came is not 120.
Exercise:
At admission two groups of women on two different family planning methods in
clinical trials show the following characteristics.
Height (cm)
Cycloprovera 155.86 5.17 42
HRP 102 155.83 6.39 48
Age (years)
Cycloprovera 27.71 4.10 42
HRP 102 28.46 4.66 48
22
23