Sunteți pe pagina 1din 128

Survey Study of The Requirement

of A Teller Machine Within The


University Premises For The Public
Usage

Author: R.M.K.T. Rathnayaka

Supervisor: Dr.L.A.L.W. Jayasekara

Special Degree Part II (Research Project)


Department of Mathematics
University of Ruhuna
Matara.
Declaration

Thesis of the project as a partial fulfillment for special Degree in Mathematics part (II)
(2002) by R.M.K.T. Rathnayaka under the supervision and guidance of Dr. L.A.L.W.
Jayasekara, senior lecturer, Department of Mathematics, University of Ruhuna, Sri
Lanka.

————————————–
Dr. L.A.L.W. Jayasekara

————————————–
R.M.K.T. Rathnayaka
(Index No: 2002/S/5043)

—————————————
Date

2
ACKNOWLADGEMENT

My sincere thanks to my supervisor Dr.L.A.L.W. Jayasekara, for invaluable help and


encouragement during the preparation of this thesis.

I would like to thanks Mr. Rathnayaka, Head of the Department of Mathematics,


University of Ruhuna, for giving me the opportunity to carry out my research work in
the Department. Equally, I wish to express gratitude to Senior lecturers in Department
of mathematics, and Miss B.B.U.P. Perera, Assistant lecturer in Department of
Mathematics, University of Ruhuna and all the members of the Department for their
kind coorperation.

Finally I would like to thank my parents, demonstrators and friends who provided
support to successfully complete this project.

Name :
R.M.K.T. Rathnayaka (2002/S/5043)

Department Of Mathematics
University Of Ruhuna
Matara
Sri Lanka

Date : 04/04/2008

3
Abstract

Since a long time, people have used to deposit their main requirements such as foods,
cloths and money after their daily usage, with a view to use in the future. Step by step
with use of money, the concept of banking was become popular all over the world, and
rapidly improved with the new out comes of the technology revolution. so that they in-
troduced teller and credit card facilities for the convenience of public.

As an example, it is clear that a lot of people who are in our university also keeping
context with the banking sector, and most of them like to keep their accounts in the
government banks rather then the private banks. Considerable number of these people
uses the teller machines and credit cards. But most of the university people have face for
a lot of troubles, due to these teller machines are installed within the urban areas. So my
main intention is to investigate this matter.

About 3500 people come daily, in to the university premises for their academic and official
work. The goal of this study to survey study of the requirement of a teller machine within
the university premises for the public usage. We prepared to questionnaires for collect
data and interview more than 500 people around the university area in wellamadama
premises comprising academic, non academic staff, internal external student and security
staff.

Finally in order to analysis these data we have use “MINITAB” and “R” statistical soft-
wares.

i
Contents

Introduction 1

1 Simple Regression Analysis 3


1.1 Using Simple Regression To Describe A Liner Relationship . . . . . . . . . 3
1.2 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Inferences From A Simple Regression Analysis . . . . . . . . . . . . . . . . 8
1.4 Model And Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Multiple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Least-Square Procedures For Model Fitting . . . . . . . . . . . . . . . . . . 13
1.7 Polynomial Model Of Degree p . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Univarite Logistic Regression Model 16


2.1 Why Use Logistic Regression Rather Than Ordinary Linear Regression. . . 16
2.2 The Simple Logistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 The Importance Of The Logistic Transformation . . . . . . . . . . . . . . . 22
2.4 Fitting The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . 23
2.5 Fitting The Logistic Regression Model By Using
Maximum Likelihood Method . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Testing For The Significance Of The Coefficients . . . . . . . . . . . . . . . 27
2.7 Testing For The Significance Of The Coefficents For The Logistic Regressin 27
2.8 Confidence interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Multiple Logistic Regression Model 35


3.1 The Multiple Logistic Regression Model . . . . . . . . . . . . . . . . . . . 35
3.2 Fitting The Multiple Logistic Regression Model . . . . . . . . . . . . . . . 37
3.3 Testing For The Significance Of The Model . . . . . . . . . . . . . . . . . . 41
3.4 Likelihood Ratio Test For Testing For The Significance Of The Model . . . 42
3.5 Wald Test For Testing For The Significance . . . . . . . . . . . . . . . . . 44

ii
3.6 Confidence interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Interpretation Of The Fitted Logistic Regression Model 48


4.1 Dichotomous Independent Variable . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Polychatomous Independent Variable . . . . . . . . . . . . . . . . . . . . . 55
4.3 Deviation From Means Coding Method . . . . . . . . . . . . . . . . . . . . 58
4.4 Continous Independent Variable . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 The Multivariable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Interaction And Confounding . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Estimation Of Odds Ratios In The Presence Of
Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Model-Building Strategies And Mothods For Logistic Regression 73


5.1 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Fractional polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6 Descriptive Data Analysis 84

7 Discussion 103

8 Conclusion 110
8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

9 Appendx 116

iii
An Introduction to Regression
Analysis

Introduction
Advance in technology including computers, scanners, and telecomunications equipment
have buried present-day managers under a mountain of data. Although the purpose of
these data is to assist managers in the decision-making process, corporate executives who
face the task of juggling data on many variables may find them selves at a loss when
attemping to make sence of such information. The decision-making process is futher
complicated by the dynamic elements in the business environment and the complex in-
terrlationships among these elements.

This text has been prepared to give managers (and future managers) tools for examining
possible relationships betweeen two or more variables. For example, sales and adver-
tising are two variables commonly throught to be related. When a soft drink company
increases advartising expenditures by paying professinal athletes millions of dollars to do
its advertisements, it expects this outlay increase sales. In general, when decisions on
advertising expenditures of millions of dollers are involed, it is comforting to have some
evidence that, in the past, increased advertising expenditures indeed let to increased sales.

Another example is the relationship between the selling price of a house and its square
footage. When a new house is listed for sale, how shold the price be determined? Is a
4000-square-foot house worth twice as much as a 2000-square -foot house? What other
factors might be involved in the pricing of houses and how should these factors be includ
in the determination of the price?

In a study of absenteeism at a large manufacturing plant, managment may feel that


several variables have an impact. These variable might include job complexity, base pay,
the number of yeras the worker has been with the plant, and the age of that worker. If
absenteeism can cost the company thousands of dollars then the importance of identifying
its associated factors becomes clear.

Perhaps the most important analytic tool for examining the relationships between two

1
or more variables is regression analysis.Regression analysis is a statistical technique for
developing an equation describing the relationship between two or more variables. One
variable is specified to be the dependent variable, or the valuble to be explained. The
other one or more variable are called the independent or explanatory variables. Using the
previous examples, the soft drink firm would identify sales as the dependent variable and
advertising expenditures as the explanatory variable. The real estate firm would choose
selling price as the dependent variable and size as the explanatory variable to explain
variation in selling price from house to house.

There are several reasons business researchers might might want to know how certain
variable are related. The retail firm may whont to know how much advertising is neces-
sary to achieve a certain level of sales. An equation expressing the relationship between
sales and advetising in useful in answering this question. For the real estimate firm, the
relationship might be used in assiging prices to houses coming on to the market. To try
to lower the absenteeism rate, the management of the manufacturing firm wants to know
what variables are most highly related to absenteeism. Reasons for wanting to develop
an equation relating two or more variables can be classified as follows.

(a) To describe the relationship.

(b) For control purpose (What value of the explanatory variable is neede to produce a
certain level of the dependent variable)

(c) For prediction

Much Statistical analysis is a multistage of trial and error. A good deal of exploratory
work must be done to select appropriate variables for study and to determine relationships
between or among them. This requires that a variety of statistical tests and procedures be
performed and sound judgments be made before one arrives at satisfactory choices of de-
pendent and explanatory variable. The emphasis in this text is on this multistage process
rather than on the computations themselves or an in-depth study of the theory behind
the techniques presented. In this sense, the text is derected at the applied researcher or
the consumer of statistic.

Except for a few preparator y examples, it is assumed that a computer is available to


the reader to perform the actual computations. The use of statistical software frees the
user to concentrate on the multistage “model-building” process. Most examples use il-
lustrative computer output to present the results. The two software packages used are
“MINITAB” and “Microsoft Excel 2000”. MINITAB is include because it is widely used
as a teaching tool in universities and is also used in industry.

2
Chapter 1

Simple Regression Analysis

1.1 Using Simple Regression To Describe A Liner


Relationship
Regression analysis is a statistical technique used to describe relationships among vari-
ables. The simplest case to examne is one in which variable y, referred to as the dependent
variable, may be related to another variable x, called an independent or explanatory vari-
able. If the relationship between y and x is believed linear, then the equation expressing
this relationship is the equation for a line:
y = b0 + b1 xi (1.1)
If a graph of all the (x,y) pairs is constructed, then β0 represents the y intercept, the
point where the line crosses the vertical (y) axis, and β1 represents the slope of the line.
Consider the data show in Table 1.1. A graph of the (x,y) pairs would appear as show
in Figure 1.1. Regression analysis is not needed to obtain the equation exprssing the
relationship between these two variables.In equation form:
y = 1 + 2x
This is an exact or deterministic linear relationship. Exact linear relationships are some-
times encoutered in business enviroments. For example, from accounting,
T otalCosts = F ixedCosts + variableCosts
Other exact relationships may be encountered in various science courses (for example ,
physics or chemistry). In the social sciences (for example, psychology or sociology) and

X 1 2 3 4 5 6
Y 3 5 7 9 11 13

Table 1.1: Example Data set-I

3
Figure 1.1: Example Data set-I

X 1 2 3 4 5 6
Y 3 2 8 8 11 13

Table 1.2: Example Data set-II

in business and ecomomics, exact linear relationships are the exception rather than the
rule. Data encountered in a business environment are more likely to appear as in Table
1.2. These data graph as shown in Figure 1.2.

It appears that x and y may be linearly related, but it is not an exact relationship.
Still it may be describe to describe the relationship between in equation form. This can
be done by drawing what appears to the “best-fitting” line through the points and esti-
mating (guessing) what the values of β0 and β1 are for this line. This has been done in
Figure 1.2. For drow a good guess might be the following equation:

ŷ = −1 + 2.5x

Figure 1.2: Example Data set-II

4
y
11
00
11
00 0
1
00
11
y*=8
0
1 11
00
0
1
11
0
y*−y*
0
y*=7
0
10
1
0
1
0
1
0
1
0
1
1 1
0 0
0
1
0
1
x*=3 x

Figure 1.3: Motivation for the Least-squares Regression line

The drowbacks to this method of fitting the line should be clear. For example, if the
(x, y) pairs graphed in Figure 1.2 where given to two people, each would probably guess
different values for the intercept and slope of the -fitting line. Furthermore, there is no
way to assess who would be more correct. To make line fitting more precise, a definition
of what is means for a line to be the “best“ is needed. The criterion for a best-fitting line
that we will use might be called the “miniumum sum of squared errors”criterion or, as it
is more commonly know, the least-quares criterion.

In Figure 1.3, the (x, y) pairs from Table 1.2 have been plotted and arbitrary line drawn
through the points. Consider the pair of values denoted (x∗ , y ∗) The actual y value is
indicated as y ∗ ;the value predicted to be associated with x∗ if the line shown where used
is indicated as ŷ ∗ . The difference between the actual y value and the predicted y value
at the point x∗ is called a residual and represent the “error”involved. This error denoted
y ∗ − ŷ ∗ . If the line is to fit the data points as accurately as possible, these errors shoud be
minimized.This should be done not just for the single point (x∗ , y ∗), but for all the points
on the graph. There are several vays to approach this task.

(1) Use the line that minimizes the sum of errors, ni=1 (yi − ŷi ).
The problem with this approach is that,for any line that passes through the point(x̄,ȳ),
n
i=1 (yi − ŷi )=0

so there are an infinite number of lines satisfying this criterion,some of which obvi-
ously do not fit the data well. For example, in Figue 3.4, line A and B have both
been constucted so that ,
 n
(yi − ŷi ) = 0
i=1

But line A obviously fits the data better than line B;that is , it keeps the distances
yi − ŷi small.

5
y

11111111
00000000 11
00
00
11
00000000
11111111 1
0
00
11
00000000
11111111
00000000
11111111
11
00
00000000
11111111
00000000
11111111
0
1
00000000
11111111
y=7.5 0
1
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
1
0
00000000
11111111
00000000
11111111
x*=3.5 x
n
Figure 1.4: Line A and B Both satisfing the criation i=1 (yi − ŷi ) = 0

(2) Use the line that minimize the sum of absolute value of error.
n
i=1 |(yi − ŷi )|

This is calledthe minimum sum of absolute error criterion.The resulting line is called
the least absolute value (LAV)regression line.Although use of this criterion is gain-
ing popularity in many situations, it is not the one that ew use in this text. Finding
the line that satisfies the minimumsum of absolute errors criterion requires solving
a fairly compex problem by a technique called linear programming.

(3) Use the line that minimizes the sum of squared errors.

1.2 Least Squares Estimation


The parameters β0 and β1 are estimated by the method of least squares. The reasoning
behind this method is quite simple. From the many straight lines that can be drawn
through a scatergram we wish to pick the one that“best fits” the data. The fits is “best”
in the sense that the values of b0 and b1 chosen are those that minimize the sum of the
squares of the residuals. In this way we are essntially picking the line that comes as close
as it can to all data points simultaneously. For example, if we consider the sample of five
data points shown in Figure 1.8, then the least-squares procedure selects that line which
causes e21 + e22 + e23 + e24 + e25 to be as small as possible.

The sum of squres of the errors about the estimated regression line is given by

n 
n
SSE = e2i = (yi − b0 − b1 x1 )2
i=1 i=1

6
y

11
00
001
110e5
0
1
0
1
0
1 1
0
0
1
11
0e3
e4
0 11
00
0
1
0
1 μ
11
00
e1
Y|x
= b +
0
b x
1
e2

11
00

Figure 1.5: The least-squares procedure minimizes the sum of the squares of the residuals
ei

Differentiatiating SSE with respect to b0 and b1 ,we obtain

∂SSE  n
= −2 (yi − b0 − b1 xi )
∂b0 i=1

∂SSE  n
= −2 (yi − b0 − b1 xi )xi
∂b1 i=1

We now set these partial derivatives equal to 0 and use the rules of summation to be
obtain the equation,
n 
n
nb0 + b1 xi = yi
i=1 i=1


n 
n 
n
b0 xi + b1 x2i = xi yi
i=1 i=1 i=1

This equations are called the normal equations. They can be solved easily to obtain these
estimates for β0 and β1 :
n 1
n n
i=1 xi yi − n ( i=1 xi i=1 yi )
β1 = n 2 1 n (1.2)
2
i=1 ix − ( x
i=1 i )
n
β0 = ȳ − β1 x̄ (1.3)
β0 and β1 that minimize, n
(x − x̄)(yi − ȳ)
n i
β̂1 = i=1
i=1 (xi − x̄)
2

β̂0 = ȳ − β1 x̄

7
Variable (i) xi yi xi yi x2i
1 1 3 3 1
2 2 2 4 4
3 3 8 24 9
4 4 8 32 16
5 5 11 55 25
6 6 13 78 36
Sums 21 45 196 91

Table 1.3: Computations for finding β0 and β1

A computational simper from of 1.2 is,


n 1
n n
i=1 xi yi − n ( i=1 xi i=1 yi )
β̂1 = n 2 1 n (1.4)
2
i=1 xi − ( i=1 xi )
n
As an example of the use of these formulas,consider agin the data in Table 1.3. The
intermediate computations necessary for finding β0 and β1 are shown in Table 1.3. The
slope, β1 , can be computed using the formula in equation 1.2,
196 − 16 (21)(45) 38.5
β̂1 = 1 = = 2.2
91 − 6 (21) 2 17.5
The intercept, β0 , is computed as in equation 1.3
β̂0 = 7.5 − 2.2(3.5) = −0.2
because
21 45
x̄ = and ȳ = = 7.5
6 6
The least-squares regression line for these data is
ŷ = −0.2 + 2.2x
Summary

There is no longer any guesswork associated with computing the best-fitting line once
a criterion has been stated that difines “best”. Using the criterion of minimum sum
of squared errors, the regression line we computed provides the best description of the
relationship between the variables x and y.

1.3 Inferences From A Simple Regression Analysis


Thus far, regression analysis has been viewed as a way to describe the relationship between
two variables. The regression equation obtained can be viewed in this manner simply as

8
A direct relationship An inverse relationship

No linear relationship A curvilinear relationship

Figure 1.6: Example of possible population regression lines

a descriptive statistic. However, the power of the technique of least-squares regression is


not in its use as a descriptive measure for one particular sample, but in its ability to draw
inferences or generalizations about the relationship for the entire population of values for
the variable x and y.

To draw inferences from a simple regression, we must make some assumptions about how
x and y are related in the population. These initial assumptions describe an “ideal”situation.
Later, each of these assumptions is relaxed and we demostrate modifications to the basic
least-squares appoach the provid a model that still suitable for statistical inference.

Assume that the relationship between the variables x and y is represented by a popu-
lation regression line. The equation of this line is written as,

μy|x = β0 + β1 x (1.5)

Where μy|x is the conditional mean of y given of x,β0 is the y intercept for the population
regressiion line, and, β1 is the slope of the population regression line.Example of possible
relationships are shown in Figure 1.6.

Suppose that we are developing a model to describe the temperature of the water off
the continental self. Since the temperature dependes in the part on the depth of the
water, two variable are involved. These are X, the water depth, and Y , the water tem-

9
y y
11
00 00
11
00
11 00
11 μ = β + β x
00
11 0
1
00
11 Y|x 0 1

temperature
000000000000000
111111111111111

temperature
0
1
11 1
00 000 0
11 00
11 11
00
11
00
00
11
000000000000000
111111111111111
0
1 0
100 11
11 00
11 00 0
11
000000000000000
111111111111111
111
00011 0
0000
11 00
11
0
1
00
11 11100
1100
00 11 00
000000000000000
111111111111111
11
000000000000000
111111111111111
111
0 1100
0011 11 0
00 10 11
00
100 1
000000000000000
111111111111111

water
0 11
00

water
00
1111
00 000000000000000
111111111111111
11 1
00 011
00 11
00
000000000000000
111111111111111
011
1 00 11
00
000000000000000
111111111111111
000000000000000
111111111111111
000000000000000
111111111111111
000000000000000
111111111111111
0 x 0 x
Depth of water Depth of water
(a) (b)

(x,y)
i i

temperature
100
011
0 1
1 0
0
1
0
1
0 1
1 0
1 ε
0000000000000
1111111111111
0 01
1 00 1
1
0 0
00
11
0000000000000
1111111111111
1 0
1
water
00 0
1
00
11
0000000000000
1111111111111
11 1 0
0 1 011
00
1
0000000000000
1111111111111
11 0
00 1 0 1 01
0000
11
0000000000000
1111111111111 0011
11
0000000000000
1111111111111
11 0
00
0000000000000
1111111111111
0000000000000
1111111111111111
00
00
11
0000000000000
1111111111111
0 x
Depth of water
(c)

Figure 1.7: Depth of water Vs Water Tempature.

perature. We are not interested in making inferences on the depth of the water. Rather,
we want to describe the behavior of the water temparatuer under the assumption that the
depth of the water is known precisely in advance. Even if the depth of the water is fixed
at some value x, the water temperature will still vary due to other random influences.
For example, if several temparature measurements are taken at various places each at a
deapth of x = 1000 feet, these measurements will vary in value. For this reson, we must
admit that for a given x we are really dealing with a “Conditional” random variable,
which we denote by Y |x (Y given that X = x).

This conditional random variable has a mean denoted by μY |x . It is obvious that the
average temperature atx = 1000 feet to be the same as that at x = 5000 feet. That is, it
is reasonable to assume that μY |x is a function of x. We call the graph of this function the
curve of regression of Y on X. Since we assume that the value of X is known in advance
and that the value assumed by Y dependes in part on the particular value of X under
considaration, Y is called the dependent or response variable. The variable X whose
value is used to help predict the behaviour of Y |x is called the independent or predictor
variable or the regressor.

My immediate problem is to estimate the form of μY |x based on data obtained at some


selected values x1 , x2 , x3 , ...., xn of the predictor variable X.The actual values used to de-
velop the model are not overly important. If a functional relationship exit, it should
become apparent regardless of which X values are used to didiscovedr it. However, to be
of practical use, these values, should represent a fairly wide range of possible values of the
independent variable X. Sometimes the values used can be preselected. For example, in

10
studing the relationship between water temperatures and deapths, we might know that
our model is to be used to predict water temperature for depths from 1000 to 5000 feet.
We can choose to measure water temparatures at any water deapth that we wish within
this range. For example, we might take measurement at 1000-foot increments. In this
way we present our X values at x1 =1000, x2 =2000, x3 =3000, x4 =4000, x5 = 5000 feet.
When the X values used to develop the regression equation are preseleted, the study is
said to be controlled.

Regardless of how the X values for study are selected, my random sample is properly
viewed as taling the form

{(x1 , Y |x1 ), (x2 , Y |x2 ), (x3 , Y |x3 ), ....., (xn , Y |xn )}

The first member of each ordered pair denoptes a value of the independent variable X; it
is a real number. The second member of each pair is a random variable.

Now we estimate the curve of regression of Y on X when the regression is conideered


to be linear. In this case the equation for μY |x is given by,
μy|x = β0 + β1 x (1.6)
Where β0 and β1 denote the real numbers.

1.4 Model And Parameter Estimation


Description Of Model
From the elementary algebra that the equation for the straight line is y = b + mx, where b
denote they intercept and m denotes the slope of the line. In the simple linear regression
model is,
μy|x = β0 + β1 x
β0 denotes the intercept and the β1 the slope of the regression line. We must find a logical
way to estimate the theoretical parameters β0 and β1 .

In conducting a regression study, we shall be observing the variable X at n points x1 ,


x2 , x3 , ....., xn . These points are assumed to be measured without error. When they are
preselected by the experimenter, we say that the study is a controlled study; when they
are observed at random, then the study is called an observational study. Both situations
are handled in the same way mathematically. In either case we shall be concerned with
the n random variables Y |x1 , Y |x2 , Y |x3 , ....., Y |xn . recall that a random variable varies
about its mean value. Let Ei denote the random difference between Y |xi and its mean,
μY |x , That is, let
Ei = Y |xi − μy|xi (1.7)

11
Solving this equation for Y |xi , we conclude that

Y |xi = μy|xi −Ei

In this expression it is assumed that the random difference Ei has mean 0. Since we are
assuming that the regression is linear, we can conclude that μy|xi = β0 +β1 xi . Substituting,
we see that
Y |xi = β0 + β1 xi + i
It is customary to drop the conditional notation and to denote Y |xi by Yi . Thus an
alternative way toexpress the simple linear regression model is

Y − i = β0 + β1 xi + Ei (1.8)

where Ei is assumed to be random variable with mean 0.


Our data consist of a collection of n pairs (xi , yi ), where xi is an observed value of the
variable X and yi is the corresponding observation for the random variable Y . The
observed value of a random variable usually differes from its mean value by some random
amount. This idea is expressed mathematically by writing

yi = β0 + β1 xi + i (1.9)

In this equation i is a realization of the random variable Ei that appears in the alternative
model for simple linear regression 1.8.
In a regression study it is useful to plot the data points in the xy plane. Such a plot is
called a “Scattergram”. We do not expect these points to lie exactly in the straight line.
However, if linear regression is aplicable, then they should exhibit a linear treand.
Once β0 and β1 have been approximated from the available data, we can replace these
theoretical parameters by their estimated values in the regression model. Letting b0 and
b1 denote the estimates for β0 and β1 ,respectively, the estimated line of regression takes
the form
μ̂Y |x = b0 + b1 x (1.10)
Just as the data points do not all lie on the theoretical line of regression,they also do not
all lie on this estimated regression line. If we let ei denote the vertical distance from a
point(xi , yi ) to the estimated regression line, then each data point satisfies the equation

yi = b0 + b1 xi + ei

The term ei is called the residual. Figure 1.8 illustrates this idea and points out the
difference between i and ei graphically.

1.5 Multiple Linear Regression Model


In the previous section we studied the simple logistic regression model. This model
expresses the idea that the mean of a responce variable Y depends on the value assumed

12
by a single predictor value X. In this section I extend the concepts studied earlier to
cases in which the model become more complex. In particular,we distinguish between
two basic models: the polynomial model, in which the single predictor variable can appear
to a power greter than 1, and the multiple linear regression model, in which more than
one distinct variable can be used.

1.6 Least-Square Procedures For Model Fitting


In this section we develop the least-squares estimators for the parameters in both the
polynomial and multiple regression models. Before introducing these models specifically,
let us not that each of term is a special case of what is called the general linear model.
These are models in which the mean value of a responce variable Y is assumed to depend
on the values assumed by one or more predictor variables. As in the case of simple linear
regression, the predictor variable X1 , X2 , ....., Xk are not treated as random variables.
However, for a given set of numerical values for these variables x1 , x2 , ..., xk , the responce
variable denote by Y |x1 , x2 , ..., xk is assumed to be a random variable. The general linear
model expresses the mean value of this conditional random variable as a function of
x1 , x2 , ..., xk . The model takes the following form:

General linear model


μY |x1 ,x2 ,...,xk = β0 + β1 x1 + β2 x2 + ...... + βk xk (1.11)
In this model x1 , x2 , ..., xk denote known real numbers;β0 , β1 , β2 , ....., βk denote unknown
parameters.Our main task is to estimate the values of these parameters from a data set.

Example:1.6.1
Suppose that we want to develop an equation with which we can predict the gasoline
mileage of an automobile based on its weight and the temperature at the time of opera-
tion. We might pose the model,

μY |x1,x2 = β0 + β1 x1 + β2 x2

Here the responce variable is Y , the mileage obtained. There are two independent or
predictor variables. These are X1 , the wight of the car, and X2 , the temperature. The
values assumed by these variable are denoted by x1 and x2 , respectively. For example,
we might want to predict the gas mileage for a car the weights 1.6 tons when it is being
driven in 85Ḟ wether, Here x1 = 1.6 and x2 = 85. The unknown parameters in the model
are β0 , β1 , and β2 . Their values are to be estimated from the data gathered.

It is possible to treat the polynomial and multiple regression models simultaneously from
mathematical standpoint. However, they differ enough in a practical sence to jestify
conidering them separately. We begin with a desciption of the general polynomial model.

13
y

11
00
00 0
11
11
00 1 11 1
00 0
0
1
00 1
11 0 0
1 0
1
0 11
1 00
11
00 0
1
0
1

Figure 1.8: Quadaratic model:μ = β0 + β1 x + β2 x2

1.7 Polynomial Model Of Degree p


The general polynomial regression model of degree p expresses the mean of the responce
variable Y as polynomial function of one predictor value X. It is takes the form,

μY |x = β0 + β1 x1 + β2 x2 + ...... + βp xp (1.12)

Where the p is a positive integer .If we let x1 = x, x2 = x2 , x3 = x3 , ........., xp = xp , then


the model can be rewriten in the general linear form as 1.12.
Scatergrams are useful in determining when a polynomial model might be appropriate.
The pattern show in Figer 1.10 suggests the quadratic model,

μY |x = β0 + β1 x1 + β2 x2 (1.13)

that of Figer 1.10 poins to be the cubic model,

μY |x = β0 + β1 x + β2 x2 + β3 x3 . (1.14)

Once we decide that a polynomial is appropriate, we are faced with the problem of esti-
mating the parameters β0 , β1 , β2 , ......, βp . To apply the method of least squares, we first
express the polynomial in the form,

Y |x = β0 + β1 x1 + β2 x2 + ...... + βp xp + Ei (1.15)

Where Y |x denotes the response variable when the predictor variable assumes the value
x, and E denotes the random difference between Y |x and its mean value,
μY |x = β0 + β1 x + β2 x2 + ...... + βp xp . A random sample of size n takes the form
{(x1 , Y |x1 ) , (x2 , Y |x2 ) , ....., (xn , Y |xn )} Where the first member of each ordered pair de-
note a real number, and the secound, a random variable. As in the case of sample linear
regression, it is customary to drop the conditional notation. The sample itself become
(x1 , y1 ), (x2 , y2), ...., (xn , yn ) where each ι = 1, 2, ...., n,

Yi = β0 + β1 x1i + β2 x2i + ...... + βp xpi + Ei (1.16)

14
y
00
11 1
0
0
1
0 1
1 0
0
1
0
1
0 00
1
0
1 11
11
0
00
11
0 11
1 0 0
1
00
00 0
11 1
00
11 11
00
00
11

Figure 1.9: Cubic model:μ = β0 + β1 x + β2 x2 + β3 x3

Once again, we must assume that the random error E1 , E2 , ......, En are independent ran-
dom variables, each with mean 0, and variance σ 2 .
The estimate mean responce, estimated value of Y for a given by

ŷ = μ̂Y |x = b0 + β1 x + b2 x2 + ...... + bp xp (1.17)

Where b0 , b1 , ....., bp are the least-squares estimates for β0 , β1 , ...., βp To find these estimates,
we minimize the sum of squares of the residence.

15
Chapter 2

Univarite Logistic Regression Model

Regression method have become an integral component of any data analysis concerned
with describing the relationship between a responce variable and one or more explanatory
variables. It is often the case that yhe outcome variable is discrete, taking on two or more
explanatory variables. It is often the casethat the outcome variable is discrete, taking
on two or more possible values. Over the last decade the logistic regression model has
become, in many fields, the standard method of analysis in this situation.

Logistic regression analysis is the most popular regression techiniques available for mod-
eling dichotomous dependent variables. In this chapter I discribe the univarite logistic
regression model and several of its key featurs. Particularly how an odd ratio can be esti-
mated from it. We also demostrate how logestic regression may be applied, using real-life
data set. Over the last decade the logistic regression model has become in many fields
the standed method of analysis in this situation

Before begining a study of logistic regression it is important to undestand that the goal of
an analysis using this method is the same as that of any model-bulding tecnique used in
statistics:to find the best fitting and most parsimonious, yet biologically reasonable model
to describe the relationship between an outcome (dependent or responce) variable and a
set of independent (predictor or explanatory) variables. These variables are often called
covariates. The most common example of modeling, and one assumed to be familiar to
the readers of this text, is the usual linear regression model where the outcome variable
is assumed to be continous.

2.1 Why Use Logistic Regression Rather Than Ordi-


nary Linear Regression.
In early statistions use ordinary linear regression for there data analysisings. They didn’t
use logistic regression with a binary outcome. Statistic develop day by day, however, are
now most statistions & phychologists use logistic regression for following reasons.

16
(i) The outcome variable in the logistic regression is binary or dichotomous.

(ii) If you use linear regression, The predicted values will become grether than one and
less than zero. If we moves ar theretically inadmissble.

(iii) One of the assumptions of regression is that the variance of Y is constant be the
case with a binary variable, become the variances is P Q.

Example2.1: When the 50 precent of the people are,t hen the variance is 0.25. Its maxi-
mum value as we move to more extreme values, the variance decreases, when p = 0.1, the
variance (p × q) is 0.1×0.9 = 0.09, so as P approaches 1 or zero, the variance approches
zero.

This difference between logistic and linear regression is reflected both in the choice of
parametric model and in the assumptions. Once this difference is accounted for, the
methods employed in an analysis using logistic regression follow the same general prin-
cipales used in linear regression. Thus, the techiniques used in linear regression analysis
will mortivate our approach to logistic regression. We illustrate both the similarities and
differences between logistic regression and linear regression with an example.
textbfExample2.2

Table 2.1 Age and coronary heart disease (CHD) stutus of 100 subjects.
ID AGE AGRP CHD ID AGE AGRP CHD
1 20 1 0 21 34 2 0
2 23 1 0 22 34 2 0
3 24 1 0 23 34 2 1
4 25 1 0 24 34 2 0
5 25 1 1 25 34 2 0
6 26 1 0 26 35 3 0
7 26 1 0 27 35 3 0
8 28 1 0 28 35 3 0
9 28 1 0 29 36 3 1
10 29 1 0 30 36 3 0
11 30 2 0 31 37 3 0
12 30 2 0 32 37 3 1
13 30 2 0 33 37 3 0
14 30 2 0 34 38 3 0
15 30 2 0 35 38 3 0
16 30 2 1 36 39 3 0
17 32 2 0 37 39 3 1
18 30 2 0 38 40 4 0
19 33 2 0 39 40 4 1
20 33 2 0 40 41 4 0

17
ID AGE AGRP CHD ID AGE AGRP CHD
41 41 4 0 71 53 6 1
42 42 4 0 72 54 6 1
43 42 4 0 73 55 6 1
44 42 4 0 74 55 7 1
45 42 4 1 75 55 7 0
46 43 4 0 76 56 7 1
47 43 4 0 77 56 7 1
48 43 4 1 78 56 7 1
49 44 4 0 80 57 7 1
50 44 4 0 81 57 7 0
52 44 4 1 82 57 7 0
53 45 5 1 83 57 7 1
54 45 5 0 84 57 7 1
55 46 5 1 85 57 7 1
56 46 5 0 86 58 7 1
57 47 5 1 87 58 7 0
58 47 5 0 88 58 7 1
59 47 5 0 89 59 7 1
60 48 5 1 90 59 7 1
61 48 5 0 91 60 8 0
62 48 5 1 92 60 8 1
63 49 5 1 93 61 8 1
64 49 5 0 94 62 8 0
65 49 5 0 95 62 8 1
66 50 6 1 96 63 8 1
67 51 6 0 97 64 8 1
68 52 6 1 98 64 8 1
69 52 6 0 99 65 8 1
70 53 6 0 100 69 8 1
Table 2.1 lists age in years (AGE), and presence or absence of evidence of significant
coronary heart disease (CHD) for 100 subjects selected to participate in a study. The
table also cantains an identifier variable (ID) and an age group variable (AGRP). The out
come variable is CHD, which is coded with a value of zero to indicate CHD is absent, or
1 to indecatethat it is precent in the individual.

It is of interest to explore the relationship between age and the presence or absence
of CHD in this study population. Had out come variable CHD been continuse rather
than binary, we probably would begin by forming a scatter plot of the outcome versus
the independent variable. We would use this scatterplot of the data in Table 2.1 is given
in Figer 2.1.

18
Figure 2.1: Scatterplot by CHD by AGE for 100 subjects.

In this scatterplot all points fall on one of two parallel lines representing the absence
of CHD (y = 0) and the presence of CHD (y = 1). There is some tendency for the
individuals with no evidence of CHD to be younger than those with evidence of CHD.
While this plot does depict the dichotomous nature of the outcome variable quite clearly,
it does not provide a clear picture of the nature of the relationship between CHD and age.
Other problem with Figure 2.1 is that the variabilty in CHD at all ages is large. This
makes it difficult to describe the functional relationship between age and CHD.

One common method of removing some variation while still maintaning the stucture of
the relationship between the outcome and the independent variable is to create intervals
for the independent variable and compute the mean of the outcome variable within each
group. In this table 2.1 this strategy is carried out by using the age group variable,
AGRP which categorize the age data of table 2.1. Table 2.2 contains, for each age group,
the frequncy of occurreence of each outcome as well as the mean (or propotion with CHD
present) for each group.

Now we can proportion of individuals CHD versus the incependent of each age inter-
val. By examining this is a clear picture of the relationship begins to emerge. It appears
that as age increses,the propotion of individuals with evidence of CHD increases.

While this provides considerable insight in to the relationship between CHD and age
in this study, a functional form for this relationship needs to be described. The plot
in this figure is similar to what one might obtain if this same process of grouping and
averaging whre per formed in a linear regression. We will note two important differences.

19
Age Group n Absent Present Mean(Proportion)
20-29 10 9 1 0.10
30-34 15 13 2 0.13
35-39 12 9 3 0.25
40-44 15 10 5 0.33
45-49 13 7 6 0.46
50-54 8 3 5 0.63
55-59 17 4 13 0.76
60-69 10 2 8 0.80
Total 100 57 43 0.43

Table 2.1: frequncy table of AGE group by CHD

Figure 2.2: Plot of the presentage of subjects with CHD in each age group.

20
Table 2.2:

The first difference concerns the nature of the relationship between the outcome and
independent variable. In any regression problem the key quantity is the mean value of
the outcome variable, given the value of the independent variable. The quntity is called
the conditional mean and will be expressed as E(Y |x) where Y denotes the outcome vari-
able and x denotes a value of the independent variable. In The quantity E(Y |x) is called
“the expected value of Y, given the value x”, In linear regression, we assume that this
mean may be expressed as equation linear in x,
E(Y |x) = β0 + β1 x (2.1)
This expression implice that it is possible for E(Y |x) to take on any value as x range
between −∞ and ∞.

The colum labeled “Mean” in Table 2.2 provides an estimate of E(Y |x). We will as-
sume for purposes of exposition, that the estimated values bploted in Figure 2.2 are be
close enouge to the true values of E(Y |x) to provide a reasonable assessment of the rela-
tionship between CHD and AGE. With dichotomous data, the conditional mean must be
greater than or equal to 1 [0 ≤ E(Y |x) ≤ 1] .

The change in the E(Y |x) per unit change in x becomes progressively smaller as the
conditional mean gets closer to 0 or1. The curve is said to be S-shaped. It resembles a
plot of a commulative distribution of random variable. It should not seemsurrising that
some well-know commulative distributions have been used to provide a model for E(Y |x)
in the case when Y is dichotomous.The model will use is that of the logistic distribution.

Many distribution functions have been proposed for use in the analysis of a dichotomous
outcome variable. Cox and Shell (1989) discuss some of these. There are two primary
reasons for choosing the logistic distribution.

First from a mathematical point of view, it is an extremely flexible and easily used func-
tion, and secound it lends it self to a clinically meaningful interpretation.

2.2 The Simple Logistic Model


Logistic regression is a mathematical modling apprach that can be use to describe the
relationship of several variable X to a dichotomous dependent variable Y where Y is
typically called as 1 or 0 for its two possible categories. The logistic model describes the
expected value of Y (i.e E(Y ) ) in termes of following “logistic formula”.
1
E(Y ) = (2.2)
1+ e−(β0 +β1 x)

21
eβ0 +β1 x
E(Y ) = (2.3)
1 + eβ0 +β1 x
For (0, 1) random variable such as Y . It follows from basic statistical principales about
expected values that E(Y ) is equivalent to probability P r(Y = 1) for a (0, 1) random
variable Y ,

E(Y |x) = [0 × P r(Y = 0)] + [1 × P r(Y = 1)] = P r(Y = 1) (2.4)

can be written in a form that describes the probability of occurrence of one of the two
possible out comes of Y as follows:
1
P (Y = 1) = (2.5)
1 + e−(β0 +β1 x)
In order to simplify notation, we use the quantity,

π(x) = E(Y |x) (2.6)

to represent the conditional mean of Y given x when logistic distribution is used. The
specific form of the logistic regression model we use is,

eβ0 +β1 x
π(x) = (2.7)
1 + eβ0 +β1 x
A transformation of π(x) that is central to my study of logistic regression is the logistic
tarnsformation.

This transformation is defined, in terms of π(x)


 
π(x)
g(x) = ln (2.8)
1 − π(x)

now apply equation 2.7,then we have


⎡ ⎤
e(β0 +β1 x)
⎢ 1 + e(β0 +β1 x) ⎥
g(x) = ln ⎢

⎥ = ln[e(β0 +β1 x) ] = β0 + β1 x
⎦ (2.9)
1
1 + e(β0 +β1 x)

2.3 The Importance Of The Logistic Transformation


(1) The logit transformation g(x) has many of the desirable properties of a linear re-
gression model.

(2) The logit, g(x), is linear in its parameters, may be continuse, and may range from
−∞ to + ∞, depending to the range of x.

22
(3) Logistic regression models concerns the conditional distribution of the out come
variable. In the linearregression model we assume that an observation out come
variable may be expressed as

Y = E(Y |x) + 

The quantity  is called the error and expressed an observation’s deviation from the con-
ditional mean. The most common assumption is that ε follows a normal distribution with
mean zero and some variance that is constant across levels of the independent variable.
It follows that the conditional distribution of the outcome variable x will be normal with
mean E(Y |x), and a variable that is constant. This is not the case with a dichotomous
outcome variable. In this situation we may express the value of the outcome variable given
x as y = π(x) + ε, Here the quantity ε may assume one of two possible values. If y = 0
then ε = −π(x) with probability 1 − π(x). Thus , ε has a distribution with mean zero
and variance equal to π(x)[1 − π(x)]. That is, the conditional distribution of the outcome
variable follows a binomial distribution with probability given by the conditional mean,
π(x).

Summary:

We have seen that in a regression analysis when the outcome variable is dichotomous:
(1) The conditional mean of the regression equation must be formulated to be bounded
between zero and 1. we have stated that the logistic regression model, π(x) given
in equation 2.7, satisfies this constraint.

(2) The binomial, not a normal, distribution describe the distribution of the errors and
will be the sistical distribution upon which the analysis is based.

(3) The principales that gride an analysis using linear regression will also guide in
logistic regression.

2.4 Fitting The Logistic Regression Model


Suppose we have a sample of n independent observations of the pair (xi , yi) , i = 1, 2, ..n
where yi denotes the value of a dichotomous out come variable and xi is the value of the
independent value for the ith subject. Further more, assume outcome variable has been
coded as 0 or1, representing the absence or the presence of charecteristic respectively.
This coding for a dicotomous outcome is used through for our text. To fit the logistic
regression model in equation 2.7 to a set ofdata requires that, we estimate the value of β0
and β1 , the unknown parameters.

In linear regression, the method used most often for estimating unkown parameter is
least squares. In that method we choose those values of β0 and β1 wchich minimize the

23
sum of squared deviations of the observed value of Y from the predicted values based
upon the model under the usual assumptions for linear regression. The method of least
squares yields estimatators with a number of desirable statistical properties. Unfortunatly
when the model of least squares is applied to a model with a dichotomous outcome of the
estimators no longer have same properties.

2.5 Fitting The Logistic Regression Model By Using


Maximum Likelihood Method
The general method of estimation that the least square function under the linear regres-
sion model (when the error terms are normally distributed) is called maximum likelihood
. This method will provide the function for our approach to estimation with the logistic
regression model. In a very general sence the method of maximum likelihood yields values
for the unkown parameters which maximize the probability of obtaining the obseved set of
data. we can use this method when the error terms are normally distributed.This method
will provide the function for our approach to estimation with the logistic regression model.
In a very general sence the method of maximum likelihood yields values for the unkown
parameters which maximize the probability of obtaining the observed data.

In order to apply this method first we must construct a function called likelihood func-
tion. This function expresses the probability of the obserrved data as a function of the
unknown parameters. The maximum likelihood estimators of these parameters are chosen
to be those values that maximize this function. We now describe how to find these values
from the logistic regression model.

If Y is coded as 0 or 1 then the expression for Π(x)given in equation2.7 providees (for an


arbitrary value of β=(β0 , β1 )) the conditional probability that Y equal to 1 given x. This
will denote as P (Y = 1|x). It follows the quantity 1 − Π(x) gives the conditional proba-
bility that Y is equal to the zero given x,P (Y = 0|x). Thus, for those pairs (xi , yi), where
yi = 1 the contribution to the likelihood function is π(xi ) and for those pairs whire yi ,
the contribution to the likelihood function is 1 − π(xi ), where the quantity π(xi ) denotes
the values of π(x) computed at xi . Thus for pairs (xi , yi),

yi = 1 contribution to likelihood function = π(xi )


yi = 0 contribution to likelihood function = 1 − π(xi ),

Where the quntity π(xi ) denotes the value value of π(x) computed at xi . A convent
way to express the contribution to the likelihood function for the (xi , yi ) is through the
expression,

Likelihood f unction f or the pair (xi , yi) = π(xi )yi (1 − π(xi ))1−yi (2.10)

24
since the observartion are assumed to be independent, the likelihood function is obtain
as the product of the terms given in expresion 2.10, as follows.

l(β) = Πni=1 [Π(xi )]yi [1 − π(xi )]1−yi (2.11)

The principal of maximum likelihood states that we use as our estimate of β the value
which maximize the expressin in 2.11. However, it is easier mathematicallu to work with
the log of equation 2.11. This expression,the loglikelihood is defined as,

n
L(β) = ln[l(β)] = yi ln[π(xi )] + (1 − yi ) ln[1 − π(xi )] (2.12)
i=1


n  
π(xi )
L(β) = ln [l(β)] = yiln + ln [1 − π(xi )] (2.13)
i=1
1 − π(xi )

by equation2.7,
eβ0 +β1 x
π(x) =
1 + eβ0 +β1 x
then we can be obtain
1
1 − π(xi ) = (2.14)
1 + 1 + eβ0 +β1 x
Now we devide equation 2.7/2.14,

π(xi )
= eβ0 +β1 x (2.15)
1 − π[xi ]

then we apply log scale for both sideds,


 
π(xi )
ln = β0 + β1 x (2.16)
1 − π[xi ]

Now we apply equation 2.5 & 2.16 for equation 2.13



n  
1
L(β) = ln(l(β)) = (β0 + β1 ) + ln (2.17)
i=1
1 + eβ0 +β1 x

Defferentating with respect to β0


 n
∂ ∂ 1
ln(l(β)) = yi + (1 + eβ0 +β1 x )
∂β0 i=1
∂β0 1 + eβ0 +β1 x

∂  n
eβ0 +β1 x
ln(l(β)) = yi −
∂β0 i=1
1 + eβ0 +β1 x

25
∂  n
ln(l(β)) = yi − π(xi ) (2.18)
∂β0 i=1
now defferentation with respect to β1
n
∂ β0 +β1 x ∂ 1
ln(l(β)) = yi xi + (1 + e )
∂β1 i=1
∂β1 1 + eβ0 +β1 x
n
i 1 + eβ0 +β1 x
= yi x − xi
i=1
1 + 1 + eβ0 +β1 x
∂ ∂
if ln(l(β)) = 0 and ln(l(β)) = 0
∂β0 ∂β1
To find the value of β that maximizes L(β), we differentiate L(β) with respect to β0
and β1 and set the resulting expressions equal to zero. These equations, known as the
likelihood equations, are:
 n
[yi − π(xi )] = 0 (2.19)
i=1

n
xi [yi − π(xi )] = 0 (2.20)
i=1
The value of β given by the solution to equations 2.19 & 2.20 is called the maximum
likelihood estimate and debote as β̂. In general, the use of symbol “.̂” denotes the max-
imum likelihood estimate of the respective quantity. For exampleπ(xi ) is the maximum
likelihood estimate of π(xi ). This quantity provides an estimate of the conditional prob-
ability that Y is equal to 1, given that is equal to xi . As such, it represents the fitted or
redicted value for the logistic regression model. an interesting consequence of equation
2.19 is that,
n n
yi = π̂(xi ) (2.21)
i=1 i=1
That is, the sum of the observed value of y is equal to the sum of the predicted (expected)
values.

As an example, consider the data given in Table 2.3. Used of statistical software pakage
called Minitab, with AGE as the independent variable, produced the out put in the Table
2.3. The maximum likelihood estimates of β0 and β1 are, thus see to be β̂0 = −5.309 and
β̂ = 0.111. To fitted values are given by the equation
e− 5.309 + 0.111 × AGE
π(x) = (2.22)
1 + e − 5.309 + 0.001 × AGE
and the estimated logit, ĝ(x), is given by equation
ĝ(x) = −5.309 + 0.111 × AGE (2.23)
The log likelihood given in Table 2.3 is the value of equation 2.12 computed using β̂0 and
β̂1 .

26
Variable Coeff Std.Err Z P > |Z|
AGE 0.111 0.0241 4.61 0.001
Constant -5.309 1.1337 -4.68 0.001

Table 2.3: Results of fitting the logistic regression model to the data in Table 2.1

2.6 Testing For The Significance Of The Coefficients


The general method for assessing significance of variables is easily illustrated in the linear
regression model, and its use there will motivate the approach used for logistic regres-
sion. A comparisoin of the two approaches will highlight the difference between modeling
continous and dichotomous responce variable. In the regresion, the assessment of the sig-
nificance of the slope coefficient is approached by forming what is referred to an analysis
of variance table. This table partition the total sum of squared deviations about their
mean in to two parts,
(1) The sum of squared deviations of observations about the regressionline SSE. (or
residual sum-of -squares).
(2) The sum of squares of predicted values,based on the regression model, about the
mean of the dependent variable SSR.(Or due regression sum-of-squares).
In linear regression, the comparison of observed and predicted values is based on the
square of the distance between the two. If yi denotes the observed value for the ith
individual under the model, then the statistic used to evaluate this comparison is,

n
SSE = (yi − ŷi )2 (2.24)
i=1

Under the model not contaning the indepent variable in question the only parameter is β0
and β̂0 = ȳ and mean of the responce variable. In this case, ŷi = ȳ and SSE is equal to the
total variance. When we include the independent varible in the total model any decrease
in SSE will be due to the fact that the slope coefficient for the independent variable is
not zero. The change in the value of SSE is the due to the regression source of variabilty,
denote SSR, That is,
 n   n 
 
2 2
SSR = (yi−) − (yi − ŷi ) (2.25)
i=1 i=1

2.7 Testing For The Significance Of The Coefficents


For The Logistic Regressin
The guiding principle with logistic regression is also same: In logistic regression, compar-
ison of observed to predicted value is based on the log likelihood function define in the

27
equation 2.12

n
L(β) = ln(l(β)) = {yi ln [π(xi )] + (1 − yi ) ln[1 − π(xi )]} (2.26)
i=1

To better undestand this comparison, it is helpful conceptully to think of an observed


value of the responce variable as also being a predicted value resulting from a saturated
model. A saturated model is one that contains as many parameters as there one data
points. (A simpal model is a saturated model is fitting a liner regression model when
there only two data points, n = 2 ).

The comparison of observed predicted values using the likelihood function is based on
the following expression:
 
Likelihoodof thef ittedmodel
D = −2ln (2.27)
likelihoodof thesaturatedmodel
The quantity inside the large brackets in the expression above is called the likelihood ratio.
Using minus twice its log is necessary to obtain a quantity whose distribution is known
and can there for based for hypothesis is known and can there for beased for hypothesis
testing purpose. Such a test is called the likelihood ratio test. By using equation 2.12
⎡ n ⎤

⎢ { yi ln [π̂(xi )] + (1 − yi ) ln[1 − π̂(xi )]} ⎥
⎢ i=1 ⎥
D = −2ln ⎢ 
⎢ ( n { yi ln [yi ] + (1 − yi) ln[1 − yi ]}) ⎥

⎣ i=1 ⎦

Where π̂i = π̂(xi )


n    
π̂i 1 − π̂i
D = −2 yi ln + (1 − yi)ln (2.28)
i=1
y i 1 − yi
The statistic, D, in equation2.7 is called the deviance by some authors, are plays a central
role in some approoches to assessing goodness of fit. The deviation for logistic regression
plays the same role that the residual sum of squars plays in linear regression.
Furthermore, in a setting such as the show in Table 2.1, where the values of the outcome
variable are either 0 or 1. So likelihood of saterrated model is 1,
n
l(Sateratedmodel) = πi=1 (yi)yi × (1 − yi )(1−yi ) = 1 (2.29)
So that,
D = −2ln(Likelihoodof thef ittedmodel) (2.30)
For purpose of, assesing the significance of an independent variable. We compare the
value of D with and without independent variable in the equation. The change in D due
to the inclusion of the independent variable madel obtain as;
G = D(Modelwithvariable) − (Modelwithoutvariable) (2.31)

28
The statistic plays the some role in logistic regression as the numerator of the patial F
test does in linear regression.Becouse the likelihood of the saturated model is common to
both values of D begin difference to compute G, then be repressed as,
 
Likelihood with a variable
G = −2ln (2.32)
Likelihood without variable

 y i  1−yi
n eβ̂0 e β̂0
l(without variable) = πi=1 1−
1 + eβ̂0 1 + eβ̂0
 y i 1−yi
n eβ̂0 1
= πi=1
1 + eβ̂0 1 + eβ̂0
 
n (eβ̂0 )yi
= πi=1
1 + eβ̂0


n
ln {l(without variable)} = yi β̂0 − ln(1 + eβ̂0 )
i=1

Defferentation with respect to β̂0

∂  n
1
| (ln(i(without variable))) = yi − eβ̂0 (2.33)
∂βo 1 + eβ̂0
i=1


If (ln(i(without variable))) = 0,
∂βo
 

n
eβ̂0
yi = n
i=1
1 + eβ̂0
 

n
eβ̂0
n1 = yi = n
i=1
1 + eβ̂0

n  
n1 i=1 yi eβ̂0
= = (2.34)
n n 1 + eβ̂0
Simmlary,

n
n0 = 1 − yi (2.35)
i=1
n
n0 i=1 1 − yi 1
= = (2.36)
n n 1 + eβ̂0

29
In this case,the value of G is,
⎡  n1  n0 ⎤
n1 n0
⎢ ⎥
G = −2 ln ⎣ n n yi n
(1−yi ) ⎦
(2.37)
Πi=1 π̂ (1 − π̂)

So we can take Test statistic,


 n 

G=2 [yi ln(π̂i )] − [n1 ln(n1 ) + n0 ln(n0 − n ln(n))] (2.38)
i=1

Hypotesis:
H0 : β1 = 0
H1 : β1 = 0
Under the hypotesis that β1 is equal to zero, the statistic G follows a chi-square distribution
with degree of freedom. Additional mathematical assumptions are also needed.however,
for the above case they are rather than nonrestrictive and involve having a sufficintly large
sample size n.
We use symbol χ(ν) to denot a chi-square random variable with ν degree-of -freedom.

As an example, we consider the model fit ton the data in Table 2.1 whose estimated
coefficients and loglikelihood are given in Table 2.3. For these data, n1 = 43 and n0 = 70
thus evaluating G, as show in equation 2.34 yields.

G = 2 {−53.677 − [43 + 57 − 100 100]}


= 2[−53.677 − (−68.331)]
= 29.31

Using this notation, the p− value associated with test is, P [χ2 > 29.31] < 0.001. Thus, we
have conconvancing evidence that AGE is a significant variable inpredicting CHD. This
is merely a statement of the statistical evidence for this variable.

Other important factor to consider before concluding that the variable is clinically im-
portant would include the appropriatenss of the fitted model, as well as inclusion of other
potentially inportant variables.

Wald test
The assumption needed for these tests are the same as those of the likelihood test in
equation 2.12. A More complete discussion of these tests and their assumptions may be

30
found in Rao(1973).

The wald test statistic obtained by comparing the maximum likelihood estimate of the
slope parameter,β̂1 , to an estimate of its standad error.

Hypotesis:
H0 : β1 = 0
H1 : β1 = 0
The resulting ratio,under hypothesis that β1 = 0, will follow a standad normal distribu-
tion.

Test statistic:
β̂1
W = (2.39)
SE(β̂1 )
For example, the Wald test for the logistic regression model in Table 2.3 is provided in
the colum headed Z and is,
0.111
W = = 4.61
0.024
and the the two tailed P-value, Table 2.3, is P (|Z| > 4.61) < 0.001.
Where Z denotes a random variable following the standed normal distribution.Hauck and
Donner (1977) examined to performance of the wald test and found that it behaved in
an aberrant manner, often failing to reject the null hypotesis when the coefficint was
significant. They recommended that the likelihood ratio test be used.
Jennings (1986) has also looked at adequacy of inferences in logistic regression based
on wald statistics. His conlusion are similar to those of Hauck and Donner. Both the
likelihood ratio test, G, and wald test, W , require the computation of the maximum
likelihood estimate for β1 .

Score test
In univariate case, this test is based on the conditional distribution of the derivative in
equation 2.19. In this case, we can write down an expression for the score test. The
test uses the value of equation 2.20. Computed using  n β0 = ln(n1 /n0 ) and β0 = 0. As
1
noted earlier ,under these parameter values, π̂ = = ȳ. Thus ,the left-hand side
n n
of equation
 2.20 become i=1 xi (yi − ȳ). It may be shown that the estimated variance
2
ȳ(1 − ȳ) (xi − x̄) . The test statistic for the score test (ST) is,
n
i=1 (yi −)
ST =  n (2.40)
y(1 − y) i=1 (xi − x) 2

Hypotesis:
H0 : β1 = 0
H1 : β1 = 0

31
As an example of the Score test, consider the model fit to the data in Table2.1. The value
of the test statistic for example is,
296.66
ST = √
3333.7342
= 5.41

and the two tailed P − value, P (|Z| > 5.41) < 0.001.

2.8 Confidence interval estimation


The basis for contruction of the interval estimators is the same statistical theory. We used
to formulate is the test for significance of the model.
The end points of a 100(1 − α)% confident interval for the slope coefficent,

beta1  N(β1 , var(β̂1 ))


β̂1 − β1
  N(0, 1)
var(β̂1 )

Where SE[β̂1 ] be a positive square root of the variance estimator.


⎛ ⎞
β̂1 − β1
P r ⎝−z1− α2 ≤  ≤ z1+ α2 ⎠
var(β̂1 )

The end points of a 100(1 − α)% confident interval for the slope coefficent are,

β̂1 ± z1− α2 SE(β̂1 ) (2.41)


and, for the intercept they are,

β̂0 ± z1− α2 SE(β̂0 ) (2.42)


Where z1− α2 is the upper 100(1 − α2 )% point form the standed normal distribution and
ŜE denotes a model-based estimater of the standed error of the respective parameter
estimater.
As an example, consider the model fit to the data Table 2.1 regressing AGE on the
presence or absence of CHD. The results are presented in the Table2.3.
The endpoints of a 95% coefidence interval for the slope coeffident interval for the slope
confident from 2.41are,
0.111 ± 1.96 × 0.0241
Interval is (0.064, 0.158). Briefly, the results suggest that the change in the log-odds of
CHD per one year increase in age is0.111 and the change could be a little as 0.064 or as

32
0.158 with 95 percent confidence.
The logit is the linear part of the logistic regression model and, it is a most like the fitted
line in a linear regression model. The estimator of the logit is,
ĝ(x) = β̂0 + β̂1 x (2.43)
The estimator of the variance of the estimator of the logit requires obtaining the variance
of a sum. In this case it is,
v̂ar(ĝ(x)) = v̂ar(β̂0 ) + x2 v̂ar(β̂1 ) + 2xĉov(β̂0 , β̂1 ). (2.44)
In general the variance of a sum is equal to the sum of the variance of each term and
twice the covariance of each possible pair of terms formed from the components of sum.
The endpoints of a 100(1 − α)% Wald-based confidence interval for the logit are,
ĝ(x) ∼ N(g(x), var[ĝ(x)])

ĝ(x) − g(x)
∼ N(0, 1)
SE[ĝ(x)]
The end point of a 100(1 − α) confidence interval for the logit are,

ĝ(x) − g(x)
P r −z1− α2   +z1− α2
SE[ĝ(x)]
100(1 − α)% confidence interval for the logit are,
ĝ(x) ± z α ŜE[ĝ(x)] (2.45)
(1− )
2
Where ŜE[ĝ(x)] be a positive square root of the variance estimater in 2.44.

The estimated logit for the fitted model in Table 2.3 is show in 2.23. In order to evaluate
2.44 for a specific age we need the estimated covariance matrix.This matrix can be ob-
tained from the output from all logistic regression software pakeges.

The estimated logit form 2.23 for a subject of age 50 is, for subjet of age 50 is,
ĝ(x) = −5.309 + 0.111 × 50
= 0.24
The estimated variance using equation2.44 and result in Table 2.4 is,
v̂ar[ĝ(50)] = 1.28517 + (50)2 × (0.000579) + 2 × 50(−0.026679) = 0.0650

and the estimated standed error is, ŜE[ĝ(50)] = 0.249. Thus the end points of a 95
percent confidence interval for the logit at age 50 are,
0.240 ± 1.96 × 0.2550 = (−0.260, 0.740)

33
AGE Constant
AGE 0.000579
Constant -0.026677 1.28517

Table 2.4: Estimated convariance matrix of the estimated coefficicent in Table 2.3

The estimated of the logit and its confidence interval provide the basis for the estimator
of the fitted value, in this case the logistic probability, and its associated confidence
interval.In particular, using equation2.24 at the 50 the estimated logistic probability is,

e(50)
π̂(50) =
1 + e(50)
e−5.31+0.111×50
=
1 + e−5.31+0.111×50
= 0.560

and the end points of a 95 percent confidence interval are obtained from the respective end
points of the confidence interval for the logit. The end point of the 100(1 − α)%confidence
interval for the fitted value are ,

ĝ(x) ± z α ŜE[ĝ(x)]
(1− )
2
1 + ĝ(x) ± z α ŜE[ĝ(x)]
(1− )
2
using the example at age 50 to demostrate the calculations, the lower limit is,
e−0.260
= 0.435
1 + e−0.260
and the upper limit is ,
e0.740
= 0.677
1 + e0.740
We have found that a major mistake often made by persons new to logistic regression mod-
eling is to try and apply estimates on the probability scale to individual subjects. The
fitted value computed in π̂(50) analogous to a particular point on the line obtained from
a linear regression.In linear regression each point on the fitted line provides an estimate
of the mean of the dependent variable in a population of subjects with co variate value “x”.

Thus the value of 0.56 in π̂(50) is an estimate of the mean (i.e propotion) of 50 years
old subjects in the population sampled that have evidence of CHD.Each individual 50
years old subject either does or does not have evidence of CHD. The confidence interval
suggests that this mean could be between 0.435 and 0.677 with 95% cofidence.

34
Chapter 3

Multiple Logistic Regression Model

In the previous chapter we introduced the logistic regression model in the univariate
contex. As in the case of linear regression, the strength of a modeling technique lies in its
ability to model many variables, some of which may be on different measurement scale. In
this chapter we will generalize the logistic model to the case of more than one independent
variable. This will be referrd to as the “multivariable case”.Central to the consideration
of multiple logistic models will be estimation of the coefficients in the model and testing
for their significance.

3.1 The Multiple Logistic Regression Model


Consider a collection of p independent variables denoted b the vector X́=(x1 , x2 , x3 ...., xp ).
For the moment we will assume that each of these variables is at least interval scale. Let
the conditional probability that the out comeis presen be denoted by P (Y = 1|X) = π(x).
The logit of the multiple logistic regression model is given by the equation,

g(X) = β0 + β1 x1 + β2 x2 + ............... + βp xp (3.1)

In which case the logistic regression model is,

eg(X)
π(X) = (3.2)
1 + eg(X)
In some of the independent variable are discrete, nominal scale variables such as race,
sex, treatment group, and so forth. It is inapproprite to include them in the model as
if they were interval scale variables. The numbers used to represent the variables levals
of these nominal scale variables merely indentifiers, and have no numeric significance. In
this situation the method of choice is to use a collection of design variables(Or dummy
variables).

suppose, for example, that one of the independent variable is race. Which has been
coded as “White”,“Black”, and “Others”. In this case two design variables are necessary.

35
RACE D1 (Design Variable) D2 (Design variable)
White 0 0
Black 1 0
Other 0 1

Table 3.1: An example of the coding the design variables for race coded at three levels
y

2
//
β
0

/ 1
β
0

β
0

Figure 3.1: Design variables.

One possible coding strategy is that when the respondent is “White”, The two design
variables, D1 and D2 , would both be set equal to zero;when represent is “Black’, D1 and
D2 , would both be set equal to zer; when the respondent is “black”, D1 would be set
equal to 1,while D2 would still equal 0; when the race of the espondent is “Other”, we
would useD1 = 0 and D2 = 1. Table 3.1 illustrates this cording the design variables.

Most logistic regression software will generate design variables, and some programs have
a choice of several different models. The different strategies for creation and interetation
of design variables and discussed in detail in next capter. In general, if a nominal scaled
variable has k possible values, then k − 1 design variables will be needed. This is true
since, unless stated other wise, all of our models have contant term. To illustrate the
notation used for design variables in this text suppose that the j th independent variable
xj has kth levels. The kj − 1design variables will be denoted as Djl and the coefficints for
these design variable will be denoted as βjl , l = 1, 2, ...kj−1.

Thus, the logit for a model will p variables and j th variable being discrete discrete would

36
be

n
g(X) = β0 + β1 x1 + β2 x2 + ............... + βjl Djl βp xp (3.3)
l=1

By using example 3.1,


2
g(X) = β0 + β1 x1 + β2l D2l l = 1, 2
l=1
= β0 + β1 x1 + β21 D21 + β22 D22 + β21 D21

(1)W hite =⇒

g1 (X) = β0 + β1 x + β21 (0)


= β0 + β1 x

(2)Black =⇒

g2 (X) = β0 + β1 x + β21 (1) + β22 (0)


= β0 + β1 x + β21
= β3 + β1 x

(3)Other =⇒

g3 (X) = β0 + β1 x + (β21 (0) + β22 (0))


= β0 + β1 x + β22
= β4 + β1 x

we can ploted this equations.

The 3 equations are paralel each of others. Only the Intercepts are differents. When
discussing the multiple logistic regression model we will, in general, suppress the sum-
mation and double subscripting needed to indicate when design variables are being used.
The exception to this will be the discussion of modeling strategies when we ned to use
the specific value of the coefficients for any design variable in the method.

3.2 Fitting The Multiple Logistic Regression Model


Assume that we have a sample of n indepedent observations (xi , yi ) i = 1, 2, 3, ..., n. As
in the univariate case, fitting the model requires that univariate case, fitting the model
requires that we obtain estimates of the vector β  = (β0 , β1 , ..., βp ). The method of estima-
tion used in the multivariable case will be the same as in the univariate situation-maximum
likelihood. There will be p + 1 likelihood equations that are obtained by differentiating
the log likelihood function with respect to the p + 1 coefficients.

37
Suppose we have a sample of n independent observations of the pair (x1i , x2i , x3i , ...., xpi , yi )
i = 1, ......, p, where yi denotes the value of a dichotomous outcome variable and xi is the
value of the independent variable for the ith subject. Furthermore, assume that the out-
come variable has been coded as 0 or 1, representing the obsence or the presence of the
characteristic respectively. This will be denoted as P (Y = 1|x). It follow that the quan-
tity 1 − π(x) gives the conditional probability that Y is equal to zero gives x,P (Y = 0|x).
Thus, for those pairs (x1i , x2i , .....xpi , yi) where yi = 1, the contribution to the likelihood
function is π(xij ), and for those pairs where yi = 0, the contribution to the likelihood
function is 1 − π(xi ),where the quantity π(xij ) denote the value of π(x) computed at xij .
The pair (x1i , x2i , .....xpi , yi ), contribution to,
y1 = 1 contribution to likeliood f unction = π(xij )
y1 = 0 contribution to likeliood f unction = 1 − π(xij )
The conventity way to express to express the conttribution to the likelihood function for
the pair (x1i , x2i , .....xpi , yi) is through the expression.
π(xij )yi [1 − π(xij )]1−yi
Science the observations are assumed to be independent, the likelihood function is ob-
tained as product of the terms given gy expression.
n
l(β) = πi=1 π(xij )yi [1 − π(xij )]1−yi (3.4)
Where β  = (β0 , β1 , ..., βp ), The method of estimation used in the multivariate case will
be the same as in the univariate situation-maximum likelihood.
There will be P + 1 likelihood equations that are obtained by differentiating the log like-
lihood function with respect to the P + 1 coefficients.

The principal of maximum likelihood,



n
ln(l(β)) = { ln(π(xij + (1 − yi ) ln(1 − π(xij ))))}
i=1

n
π(xij )
= ln + ln (1 − π(xij ))
i=1
1 − π(xij )
by equation 3.2,
eg(x) 1
π(xij ) = & 1 − π(xij ) =
1 + eg(x) 1 + eg(x)
so we can get,
π(xij )
= eg(x)
1 − π(xij )

π(xij )
ln = ln(eg(x) ) = g(x)
1 − π(xij )

38

n  
1
ln(l(β)) = yi g(x) + ln (3.5)
i=1
1 + eg(x)
n  
1
= yi (β0 + β1 x1 + β2 x2 + ...... + βp xp ) + ln
i=1
1 + eβ0 +β1 x1 +β2 x2 +......+βpxp
(3.6)
Defferentating w.r.t β0 ,
n
∂  β0 +β1 x1 +β2 x2 +......+βpxp
 −eβ0 +β1 x1 +β2 x2 +......+βpxp
[ln(l(β))] = yi + 1 + e
∂β0 i=1
1 + eβ0 +β1 x1 +β2 x2 +......+βpxp
n
−eβ0 +β1 x1 +β2 x2 +......+βpxp
= yi −
i=1
1 + eβ0 +β1 x1 +β2 x2 +......+βpxp

n
= yi − π(xij )
i=1

∂  n
[ln(l(β))] = yi − π(xij ) (3.7)
∂β0 i=1
now defferentatiating equation 3.5 w.r.t βi ,
n
∂  β0 +β1 x1 +β2 x2 +......+βpxp
 −eβ0 +β1 x1 +β2 x2 +......+βpxp
[ln(l(β))] = yi xij + 1 + e (xij )
∂β1 i=1
1 + eβ0 +β1 x1 +β2 x2 +......+βpxp
n  
 β0 +β1 x1 +β2 x2 +......+βpxp
 −eβ0 +β1 x1 +β2 x2 +......+βpxp
= xi yij + 1 + e (xij )
i=1
1 + eβ0 +β1 x1 +β2 x2 +......+βpxp

n
= xij [yi − π(xij )]
i=1

∂  n
[ln(l(β))] = xij [yi − π(xij )] (3.8)
∂β1 i=1
Then, 3.7 and 3.8,
∂ ∂
[ln(l(β))] = 0 & [ln(l(β))] = 0
∂β0 ∂β1
so we can get,

n
yi − π(xij = 0 (3.9)
i=1


n
xij [yi − π(xij )] = 0 (3.10)
i=1

39
For j = 1......p. As in the univariate model, the solution of the likelihood equations re-
quires special software that is available in the most, if not all, statistical packages. Let
β̂ denote the solution to these equations. Thus, the fitted values for the multiple logistic
regression model are π̂(xi ), the value of the expression in the equation 3.8 computed using
hatβ and xi .

In the previous chapter only a brief mention was made of the method for estimating
the standed errors of the estimated coefficients. Now that the logistic regression model
has been generalized both in concept and notation to the multivariable case, we consider
estimation of standed errors in more detail.

The method of estimating the variance of the estimated coefficients follows from well-
developed theory of maximum likelihood estimation. This theory states that the estima-
tors are obtained from the matrix of secand partial derivatives of the likelihood function.
These partial derivatives have the following form,

∂ 2 L(β)  n
= − x2ij πi (1 − πi ) (3.11)
∂βj2 i=1

∂ 2 L(β)  n
=− xij xil πi (1 − πi ) (3.12)
∂βj ∂βl i=1

for j = 1, ....p where πi denote P i(xi ) Let the (P + 1) × (P + 1) matrix containing the
negative of the terms given in equation 3.11 and 3.16 be denote as I(β). This matrix
is called the observed information matrix. The variance and covariance of the estimated
coefficients are obtain from the inverse of this matrix which we denote as V ar(β) = I −1 (β).
we will use notation V ar(βi ) to denote the j th diagonal element of this matrix, which is
the variance of β̂j ,and Cov(βj , βl )to denote an arbitrary off-diagonal element. which is
covariance of β̂i and β̂l . The elements of the variance and covariance, which will be
denote by V̂ ar(β̂), are obtained by evaluating V ar(β) at β̂. We will use V ar(β̂j ) and
ĉov(β̂j , β̂l ) j, l = 1....p to be denote the values in this matrix. Also estimated standed
errors of the estimated coefficients, which we will denote as,
  12
ŜE(β̂j ) = V̂ ar(β̂j ) (3.13)

for j = 1....p. We will use this notation in developing methods for coefficient testing and
confidence interval estimation.
A formulation of the inmormation matrix which will be useful when discussing model
ˆ β̂) = X  V X, where X is an n by P + 1 matrix containing
fitting and assessment of fit is I(
the data for each subject,and V is an n by n diagonal matrix with general element

40
Π̂(xi )(1 − Π̂i ). That is, the matrix X is,
⎛ ⎞
1 x11 x12 . . . x1p
⎜ 1 x21 x22 . . . x2p ⎟
⎜ ⎟
X = ⎜ .. .. .. .. .. ⎟ (3.14)
⎝ . . . . . ⎠
1 xn1 xn2 . . . xnp
and the matrix V is,
⎛ ⎞
π̂1 (1 − π1 ) 0 ... 0
⎜ 0 π̂2 (1 − π2 ) . . . 0 ⎟
⎜ ⎟
V=⎜ .. .. .. ⎟ (3.15)
⎝ . 0 . . ⎠
0 ... 0 π̂n (1 − πn )
before proceeding futher we present an example that illustrates the formulation of a
multiple logistic regression model and the estimation of its coefficients using a subject of
the variables from the data for the low birth weight study. The code sheet for the full data
set is given in Table(2.6). The goal of this study was to identify risk factors associated
with given birth to a low birth weight body (weighing less than 2500 grams).
Data were collected on 189 women, n1 = 59 of whom had low birth weight babies and
n0 = 130 of whom had normal birth weight babies.Four variables thought to be inportant
were,
(i) Age
(ii) Wight of the mother at her least menstrual period
(iii) Race
(iv) Number of physician visits during the first trimester of pregnancy.
In this example, the variable race has been recoded using the two design variables in
Table 3.1. The results of fitting the logistic regression model to these data are shown
in Table 3.2. In the Table 3.2 the estimated coefficients for the two design variables for
race are indicated by RACE2 and RACE3 . The estimation logit is given by the following
expression.
ĝ(x) = 1.295−0.024×AGE−0.014×LW T +1.004×RACE2 +0.4333×RACE3 −0.049×F T V
The fitted values are obtained using the estimated logit, ĝ(X)

3.3 Testing For The Significance Of The Model


There are three methods for find the level of significance.
(1) likelihood Ratio test
(2) Wald test
(3) score test

41
Variable Coeff. Std.Err Z P > |z|
AGE -0.024 0.0337 -0.71 0.480
LWT -0.014 0.0065 -2.18 0.029
RACE2 1.004 0.4979 2.02 0.044
RACE3 0.433 0.3622 1.20 0.232
FTV -0.049 0.1672 -0.30 0.768
Constant 1.295 1.0714 1.21 0.227

Table 3.2: Table 3.2, estimated coefficients for a multiple Logistic regression model using
the variables AGE, weight at least menstrual period (LWT), Race and Number of first
trimester physician visits (FTV) for the low birth weight study.

3.4 Likelihood Ratio Test For Testing For The Sig-


nificance Of The Model
Once we have fit a paticular multiple (multivariable)logistic regression model, we begin
the process of model assessment. As in the univariate case presented in chapter 2, the first
step in this process is the univariate assess the significance of the variable in the model.
The likelihood ratio test for overall significance of the p coeffcients for the independent
variables in the model is performed in exactly the same manner as in the univariate case.
The test is based on the statistic G given in equation 2.32. The only difference is that the
fitted values, π̂, under the model are based on the vector containing p+1 parameters,β̂.
Under the null hypotesis that the p “slope” coeffcients for the covariantes in the model
are equal to zero, the distribution of G will be chi-square with p degree of the freedom.
Consider the fitted model whose estimated coefficients are given in Table 3.2 for that
model, the values of the loglikelihood, calculated by using minitab software.

Test statistic

G = 2 {[yi ln(π̂i ) + (1 − yi )ln(1 − π̂i ) + (nln(n1 ) + n0 ln(n0 ) − nlnn )]} (3.16)

Hypotesis:

H0 : βj = 0 j = 1....p
H1 := βj = 0

The likelihood is given by the Table 3.2 For that model the value of the likelihood, shoe at
the bottem of table is L = −111.286. The log likelihood for the constant only model may
be obtained by evaluating for the constant only model may be obtained by evaluating the
numerator equation 2.37 or by fitting the constant only model.
n=189
n0 = 130

42
Variable Coeff Std.Err Z P > |Z|
LWT -0.015 0.0064 -2.31 0.018
RACE2 1.081 0.4881 2.22 0.027
RACE3 0.481 0.3567 1.35 0.178
Constant 0.806 0.8452 0.95 0.340

Table 3.3: Estimated coefficients for a multiple Logistic Regression model sing the variable
LWT and RACE from the low birth wight stutdy.

n1 = 59
Thus the value of the likelihood ratio test statistic from equation 2.44

G = 2[−111.286 − [(59ln59 + 130ln130) − 189ln189]]


= 12.099

Now we can compare this value by using χ(n − 1)degree of the freedom.

The P-value for the test is ,

P [χ2 (5) > 12.099] = 0.034 < 0.05

which is significant at the α = 0.05 leval.

conclution
We reject the null hypotesis in this case and conclude that at least one and perhaps all p
coefficient are different from zero, an interpretation analogous to that in multiple linear
regression.
If our goal is to obtain the best fitting model while minimizing the number of parameters,
the next logit step is to fit a reduced model containing only those variables thought to
be significant, and comare it to the full model contaning all the variables. The results of
fitting the reduced model are given by Table 3.3. The difference between the two models
is the exclusion of the variables AGE and FTV from the full model. The likelihood ratio
test comparing these model is obtained using the difinition of G given the equation 2.37.
It will have a distribution that is chi-square with 2 degrees -of fredom under hypotesis
that the coefficients for the variable excluded are equal zero.

Hypotesis:

H0 : βj = 0 ∀j = 1, 2, 3, ...p
 0 f or at ∀ onej.
H1 : βj =

The value of the test statistic comparing the models in Table 2.2&2.3. ,

G = −2[(−111.630) − (111.286)] = 0.688

43
which, with 2 degree-of-fredom, has a P −value of P [χ2 (2).0.688] = 0.709. Science the
P -value is large , exceeding 0.05, we conclude that the reduced model is good as full
model. Thus there is no advantage in includeing AGE and FTV in the model. However,
we must not base our models entirely on tests of statistical significance.
Whenever a categorical independent variable is include (or exclude) from a model, all of
its design variables shold be include (or excluded); to do other wise implies that we have
recoded the variable.

For example, if we only include design variable D1 as define in Table 3.1, then RACE is
entered in to the model as a dichotomous variable coded as black or not black. If k is the
number of levels of a categorical variable, then the contribution of this variable will be
k − 1. For example, if we exclude race from the model,and race is coded at three levels
using the design variables shon in table 3.1, then there would be 2 degrees-of-freedom for
the test, one for each design variable.

3.5 Wald Test For Testing For The Significance


Becase of multiple degree-of-freedom we must be careful in our use of the Wald(W ) statis-
tic to assess the significance of the coefficients. For example, if the W statistics for both
coefficients exceed 2, then we could conclude that the design variable are significant.The
multivariable analog of the wald test is obtained from the folowing vector-matrix calcu-
lation:

Test statistic
W = β̂  [v̂ar(β̂)]−1 β̂
= β̂  [X  V X]−1 β̂,

The matrix β̂ is,


β̂ = (β0 β1 ....βp )p×1
The matrix X is, is an n by n diagonal matrix with general element Π̂i (1 − Π̂i ),

X  V X = (p × n) × (n × p) × (p × n) = (p × n). So W be a normal scale


variable.

Hypotesis
H0 : βj = 0 ∀j = 1, 2, 3, ...p
H1 : βj = 0 f or at ∀ one j.
It will be disributed as chi-square with p + 1 deree-of-fredom. Under the hypotesis that
each of the p + 1 coefficients is equal is equal zero. Tests for just the p slope coefficients

44
are obtained by eliminating β̂0 , from β̂ and the relevant row(first or last) and colum(first
or last) from (X  V X). Science evaluation of this test requires the capabilty to perform
vector-matrix operations and to obtain β̂, there is no gain over the likelihood ratio test
of the significant of the model.

3.6 Confidence interval estimation


We discussed confidence interval estimatoers for the coefficients, logit and logistic proba-
bilities for the simple logistic regression model. The methods used for coefidence interval
estimators for a multiple variable model are essentially the same.

The confident interval estimator for the logistic ia a bit more comlicated for the mul-
tiple variable model than the result presented in 2.45. The basic idea is the same, only
there are now more terms involved in the summation.It follows from 3.2 that a general
expression for the estimator of the logit for a model containing p covariates is,

ĝ(X) = β̂0 + β̂1 x1 + β̂2 x2 + ............... + β̂p xp (3.17)

An alternative way to express the estimation of the logit in 3.17 is through the use of
vector notation as ĝ(X) = X  β̂, where the vector β̂  = (β̂0 , β̂1 , ....β̂p ) denote the estimator
of the p+1 coefficients and the vector X  = (X0 , X1 , ..., Xp ) represents the constates in
the model, where x0 = 1.

ĝ(X) = β̂0 + β̂1 x1 + β̂2 x2 + ............... + β̂p xp

V̂ ar[ĝ(X)] = V ar[β̂0 + β̂1 x1 + β̂2 x2 + ............... + β̂p xp ]


p p

n  
V̂ ar[ĝ(X)] = Xj2 V̂ ar(β̂j ) + 2xj xk Ĉov(β̂j β̂k )
i=1 j=0 k=j+1

We can express this resuls much more concisely by using the matrix expression for the
estimater of the variance of the estimater of the coefficients.Form the expression for the
observed information matrix, we have that,

V ar(β̂) = (X  V X)−1 (3.18)

It follows from 3.18 that an equivation expression for the estimator is 3.17,

V ar[ĝ(x)] = X  V̂ ar(β̂)X
= X  (X  V X)−1 X

Var[ĝ(x)] = X  (X  V X)−1 X (3.19)


Fortunately, all good logistic regression software pakeges provide the option for the user
to create a new variable containing the estimated values of 3.19 or the standed error for

45
LWD RACE2 RACE3 Constant
LWT 0.000041
RACE2 -0.000647 0.2382
RACE3 0..000036 0.0532 0.1272
Constant -0.005211 0.0226 -0.1035 0.7143

Table 3.4: Estimated covariance matrix of the estimated coefficients in Table 3.3

all subjects in the data set. This feature estimaters the computational burden associated
with the matrix calculationas in 3.19 and allows the user to routinely calculate fitted
values and confidence interval estimaters. However it is useful to illustrate the details of
the calculations.

Using the model in Table 3.3, the estimated logit for a 150 pound white women is,
ĝ(LW T = 150, RACE = white) = 0.806 − 0.015 × LW T + 1.081 × RACE2 + 0.481 × RACE3
= 0.806 − 0.015 × (150) + 1.081 × 0 + 0.481 × 0
= −1.444
and estimated the logistic probability is,
e−1.444
Π̂(LW T = 150, RACE = white) =
1 + e−1.444
= 0.91
Conclution

The interpretation of the fitted value is that the estimated proportion of low birth weight
babies amoung 150 pound white momen is 0.191.In order to use 3.17 to estimate the
variance of this estimated logit we need to obtain the estimated covariance matrix show
in the Table 3.4. Thus the estimated variance of the logit is,
ĝ(LW T, RACE = W hite) = β0 + β̂1 × LW T + β̂2 RASE2 + β̂3 RACE3

V ar[ĝ(LW T = 150, RACE = W hite)] = var(β̂0 ) + 02 var(β̂2 ) + 02 var(β̂3 ) + 2 × 0 × ĉov(β̂1 β̂2 )


+2 × 0 × ĉov(β̂0 β̂3 ) + 2 × 150 × ĉov(β̂1 β̂2 )
+2 × 150 × 0 × ĉov(β̂1 β̂2 )2 × 0 × 0 × ĉov(β̂2 β̂3 )

V ar[ĝ(LW T = 150, RACE = W hite)] = 0.7143 + (150)2 × 0.000041 + 0 × 0.2382 + 0 × 0.1272


+2 × 150 × (−0.0052) + 2 × 0 × 0.0226
+2 × 0 × (−0.1035) + 2 × 150 × 0 × (−0.000647)
+2 × 150 × 0 × 0.000036 + 2 × 0 × 0.0532
= 0.0768

46
The standad error is,

V ar[ĝ(LW T = 150, RACE = W hite)] = var(V ar[ĝ(LW T = 150, RACE = W hite)])
= 0.2771

The confidence interval for 95% confidence interval for estimated logistic is,

ĝ(X) ∼ N(−1.444, (0.2771)2 )


−1.444 ± 1.96 × (0.2771) = (−1.988, −0.0901)

the associated confidence interval for the fitted value is (0.120,0.289).

47
Chapter 4

Interpretation Of The Fitted


Logistic Regression Model

Introduction
In chapter 2 and 3 I discused the method for fitting and testing for the significance of the
logistic regression model. After fitting a moel the emphasis shifts from the computation
and assessment of significance of the estimated coefficients to the interpret ation of their
values. Strictly speaking, an assessment of the adequacy of the fitted model should pre-
cede any attempt at interpteting it. Thus, we begin this chapter assuming that a logistic
regression model has been fit, that the variables in the model are significant in either a
clinical or statistical sense, and that the model fits according to some statistical measure
of fit.

The interpretation of any fitted model reqires that we be able to drow practical infer-
ences from the estimated coefficients in the model.
The interpretation of any fitted model requires that we be able to draw practical infer-
ences from the estimated coefficients in the model. The question being address is: What
do the estimated coefficients in the model tell us about the research questions that the
motivated the study?

For most models this involves the estimated coefficients for the independent variables
in the model.On occasion, the independent coefficient is of interest;but this is the excep-
tion, not the rule.

The estimated coefficients for the independent variables represent the slope.(i.e, rate of
change) of a function of the dependent variable per unit of change in the independent
variable.Thus interpretation involves two issues.

(1) Determining the functional relationship between the dependent variable and the
independent variable.

48
(2) Appropriatly defining the unit of change for the independent variable.

The first step to determine what function of the dependent variable yields a linear function
of the independent variables. This is called the link function. In the case of a linear
regression model, it is the identity function science the dependent variable, by definition,
is linear in the parameters. (For those unfamiliar with the term)“identity function”, it
is the function Y = y). In the logistic regression model the link function is the logit
transfomation,

g(X) = ln[π(x)(1 − π(x))]


= β0 + β1 x

For a linear regression model recall that the slope coefficient, β1 is equal to the difference
between the value of the dependet variable x+1 and the value of the dependent variable at
x,for any value of x. For example, if y(x) = β0 + β1 x. it follows that β1 = y(x + 1) − y(x).
In this case, the interpretation of the coefficient is relatively straighforward as it expresses
the resulting change in the measurement scale of the dependent variable for a unit change
in the independent variable. For example, if in a regression of wight on hight of male
adolescents the slope is 5, then we would conclude that an increase of 1 inch in height is
associated with an increase of 5 pounds weight.

In the logistic regression model, the slope coeficients represents the change in the logit cor-
responding to a change of one unit in the independent variable (i.e, β1 = g(x + 1) − g(x)).
Proper interpretation of the coefficient in a logistic regression model depends on being
able to place meaning on the difference between two logits. Interpretation of this differ-
ence is discussed in detail on a case-by-case basis as it relates directly to the definition
and meaning of a one-unit change in the independent variable.

4.1 Dichotomous Independent Variable


We begin our consideration of the interpretation of logistic regression coefficients with the
situation where tha independent variable is nominal sale and dichotomous (i.e measured
at two levels). This case provides the cnceptual foundation for all other situations.
We assume that the independent variable, x is coded as either zero or none.The difference
in the logit for a subject with x=1 and x=0 is,

g(x) = β0 + β1 x
g(1) − g(0) = (β0 + β1 ) − (β0 + β1 x)
= β1

The algebra shown in this equation is rather straightforward. We present it in this level of
detail to emphasize that the first step in interpreting the effect of a covarite in a model is

49
Outcome Independent Variable (X)
Variable(Y) x=1 x=0
β0 +β1
e eβ0
y=1 π(1)= π(0)=
1 + eβ0 +β1 1 + eβ0
1 1
y=0 1 − π(1)= β +β
1 − π(0)=
1+e 0 1 1 + eβ0
Total 1.0 1.0

Table 4.1: values of the logistic regression model when the independent variable is di-
chotomous outcome

express the desired logit difference in terms of the model. In this case the logit difference
is equal to β1 , In order to interpret this result we need to introduce and discuss a mesure
of association termed the odd ratio.

The possible values of the logistic probabilities may be conveniently display in a 2 × 2


table as shown in Table 4.1. The odds of the outcome being present among individuals
with x = 1 is defined as π(1)/(1 − π(1)). similary, the odds of the oucome being present
among individuals with x = 0 is defind as π(0)/(1−π(0)). The odds ratio, denoted OR, is
defind as the ratio of the odds for x = 1 to the odds for x = 0, and given by the equation,
eβ0 +β1 x
Π(x) = (4.1)
1 + β0 + β1 x

π(1)/(1 − π(1))
OR = (4.2)
π(0)/(1 − π(0)
β +β
e 0 1 1
( 1+e β0 +β1 )/( 1+eβ0 +β1 )
OR = β
e 0 1
( 1+e β0 )/( 1+eβ0 )

eβ0 +β1
=
eβ0
= eβ1
= β1

Hence, for logistic regression with a dichotomous independent variable coded 1 and 0, the
relationship between the odds ratio and the regression coefficecent is,

OR = eβ1 (4.3)

This simple relationship between the coefficient and the odd ratio is fundamental reason
why logistic regression has proven to be such a powerful analytic research tool.
The odds ratio is a measure of association which has found wide use, especilly in epidemi-
ology, as it approximates how much more likely (or unlikely) it is for the outcome to be

50
Outcome AGED(X)
CHD(Y) ≥ 55(1) < 55(0) Total
Present(1) 21 22 43
Absend(0) 6 51 57
Total 27 73 100

Table 4.2: cross-classification of AGE dichotomized at 55 years and CHD for 100 subjects

present among those with x = 1 then among those with x = 0.

For example, if y denotes the persence or absence of lung cancer and if x denotes whether
the person is a smoker, than ÔR =2 estimates taht lung cancer is twice as likely to ocur
among smokers than among nonsmokers in the study population. As another example,
suppose y denotes the precence or absence of heart disease and x denotes whether or not
the person engages in regular strenous pysical exericise.

If the estimated odds ratio is ÔR = 0.5, then occurreses of heart disease is one harf
as likely to occur among those who exercise than those who donot in the study popula-
tion.
The interpretation given for the odds ratio is based on the fact that in many instance
it approximates a quantity called the Relative risk. The parameter is equal to the ra-
tio
 π(1)/π(0).
 It follws from 4.1 that the odd ratio approximates the relative risk if,
1−π(0)
1−π(1)

1. This holds when π(x) is smaller for both x = 1and 0.

An example may help to clarify what the odds ratio is and how it is coputed from the
results of a logistic regression program or from a 2 × 2 table. In many examples of logis-
tic regression encountered in the literature we find that a continuous variable has been
dichotomize at some biologically meaningful cutpoint.

Example: We consider pevious example that data displayed in Table1.1, and create a
new variable, AGED, which takes on the variable 1 if the age of the subject is greater
than or equal to 55 and zero otherwise. The result of cross classifying the dichotomized
age variable with the outcome variable CHD is preseted in Table 3.2. By usin equation
4.3, the pair (xi , yi), The product of the terms given,
n
l(β) = πi=1 π(xi )yi [1 − π(xi )]1−yi (4.4)

The data in Table 3.2 tell us that there were 21 subjects with values (x = 1, y = 1) ,22
with (x = 0, y = 1) 6 with (x = 1, y = 0), and 51 with (x = 0, y = 0). Hence, for these
data,the likelihood function shown in 2.11 simplifies to,

l(β) = π(1)21 × [1 − π(1)]6 × π(0)22 × [1 − π(0)]51 (4.5)

51
Use of a logistic regression program to obtain the estimates of β0 and β1 .
The estimate of the odd ratio is ÔR = e2.094 = 8.1. Readers who have have had some
previous experience with the odds ratio undoubtedly wonder why a logistic regression
package was used to obtain the maximum likelihood estimate of the odds ratio. When it
could have been obtained directly form the cross-product ratio from Table 4.2 namely,
(21/6)
ÔR = = 2.094
(22/51)
(21/6)
β̂1 = ln[ ] = 2.094
(22/15)
We emphasize here that logistic regression, in fact, regression even in the simplest case
possible.The fact that the data may be formulated in terms of a contingency table pro-
vides the basis for interpretation of estimated coefficients as the log of odds ratio.

Along with the point estimation of a parameter, it is good idia to use a confidence interval
estimate of provide additional information about parameter value.

The odds ratio, OR, is usually the parameter of interest in a logistic regression due to its
ease of interpretation. However, its estimate, ÔR tends to have a distribution that the
skewed. The skewness of the sampling distribution of ÔR is due to the fact that possible
values range between 0 and ∞, with the null value equatling.

In theory, for large enogh sample sizes, the distribution of ÔRis normal.

ln(ÔR) ∼ N(β1 , SE(β̂1 ))

Unfortunately, this sample size requirement typically exceeds that of most studies. hence,
inferences are unually based on sampling distribution of ln(ÔR) = β̂1 , which tends to
follow a normal distribution for much smaller sample sizes. A 100(1 − α)% confidence
interval (CI) estimate for the odds ratio is obtained by first calculating the end points of
a confidence interval for the coefficients β1 ,

exp[β̂1 + z(1− α2 ) SE β̂1 ]

As an exaple, consider the estimationof the odds ratio for the dichotomize variable AGE
D. The point estimation is ÔR = 8.1 and the end points of a 95% confidence intervel are

exp(2.094 ± 1.96.529) = (2.9, 22.9)

Conclution
This interval is typical of the confidence intervals seen for odds ratios when the point es-
timate exceeds 1. The confidence interval is skewed to the right. The confidence interval
suggests that CHD among those 55 and older in the study population could be as little
as 2.9 lites or much as 22.9 times more likely than those under 55, at the 95 present level

52
of confident.

We illustrate these computations in detail, as they demonstrate the general method for
computaing estimates of odds ratios in logistic regression. The estimate of the log of the
odds ratio for any independent variable at two different levels, say x = a versus x = b, is
the difference between the estimated logits computed at these two values.

ln[ÔR(a, b)] = ĝ(x = a) − ĝ(x = b)


= (β̂0 + β̂1 × a) − (β̂0 + β̂1 × b)
= β̂1 (a − b)

ln[ÔR(a, b)] = β̂1 (a − b) (4.6)


The estimate of the odds ratio is obtained by exponentiating the logit difference,

ÔR(a, b) = exp[β̂1 × (a − b)] (4.7)

This expression is equal to exp(β1 )only when (a-b)=1.In 4.6 & 4.7 the notation ÔR(a, b)
is use to represent the odds ratio,

π̂(x = a)/(1 − π̂(x = a))


OR = (4.8)
π̂(x = b)/(1 − π̂(x = b)

and when a = 1 and b = 0. We let ÔR = ÔR(0, 1) Some software packeges ofer a choice
of methods for coding design variables. The “zero-one”coding used so far in this section
is frequenty referred to as reference cell coding.

There are two methods for coding the cells.

(1) Reference cell coding method

(2) Eviation from means coding method

Reference Cell Coding Method


The reference cell method typically assigns the value of zero to the lower code for x and
one to hight code
Example: If sex was coded as 1 =male and 2 =female. Then the resulting design variable
under this method, D would be coded 0 =male and 1 =female, and then treating the
variable SEX as if it were interval scaled.

53
Sex(code) Design variable
Male(1) 0
Female(2) 1

Table 4.3: Illustration of the coding of the design variable using the reference cell method.
Sex(code) Design variable
Male(1) -1
Female(2) 1

Table 4.4: Illustration of the coding of the design variable using the deviation from means
method.

Deviatio Form Means Coding Method


Another coding method is frequently reffed to as deviation form means coding. This
method assings the value of −1 to the lower code,and a value 1 to the higher code. The
coding for the variable SEX discussed above is shown in Table 3.4. Suppose we wish to
estimate the odds ratio of female versus male, when deviation from means coding is used.
we do this by usig the general nethod shown in 4.6 & 4.7. different levels, say x=a versus
x=b, is the difference between the estimated logits computed at these two values.

ln[ÔR(f emale, male)] = ĝ(x = f emale) − ĝ(x = male)


= g(D = 1) − g(D = −1)
= (β̂0 + β̂1 × (D = 1)) − (β̂0 + β̂1 × (D = −1))
= 2β̂1

and the estimated odds ratio is,

ÔR(f emale, male) = exp(2β̂1 )

Thus, if we had exponentiated the coefficient from the computer output we would have
obtained the wrong estimated of the odds ratio. This points out quite clearly that we
must pay close attention to the method to the method used to code the design variables.
The method of coding also influences the calculation of the end points of the confidence
interval. For the above example, using the deviation from means coding, the estimated
standard error needed for confidence interval estimation is ŜE(2β̂1 ) which is 2 × ŜE(β1 ).
Thus the end points of confidence interval are

exp[2β̂1 + z(1− α2 ) 2SE β̂1 ]

In general, the end points of the confidence interval for the odds rather given in 4.9 are

exp[β̂1 (a − b) + z(1− α2 ) |a − b|SE β̂1 ]

54
CHD status White Black Hispanic Other Total
Present 5 20 15 10 50
Absent 20 10 10 10 50
Total 25 30 25 20 100
Odds Ratio 1 8 6 4
above95 CI - (2.3,27.6) (1.7,21.3) (1.1,14.9)
ln(ÔR) 0.0 2.08 1.79 1.39

Table 4.5: cross-classification of hypothetical data on RACE and CHD status for 100
subjects.

where |a − b| is the absolute value of (a − b).


since we can control how we code our dichotovariabls, we recommend that, in most situ-
ations, they be coded as 0 or 1 for analysis purpose. Each dichotomous variable is then
treated as an interval scale variable.

summary:
For a dichotomous variable the parameter of interest is the odds ratio. An estimate of this
parametermay be obtained from the estimated logistic regression coefficient, regardless
of how the variable is coded. This relationship between the logistic regression coefficient
and the odds ratio provideds the fundamention for interepretation of all logistic regression
results.

4.2 Polychatomous Independent Variable


Suppose that insted of two categories the independent variable has k > 2 distinct values.
For example, we may have variables that denote the county of residence within a state,
the clinic used for primary health care within a city, or race. Each of these variable has
a fixed number of distinct values and the scale of measurment is nomial.

In this section we present methods for creating design variables for polychotomous in-
dependent variables. The choice of a particular method depends to same extent on the
goals of the analysis and the stage of model development.

We bebin by extending the method presented in Table 4.1 for a dichotomous variables.
For example, suppose that in a study of chd the variable race is coded at four levels, and
that the cross-classification of RACE by CHD yields the data in Table 4.5.

These data are hypothetical and have been formulated for ease of computation. The
extension to a situation where the variable has more than four levels is not conceptually
different, so all examples in this section use k = 4.

55
RACE(code) Race(2) RACE(3) RACE(4)
white(1) 0 0 0
Black(2) 1 0 0
Hispanic(3) 0 1 0
Other(4) 0 0 1

Table 4.6: specification of the design variables for RACE using reference cell coding with
white as the reference group.

variable Coefficients Std.Err z P > |z|


RACE(2) 2.079 0.6325 3.29 0.001
RACE(3) 1.792 0.6466 2.78 0.006
RACE(4) 1.386 0.6708 2.07 0.039
CONSTANT -1.386 0.5000 -2.77 0.006
Log likelihood=-62.2937

Table 4.7: Results of fitting the logistic regression model to the data in table 4.5 using
the disign variablesa in table 4.6.

At the bottom of the Table 4.5, the odds ratio is given for each race, using white as the
refference group. For example, for hisponic the estimated odd ratio is,

(15 × 20)
(5 × 10)

This reference group is indicated by a value of 1 for the odds ratio.

These some estimates of the odds ratio may be obtained from a logistic regression pro-
gram with an appropriate choice of design variables. The method for specifing the design
variables. The method for specifying the design variables involves setting all of them
equal to zero for the reference group, and then setting a single design variable equal to
1 for each of the other groups. This is illustrated in following Table4.6. using of any lo-
gistic regression program with design variables as show in Table 4.6 yields the estimated
logistic regression coefficients given in table 4.7. Use of any logistic regression program
with design variables coded as shown in Table 4.6 yields the estimated logistic regression

56
regression coefficients given in Table4.7.

n
ĝ(x) = β0 + βjl Djl
i=1
= β0 + (β2l D2l + β22 D22 + β23 D23 )
= β0 + (β̂l D2l + β̂2 D22 + β̂3 D23 )
= β0 + (β̂l RACE(2) + β̂2 RACE(3) + β̂3 RACE(4))

ln[ÔR(Balck, W hite)] = ĝ(Black) − ĝ(white)


" #
= β0 + β̂l (RACE(2) = 1) + β̂2 (RACE(3) = 0) + β̂3 (RACE(4) = 0)
" #
− β̂l (RACE(2) = 0) + β̂2 (RACE(3) = 0) + β̂3 (RACE(4) = 0)
= β̂1

also

ln[ÔR(Hyisponic, W hite)] = ĝ(Hyisponic) − ĝ(white)


" #
= β0 + β̂l (RACE(2) = 0) + β̂2 (RACE(3) = 1) + β̂3 (RACE(4) = 0)
" #
− β̂l (RACE(2) = 0) + β̂2 (RACE(3) = 0) + β̂3 (RACE(4) = 0)
= β̂2

ln[ÔR(Balck, W hite)] = ĝ(Black) − ĝ(white)


= β1
(20 × 20)
=
(5 × 10)
= 8
β1 = 2.072

ln[ÔR(Hispanic, W hite)] = ĝ(Hysponic) − ĝ(white)


= β2
(15 × 20)
=
(5 × 10)
= 6
β1 = 1.792

57
RACE(code) Race(2) RACE(3) RACE(4)
white(1) -1 -1 -1
Black(2) 1 0 0
Hispanic(3) 0 1 0
Other(4) 0 0 1

Table 4.8: specification of design variable for RACE using deviation form means coding.

ln[ÔR(Other, W hite)] = ĝ(Other) − ĝ(white)


= β3
(10 × 20)
=
(5 × 10)
= 2
β1 = 1.386

 
1 1 1 1
v̂ar(β̂j ) = + + +
5 20 20 10
ŜE(βj ) = [v̂ar(βj )]1/2
= 0.6325

We begin by computing the confidence limits for the log odds ratio (the logistic regression
coefficient) and then exponentiate these limits to obtain limits for the odds ratio. In
general, the limits for a 100(1 − α)% C.I for coefficent are of the form,
β̂j ± (1 − α2 ) × SE(β̂j ).

4.3 Deviation From Means Coding Method


The secound method of coding design variables is called deviation from means coding.
This coding expresses effect as the deviation of the “group mean”from the “overall mean”.

The estimated coefficients obtained using deviation from means coding may be used to
estimated the odds ratio for one category relative to a reference catogary. The equation
for the estimate is more complicated than the one obtained using the reference cell cod-
ing. However, it provides an excellent example of the basic principal of using the logit
difference of compute an odds ratio. To illustrate this we calculate the log odds ratio
of Black versus White using the coding for design variablegiven in Table4.8. The logit

58
difference is as fillows.

n
ĝ(x) = β0 + βjl Djl
i=1
= β0 + (β2l D2l + β22 D22 + β23 D23 )
= β0 + (β̂l D2l + β̂2 D22 + β̂3 D23 )
= β0 + (β̂l RACE(2) + β̂2 RACE(3) + β̂3 RACE(4))

" #
ln ÔR(Balck, W hite) = ĝ(Black) − ĝ(white)
" #
= β0 + β̂l (RACE(2) = 1) + β̂2 (RACE(3) = 0) + β̂3 (RACE(4) = 0)
" #
− β̂l (RACE(2) = −1) + β̂2 (RACE(3) = −1) + β̂3 (RACE(4) = −1)
= 2β̂1 + β̂2 + β̂3
" #
ln ÔR(Balck, W hite) = 2β̂1 + β̂2 + β̂3 (4.9)
To obtain a confidence interval we must estimate the variance of the of the sum of the
coefficientsints in 4.9. In this example, the estimater is,
" #
var ln[ÔR(Balck, W hite)] = 4 × var(β̂1 ) + var(β̂2 ) + var(β̂3 ) + 4ĉov(β1 , β2 )
+4 × ĉov(β1 , β2 ) + 42ĉov(β2 , β3 )
The evaluation of 4.9 for the current example gives,
" #
ln ÔR(Balck, W hite) = 2(0.765) + 0.477 + 0.072
= 2.079
The estimate of the variance is obtained by evaluating 4.9 wchich, for the current example,
yields,
"  #
V ar ln ÔR(Balck, W hite) = 4(0.351)2 + (0.362)2 + (0.385)2 + 4(−0.031)
4(−0.040) + 2(−0.0444)
= 0.40
and the standed error is
" #
ŜE ln[ÔR(Balck, W hite)] = 0.6325
We note that the values of the estimated log odds ratio, 2.079, and the estimated standed
error, 0.625, are identical to the values of the estimated coefficient and standed error for
the first design variable in Table 4.7. This is expected, since the design variables used to
obtain the estimated coefficients in Table 4.7 were formulated specifilly to yield the log
odds ratio relative to the white race category.

59
variable Coefficients Std.Err z P > |z|
RACE(2) 0.765 0.3506 2.18 0.029
RACE(3) 0.477 0.3623 1.32 0.188
RACE(4) 0.072 0.3846 0.19 0.852
CONSTANT -0.072 0.2189 -0.33 0.742
Log likelihood=-62.2937

Table 4.9: Results of fitting the logistic regression model to the data in Table 4.5 using
the design variables in Table4.8

4.4 Continous Independent Variable


When a logistic regression model contains a continous independent variable, interetation
of the estimated coefficients depend on how it is entered in to model and the paticular
units of the variable. For purpose of developing the model to interpret the coefficient
for a continuous variable, we assume that logit is linear in the variable.For purpose of
developing the method to interpret the coefficients for a continous variable,we assume
that the logit is linear in the variable.

Under the assumtion that the logit is linear in the continuse variable, x, the equation
for the logit is g(x) = β0 + β1 x. It follows that the slope coefficient β1 , give the change in
the log odds for an increase of 1 unit in x that is,
g(x) = β0 + β1 x
g(x + 1) = β0 + β1 (x + 1)
g(x + 1) − g(x) = β1
Most often the value of “1” is not clinically interesting. For example, a 1 year increase in
age or a 1mmHg increases in systolic blood may be too small to be considered important.
But a change of 10 yeares or 10mmHg might be considered more useful.

On the other hand, if the range of x is from zero to 1,then a change of 1 is too large
and a change of 0.01 may be more realistic. Hence, to provided a useful interpretation for
continuous scale covariates we need to develop a method for point and interval estimation
for an arbitrary change of “c” units in the covariate.

The logs odds ratio of c units in x is obtained from the logit difference.
g(x) = β0 + β1 x
g(x + c) = β0 + β1 (x + c)
g(x + c) − g(x) = c × β1
and the associated odds ratio is obtained by exponentiating logit difference.
OR(c) = OR(x + c, x) = exp(cβ1 )

60
An estimate may be obtained by replacing β1 with its maximum likelihood estimate β̂1 .
An estimate of the standard error needed for confidence error of β1 by c. Hence the
endpoints of the 100(1 − α)% confidence interval estimate of OR(c) are,
α
exp[c × β̂j ± (1 − ) × c × SE(β̂j )] j = 1...p
2
since both the point estimate and end points of the confidence interval depend on the
choice of c, the particular value of c should be clearily specified in all tables and calcula-
tions. The rather arbitrary nature of the choice of c may be trouble to same.

To provide the reader of our analysis with a clear indication of the of hom the risk of
the outcome being present chanes with the variable in question changes in multiple of 5
or 10 may be most meaningful and easlily undestood.

As an example , consider the univariable model in Table 1.3, In that example a logis-
tic regression of AGE on CHD status using the data of Table 2.1 was reported. The
resulting estimated logit was

ĝ(AGE) = −5.310 + 0.111 × AGE (4.10)

The estimated odds ratio for an increasing of 10 years in age is ,

ÔR(10) = exp(10 × 0.111) = 3.03


This indicates that for every increse 3.03 times.

The validity of such a statement is questionable in this example, since the additional
risk of CHD for a 40years old compared to a 30 years old may be quite different form the
additional risk of chd for a 40 years old compared of a 30 years old may be quite different
from the additional risk of CHD for a 60 years to 50 years old.This is an unavoidable
dilemma when continouse covariates are modeled linearly in the logit. If it is belived that
the logit is not linear in the covariate, then grouping and use of dummy variables should
be considered.

The end points of a 95% confidence interval for odds ratio are,

exp(10 × 0.111 ± 1.96 × ×0.024) = (1.90, 4.86)

Results similar to these may be placed in tables displaying the results of a fitted logistic
regression model.

In summary, the interpretation of the estimated coefficient for a continous variable is


similar to that of nominal scale variable: an estimated log odds ratio. The primary
difference is that a meaningful change must be defined for the continuous variable.

61
4.5 The Multivariable Model
In the previous we discussed the interpretation of an estimated logistic regression coeffi-
cient in the case when there is a singal variable in the fitted model. Now we considers a
multivariable analysis for a more comprehensive modeling of the data. One goal of such
an analysis is to statistically adjust the estimated estimated effect of each variable in the
model for differences in the distributions of and associations among the other indepen-
dent variables. Applying this concept to a multivariable logistic regression madel, we may
surmise that each estimated coefficient provide an estimate of the log odds adjusting for
allother variables included in the model.

A full understanding of the estimaters of the coefficients from a multivariable logistic


regression model requires that we ve a clear understanding of what is actually meant by
the term adjusting statistically, for the variables. We begin by examining adjustment in
the context of a linear regression model, and then extend the concept to logistic regression.

The multivariable situation we examine is one in which the model contains two inde-
pendent variables,

(1) One-dicotomous

(2) One-Continuous

but primary interest is focused on the effect of the dichotomous variable. This situation
is frequently encountered in epidemiologic reserch when an exposure to a risk factor is
recorded as being either present or absend, and wish to adjust for a variable such as age.
The analogous situation in linear regression is called analysis of covariance.

For example, we wish to compare the mean weight of two groups of boys. It is known
that wightis associated with many characteristics, one of which is age.Assume that on all
characteristics except age the two groups have nealy identical distributions. If the age
distribution is also the same for the two groups, then a univariate analysis would suffice
and we could compare the mean weight of two groups. This comparison would provide us
with a correct estimate of the difference in weight between the two groups. The statistical
model that describe the situation in Figure 4.1 states that the value of weight, W , may
be express as,
w = β0 + β1 x + β2 a (4.11)
where,

(1) x=o for group 1 and

(2) x=1 for group 2 and denote age.

In this model the parameter β1 represent the true difference in weight between the two
groups and β2 represent the rate of change in weight per year of age. Suppose that the

62
w

Weight(w)
w

a a a
Age(a)

Figure 4.1: Comparison of the weight of two groups of boys with different distribution of
age.

Group1 Group2
Variable Mean Std.Dev Mean Std.Dev
PHY 0.36 0.485 0.80 0.404
AGE 39.60 5.272 47.34 5.259

Table 4.10: Descriptive statistics for two groups of 50 mens on AGE and whether they
had seen a physician(PHY)(1=Yes,0=No)within the last months.

mean age of group 1 is ā1 and the mean age of group 2 is ā2 .
This situation is describe graphically in Figer3.1. In this figer it is assumed that the
relationship between age and weight is linear, with the same significant nonzero slop in
each group.

Compare of the mean weight of group1 to mean weight of group 2 amounts to a com-
parison of w1 to w2 .In terms of the model this difference is,

w2 = β0 + β1 x + β2 a2
w1 = β0 + β2 a1
(w2 − w1 ) = β1 + β2 (a2 − a1 )

Thus the comparison involves not only the true difference between the group, β1 , but a
components β2 (a2 − a1 ) which reflect the difference between the ages of the groups.
The process of statistically adjesting for age involves comparing the two groups at some
common value of age. The value usually used is the mean of the two groups which, for the
example, is denoted by ā in Figure 3.1. In terms of the model this yields a comparison of

63
w4 to w3 ,

w4 = β0 + β1 x + β2 a
w3 = β0 + β2 a
w4 − w3 = β1 x + β2 (a − a)
= β1

is true difference between two groups.

In theory any common value would yield the same difference between two lines. The
choice of the overall mean makes sense for two reasons. It is biologically reasonable and
lies within the range for which we belive that the association between age and wight is
linear and contant within each group.

Consider the same situation show in figuer 3.1, but instead of weight being the dependent
variable, assume it is a dichotomous variable and that the vertical axis denotes the logit.
That is, under the model the logit is given gy the equation

g(x, a) = β0 + β1 x + β2 a

A univariate comparison obtained from the 2 × 2 table cross-classfing outcome and group
would yield a log odds ratio approximately equal to β1 +β(a2 −a1 ). This would incorrectly
estimate the effect of group due to the difference in the distribution of age.

This logit difference is g(x = 1, ā) − g(x = 2, ā) = β1 .Thus, the coefficients β1 is the
log odds ratio that we would expect to obtain from a univariate comparison if the two
groups had the same distribution of age.

The univariate log odds for group 2 versus group 1 is,

ln(ÔR) = ln(0.8/0.2) − ln(0.36/0.64)


= 1.962
ÔR = 7.11

We can also see that there is a considerable difference in the age distribution of the two
groups, the mean in group 2 being on average more than 7 years older than those in group
1. We would guess that much of the apparent differnce in the proportion of men secing a
physician might be due to age. Analyzing the data with a bivariate model using a coding
of Group=0 for group 1, and Group=1 for group2, yield the estimated logistic regression
shown in Table 4.11. The age adjusted log odds ratio is ÔR=exp(1.263)=3.54.Thus ,much
of the apparent difference difference between the two groups is fact,due to diffrences in
age.
Let us examine this adjustment i more detail using Figure 4.1. An approximation to the

64
variable Coefficients Std.Err z P > |z|
GROUP 1.263 0.5361 2.36 0.018
AGE 0.107 0.0465 2.31 0.021
CONSTANT -4.866 1.9020 -2.56 0.011
Log likelihood=-54.8292

Table 4.11: Resuls of fitting the logistic regression model to the data summarized in Table
4.10

unadjested odds ratio is obtained by exponentiating the dfference w2 − w1 . In terms of


the fitted logistic regression model shown in Table 3.11. This difference is,

w2 − w1 = (β0 + β1 + β2 a2 ) − (β0 + β2 a1 )
= β1 + β2 (a2 − a1 )
[−4.866 + 1.263 + 0.107()047.34] − [−4.866 + 0.1107(39.60)]
= 1.263 + 0.107(47.34 − 39.60)

The value of this odds ratio is,

e1.263+0.107(47.34−39.60) = 8.09

The discrepancy between 8.09 and the actual unadjusted odds ratio, 7.11 is based on
the difference in the average logit,while the crude odds ratio is approximatly equal to a
calculation based on the averege estimated logistic probability for two groups.
The age adjested odds ratio is obtained by exponentating the difference w4 − w3 , which
is equal to the estimated coefficient for GROUP. In the example the difference is,

w4 − w3 = (β0 + β1 + β2 a) − (β0 + β2 a)
= β1
[−4.866 + 1.263 + 0.107(43.47)] − [−4.866 + 0.107(43.47)]
= 1.263

Bachand and Hosmer (1999) compare two different sets of criteria for defining a covariate
to be confounder. They show that the numerical approach used in this section, examining
the change in the magnitude of the coefficient for the risk factor from logistic regression
models fit with and without the potential confounder both risk factor and confounder
is not fully Sshape. The method of adjestment when the variables are all dichotomous,
polychotomous, continuous or a mixture of these is identical to that just described for the
dichotomous-continuous variable case. For example, suppose that instead of treating age
as continuous it was dichotomized using a cutpoint of 45 years. To obtain the age-adjusted
effect of group we fit age-adjusted effect of group we fit the bivariate model containing the
two dichotomous variables and calculate a logit difference variables and calculate a logit
difference at the two dichotomous variables and calculate a logit difference at the two

65
levels of group and a common value of the dichotomous variables for age. This procedure
is similar for any number and mix of variables.Adjusted odds ratios are obtained by
comparing individuals who differ only in the characteristic of interest and have the values
of all other variable constant.

4.6 Interaction And Confounding


In the last section I saw how the inclution of additional variables in a model provides
a way of statistically adjesting for potential differences in their distributions. The team
confounder is used by epidemiologyist to describe a covariate that is associated with both
the outcome variable of interest and a primary independent variable or risk factor.When
both associations are present then the relationship between the risk factor and the out-
come variable is said to be confounded.

In this section we introduce the concept of interaction and show how we can control
for its effect in the logistic regression model. In addition, I illustrate with an example
how confounding and interaction may affect the estimated coefficients in the model.

If the association between the covariate (i.e age) and the outcome variable is the same
with in each level of the risk factor (i.e group), then there is no interaction between the
covariate and the risk factor.

Graphically, the absence of interaction yields a model with two parallel lines, one for
each level of the risk factor variable. In general, the absence of interaction is character-
ized by a model that contains no second or higher order terms involving two or more
variables. When interaction is present, the asspociation between the risk factor and the
outcome variable differs, or depends in some way on the level of the convariate. That is,
the covariate modefies the effect of the risk factor. Epidemiologists use the term effect
modifier to describe a variable that interacts with a risk factor. In the previous exam-
ple,the logit in linear in age for the men in group 1, then interaction implies that the
logit does not follow a line with the same slope for the secound group. In theory, the
association in group 2 could be described by almost any model except one with the same
slope for the secound group. In theory, the association in group 2 could be dascribe by
almost any model except one ith the same slope as the logit for group 1.

Figure 3.2 presents the graphs of three different logits.In this graph, 4 has been addsd
to each of the logists to make plotting more convenient. The graphs of these logits are
used to exlpain what is meant by interaction. Consider an explain what is meant by in-
teraction. Consider an example where the outcome variable is the presence or absence of
CHD, the risk factor is sex, and the covariate is age. Suppose that the line labeled l1 cor-
responds to the logit for female aas a function of age. Line l2 represents the logit for males.

66
l3
5
l2

4
LogOdds+4
3
l1

35 40 45 50 55 60 65
AGE

Figure 4.2: Plot of the logits under three different models showing the presence and
absence of interaction.
Model Constant SEX AGE SEX × AGE DEVIATION G
1 0.060 1.981 419.816
2 -3.3374 1.356 0.082 407.78 12.036
3 -4.216 -4.216 0.103 -0.062 406.392 1.388

Table 4.12: Estimate logistic regression coefficients , deviance, and the likelihood ratio test
statistic (G) for an example showing evidence of confounding but no interation(n=400)

These two lines are parallel to each other, indicating that the relationship between age
and CHD is the same for male and females. In this situation there is no interection and
the log odds ratios for sex(male vs Female) Controlling for age is given gy difference be-
tween line l2 and l1 , l2 − l1 , This difference is equal to the vertical distance between the
two lines, which is the same for all ages.
Suppose instead that the logit for males is given by the line l3 . This line is steeper than
the line l1 , for females, indicating that the relationship between age and CHD among
males is different from that among females. When this occurs we say there is an interactin
between age and sex. The estimate of the log-odds for sex (male versus female)controlling
for age is still given by the vertical distance between the lines, l3 − l1 , but this difference
now depends on the age at which the comparison is being made. Thus, we cannot esti-
mate the odds ratio for sex without first first specifying the age at which the comparison
is being made. In order words, age is an effect mofifier.

67
Model Constant SEX AGE SEX × AGE DEVIATION G
1 0.201 2.386 376.712
2 -6.672 1.2774 0.166 338.688 38.024
3 -4.825 -7.838 0.121 0.205 330.654 8.034

Table 4.13: Table 4.13 Estimate logistic regression coefficients, deviance, and the like-
lihood ratio test statistic (G) for an example showing evidence of confounding but no
interation(n=400)

Result of Table 4.12 & Table 4.13


Table 4.12 and 4.13 present the result of of fitting a series of logistic regression models
to two different sets of hypothetical data. The variable in of the data sets are the same
SEX,AGE and the outcome variable CHD. In addition to the estimated coefficients, the
deviation for each model is given. Recall that the change in the significance of coefficients
for variables added to the model. An interaction is added to the model gy creating variable
that is equal to the product of the value of SEX and the value of AGE. Some programs
have synatex that automatically creates interaction variables in a statistical model, while
other require the user to create them through a data modification step.

Examining the results in Table 4.12 we see that estimated coefficient for the variable
SEX changed from 1981 in the model 1 to 1.356 a 46 pesent decrease, when AGE was
added in model 2. Heance, there is clear evidence of a confounding effect due to AGE
when the interection term “AGE×SEX” is added in model 3 we see that the change in
the deviance is only 1.3888, when compared to the chi-squar distribution with 1 degree
of freedom, yields a p-value of 0.24, which is clearly not significant.

Note that the coefficient for sex changed from 1.356 to 4.239. This is not surprising
science the inclusion of an interaction term, especially when it involves a continuous vari-
able usually produces fairly marked changes in the estimated coefficients of dichotomous
variables involved in the interaction. Thus, when an interaction term is presented in
the model we cannot assess confounding via the change in a coefficient. For these data
we would prefer to use model 2 that suggests age is a confounder but not an effect modifier.

The resuls in Table 3.13 show evidence of both confounding and interaction due to age.
Comparing model 1 and model 2 we see that the coefficient for sex changes from 2.386 to
1.374 an 87 percent decrease. When the age by sex interaction is added to the model we
see that the change in the deviance is 8.034 with a p-value of 0.005. Science the change
in the deviance is significant we prefer model 3 to model 2 and should regard age as both
a confounder and an effect modifier. The net result is that any estimate of the odds ratio
for sex shoud be made with reference to a specific age.

68
4.7 Estimation Of Odds Ratios In The Presence Of
Interaction
In previous section we showed that when there was interection between a risk factor and
another variable,the estimater of the odds ratio for the risk factor depends on the value of
the variable that is interacting with it. In this situation we may not be able to estimate
the odds ratio by simply exponentiating on estimated coefficient one approch that will
always yield the correct model-based estimate is to,

(1) Write down the expressins for the logit at the two levels of the risk factor being
compared.

(2) Algebraically simplify the difference between the two logits and compute the value.

(3) Exponentiate the value obtained in step 2.

As a first example, we develop the method for a model containing only two variables and
their interection.

In this model, denote the risk factor as, F, the covariate as X and their interection as
F×X and X=x is,
g(f, x) = β0 + β1 f + β2 x + β3 f × x (4.12)
Assume we want the odds ratio comparing two levels of F, F=f1 versus and F, F=f0 at
X=x. Following the three step procedure first we evaluate the expressions procedure first
we evaluate the expression for the two logits yielding.

g(f1 , x) = β0 + β1 f1 + β2 x + β3 f1 × x
g(f0 , x) = β0 + β1 f0 + β2 x + β3 f0 × x

Second we compute and simpify there difference to obtain the log-odds ratio yieling.

ln[OR(f = f1 , f = f0 , x = x)] = g(f1, x) − g(f1 , x)


= (β0 + β1 f1 + β2 x + β3 f1 × x)
−(β0 + β1 f0 + β2 x + β3 f0 × x)
= β1 (f1 − f0 ) + β3 × x(f1 − f0 )

Third we obtain the odds ratio by exponentiating the difference obtained at step 2 yieding.

OR = exp[β1 (f1 − f0 ) + β3 × x(f1 − f0 )] (4.13)

The expression for the log-odds ratio in4.13 does not simlipy to a single coefficient. Instead
it involves two coefficients, the difference in the values of the risk factor and the interaction
variable.The estimator of the log-odds ratio is obtained by replaceing the parameters in
4.13 with the three estimaters.we calculate the end points for the confidence interval for

69
the log-0dds ratio and then exponentiate the end points is the estimater of the variance
of the estimater of the log odds ratio in 4.13. Using methods for calculating the variance
of a sum we obtain the following estimater.
" #
v̂ar ln[ÔR(F = f1 , F = f0 , X = x)] = (f1 − f0 )2 × v̂ar(β̂1 ) (4.14)
+[x(f1 − f0 )]2 var(β̂3 ) (4.15)
2
+2x(f1 − f0 ) × cov(β1 , β3 ) (4.16)

The end points of a (1 − α)% confidence interval estimator for the log-odds ratio are,
" #
[β1 (f1 − f0 ) + β3 × x(f1 − f0 )] ± z(1− 2 ) ŜE ln[ÔR(F = f1 , F = f0 , X = x)]
α (4.17)

Where the standad error in 4.17is the positive square root of the variabne estimator in
4.13. We obtain the end points of the confidence interval estimator for the odds ratio by
exponentiating the endpoints in 4.17.

The estimators for the log-odds and its variance simplify in the case when F is a di-
chotomous risk factor. If let f1 = 1and f0 = 0 then the estimator of the log-odds ratio
is,

ln[ÔR(F = 1, F = 0, X = x)] = β̂1 + β̂3 x (4.18)

The estimator of the variance is,

V̂ arln[ÔR(F = 1, F = 0, X = x)] = v̂ar(β̂1 ) + x2 v̂ar(β̂3 ) + 2xĉov(β̂1 , β̂3 ) (4.19)

and the end points of the confidence interval are ,


" #
(β̂1 + β̂3 x) ± z(1− 2 ) ŜE ln[ÔR(F = f1 , F = f0 , X = x)]
α (4.20)

Example:
We consider a logistic regression model using the low birth weight data described in section
1.6 containing the variables AGE and a dichotomous variable, LWD, based on the weight
of the mother at least menstrul period. This variable takes on the value 1 if LWT< 110
pounds, and is zero otherwise. The result of fitting a series of logistic regressio models
given in Table 4.14. using the estimation coefficent for LWD in model 1. we estimated
the odds ratio as exp(1.054) = 2.87. The result shown in Table 4.14 indicate that AGE
is not a stong confounder, β̂% = 4.2, present,but it does interact with LWD, P=0.076.
Thus, to assess the risk of low wight at the last menstrual period correctly. We must
include the interaction of this variable with the women’s age becouse the odds ratio is not
constant over age.

An effective way to see the presence of interaction is via a graph of the estimated logit
under model 3 in Table 4.14 This is show in Figure 4.3.

70
Model Constant LWD Age lWD × AGE ln[l(β) G P
0 -0.790 -117.34
1 -1.054 1.054 -113.12 8.44 0.0004
2 -0.027 1.010 -0.044 -112.14 1.96 0.160
5 0.774 -1.944 -0.080 0.132 -110.57 3.14 0.076

Table 4.14: Table 4.14 estimated logistic regression coefficients, Deviance, the likeli-
hood ratio teststatistic (G), and the P-value for the change for models containing LWD
and AGE from the low birthwight containing LWD and AGE from the low birthwight
data(n=189)

Constant LWD AGE LWD × AGE


Constant 0.828
LWD -0..828 2.975
AGE -0.352-02 -0.353-01 0.157-02
LWD*AGE -0.352-01 -0.128 -0.157-02 0.573-02

Table 4.15: Estimated covariance matrix for the estimated parameters in model 3 of Table
4.14.

The upper line in Figure 4.3 corresponds to the estimated logitfor women with LWD=1
and the lower line is for women with LWD=0. Separate plotting symbols have been used
for the two LWD groups. The estimated log-odds ratio for LWD=1 verses LWD=0 at
AGE=x from 4.18 is equal to the vartical distance between the two lines at AGE=x in
Figure 4.3 that none of the women in the low wight group, LWD=1, are older than about
33 years. Thus we should restrict our estimates of the effect of low wight to the range
of 14 to 33 years. Based on these observations we estimate the effect of low weight at
15, 20, 25 and 30 years of age.

Using 4.18and the result for model 3 the estimated log-odds ratio for low weight at the
last menstrual period for a women of AGE a is,
ln[ÔR(LW D = 1, LW D = 0, AGE = a)] = −1.944 + 0.132 (4.21)
In oder to obtain the estimated variance we must first obtain the estimated covariance
matrix is symmetric most logistic regression.

Soft packages print the result in the form similar to that shown in thae Table 4.15.

The estimated variance of the log-odds ratio given 4.17 is obtain from 4.20 and is
" #
V ar ln[ÔR(LW D = 1, LW D = 0, AGE = a)] = 2.975 + a2 × 0.0057 + 2 × (−0.128)
(4.22)

71
Age 15 20 25 30
OR 1.04 2.01 3.90 7.55
95 CIE 0.29-3.79 0.91-4.44 1.71-8.88 1.95-29.19

Table 4.16: Estimated odds ratios and 95 present confidence intervals for LWD,controlling
for AGE.

values of the estimated odds ratio and 95% confidence interval (CI) using 4.22 for several
ages are given in Table 4.16. The result show in Table 4.16 demostrate that the effect
of lwd on the odds of having a low birth weight baby increase exponentially with age.
The result also show that the increase in risk is significant for low weight women 25 years
and older. In particular low weight women 25 years and older. In particular low weight
women of age 30 are estimated to have a risk that is about 7.5 times that of women of
the same age who are not low weight. The increase in risk could be as little as two times
ar as much as 29 times with 95% coefidence.

72
Chapter 5

Model-Building Strategies And


Mothods For Logistic Regression

In the previous cahpters we fouused on estimating, testing, and interpreting the coeffi-
cients in a logistic regression model. The examples discussed were characterized by having
few independent variables, and there was perceived to be only one possible model. While
there may be situations where this is the case, it is more typical that there are many
independent variables that could potentially be included in the model. Hence, we need to
develop a stretegy and associated methods for handling these more complex situations.

The goal of any method is to select those variables that result in a “best” model within
the scientific context of the problem. In order to achieve this goal we must have:
(1) A basic plan for selecting the variables for the model

(2) A set of methods for assessing the adequacy of the model both in terms of its
individual variables and its overall fit.
We suggest a general stetegy that consider both of these areas. Succesful modeling of a
complex data set is part science, part statisticalmethods, and part eperience and common
sense. It is our goal to provide the reader with a paradigm that, when applied thoughtfully,
yields the dest possible model within the constraints of the available data.

5.1 Variable selection


The criteria for includeing a variable in a model may vary from one problem to the next
and from one scientific discipline to another. The traditional approach to statistical model
building involves seeking the most parsimonious model that still explains the data. The
rationale for minimizing the number of variables in the model is that the resultant model
is more likely to be numerically stable, and is more easily generalized. The more variables
included in a model, the greater the estimated standard errors become,and the more de-
pendent the model becomes on the obseved data. Epidemiologic methodologists suggest

73
including all clinically and intuitively relevant variables in the model, regardless of their
“statistical significance”. The rationale for this approach is to provide as complete con-
trol of confounding as possible within given the dataset. This is based on the fact that
it is possible for individual variables not to exhibit strong confounding, but when taken
collectively, considerable confounding can be present in the data.

The major problem with this approach is that model may be “overfit”, producing nu-
merically unstable estimates. Overfitting is typically characterized by unrelistically large
estimated coefficients and/or estimated standard errors. This may be especially trou-
blesome in problems where the number of variables in the model is large relative to the
number of subjects and/or when the overall proportion responding (y = 1) is close to
either 0 or 1.

There are several steps one can be follow to aid in the selection of variables for a lo-
gistic regression model. The process of model building is quite similar to the one used in
linear regression.
(1) the section process should being with a careful univariable analysis of each variable.
For nominal, ordinal, and continuous variables with few integer values, we suggest
this be done with a continuous variable of outcome (y=0,1) versus the k levels of
the independent variable. The likelihood ratio chi-square test with k-1 degrees -of-
freedom is exactly equal to the value of the likelihood ratio test for the significance
of the coefficients for k-1 design variables in a univariable logistic regression model
that contains that single independent variable.Since the pearson chi-square test is
asymptotically equivalent to the likelihood ratio chi-square test, it may also be used.
In addition to the overall test, it is good idea, for those variables exhibiting at least
a moderate level of association, to estimate the individual odds ratio (along with
confidence limits) using one of the levels as the reference group.

For continuous variables, the most desirable univariable analysis involves fitting
a univariable logistic regression model to obtain the estimated coefficient, the es-
timated standed error, the likelihood ratio test for the significance of the coeffi-
cient,and the univariable Wald statistic. An alternative analysis, which is equivalent
at the univariable level, may be based on the two-sample t-test. Descriptive statis-
tics avalable from two-sample t-test analysis generally include group means,standed
deviations, the t-statistic, and its p− value. The similarity of this approach the logis-
tic regression analysis follows from the fact that the univariable linear discriminant
function estimate of the logistic regression coefficient is
$
(x̄1 − x̄0 ) t 1 1
= + (5.1)
s2p sp n1 n0
and that the linear discriminant function and the maximum likelihood estimate of
the logistic regression coefficient are usually quit close. When the independent vari-
ble is approximatly normally distributed within each of the outcome groups, y=0,1.

74
Thus the univariate analysis based on the t-test shoud be useful in determining
whether the variable should be included in the model,science the p-value should be
of the same order of magnitude as that of the Wald statistic, or likelihood ratio test
from logistic regression.

For continous covariates, we may wish to supplement the evaluation of the uni-
variable logistic fit with some sort of smoothed scatterplot. This plot is helpful, not
only in asertainingthe potential importance of the variable and possible presence
and effect of extreme of extreme (large or small) observations, but also its appro-
priate scale. One sample and easily comuted from of a smoothed scaterplot was
illustrated in Figure 1.2 using the data in Figure 1.2.

Other more complecated methods that have greater precision are available.

Kay and Little (1987) illustrate the use of a method proposed by Copas(1983).
This method requires computing a smoothed value for the resonse variable for each
subject that is a weighted average of the values of the outcome variable overall sub-
jects. The weight for each subject is a continous decreasing function of the distance
of the value of the covariate for the subject under consideration from the value of
the covariate for all other cases. For example, for covariate x for the ith subject we
compute n
w(xi xj )yi
ȳi = i=1
n (5.2)
i=1 w(xi xj )
Where w(xi xj ) represents a paticular weight function. For example if we use
STATA’s scatterplot smooth command, ksm,with the wight option and band width
k, then
 3
|xi − xj |3
w(xi , xj ) = 1 − (5.3)

where is define so that the miximum value for the weight is  1 and the two
indices defining the sumation. ij and iu , include the k precent of the n subjects
with x value closet to xi . Other wight function are possible as well as additional
smoothing using locally weighted least squares regression, called lowess in some
packages.
(2) Upon completion of the univariable analysis,we select variables for the multivariable
analysis.Any variable whose univariable test has a p-value < 0.25 is a candidate for
the multivariable model along with all variable of known clinical importance. Once
the variable have been identified, we bwgin with a model containing all of the se-
lected variables.

Our recomandation that 0.25 level be used as a screening criterion for variable
selection is based on the work by Bendel and Afifi (1977) on linear regression and
on the work by Mickey and Greenland (1989) on logistic regression. Thus we can

75
show that use of more traditional level (such as 0.05 level) often fails to identify
variables known to be important. Use of the higher level has the disadvantage of
including variable that are of questionable importance at the model bulding stage.
For this reson, it is importance to review all variables added to a model critically
berore a dicision is reached regarding the final model.

As node above, the issue of variable selection is made complicated by different


analytic philosophies as well as different statistical methods. One school of thought
argues for the inclusion of all scientifically relevent variables in to multivariable
model regardless of the results of univariable model analyses. In the general, the
appropriateness of the decision to begin the multivariable model with all possible
variables depends on the overall sample size and the number in each outcome group
relative to the total number of candidate variables. When the data are adequate
to support such an analysis it may be useful to begin the multivariable modeling
from this point. However, when the data are inadequate, this appoach can produce
a numerically unstable multivariable discussed in greater detail. In this case the
Wald statistics should not be used to select variables becouse of the unstable nature
of the results. Insted, we should select a subset of variables based on results of the
univariable analyses and refine the definition of “Scientifically relevent”.

Another approach to variable variable selection is to use a stepwise method in


which variables are selected either for inclusion from the model in a sequentioal fash-
ion based solely on statistical criteria. There are two main versions of the stepwise
procedure.

(a) Forward selection with a test for backward estimation


(b) Backward elimination followed by a test for forward selection.

This algorithms used to define these procedures in logistic regression. The stepwise
approach is useful and intuitively appeling in that it buids models in a sequential
fashion and it allows for the examination of a collection of models which might not
otherwise have been examined.
“Best subsets selection” is a selection method that has not been used extensively in
logistic regression.

Stepwise, best subsets, and other machanical selection procedures have criticized
because they can yield a biologically implausible model[Greenland(1989)] and can se-
lect irrelevant, or noise, variables [Flack and chang (1987), Griffiths and Pope(1987)].
The problem is not the fact that the computer can select such models, but rather
that the analyst fails to scretinize the resulting model carefully, and reports such
result as the final, dest model. The wide avalability and ease with wtich stepwise
mothods can be used has undoubtedly reduced some analysts to the role of assisting
the computer in model selecting rather than the more appropriate alternative. It is

76
only when the anayst understands the strengths, and especially the limitations, of
the methods that these methods can serve as useful tools in the model-building
process. The analyst, not the computer, is ultimately responsible for the review
and evaluation model.

(3) Following the fit of multivariable model, the impotance of each variable included
in the model should be varified.This shoud be verifified.This should include(a) an
examination of the Wald statistic for each variable and (b) a comparison of each
estimated coefficient with the coefficient from the model containing only that vari-
able. Variable that do not contribute to the model based on these criteria shoud
be eliminated and contribute to the model based on these criteria should be elim-
inated and a new model shoud be fit.The new model should be compared to the
old, lager, model using the likelihood ratio test. Also, the estimated coefficients for
the remaining variables should be concerned about variables whose coefficients have
changed markedly in magnitude. This indicates that one or more of the excluded
variables was important in the sense of providing a needed adjustment of the effect
of the variable that remained in the model. This process of deleting, refitting, and
varifying continues until it appears that all of the important variable are included
in the model and those excluded are clinically and/or statistically unimportant.

At this point, we suggest that any variable not selected for the original multivari-
able model de added back in to the model. This step can be helpful in identifying
variable that,by themselves, are not significantly related to the outcome but make
an important contribution in the presence of other variable.

We refer to the model at the end of step (3) as the preliminary main effect model.

(4) Once we have obtained a model that we feel that we feel contains the essential
variable, we sshoud look more closely at the variables in the model. The question
of the appropriate categories for discrete variables should have been addressed at
the univariable stage.For continouse variables we should check the assumptuion of
linearity in the logit.

Assuming linearity in the variable selection stage is a common particular variable


shoud be in the model. The graph for several different relationships between the
logit and a continous independent variable are show in Figure5.1. The Figure 5.1
illustrates the case when the logit is

(a) Linear
(b) Quadratic
(c) Some othe nonlinear continuous relationship
(d) binary

77
y

(a) (c)
(d)
Log−odds or Logit (b)

Binary
Liner
Quadratic Other Nonliner

x
Covariate

Figure 5.1: Different types of models for relationship between the logit and a continuous
variable.

Where there is a cutpoint above and below which the logit is constant. In each of the
situations describe in Figure 5.1 fitting a linear model would yield a significant slop.
Once the variable is identified as important, we can obtain the correct parametric
relationship or scale in the model refinement stage.The exxeption to this would be
the rare instance where the function is U−shaped. We refer to the model at the
end of step(4) as the mainef f ectsmodel.

(5) Once we have refined the main effects model and ascertauned that each of the con-
tinuous variables is scaled correctly, we cheek for interactions among the variables
in the model. In any model an interaction between two variables implies that the
effect of one of the variables is not constant over levels of the other. For example,
an interaction between sex and age implies that the slope coefficient for age is dif-
ferent for male and females. The final decision as to whether an interaction term
should be included in a model should be based on statistical as well as practical
considerations. Any interaction term in the model must make sence from a clinical
perspective.

We address the clinical plausibility issue by creating a list of possible pairs of vari-
ables in the model that have some scientific basis to interact with each other. The
interaction variables are created as the arithmetic product of the pairs of main effect
variables. We add the interaction variables, one at a time, to the model containing

78
all the main effects and assess their significance using a likelihood ratio test. We
feel that interactions must contribute to the model at traditional levels of statistical
significance. Inclsion of an interaction term in the model that is not significant typ-
ically increases the estimated standad errors without chaning the point estimates.
In general, for an interaction term to alter both point and interval estimates, the
estimated coefficient for the interaction term must be statistically significant.

In step(1) we mentioned mentioned that one way to examine the scale of the co-
variate is to use a scatterplot smooth, plotting the reults on the logit scale. Un-
fortunately scatterplot smoothing method are not easily extended to multivariable
models and trhus have limited applicability in the model refinement step. However,
it is possible to extend the grouping type smooth show in Figure 2.2 to multivariable
models.

The procedure is easily implemented within any logistic regression package an is


based on the following observation. The difference, adjusted for other model covari-
ates, between the logits for two different groups is equal to the value of an estimated
coefficient from a fitted logistic regression model that treats the grouping variable
as catogorical. We have found that the following implementation of the grouped
smooth is usually adequate for purposes of visually checking the scale of a continu-
ous covariate.

First, using the descriptive statistics capabilities of most any statistical package,
obtain the quartiles of the distribution of the variable. Next create a categorical
variable with 4 levels using three cutpoints based on the quartiles. Other grouping
strategies can be used but one based on quartiles seems to work well in practice. Fit
the multivariable model replacing the continous variable with the 4− level catoga-
rical variable. To do this, 3 design variables must be used with the lowest quartile
serving as the reference group. Following the fit of the model; plot the estimated
coefficients versus the midpoints of the groups. In addition, plot a coefficients versus
the midpoints of the groups.In addition, plot a coefficients versus the midpoint of
the group. In addition, plot a coefficient equal to zero at the midpoint of the first
quartile. To aid in the interpretation we connect the four plotted points. Visually
inspect the plot and choose the most logical parametric shape(s) for the scale of the
variable.

The next step is to refit the model using the possible parametric forms suggested
by the plot and choose one that is significantly different form the linear model and
makes clinical sense. It is possible that two or more different parameterations of the
covariate will yield similar models in the sence that they are significantly different
from the linear model. However, it is our experience that one of the possible models
will be more appealing clinically, thus yielding more easily interpreted parameter
estimated.

79
Another more analytic approach is to use the method of fractional polynomials.

5.2 Fractional polynomial


Fractional polynomial was developed by Royston and Altman (1994),to suggest transfor-
mations. We wish to determine what value of xp ields the best model for the covariate. In
theory, we could incorporate the power, p, as an additional parameter in the estimation
procedure. However, this greatly increases the complexity of the estimation problem.
Royston and Altman propose replacing full maximum likelihood estimation of the power
by a search through a small but reasonablee set of possible values. Hosmer and Lemeshow
(1999) provide a brief introduction to the use of fractional polynomials when fitting a pro-
portional hazards regression model. This material provivdes the basis for our discussion
of its application to logistic regression.

The method of fractional polynomials may be used with a multivariable logistic regression
model, but for sake of simplicity, we describe the procedure using a model with a single
continuous covariate.

The logit, that is linear in the covariate, is


g(x, β) = β0 + xβ1 (5.4)
Where β deno0tes the vector of modelcoefficients.One way to generalize this function is
specify it as
j

g(x, β) = β0 + Fj (x)βj (5.5)
j=1

The functions Fj (x) are a particular type of power function.The value of the first function
is F1 (x) = xp1 . In theory, the power, p1 , could be any number, but in most applid
settings it makes sense to try to use somethin simple. Royston and Altman (1994) propose
restricting the power to be among those in the set Ω = −2, −1, −0.5, 0, 0.5, 1, 2, 3,where
p1 = 0 denotes the log of the variable.The remaining functions are defined as,

xpj f or pj = pj−1
F (x) =
Fj−1 ln(x) f or pj =j−1
for j = 2, 3, .... and restriciting powers to those in Ω. For example, if we chose j = 2 with
P1 = 2 and P2 = 2, then the logit is by usin ??
g(x, β) = β0 + F1 (x)β1 + F2 β2
F1 (x) = ln(x) and F2 (x) = √1
x

1
g(x, β) = β0 + lnxβ1 + √ β2
x

80
Variable Coeff. Std.Err ÔR 95%CI G P
AGE 0.018 0.0153 1.20∗ (0.89 , 1.62) 1.40 0.237
BECK -0.008 0.0103 0.96+ (0.87 , 1.06) 0.64 0.425
NDRGTX -0.075 0.0247 0.93 (0.88 , 0.97) 11.84 0.001
IVHX2 -0.481 0.2657 0.62 (0.37 ,1.04) - -
IVHX3 -0.775 0.2166 0.46 (0.30 , 0.70) 13.35 0.001
RACE 0.459 0.2110 1.58 (1.04 , 2.39) 4.62 0.032
TREAT 0.437 0.1931 1.55 (1.06 , 2.26) 5.18 0.023
SITE 0.264 0.2034 1.30 (0.87 , 1.94) 1.67 0.197

Table 5.1: Univariable logistic regression models for the USI(n=575)

As another example,if we chose j = 2with P1 = 2and P2 = 2,then the logit is,


g(x, β) = β0 + x2 β1 + β2 x2 ln(x)
The model is quadratic in x when j = 2 with P1 = 2and P2 = 2. Again we could allow the
covariate to enter the model with any number of functions.j; but in most applied settings
anadequte transformation may be found if we use j = 1 or 2.

Example:
As an example of the model-bulding process, consider the analysis of the UMARU IM-
PACT study(USI). The study is described in section 2.6 and a code sheet for the data
is shown in Table 2.8. Briefly the goal of the analysis is to determine whether there is a
difference between the two treatment programs after adjesting for potential confounding
and interaction variables.

∗−Odds Ratio for a 10 year increase in AGE.

+−Odds Ratio for a 5 point increase in BECK.

One outcome of considerable public health interest is whether or not a subject remained
drug free for at least one year from randomization to treatment (DeREE in table 2.8).
A total of the 575 subjects (25.57%), considered in the analyses in this text, remained
drug free for at least one year. The analyses in this chapter are primarily designed to
demostrate specific aspects of the logistic model building.
The results of fitting the univariable logistic regression models to these data are given in
Table 4.1.In this table we present, for each variable listed in the first column,the following
information.
(1) The estimated slope coefficient (s) for the univariable logistic regression model con-
taining only this variable
(2) The estimated standad error of the estimated slop coefficient

81
Variable Coeff. Std.Err z P > |z|
AGE 0.050 0.0173 2.91 0.004
NDRGTY -0.062 0.0256 -2.40 0.016
IVHX2 -0.603 0.2873 -2.10 0.036
IVHX3 -0.733 0.2523 -2.90 0.004
RACE 0.226 02233 1.01 0.311
TREAT 0.443 0.1993 2.22 0.026
SITE 0.149 0.2172 0.68 0.494
Constant -204 0.5548 -4.34 ¡0.001
Log likelihood -309.6241

Table 5.2: Results for a multivariable model containing the covariates sinificant at the
level of Table5.1.

(3) The estimated odds ratio, which is obtained by exponentiating the estimated coeffi-
cient. For the variable AGE the odds ratio is for a 5−point increase. This wos done
since a change of 1 year or 1 point would not be clinically meaningful.
(4) The 95% CI for the odds ratio.
(5) The likelihood ratio test statistic, G, for the hypothesis that the slope coefficient
is zero. Under the nullhypothesis,this quantity follows the chi-square distribution
with 1 degree of freedom,except for the variable IVHX, where it has 2 degrees of
freedom.
(6) The significance level for the likelihood ratio test.
With the exception of Beck score there is evidence that each of the variables has some as-
sociation (P< 0.25) with the outcome, remaining drug free for at least one year(DFREE).
The covariate recording historyof intravenous drug use (IVHX)is modeled via two design
variables using “1=Never” as the reference code. Thus its likelihood ratio test has two
degrees-of-freedom. We begin the multivariable model with all but BLCK. The results of
fitting the multivariable model are given in Table 5.2. The results in Table 5.2, When com-
pared to Table 5.1, indicate weaker associations for some cocovariates when controlling
for other variable s.In particular, the significance level for the Wald test for the coefficient
for SITE is p = 0.494 and for RACE is P = 0.311. Strict adherence to conventional levels
of statistical significance would dictate that we consider a smaller model deleting these
covariates. However, due to the fact that subjects were randomized to treatment within
site we keep SITE in the model. On consultation with our colleagues we were advised that
race is an important control variable. Thus on the basis of subject matter considerations
we keep RACE in the model.

The next step in themodeling process is to check the scale of the continuous covariates
in the model, AGE and NDRGTX in this case. One approach to developing the order in

82
which to check for scale is to rank the continous variable by their pespective significance
levels. Results in Table 5.2 suggest that weconsider Age and then NDRGTX.

83
Chapter 6

Descriptive Data Analysis

Data analysis and results obtain by them


Detail data analysis is used to represent the results and basic characteristics of a distri-
bution in a much accurate and attractive form. So that, some one can get a rough idea
about the corresponding distribution very easily by investigating the distribution report.

First, I studied the reasons for the requirement of installation of a teller machine within
the university premises for the public usage and here, I wish to perform a data analysis
for these collected data.

So that, to do this, I have chosen the entire academic and non-academic staff as the
population of my distribution.
Because of this, I decided to give a rough idea about the distribution structure of the
lecturers, students, and non-academics of the Wellamadama premises.
Situation Size of the population
Academic staff 185
Temp & demonstrate staff 110
Non Academic staff 399
Security staff 45
Intetnal student 5109
External Student 704
Total 6552
Now, let us analyses this entire distribution structure in detail according to the distinct
faculties and disciplines.
First, let us have a close look at the structure of the academic staff.Besically this can be
divided in to two parts.
(1.) Lecturers
(2.) Others

84
Figure 6.1: Size of the University population

All lecturers, who are working at each faculty, can be put in to the first category, and
all the remaining permanent and temporary academics are fallen in to the second category.

First, let us choose the set of lecturers for the faculties of the faculties of Science, Arts,
Management & Finance and Fisheries & Marine Science. This set can be shown as in the
following table according to their faculties.
Facuilty Academic staff Presentage
H & SS 80 43.24
Managment & Finance 28 15.14
Science 67 36.22
Fisheries & Marine Science 10 5.41
Total 185 100
When these data are being investigated, we can clearly see that, a considerable number
of lecturers are working at the Arts and Science faculties. The reason for this is the most
number of students of the university are studding in these two faculties. On the other
hand, only the few lecturers are working at the Management & Fisheries faculties due to
the lack of students.

Then, we can study the distribution of the temporary and demonstration staff of the
university.
Facuilty Demonstrate staff Presentage
H & SS 17 15.45
Managment & Finance 5 8.46
Science 75 70.12
Fisheries & Marine Science 13 11.84
Total 110 100

85
Figure 6.2: Academic staff

When we study the above graph, it is clear that the distribution of the temporary and
demonstration staff is completely different from the previous distribution of the academic
staff. Here we can see that the other academic staff of the science faculty is mush lager
than the other faculties of the university.

The reason for this is that science faculty needs a considerable number of temporary
staff member to guide the students at their practical sessions.
Then let us analyse the distribution of the non-academic staff of the university.
Facuilty Non academic staff
Administrate 262
H & SS 39
Managment & Finance 10
Science 84
Fisheries & Marine Science 4
Total 399

If we analyse the above graph, we can see that the most of the non-academic members
are working under the administration branch. They are distributed as the,

(1.) Administrative officers

(2.) Clerical staff

(3.) Technical officers

(4.) Minor and other staff members

86
Figure 6.3: Temporary and Demonstrate staff

Within the Finance branch, Administration branch, Library and other faculties.
Student of the university are considered as the major entity and it is much important
to study their distribution within the university. Besically we can divide them in to two
parts as follows.
(1.) Internal students
(2.) External student
Internal students are studying in the Arts, Science, Management and Fisheries & Marine
Science and most of them are staying at the university hostels. Moreover we can see that
most of them are in the some age.
But external students have their own residence places and belong to different age stages,
job categories and social status.
Up to this point, I have described the distribution and nature of the academic, non-
academic and student entities of the Wellamadama premises.

But my may intention was to study the reasons for the requirements of installation of
a teller machine within the university premises for the public usage. To do this, I have
chosen a sample of 600 entities out of 7500 total collection. At this point, I was being
careful to choose them among the each and every faculty.
Here I considered about savings accounts of the customers and their connection with the
government and private banks.
Owns a savings Account No.of servings Accounts Presentage
Yes 484 96.8
No 16 3.2
500

87
Figure 6.4: Nonacademic staff

Figure 6.5: Number of saving accounts for the sample taken out of university premises.

Here data were parted as academic & non academic sector and students for the convenience
of analyzing process.
Have a servings Accounts Yes No
Student 297 15 312
Academic & Non Acodemic 187 1 188
484 16 500

According to the above graphs it is clear that about 97% of the total maintain their ac-
counts in the both government and private banks. Some of them maintain their accounts
in Peoples bank and Bank of Ceylon which are incorporated to the university. Now I try
to analyse these data.

88
Figure 6.6: Analysis of havings Accounts for the sample taken out of university premises.

First I separated the savings accounts holders from my sample and them obtained set
were parted as university Peoples bank and university Bank of Ceylon accounts holders.
Furthermore, those sets parted as academic, non-academic and students. Following these
steps, I was able to give much attractive aspect to my work.
University Bank No. of Accounts Precentage
People’s Bank 210 42
Bank Of Ceylon 68 13.6
Both 140 28
No 82 16.4
Total 500 100
Now let us consider about the savings accounts which are maintained by the students.
University Bank People’s Bank Bank Of ceylon Both No
Student 144 20 30 118
Academic & Non Academic 66 48 52 22
210 68 82 140 500
Among them, a considerable number of students maintain their accounts in the Peoples
bank which is incorporated with the university, rather than in the Bank of Ceylon. The
reason for this is that, they have to retrieve their bursaries and Mahapola scholarship
fund through the Peoples bank.

But it is marvelous to see the connection in between the academic/non-academic staff


and the bank incorporated with the university. They are the major customers for both
these bank for them. We can see this clearly by studding the following facts.
To do this, I have chosen all the staff members who keep their savings accounts in these
banks.

89
Figure 6.7: University staff & student maintain their accounts with the university People’s
bank & BanK Of Ceylon according my sample.

Bank Accounts Money(RS)


People’s Bank 215 49,449,300.00
Bank Of Ceylon 228 3,954,310.00
443 53,430,610.00

According to the above graphs, we can say that most number of staff members maintain
their savings accounts in the Bank of Ceylon.

Now let us have some idea about the capitals of these banks, due to these accounts.
By comparing the above two graphs, we can say that the Peoples bank is the one who
contributed to the biggest part of circulation of money.

So I decided to study about this moreover, and divide the above set in to two parts
as academic and non-academic.
People’s Bank Bank Of Ceylon
Academic 122 58
Non Academic 93 170
215 228

According to the above graphs, we can say that, most of the academic staff members
maintain their accounts in the Peoples bank and most numbers of non-academic staff
members maintain their accounts in the Bank of Ceylon incorporated to the university.
People’s Bank Bank Of Ceylon
Academic 3,653,590.00 1,957,860.00
Non Academic 1,335,750.00 1,996,450.00
49,499,300.00 3,954,310.00

90
Figure 6.8: Analysis of university staff & student maintain their accounts with the uni-
versity People’s bank & Banl Of Ceylon according my sample.

Generally academic scullery scale is little bit higher than the non-academic salary scale.
Due to these facts it is natural to observe this kind of large capital in the Peoples bank
relative to the Bank of Ceylon.

Now I try to study nature of the savings accounts maintain by the sample members
except the banks incorporated with the university.
Banks No.of Accounts Precentage
People’s Bank 95 18.6
Pank Of Ceylon 72 14.4
Commercial 52 10.4
Seylan 14 2.8
Sampath 55 11.0
Other 212 42.8
Total 500 100

By dividing this sample further (as academic, non-acodemic and student) distribution can
be achieved much attraction.

Bank People’s BOC Commercial Selan Sampath Other


Student 73 57 36 7 25 115
Academic & Non Aca. 22 15 16 7 30 97
95 72 52 14 55 212 500

According to the above information, we can say that most of them maintain savings ac-
counts in the government bank as well as in the private banks. We can see this situation

91
Figure 6.9: Number of account obtain in university Banks

very sharply among the students. High compactions and their effort of introducing new
accounts for the young generations are the major reasons for this.

Private sector has a well developed computer network, specilly commercial and sam-
path banks. So it is possible to receive any amount of money at any time, at any part of
the island. Since most of the university students are staying at out side of their houses,
theyhave used to maintain their accounts in the private banks.

At present, even though the seylan bank is very popular among the business commu-
nity it is not so popular among the ordinary people. We can see this fact according to
the above graphs.
So far I have studied about the customers who maintain their accounts in the government
and private banks. But there are about 4% of sample members who do not maintain any
kind of savings accounts. This percentage becomes 16 students and one non-academic staff
member out of the whole sample. In order to investigate this situation, first I studied the
effect of their residence place.
Place of living No.of obtained
In the Matara urban area 10
[h] Out side of the Matara urban area 4
Hostal in side university 2
Hostal in out side university 16

According to the above graphs, we can say that most of them (10) live around the matara
town and they may take pocket money from houses to fulfill their daily needs. Because

92
Figure 6.10: How the money of payments devided each of the University banks.

of these facts, sometimes they did not want to maintain a savings account.

Moreover, in the case of our non-academic member, he was unaware about these sav-
ings accounts. So it is clear that if it is possible to aware them about these accounts, they
will definitely open their own savings accounts.
Next I would like to further analyse data according to by them. But before that I have
to remind about the samples which I have parted.

First I have randomly selected a sample with 500 entities, out of 7000 total collection.
Then the sample was dividing in to two parts as, members who maintain savings accounts
and do not. Their flavor of maintain teller cards is different form one to another. So I
decided to investigate the reasons for this.

Following graphs show the number of sample members who maintain and who do not
maintain teller card facilities.
Obtained Teller card No.of People Presentage
Yes 380 78.51
[h]
No 104 21.49
484 100.00

In order to study their requirements in detail and much convenient way, the sample was
parted as academic, non-acodemic and students.

93
Figure 6.11: Number of accounts obtain in University Banks

Obained Teller card Yes No


Student 252 45
Academic & Non academic 128 59
380 104 484
According to the above graphs, most of the sample members have obtained teller facil-
ity. Relative to the academic and non academic staff members, a considerable number of
students have obtained this facility. Since most of them are staying at out side of their
houses. They have to use bursary and their pocket money all through the month and
they have used to deposit money in the banks due to provide secure, and it is reasonable
to use a teller cards.

But when we consider about the case of non-academic staff, we can see some kind of
collapse of their teller card usage, and there may be verity of reasons behind this.
One such possibility is, that they may think since it is possible to retrieve money at any
time via the teller machine. It could be turned in to a waste. On the other hand most of
them afraid of new technology.

Then I decided to analyse this case according to the places where these teller machines
have been established.
Bank Obtain Teller Card
University People’s Bank 261
Other Bank 78
Both 41
No card 484

94
Figure 6.12: How the money of payments devided each of the University banks.

Figure 6.13: University staff and student maintain their accounts with the govenment &
private banks accounts with the govenment & private banks accouding my sample.

In order to study this case moreover. I parted the above sample as academic and non-
academic staff.

Uni.People’s Bank Other Bank Both- No card


Student 163 53 20 59
Academic & Non Academic 98 25 21 45
261 78 41 104 484

According to the above graphs, it is clear that academic as well as the non-academic staff
members have used to get use of teller facilities from the Peoples bank incorporated to the
university. On the other hand, most of the students have used to maintain a private bank
teller card due to the facilities they provide, and they use Peoples bank savings accounts
only to retrieve their bursaries and Mahapola scholarships. But most of them wish to

95
Figure 6.14: University staff and student maintain their accounts with the govenment &
private banks accordin my sample.

Figure 6.15: Number of obtained haven’t savings accounts.

withdraw these accounts when they pass out from the university and it is not so good in
the point of view of the government banks.

As I have analysed the above, I found that, even if the most of the university popu-
lation use the teller facilities provided by the Peoples bank which is incorporated to the
university, it is not so enough. This kind of situation may be occurred due to the following
reasons.

(1.) Since the teller machine is established out side of the university premises. They
have to go to go out side if they to retrieve money. But these people are very busy
and it is not so covenant at all.

(2.) Frequently failures of the machine.

(3.) The process in between retrieving money and take back to the university is no so
secure.

96
Figure 6.16: Sample members have obtain Teller facility

(4.) This teller machine is established little bit far away from the university premises
and one can see very frequent collisions in between the university students and the
young boys who are living in the village. So the most of the university people
little bit scare to retrieve at the evening. One such an example is, few months ago
someone has attempted to commit robbery here. Because of these things, I decide to
investigate the requirements of installation of a teller machine within the university
premises.
Following are the ideas which were revealed by my sample members about their needs.
At what place are you interested No. of interesterd
In side the University 472
Out side the University 28
500
With a view to analyses this furthermore, I have parted my sample as academic,
on-academic and students.
In side the University Out side the University
Student 297 15
Academic & Non academic 175 13
472 28 500
According to the above graphs, it is clear that most of them need the installation of a
teller machine within the university premises, and lots of reasons are behind this.
(1.) It is very convenient to retrieve money.

(2.) It saves time.

(3.) Much secure.

97
Figure 6.17: The sample members have obtained Teller facility or not.

Figure 6.18: Where the sample members have obtained teller facility.

Because of these things, girls who are staying within the university premises also can
retrieve money without any hesitate.

As well as the majority of the sample need the installation of a teller machine within
the university premises, there are a few who dont wish such a facility. This minority
thinks with the availability of this facility, their expenses will be increased without any
control. According to my point of view, bearing these kinds of ideas, harm a lot for the
majority.

In my study, I have specially studied about the Bank of Ceylon which is situated within
the university premises. Now I try to analyse about the sample members who would ready
to get the facilities with the availability of a Bank of Ceylons teller machine.

98
Figure 6.19: Where the sample members have obtain teller facility.

Figure 6.20: New teller machin to be withen or outside of university area.

99
Figure 6.21: New teller machin to be withen or outside of university area.

Figure 6.22: The number of custermers who are interested in opening a new account.

Are you interesting a new Account No of New comers


Interesting 196
Having 140
Not interesting 164
Total 500

If we analyse above graphs we can clearly see that there are about 140 customers who are
dealing with the Bank of Ceylon from long time and they wish to get the teller facilities.
Moreover there are about 160 new customers who are going to open savings accounts.
This is much beneficial for the bank.

100
Figure 6.23: Are you interested in opening a new account.

Figure 6.24: who are Interesred in new teller facility

Are you interesting in Interesting Having Not intetesting


opening a new Accoount
Student 141 56 115 312
Academic staff 11 11 7 29
Tempory staff 14 4 30 48
Non academic staff 31 69 11 111
500

Then the above graph indicates the members who dont like to deal with the bank. Most
of them are final year student and temporary academic staff. They are going to leave the
university very nearly. So they dont what to maintain accounts furthermore.
Are you interesting in opening a new Account Uses not uses
Student 197 115
Academic staff 22 7
Temporary staff 17 31
Non academic staff 100 11
336 164 500

101
Installation of a teller machine within the university premises will provide a lot of indirect
benefits for the banking sector. Already the majority of the academic and non-academic
members ready to obtain the credit card facilities from the bank with the bank with the
availability of a teller machine which I have mentioned above. Since most of their scales
are good enough it is possible to issue credit cadres with higher values.

So it is much suitable to study about the customers who are willing to deal with the
banks. As an example, let us observe the column chart of the facility of fisheries biology.
It is clear that most of them wish to get the teller facilities within the university premises.
Because most of them little bit scare to go to the Matara town.

So finally, considering each and every end which I have analysed, we can decide that
it is much suitable to install a teller machine within the university premises.

102
Chapter 7

Discussion

Man was using diverse stratergies for providing his basic needs such as foods, cloths and
housing since his civilization.
Although they were practicing the exchange of goods and services in past, the situation
changed with the introduction of the exchange unit called “money”. The unit of money
subsequently became the exchange unit of both goods and services exchanges.

The industrial revolution resulted diverse services and as a result of this, the demand
for money also increased. This high demand for money led rapid spreading of banks and
other financial services all around the world.
With the help of modern technology, the current electronic banking systems capable of
providing money for any one, any where of the world.

In Sri Lanka, some of recently established private banks have done vast changes in banking
and financial sector of the country. The compitition among these banks for advancement
one over another has resulted many benifits for the customers.
The University of Ruhuna, Sri Lanka is situated in a peacefull and elegant premises close
to Matara town, southern region.
Large number of residental and non-residental people from all over the country access the
university premises daily for both academic and non-academic purposes.

As example, the daily visitors represent the distant districts like Jafna, Kilinochchi while
some others from Badulla, Anuradhapura and Polonnaruwa districts. Most of these reg-
ular visitors(particulary students) reside in university hostals while some others are from
private loggins around the university.

To accomplish the financial and banking needs of this diverse community of ruhuna uni-
versity, both governmental and private banks are functioning with this context.
Being a students of university of ruhuna, which also means being a member of it’s com-
munity I am really interesting to run a reaserch on the banking needs of this community
which will produce guidences for a more suphosticated banking system withing the uni-

103
h

Figure 7.1: Number of account obtain in university Banks

versity premises.

As an approach to the current study, it was worth studying the set up for performing
the early banking needs of university of ruhuna which was established 28 years ago, in
wellamadama, matara.

Initially, the university has kept it’s trust on governmental banks opening an internal
branch of “Bank of cylon” withing the university premises.

Both academics and non-academics then started dealing with this Bank Of Ceylon inter-
nal branch for their financial purposes.
By 1980’s, another Sri lanka bank called “People’s bank” achieved it’s significant involve-
ment in Sri Lankan economy launching it’s branches throught the country which also
established an external people’s bank branch adjacent to the university premises.

As result of more interesting banking services, with more advantageble account systems
of this newly established bank, people of the university community who were previously
dealing with bank of cylon internal branch started opening atleast one saving account
with this external People’s bank. This new trend has diverted most of academics and
non-academics from the internal Bank Of Cylon branch to external People’s bank branch.
This situation is clearly illustrated in following charts and figures. The Figure 7.1 seems
to show that still the Bank Of Cylon is holding much of university accounts, although
the Figure 7.2 shows the reality where the profit sharing between the two banks clearly
illustrates the later failure of banks of cylon withing the university banking context.

104
h

Figure 7.2: How the money of payments devided each of the University banks.

particularly, the more efficient banking services of People’s bank have moved much of
academic’s accounts from internal bank of cylon branch to the external People’s bank,
The Bank Of Ceylon is experencing a great loss of income due to the fact that academic
accounts exchange great some of money because the academics belong to the highst salary
scale withing the university community.

Therefore, the bank of cylon can increase their profit if they recover the academics trust
on their bank.
As the second part, I thought of studying the financial exchanges of university students
who are the next prominent component of the university community.

The interviewes with students reviwed that they have kept their trust mostly on pri-
vate banks, as examples “Commercial” and “Sampath” banks which are rich in latest
technology. The reason for this particular attraction is the two bank have offered the
facilities for quick money transfer for any where of the country which is an essential need
for students who are depending on parents money which are to be credited to the students
accounts from their distant areas. Many students are using People’s bank branch only for
obtaining their Mahapola scholarship installments and bursaries, because it is the only
bank having anthority on mahapola and bursary transactions.

To my feeling, the two government banks are highly unluckey that they were not ca-
pable of holding the trust of students who makes the blood stream of a living country.
To my openion, studying the reasons for overcomming of people’s bank over Bank Of
Ceylon withing the university premises may be a guide for future advancements in uni-

105
h

Figure 7.3: Number of accounts obtain in University Banks

versity banking and financial services. Under this context, the behaviour of the university
community came in to place where the most of academics, non-academics and students
are fairly busy with their daily works so that they are unable to waste their working
hours for bankings. The overlapping of their working hours with the bank opening hours
caused this problem and they were in need of any 24 hours banking service at least for
withdrawal of money from their accounts.

The People’s bank focused on this issue and they established a teller machine at the
external People’s bank branch several years ago facilitating easy withdrawal of money
both day and night.
Although this was a real help for students, some other issues associated with this teller
machine also worth considering.
Some conflicts between students and the village community are frequent and during such
conflict periods, it is unsafe the students to behave out side of the university premises and
hostals. Such conflicts therefore limit the students to reach the out side teller machine.
This situation is particular for evenings where the students mostly use the teller machine
during which the students can also be attacked by the village boys.

Also some robberies for withdrawn money also not rear during the way between uni-
versity premises and the outside teller machine. Since the teller machine is established
so close to the Matara - Kataragama road, many outsiders also complete with students
for withdrawal of money majority of Matara area is Sinhalese and therefore the students
who have come from north and eastern areas (where some conflicts are going on) are
suspecious of leaving the university premises to the external teller machine in evenings

106
h

Figure 7.4: Number of accounts obtain in University Banks

Figure 7.5: University staff and student maintain their accounts with the govenment &
private banks accordin my sample.

107
h

Figure 7.6: who are Interesred in new teller facility

because they may be questioned by either villagers or security persons.

The above mentioned reasons implies the need to install an internal teller machine withing
the university premises.
As the first step of my study, the university comprising around 5000 was selected as the
study population. The population was then dividid in to different sectors (as descibed
under data analysis) and a sample of 500 was selected for the study. Except 16 of the
sample, all the other were currenty dealing with either bank. The study shows that the
best place to establish an internal teller machines is the internal Bank Of Cylon branch.
With establishing this internal teller machine, the Bank Of Cylon branch may re-attact
around 7000 of the university community. Except 164 of my 500 sample, all the oth-
ers aggreed to start banking with the internal Bank Of Cylon branch. The majority of
disagreed people were final year students who are to complete their university education
near future and then to leave the university.
Also, the disagreed accedemics were mostly temporary tutors whose appointments will be
terminated at the year end.

Therefor, I suggest to establish a teller machine at the internal Bank Of Ceylon ap-
proving the ideas of the majority of the current study population.
The methodslogy called “Logistic regression ”, which is capable of isolating the most sf-
fective factor and out of severel factors affecting on a given issue.
As example, in my study the following customer facts were concederd as the facts that
may potentially affect on establishing an internal teller machine withing the univarsity
premises.
• Nationality
• Gender

108
• Loggin

• banking with other banks and satisfaction on their services

• The satisfaction on the teller machine services they are experiecing

Wether establishing an intenal teller machine would be more easier, more safe and more
time efficient.
Then the logistic regression was conducted and the factors having nign p− value was
removed and the remaning factors were studied.

Thus, in my study, the GENDER = 0.4562 < 0.25. This factor was removed and
restudied for another factor. Second, the study was carried out in minitab and the
T C = 0.2562 < 0.25 factor was removed. This factor reduction proccess is called “step
down wise ” method and the ultimatly remaining factors having p value less than 0.025
conceded to affect significantly on the study problem.
The results of the study will futher be described under the conclusion section. The results
of the study will further be described under the conclusion section.

The logistic transformation equation obtained at the end of the data analysis can be
used for move clear and more reasonable results.

109
Chapter 8

Conclusion

8.1 Results
Since a long time, people have used to deposit their main requirements such as foods,
cloths and money after their daily usage, with a view to use in the future. Step by step
with use of money, the concept of banking was become popular all over the world, and
rapidly improved with the new out comes of the technology revolution. so that they in-
troduced teller and credit card facilities for the convenience of public.

As an example, it is clear that a lot of people who are in our university also keeping
context with the banking sector, and most of them like to keep their accounts in the
government banks rather then the private banks. Considerable number of these people
uses the teller machines and credit cards. But most of the university people have face for
a lot of troubles, due to these teller machines are installed within the urban areas. So my
main intention is to investigate this matter.

About 3500 people come daily, in to the university premises for their academic and official
work. The goal of this study to survey study of the requirement of a teller machine within
the university premises for the public usage. We prepared to questionnaires for collect
data and interview more than 500 people around the university area in wellamadama
premises comprising academic, non academic staff, internal external student and security
staff.

The goal of the analysis was to identify who were interested for ATM facility for uni-
versity premises, and why were they inerest for this facilities. The following variables
were tracked in a computer file called ,“INT − AT M”.
the variable “PLACE” was treated as a categorical variable in the regression analysis, so
three dummy variable had to distinguish the four places they are living. These variables
were difine so that the “ Living in side matara urban area ” (PLMA) was the referent

110
Variable Code Abbreviation
Identification ID-Student IDS
code ID-Non Academic IDNA
ID-Academic IDA
ID-Temparary Academic IDTA
Place of Inside the Matara area PLMA
living Outside the Matara area PLOM
Hostal in side University PLIUH
Hostal out side University PLOUH
Gender 1-Female GENDER
0-Male
Having sarvings 1-Yes HABOC
account in BOC 0-No
Having teller 1-Yes TC
facility 0-No
Interested for 1-Yes INTEREST
BOC ATM facility 0-No
in University
primises

Table 8.1: INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIUH,PLOUH,HABOC,TC

group, follows:
 
1 if P LMA 1 if P LOM
P LMA = P LOM =
0 Othre 0 other

 
1 if P LIUH 1 if P LOUH
P LIUH = P LOUH =
0 Other 0 Other
Also The variable “ ID” was treated as a categorical variable. So dummy variable had to
distinguish the five sections where the members are working. These variables were difine,
so that “ Students” (IDS) was the referent group, follows:
 
1 if IDS 1 if IDA
IDS = IDA =
0 Othre 0 other

 
1 if IDNA 1 if IDT A
IDNA = IDT A =
0 Other 0 Other
The following block of edited computer out comes from fitted Minitab’s Logistic pro-
cedure for the dichotomous out come variable. INTEREST on the predictor GENDER,

111
TC, and IDj for j=1,2,3,4 and P LACEj , for j = 1, 2, 3, 4.

The logit from of the model being fit given as,


logit[P r(Y = 1)] = β0 + β1(IDNA) + β2 (IDA) + β3 (IDT A)
+β4 (P LOM) + β5 (P LIUH) + β6 (P LOUH)
+β7 (GENDER) + β8 (HABOC) + β9 (T C)
Where, Y denotes the depend variable INTEREST. The results of fitting the univariable
Variable Code Abbreviation
Identification ID-Student IDS
code ID-Non Academic IDNA
ID-Academic IDA
ID-Temparary Academic IDTA
Place of Inside the Matara area PLMA
living Outside the Matara area PLOM
Hostal in side University PLIUH
Hostal out side University PLOUH
Gender 1-Female GENDER
0-Male
Having sarvings 1-Yes HABOC
account in BOC 0-No
Having teller 1-Yes TC
facility 0-No
Interested for 1-Yes INTEREST
BOC ATM facility 0-No
in University
primises

Table 8.2: INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIU,PLOU,HABOC,TC

logistic regression models to these data are given in result Table. In this table we present,
for each variable listed in the first column, the following information.
(1) The estimated slope coefficient (s) for the univariable logistic regression model con-
taining only this variable
(2) The estimated standad error of the estimated slop coefficient
(3) Normal values of variables.
(4) The significance level for the likelihood ratio test.
(5) The estimated odds ratio, which is obtained by exponentiating the estimated coeffi-
cient. For example, the variable AGE the odds ratio is for a 5−point increase. This
wos done since a change of 1 year or 1 point would not be clinically meaningful.

112
(6) The 95% CI for the odds ratio.

Where y denotes the dependent variable DENGUE. Using the given table, we now fo-
cus on the information provided under the heading “ Analysis of Maximum likelihood
estimaters.” From this information, we can see that the ML coefficients obtained for the
fitted model are

β̂0 = 1.6810, β̂1 = 0.9152, β̂2 = 0.2731, β̂3 = −1.0998, β̂4 = −0.1135, β̂5 = −0.6987
β̂6 = −1.4646, β̂7 = −0.2563, β̂8 = 1.5276, β̂9 = −0.8361
So that fitted model is given (in logit form) by,

logit[P r(Y = 1)] = 1.6810 + 0.9152(IDNA) + 0.2731(IDA) − 1.0998(IDT A)


−0.1135(P LOM) − 0.6987(P LIUH) − 1.4646(P LOUH)
−0.2563(GENDER) + 1.5276(HABOC) − 0.8361(T C)

Based on this fitted model and the information provided in the computer output, we can
compute the estimated odds ratio ratio for finding factors of interesting the ATM machine
within the university premises for the public usage.

Basd on this fitted model and the information provided in the computer output, and
the information provided in the computer output, we can compute the estimated p- value
for finding factors ofinteresting the ATM macine facility within the university premises
for the public usage. We do this using the previously state rule for (Forward selection
with a test for backward estimation) adjusted p- value for (0 − 1) variable.

With the exception of GENDER there is evidence that each of the variable has some
association (p−value= 0.293 < 0.25) with the outcome, remaning the factors of interest-
ing the ATM machine within the university premises for the public usage.
The results in Table 9.3, when campared to Table 9.2, indicate weaker associations for
some covariates when controlling for other variables.

The new reduced model is written in logit from as,

logit[P r(Y = 1)] = β0 + β1(IDNA) + β2 (IDA) + β3 (IDT A)


+β4 (P LOM) + β5 (P LIUH)
+β6 (P LOUH) + β7 (HABOC) + β8 (T C)

By using the computer output we can see that the ML coefficients obtained for the new
fitted model are,

β̂0 = 1.510, β̂1 = 0.9230, β̂2 = 0.3067, β̂3 = −1.1366, β̂4 = −0.1302, β̂5 = −0.7866
β̂6 = −1.4239, β̂7 = 1.5263, β̂8 = −0.8070,

113
Variable Code Abbreviation
Identification ID-Student IDS
code ID-Non Academic IDNA
ID-Academic IDA
ID-Temparary Academic IDTA
Place of Inside the Matara area PLMA
living Outside the Matara area PLOM
Hostal in side University PLIUH
Hostal out side University PLOUH
Gender 1-Female GENDER
0-Male
Having sarvings 1-Yes HABOC
account in BOC 0-No
Having teller 1-Yes TC
facility 0-No
Interested for 1-Yes INTEREST
BOC ATM facility 0-No
in University
primises

Table 8.3: INTEREST Vs IDA ,IDS,IDTA,PLOM,PLIU,PLOU,HABOC,TC

So that new fitted model is given (in logit form) by,


logit[P r(Y = 1)] = 1.510 + 0.9230(IDNA) + 0.3067(IDA) − 1.1366(IDT A)
−0.1302(P LOM) − 0.7866(P LIUH)
−1.4239(P LOUH) + 1.5263(HABOC) − 0.8070(T C)
By usin equation 2.4 & 2.5, 2.6,we can get,

p̂i
ln = β0 + β1(IDNA) + β2 (IDA) + β3 (IDT A) + β4 (P LOM) + β5 (P LIUH)
1 − p̂i
+β6 (P LOUH) + β7 (HABOC) + β8 (T C)

p̂i
= e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}
1 − p̂i
The logit transformation is given by,
e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5(P LIU H)+β6 (P LOU H)+β7(HABOC)+β8 (T C)}
p̂i =
1 + e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}
The fitted values are given by using 2.6 is,
e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5(P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}
π̂(x) =
1 + e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}

114
By using the privious results, we can substitued the values for logistic transformation,

e{1.510+0.9230(IDN A)+0.3067(IDA)−1.1366(IDT A)−0.1302(P LOM )+......+1.5263(HABOC)}


π̂(x) =
1 + e{1.510+0.9230(IDN A)+0.3067(IDA)−1.1366(IDT A)−0.1302(P LOM )+....+1.5263(HABOC)}

8.2 Conclusion
Compared with the other subjects, the non academic staff (IDNA) were interested in more
than three times to the requirment of a Teller machine facility within university premises
for the public usage, with the odds ratio (OR) of 2.52 and confidence Interval (CI) of
1.21 and 5.24. Also the current account owners in “ Bank Of Ceylon” were interested
more than 5 times with the odds ratio (OR) of 4.61 and confidence Interval (CI) of 2.46
and 8.63. But Temporary and Demostrate staff were interested more than three times
less for the requirment of this “ BOC” teller machine (OR=0.33; CI=0.17-0.66). Also the
Academic staff & Students are less interested. But they were interested to the requirment
“ people’s Bank” Teller machine facility within university primises for the public usage.

115
Chapter 9

Appendx

116
List of Figures

1.1 Example Data set-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


1.2 Example Data set-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Motivation for the Least-squares Regression line . . . . . . . . . . . . . . . 5

1.4 Line A and B Both satisfing the criation ni=1 (yi − ŷi ) = 0 . . . . . . . . . 6
1.5 The least-squares procedure minimizes the sum of the squares of the resid-
uals ei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Example of possible population regression lines . . . . . . . . . . . . . . . 9
1.7 Depth of water Vs Water Tempature. . . . . . . . . . . . . . . . . . . . . . 10
1.8 Quadaratic model:μ = β0 + β1 x + β2 x2 . . . . . . . . . . . . . . . . . . . . 14
1.9 Cubic model:μ = β0 + β1 x + β2 x2 + β3 x3 . . . . . . . . . . . . . . . . . . . 15

2.1 Scatterplot by CHD by AGE for 100 subjects. . . . . . . . . . . . . . . . . 19


2.2 Plot of the presentage of subjects with CHD in each age group. . . . . . . 20

3.1 Design variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Comparison of the weight of two groups of boys with different distribution
of age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Plot of the logits under three different models showing the presence and
absence of interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1 Different types of models for relationship between the logit and a continuous
variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.1 Size of the University population . . . . . . . . . . . . . . . . . . . . . . . 85


6.2 Academic staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3 Temporary and Demonstrate staff . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Nonacademic staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Number of saving accounts for the sample taken out of university premises. 88
6.6 Analysis of havings Accounts for the sample taken out of university premises. 89

117
6.7 University staff & student maintain their accounts with the university Peo-
ple’s bank & BanK Of Ceylon according my sample. . . . . . . . . . . . . . 90
6.8 Analysis of university staff & student maintain their accounts with the
university People’s bank & Banl Of Ceylon according my sample. . . . . . 91
6.9 Number of account obtain in university Banks . . . . . . . . . . . . . . . . 92
6.10 How the money of payments devided each of the University banks. . . . . . 93
6.11 Number of accounts obtain in University Banks . . . . . . . . . . . . . . . 94
6.12 How the money of payments devided each of the University banks. . . . . . 95
6.13 University staff and student maintain their accounts with the govenment
& private banks accounts with the govenment & private banks accouding
my sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.14 University staff and student maintain their accounts with the govenment
& private banks accordin my sample. . . . . . . . . . . . . . . . . . . . . . 96
6.15 Number of obtained haven’t savings accounts. . . . . . . . . . . . . . . . . 96
6.16 Sample members have obtain Teller facility . . . . . . . . . . . . . . . . . . 97
6.17 The sample members have obtained Teller facility or not. . . . . . . . . . . 98
6.18 Where the sample members have obtained teller facility. . . . . . . . . . . 98
6.19 Where the sample members have obtain teller facility. . . . . . . . . . . . . 99
6.20 New teller machin to be withen or outside of university area. . . . . . . . . 99
6.21 New teller machin to be withen or outside of university area. . . . . . . . . 100
6.22 The number of custermers who are interested in opening a new account. . . 100
6.23 Are you interested in opening a new account. . . . . . . . . . . . . . . . . . 101
6.24 who are Interesred in new teller facility . . . . . . . . . . . . . . . . . . . . 101

7.1 Number of account obtain in university Banks . . . . . . . . . . . . . . . . 104


7.2 How the money of payments devided each of the University banks. . . . . . 105
7.3 Number of accounts obtain in University Banks . . . . . . . . . . . . . . . 106
7.4 Number of accounts obtain in University Banks . . . . . . . . . . . . . . . 107
7.5 University staff and student maintain their accounts with the govenment
& private banks accordin my sample. . . . . . . . . . . . . . . . . . . . . . 107
7.6 who are Interesred in new teller facility . . . . . . . . . . . . . . . . . . . . 108

118
List of Tables

1.1 Example Data set-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Example Data set-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Computations for finding β0 and β1 . . . . . . . . . . . . . . . . . . . . . . 8

2.1 frequncy table of AGE group by CHD . . . . . . . . . . . . . . . . . . . . 20


2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Results of fitting the logistic regression model to the data in Table 2.1 . . . 27
2.4 Estimated convariance matrix of the estimated coefficicent in Table 2.3 . . 34

3.1 An example of the coding the design variables for race coded at three levels 36
3.2 Table 3.2, estimated coefficients for a multiple Logistic regression model
using the variables AGE, weight at least menstrual period (LWT), Race
and Number of first trimester physician visits (FTV) for the low birth
weight study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Estimated coefficients for a multiple Logistic Regression model sing the
variable LWT and RACE from the low birth wight stutdy. . . . . . . . . . 43
3.4 Estimated covariance matrix of the estimated coefficients in Table 3.3 . . . 46

4.1 values of the logistic regression model when the independent variable is
dichotomous outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 cross-classification of AGE dichotomized at 55 years and CHD for 100 subjects 51
4.3 Illustration of the coding of the design variable using the reference cell
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Illustration of the coding of the design variable using the deviation from
means method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 cross-classification of hypothetical data on RACE and CHD status for 100
subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 specification of the design variables for RACE using reference cell coding
with white as the reference group. . . . . . . . . . . . . . . . . . . . . . . . 56

119
4.7 Results of fitting the logistic regression model to the data in table 4.5 using
the disign variablesa in table 4.6. . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 specification of design variable for RACE using deviation form means coding. 58
4.9 Results of fitting the logistic regression model to the data in Table 4.5 using
the design variables in Table4.8 . . . . . . . . . . . . . . . . . . . . . . . . 60
4.10 Descriptive statistics for two groups of 50 mens on AGE and whether they
had seen a physician(PHY)(1=Yes,0=No)within the last months. . . . . . 63
4.11 Resuls of fitting the logistic regression model to the data summarized in
Table 4.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.12 Estimate logistic regression coefficients , deviance, and the likelihood ratio
test statistic (G) for an example showing evidence of confounding but no
interation(n=400) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.13 Table 4.13 Estimate logistic regression coefficients, deviance, and the likeli-
hood ratio test statistic (G) for an example showing evidence of confound-
ing but no interation(n=400) . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.14 Table 4.14 estimated logistic regression coefficients, Deviance, the likeli-
hood ratio teststatistic (G), and the P-value for the change for models
containing LWD and AGE from the low birthwight containing LWD and
AGE from the low birthwight data(n=189) . . . . . . . . . . . . . . . . . . 71
4.15 Estimated covariance matrix for the estimated parameters in model 3 of
Table 4.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.16 Estimated odds ratios and 95 present confidence intervals for LWD,controlling
for AGE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1 Univariable logistic regression models for the USI(n=575) . . . . . . . . . . 81


5.2 Results for a multivariable model containing the covariates sinificant at the
level of Table5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.1 INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIUH,PLOUH,HABOC,TC111


8.2 INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIU,PLOU,HABOC,TC112
8.3 INTEREST Vs IDA ,IDS,IDTA,PLOM,PLIU,PLOU,HABOC,TC . . . . . 114

120
Bibliography

[1] Albright, Winston ,and Zappe Schaum’s Outline Series Theory and Problems of
Bisiness Statistics , Schaum’s Outline Series, McGraw-Hill

[2] Amemiya, T.and Powell, Applied Statistics with Microsoft Exel , Duxbury Thom-
son Learning

[3] Berk, k.and carry, Wayne L. Winston, Christopher Zappe, Data Analysis and
Decision Making with Microsoft Exel , Thomson Brooks / Cole

[4] carver, Data Analysis with MINITAB 12. , Duxbury Thomson Learning

[5] Conway, D.and Roberts Regression Analysis in Employment Discrimination


cases , Statistic and the Law. New York, NY: Wiley, 1986

[6] David W Hosmer ,Applied Logistic Regression (3rd ed.) , John Wiley and Sons

[7] Freund, R.and Littell, SAS system for Regression analysis (2nd ed.) , Duxbury
Thomson Learning

[8] Hidebrand Statistical Thinking for managers (4th ed) , Duxbury Thomson Learn-
ing

[9] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multi-
variable Methods (2nd ed.), Duxbury press, Cole publishing company

[10] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multi-
variable Methods (3rd ed.), Duxbury press, Cole publishing company

[11] Lehmann, Zeitz Statistical Explorations With Microsoft Excel , Duxbury press,
1990

[12] Hidebrand Statistical Thinking for managers (4th ed) , Duxbury Thomson Learn-
ing

121
[13] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multi-
variable Methods (2nd ed.), Duxbury press, Cole publishing company

[14] State collage MINITAB Users Guide, reliase for 12th Windows , State collage,
PA: Minitab, Inc

[15] Shiffler, Adoms Introduction Business Statistic with computer Applications (2nd
ed.) , Duxbury Thomson Learning ,1995

[16] Terry Dietman Applied Regression Analysis For Business and Economics ,
Thomson Brooks / Cole

122