0 Voturi pozitive0 Voturi negative

443 (de) vizualizări128 paginiOct 18, 2008

© Attribution Non-Commercial (BY-NC)

PDF, TXT sau citiți online pe Scribd

Attribution Non-Commercial (BY-NC)

443 (de) vizualizări

Attribution Non-Commercial (BY-NC)

- Statistical Inference, Regression SPSS Report
- Data Driven Decision Regression
- 5 Forecasting-Ch 3(Stevenson).pdf
- Locating gender in ICTD projects - E-Krishi
- staad pro 3
- 4. MV - Multiple Regression.ppt
- Bentley MX - 2004 Edition MXROAD_Introduction 022004
- Regression Analysis
- 1.Man-challenges in Service Delivery Within the-peter Kiprotich Cheruiyot 123
- v081n05p117 South African Mining and Metallurgy Journal
- Linear Models Notes
- L12 Dynamic
- Mx 2421262131
- Course 6 Econometrics Regression
- Presentation Regression
- 02 Multiple Regression
- BA REPORT
- Simple Linear Regression.pdf
- Linear Regression
- 02. 2015 - Language Learning Strategies in Chinese Students

Sunteți pe pagina 1din 128

University Premises For The Public

Usage

Department of Mathematics

University of Ruhuna

Matara.

Declaration

Thesis of the project as a partial fulﬁllment for special Degree in Mathematics part (II)

(2002) by R.M.K.T. Rathnayaka under the supervision and guidance of Dr. L.A.L.W.

Jayasekara, senior lecturer, Department of Mathematics, University of Ruhuna, Sri

Lanka.

————————————–

Dr. L.A.L.W. Jayasekara

————————————–

R.M.K.T. Rathnayaka

(Index No: 2002/S/5043)

—————————————

Date

2

ACKNOWLADGEMENT

encouragement during the preparation of this thesis.

University of Ruhuna, for giving me the opportunity to carry out my research work in

the Department. Equally, I wish to express gratitude to Senior lecturers in Department

of mathematics, and Miss B.B.U.P. Perera, Assistant lecturer in Department of

Mathematics, University of Ruhuna and all the members of the Department for their

kind coorperation.

Finally I would like to thank my parents, demonstrators and friends who provided

support to successfully complete this project.

Name :

R.M.K.T. Rathnayaka (2002/S/5043)

Department Of Mathematics

University Of Ruhuna

Matara

Sri Lanka

Date : 04/04/2008

3

Abstract

Since a long time, people have used to deposit their main requirements such as foods,

cloths and money after their daily usage, with a view to use in the future. Step by step

with use of money, the concept of banking was become popular all over the world, and

rapidly improved with the new out comes of the technology revolution. so that they in-

troduced teller and credit card facilities for the convenience of public.

As an example, it is clear that a lot of people who are in our university also keeping

context with the banking sector, and most of them like to keep their accounts in the

government banks rather then the private banks. Considerable number of these people

uses the teller machines and credit cards. But most of the university people have face for

a lot of troubles, due to these teller machines are installed within the urban areas. So my

main intention is to investigate this matter.

About 3500 people come daily, in to the university premises for their academic and oﬃcial

work. The goal of this study to survey study of the requirement of a teller machine within

the university premises for the public usage. We prepared to questionnaires for collect

data and interview more than 500 people around the university area in wellamadama

premises comprising academic, non academic staﬀ, internal external student and security

staﬀ.

Finally in order to analysis these data we have use “MINITAB” and “R” statistical soft-

wares.

i

Contents

Introduction 1

1.1 Using Simple Regression To Describe A Liner Relationship . . . . . . . . . 3

1.2 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Inferences From A Simple Regression Analysis . . . . . . . . . . . . . . . . 8

1.4 Model And Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Multiple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Least-Square Procedures For Model Fitting . . . . . . . . . . . . . . . . . . 13

1.7 Polynomial Model Of Degree p . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 Why Use Logistic Regression Rather Than Ordinary Linear Regression. . . 16

2.2 The Simple Logistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 The Importance Of The Logistic Transformation . . . . . . . . . . . . . . . 22

2.4 Fitting The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . 23

2.5 Fitting The Logistic Regression Model By Using

Maximum Likelihood Method . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Testing For The Signiﬁcance Of The Coeﬃcients . . . . . . . . . . . . . . . 27

2.7 Testing For The Signiﬁcance Of The Coeﬃcents For The Logistic Regressin 27

2.8 Conﬁdence interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 The Multiple Logistic Regression Model . . . . . . . . . . . . . . . . . . . 35

3.2 Fitting The Multiple Logistic Regression Model . . . . . . . . . . . . . . . 37

3.3 Testing For The Signiﬁcance Of The Model . . . . . . . . . . . . . . . . . . 41

3.4 Likelihood Ratio Test For Testing For The Signiﬁcance Of The Model . . . 42

3.5 Wald Test For Testing For The Signiﬁcance . . . . . . . . . . . . . . . . . 44

ii

3.6 Conﬁdence interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1 Dichotomous Independent Variable . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Polychatomous Independent Variable . . . . . . . . . . . . . . . . . . . . . 55

4.3 Deviation From Means Coding Method . . . . . . . . . . . . . . . . . . . . 58

4.4 Continous Independent Variable . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5 The Multivariable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Interaction And Confounding . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.7 Estimation Of Odds Ratios In The Presence Of

Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Fractional polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7 Discussion 103

8 Conclusion 110

8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

9 Appendx 116

iii

An Introduction to Regression

Analysis

Introduction

Advance in technology including computers, scanners, and telecomunications equipment

have buried present-day managers under a mountain of data. Although the purpose of

these data is to assist managers in the decision-making process, corporate executives who

face the task of juggling data on many variables may ﬁnd them selves at a loss when

attemping to make sence of such information. The decision-making process is futher

complicated by the dynamic elements in the business environment and the complex in-

terrlationships among these elements.

This text has been prepared to give managers (and future managers) tools for examining

possible relationships betweeen two or more variables. For example, sales and adver-

tising are two variables commonly throught to be related. When a soft drink company

increases advartising expenditures by paying professinal athletes millions of dollars to do

its advertisements, it expects this outlay increase sales. In general, when decisions on

advertising expenditures of millions of dollers are involed, it is comforting to have some

evidence that, in the past, increased advertising expenditures indeed let to increased sales.

Another example is the relationship between the selling price of a house and its square

footage. When a new house is listed for sale, how shold the price be determined? Is a

4000-square-foot house worth twice as much as a 2000-square -foot house? What other

factors might be involved in the pricing of houses and how should these factors be includ

in the determination of the price?

several variables have an impact. These variable might include job complexity, base pay,

the number of yeras the worker has been with the plant, and the age of that worker. If

absenteeism can cost the company thousands of dollars then the importance of identifying

its associated factors becomes clear.

Perhaps the most important analytic tool for examining the relationships between two

1

or more variables is regression analysis.Regression analysis is a statistical technique for

developing an equation describing the relationship between two or more variables. One

variable is speciﬁed to be the dependent variable, or the valuble to be explained. The

other one or more variable are called the independent or explanatory variables. Using the

previous examples, the soft drink ﬁrm would identify sales as the dependent variable and

advertising expenditures as the explanatory variable. The real estate ﬁrm would choose

selling price as the dependent variable and size as the explanatory variable to explain

variation in selling price from house to house.

There are several reasons business researchers might might want to know how certain

variable are related. The retail ﬁrm may whont to know how much advertising is neces-

sary to achieve a certain level of sales. An equation expressing the relationship between

sales and advetising in useful in answering this question. For the real estimate ﬁrm, the

relationship might be used in assiging prices to houses coming on to the market. To try

to lower the absenteeism rate, the management of the manufacturing ﬁrm wants to know

what variables are most highly related to absenteeism. Reasons for wanting to develop

an equation relating two or more variables can be classiﬁed as follows.

(b) For control purpose (What value of the explanatory variable is neede to produce a

certain level of the dependent variable)

Much Statistical analysis is a multistage of trial and error. A good deal of exploratory

work must be done to select appropriate variables for study and to determine relationships

between or among them. This requires that a variety of statistical tests and procedures be

performed and sound judgments be made before one arrives at satisfactory choices of de-

pendent and explanatory variable. The emphasis in this text is on this multistage process

rather than on the computations themselves or an in-depth study of the theory behind

the techniques presented. In this sense, the text is derected at the applied researcher or

the consumer of statistic.

the reader to perform the actual computations. The use of statistical software frees the

user to concentrate on the multistage “model-building” process. Most examples use il-

lustrative computer output to present the results. The two software packages used are

“MINITAB” and “Microsoft Excel 2000”. MINITAB is include because it is widely used

as a teaching tool in universities and is also used in industry.

2

Chapter 1

Relationship

Regression analysis is a statistical technique used to describe relationships among vari-

ables. The simplest case to examne is one in which variable y, referred to as the dependent

variable, may be related to another variable x, called an independent or explanatory vari-

able. If the relationship between y and x is believed linear, then the equation expressing

this relationship is the equation for a line:

y = b0 + b1 xi (1.1)

If a graph of all the (x,y) pairs is constructed, then β0 represents the y intercept, the

point where the line crosses the vertical (y) axis, and β1 represents the slope of the line.

Consider the data show in Table 1.1. A graph of the (x,y) pairs would appear as show

in Figure 1.1. Regression analysis is not needed to obtain the equation exprssing the

relationship between these two variables.In equation form:

y = 1 + 2x

This is an exact or deterministic linear relationship. Exact linear relationships are some-

times encoutered in business enviroments. For example, from accounting,

T otalCosts = F ixedCosts + variableCosts

Other exact relationships may be encountered in various science courses (for example ,

physics or chemistry). In the social sciences (for example, psychology or sociology) and

X 1 2 3 4 5 6

Y 3 5 7 9 11 13

3

Figure 1.1: Example Data set-I

X 1 2 3 4 5 6

Y 3 2 8 8 11 13

in business and ecomomics, exact linear relationships are the exception rather than the

rule. Data encountered in a business environment are more likely to appear as in Table

1.2. These data graph as shown in Figure 1.2.

It appears that x and y may be linearly related, but it is not an exact relationship.

Still it may be describe to describe the relationship between in equation form. This can

be done by drawing what appears to the “best-ﬁtting” line through the points and esti-

mating (guessing) what the values of β0 and β1 are for this line. This has been done in

Figure 1.2. For drow a good guess might be the following equation:

ŷ = −1 + 2.5x

4

y

11

00

11

00 0

1

00

11

y*=8

0

1 11

00

0

1

11

0

y*−y*

0

y*=7

0

10

1

0

1

0

1

0

1

0

1

1 1

0 0

0

1

0

1

x*=3 x

The drowbacks to this method of ﬁtting the line should be clear. For example, if the

(x, y) pairs graphed in Figure 1.2 where given to two people, each would probably guess

diﬀerent values for the intercept and slope of the -ﬁtting line. Furthermore, there is no

way to assess who would be more correct. To make line ﬁtting more precise, a deﬁnition

of what is means for a line to be the “best“ is needed. The criterion for a best-ﬁtting line

that we will use might be called the “miniumum sum of squared errors”criterion or, as it

is more commonly know, the least-quares criterion.

In Figure 1.3, the (x, y) pairs from Table 1.2 have been plotted and arbitrary line drawn

through the points. Consider the pair of values denoted (x∗ , y ∗) The actual y value is

indicated as y ∗ ;the value predicted to be associated with x∗ if the line shown where used

is indicated as ŷ ∗ . The diﬀerence between the actual y value and the predicted y value

at the point x∗ is called a residual and represent the “error”involved. This error denoted

y ∗ − ŷ ∗ . If the line is to ﬁt the data points as accurately as possible, these errors shoud be

minimized.This should be done not just for the single point (x∗ , y ∗), but for all the points

on the graph. There are several vays to approach this task.

(1) Use the line that minimizes the sum of errors, ni=1 (yi − ŷi ).

The problem with this approach is that,for any line that passes through the point(x̄,ȳ),

n

i=1 (yi − ŷi )=0

so there are an inﬁnite number of lines satisfying this criterion,some of which obvi-

ously do not ﬁt the data well. For example, in Figue 3.4, line A and B have both

been constucted so that ,

n

(yi − ŷi ) = 0

i=1

But line A obviously ﬁts the data better than line B;that is , it keeps the distances

yi − ŷi small.

5

y

11111111

00000000 11

00

00

11

00000000

11111111 1

0

00

11

00000000

11111111

00000000

11111111

11

00

00000000

11111111

00000000

11111111

0

1

00000000

11111111

y=7.5 0

1

00000000

11111111

00000000

11111111

00000000

11111111

00000000

11111111

00000000

11111111

00000000

11111111

1

0

00000000

11111111

00000000

11111111

x*=3.5 x

n

Figure 1.4: Line A and B Both satisﬁng the criation i=1 (yi − ŷi ) = 0

(2) Use the line that minimize the sum of absolute value of error.

n

i=1 |(yi − ŷi )|

This is calledthe minimum sum of absolute error criterion.The resulting line is called

the least absolute value (LAV)regression line.Although use of this criterion is gain-

ing popularity in many situations, it is not the one that ew use in this text. Finding

the line that satisﬁes the minimumsum of absolute errors criterion requires solving

a fairly compex problem by a technique called linear programming.

(3) Use the line that minimizes the sum of squared errors.

The parameters β0 and β1 are estimated by the method of least squares. The reasoning

behind this method is quite simple. From the many straight lines that can be drawn

through a scatergram we wish to pick the one that“best ﬁts” the data. The ﬁts is “best”

in the sense that the values of b0 and b1 chosen are those that minimize the sum of the

squares of the residuals. In this way we are essntially picking the line that comes as close

as it can to all data points simultaneously. For example, if we consider the sample of ﬁve

data points shown in Figure 1.8, then the least-squares procedure selects that line which

causes e21 + e22 + e23 + e24 + e25 to be as small as possible.

The sum of squres of the errors about the estimated regression line is given by

n

n

SSE = e2i = (yi − b0 − b1 x1 )2

i=1 i=1

6

y

11

00

001

110e5

0

1

0

1

0

1 1

0

0

1

11

0e3

e4

0 11

00

0

1

0

1 μ

11

00

e1

Y|x

= b +

0

b x

1

e2

11

00

Figure 1.5: The least-squares procedure minimizes the sum of the squares of the residuals

ei

∂SSE n

= −2 (yi − b0 − b1 xi )

∂b0 i=1

∂SSE n

= −2 (yi − b0 − b1 xi )xi

∂b1 i=1

We now set these partial derivatives equal to 0 and use the rules of summation to be

obtain the equation,

n

n

nb0 + b1 xi = yi

i=1 i=1

n

n

n

b0 xi + b1 x2i = xi yi

i=1 i=1 i=1

This equations are called the normal equations. They can be solved easily to obtain these

estimates for β0 and β1 :

n 1

n n

i=1 xi yi − n ( i=1 xi i=1 yi )

β1 = n 2 1 n (1.2)

2

i=1 ix − ( x

i=1 i )

n

β0 = ȳ − β1 x̄ (1.3)

β0 and β1 that minimize, n

(x − x̄)(yi − ȳ)

n i

β̂1 = i=1

i=1 (xi − x̄)

2

β̂0 = ȳ − β1 x̄

7

Variable (i) xi yi xi yi x2i

1 1 3 3 1

2 2 2 4 4

3 3 8 24 9

4 4 8 32 16

5 5 11 55 25

6 6 13 78 36

Sums 21 45 196 91

n 1

n n

i=1 xi yi − n ( i=1 xi i=1 yi )

β̂1 = n 2 1 n (1.4)

2

i=1 xi − ( i=1 xi )

n

As an example of the use of these formulas,consider agin the data in Table 1.3. The

intermediate computations necessary for ﬁnding β0 and β1 are shown in Table 1.3. The

slope, β1 , can be computed using the formula in equation 1.2,

196 − 16 (21)(45) 38.5

β̂1 = 1 = = 2.2

91 − 6 (21) 2 17.5

The intercept, β0 , is computed as in equation 1.3

β̂0 = 7.5 − 2.2(3.5) = −0.2

because

21 45

x̄ = and ȳ = = 7.5

6 6

The least-squares regression line for these data is

ŷ = −0.2 + 2.2x

Summary

There is no longer any guesswork associated with computing the best-ﬁtting line once

a criterion has been stated that diﬁnes “best”. Using the criterion of minimum sum

of squared errors, the regression line we computed provides the best description of the

relationship between the variables x and y.

Thus far, regression analysis has been viewed as a way to describe the relationship between

two variables. The regression equation obtained can be viewed in this manner simply as

8

A direct relationship An inverse relationship

not in its use as a descriptive measure for one particular sample, but in its ability to draw

inferences or generalizations about the relationship for the entire population of values for

the variable x and y.

To draw inferences from a simple regression, we must make some assumptions about how

x and y are related in the population. These initial assumptions describe an “ideal”situation.

Later, each of these assumptions is relaxed and we demostrate modiﬁcations to the basic

least-squares appoach the provid a model that still suitable for statistical inference.

Assume that the relationship between the variables x and y is represented by a popu-

lation regression line. The equation of this line is written as,

μy|x = β0 + β1 x (1.5)

Where μy|x is the conditional mean of y given of x,β0 is the y intercept for the population

regressiion line, and, β1 is the slope of the population regression line.Example of possible

relationships are shown in Figure 1.6.

Suppose that we are developing a model to describe the temperature of the water oﬀ

the continental self. Since the temperature dependes in the part on the depth of the

water, two variable are involved. These are X, the water depth, and Y , the water tem-

9

y y

11

00 00

11

00

11 00

11 μ = β + β x

00

11 0

1

00

11 Y|x 0 1

temperature

000000000000000

111111111111111

temperature

0

1

11 1

00 000 0

11 00

11 11

00

11

00

00

11

000000000000000

111111111111111

0

1 0

100 11

11 00

11 00 0

11

000000000000000

111111111111111

111

00011 0

0000

11 00

11

0

1

00

11 11100

1100

00 11 00

000000000000000

111111111111111

11

000000000000000

111111111111111

111

0 1100

0011 11 0

00 10 11

00

100 1

000000000000000

111111111111111

water

0 11

00

water

00

1111

00 000000000000000

111111111111111

11 1

00 011

00 11

00

000000000000000

111111111111111

011

1 00 11

00

000000000000000

111111111111111

000000000000000

111111111111111

000000000000000

111111111111111

000000000000000

111111111111111

0 x 0 x

Depth of water Depth of water

(a) (b)

(x,y)

i i

temperature

100

011

0 1

1 0

0

1

0

1

0 1

1 0

1 ε

0000000000000

1111111111111

0 01

1 00 1

1

0 0

00

11

0000000000000

1111111111111

1 0

1

water

00 0

1

00

11

0000000000000

1111111111111

11 1 0

0 1 011

00

1

0000000000000

1111111111111

11 0

00 1 0 1 01

0000

11

0000000000000

1111111111111 0011

11

0000000000000

1111111111111

11 0

00

0000000000000

1111111111111

0000000000000

1111111111111111

00

00

11

0000000000000

1111111111111

0 x

Depth of water

(c)

perature. We are not interested in making inferences on the depth of the water. Rather,

we want to describe the behavior of the water temparatuer under the assumption that the

depth of the water is known precisely in advance. Even if the depth of the water is ﬁxed

at some value x, the water temperature will still vary due to other random inﬂuences.

For example, if several temparature measurements are taken at various places each at a

deapth of x = 1000 feet, these measurements will vary in value. For this reson, we must

admit that for a given x we are really dealing with a “Conditional” random variable,

which we denote by Y |x (Y given that X = x).

This conditional random variable has a mean denoted by μY |x . It is obvious that the

average temperature atx = 1000 feet to be the same as that at x = 5000 feet. That is, it

is reasonable to assume that μY |x is a function of x. We call the graph of this function the

curve of regression of Y on X. Since we assume that the value of X is known in advance

and that the value assumed by Y dependes in part on the particular value of X under

considaration, Y is called the dependent or response variable. The variable X whose

value is used to help predict the behaviour of Y |x is called the independent or predictor

variable or the regressor.

selected values x1 , x2 , x3 , ...., xn of the predictor variable X.The actual values used to de-

velop the model are not overly important. If a functional relationship exit, it should

become apparent regardless of which X values are used to didiscovedr it. However, to be

of practical use, these values, should represent a fairly wide range of possible values of the

independent variable X. Sometimes the values used can be preselected. For example, in

10

studing the relationship between water temperatures and deapths, we might know that

our model is to be used to predict water temperature for depths from 1000 to 5000 feet.

We can choose to measure water temparatures at any water deapth that we wish within

this range. For example, we might take measurement at 1000-foot increments. In this

way we present our X values at x1 =1000, x2 =2000, x3 =3000, x4 =4000, x5 = 5000 feet.

When the X values used to develop the regression equation are preseleted, the study is

said to be controlled.

Regardless of how the X values for study are selected, my random sample is properly

viewed as taling the form

The ﬁrst member of each ordered pair denoptes a value of the independent variable X; it

is a real number. The second member of each pair is a random variable.

to be linear. In this case the equation for μY |x is given by,

μy|x = β0 + β1 x (1.6)

Where β0 and β1 denote the real numbers.

Description Of Model

From the elementary algebra that the equation for the straight line is y = b + mx, where b

denote they intercept and m denotes the slope of the line. In the simple linear regression

model is,

μy|x = β0 + β1 x

β0 denotes the intercept and the β1 the slope of the regression line. We must ﬁnd a logical

way to estimate the theoretical parameters β0 and β1 .

x2 , x3 , ....., xn . These points are assumed to be measured without error. When they are

preselected by the experimenter, we say that the study is a controlled study; when they

are observed at random, then the study is called an observational study. Both situations

are handled in the same way mathematically. In either case we shall be concerned with

the n random variables Y |x1 , Y |x2 , Y |x3 , ....., Y |xn . recall that a random variable varies

about its mean value. Let Ei denote the random diﬀerence between Y |xi and its mean,

μY |x , That is, let

Ei = Y |xi − μy|xi (1.7)

11

Solving this equation for Y |xi , we conclude that

In this expression it is assumed that the random diﬀerence Ei has mean 0. Since we are

assuming that the regression is linear, we can conclude that μy|xi = β0 +β1 xi . Substituting,

we see that

Y |xi = β0 + β1 xi + i

It is customary to drop the conditional notation and to denote Y |xi by Yi . Thus an

alternative way toexpress the simple linear regression model is

Y − i = β0 + β1 xi + Ei (1.8)

Our data consist of a collection of n pairs (xi , yi ), where xi is an observed value of the

variable X and yi is the corresponding observation for the random variable Y . The

observed value of a random variable usually diﬀeres from its mean value by some random

amount. This idea is expressed mathematically by writing

yi = β0 + β1 xi + i (1.9)

In this equation i is a realization of the random variable Ei that appears in the alternative

model for simple linear regression 1.8.

In a regression study it is useful to plot the data points in the xy plane. Such a plot is

called a “Scattergram”. We do not expect these points to lie exactly in the straight line.

However, if linear regression is aplicable, then they should exhibit a linear treand.

Once β0 and β1 have been approximated from the available data, we can replace these

theoretical parameters by their estimated values in the regression model. Letting b0 and

b1 denote the estimates for β0 and β1 ,respectively, the estimated line of regression takes

the form

μ̂Y |x = b0 + b1 x (1.10)

Just as the data points do not all lie on the theoretical line of regression,they also do not

all lie on this estimated regression line. If we let ei denote the vertical distance from a

point(xi , yi ) to the estimated regression line, then each data point satisﬁes the equation

yi = b0 + b1 xi + ei

The term ei is called the residual. Figure 1.8 illustrates this idea and points out the

diﬀerence between i and ei graphically.

In the previous section we studied the simple logistic regression model. This model

expresses the idea that the mean of a responce variable Y depends on the value assumed

12

by a single predictor value X. In this section I extend the concepts studied earlier to

cases in which the model become more complex. In particular,we distinguish between

two basic models: the polynomial model, in which the single predictor variable can appear

to a power greter than 1, and the multiple linear regression model, in which more than

one distinct variable can be used.

In this section we develop the least-squares estimators for the parameters in both the

polynomial and multiple regression models. Before introducing these models speciﬁcally,

let us not that each of term is a special case of what is called the general linear model.

These are models in which the mean value of a responce variable Y is assumed to depend

on the values assumed by one or more predictor variables. As in the case of simple linear

regression, the predictor variable X1 , X2 , ....., Xk are not treated as random variables.

However, for a given set of numerical values for these variables x1 , x2 , ..., xk , the responce

variable denote by Y |x1 , x2 , ..., xk is assumed to be a random variable. The general linear

model expresses the mean value of this conditional random variable as a function of

x1 , x2 , ..., xk . The model takes the following form:

μY |x1 ,x2 ,...,xk = β0 + β1 x1 + β2 x2 + ...... + βk xk (1.11)

In this model x1 , x2 , ..., xk denote known real numbers;β0 , β1 , β2 , ....., βk denote unknown

parameters.Our main task is to estimate the values of these parameters from a data set.

Example:1.6.1

Suppose that we want to develop an equation with which we can predict the gasoline

mileage of an automobile based on its weight and the temperature at the time of opera-

tion. We might pose the model,

μY |x1,x2 = β0 + β1 x1 + β2 x2

Here the responce variable is Y , the mileage obtained. There are two independent or

predictor variables. These are X1 , the wight of the car, and X2 , the temperature. The

values assumed by these variable are denoted by x1 and x2 , respectively. For example,

we might want to predict the gas mileage for a car the weights 1.6 tons when it is being

driven in 85Ḟ wether, Here x1 = 1.6 and x2 = 85. The unknown parameters in the model

are β0 , β1 , and β2 . Their values are to be estimated from the data gathered.

It is possible to treat the polynomial and multiple regression models simultaneously from

mathematical standpoint. However, they diﬀer enough in a practical sence to jestify

conidering them separately. We begin with a desciption of the general polynomial model.

13

y

11

00

00 0

11

11

00 1 11 1

00 0

0

1

00 1

11 0 0

1 0

1

0 11

1 00

11

00 0

1

0

1

The general polynomial regression model of degree p expresses the mean of the responce

variable Y as polynomial function of one predictor value X. It is takes the form,

μY |x = β0 + β1 x1 + β2 x2 + ...... + βp xp (1.12)

the model can be rewriten in the general linear form as 1.12.

Scatergrams are useful in determining when a polynomial model might be appropriate.

The pattern show in Figer 1.10 suggests the quadratic model,

μY |x = β0 + β1 x1 + β2 x2 (1.13)

μY |x = β0 + β1 x + β2 x2 + β3 x3 . (1.14)

Once we decide that a polynomial is appropriate, we are faced with the problem of esti-

mating the parameters β0 , β1 , β2 , ......, βp . To apply the method of least squares, we ﬁrst

express the polynomial in the form,

Y |x = β0 + β1 x1 + β2 x2 + ...... + βp xp + Ei (1.15)

Where Y |x denotes the response variable when the predictor variable assumes the value

x, and E denotes the random diﬀerence between Y |x and its mean value,

μY |x = β0 + β1 x + β2 x2 + ...... + βp xp . A random sample of size n takes the form

{(x1 , Y |x1 ) , (x2 , Y |x2 ) , ....., (xn , Y |xn )} Where the ﬁrst member of each ordered pair de-

note a real number, and the secound, a random variable. As in the case of sample linear

regression, it is customary to drop the conditional notation. The sample itself become

(x1 , y1 ), (x2 , y2), ...., (xn , yn ) where each ι = 1, 2, ...., n,

14

y

00

11 1

0

0

1

0 1

1 0

0

1

0

1

0 00

1

0

1 11

11

0

00

11

0 11

1 0 0

1

00

00 0

11 1

00

11 11

00

00

11

Once again, we must assume that the random error E1 , E2 , ......, En are independent ran-

dom variables, each with mean 0, and variance σ 2 .

The estimate mean responce, estimated value of Y for a given by

Where b0 , b1 , ....., bp are the least-squares estimates for β0 , β1 , ...., βp To ﬁnd these estimates,

we minimize the sum of squares of the residence.

15

Chapter 2

Regression method have become an integral component of any data analysis concerned

with describing the relationship between a responce variable and one or more explanatory

variables. It is often the case that yhe outcome variable is discrete, taking on two or more

explanatory variables. It is often the casethat the outcome variable is discrete, taking

on two or more possible values. Over the last decade the logistic regression model has

become, in many ﬁelds, the standard method of analysis in this situation.

Logistic regression analysis is the most popular regression techiniques available for mod-

eling dichotomous dependent variables. In this chapter I discribe the univarite logistic

regression model and several of its key featurs. Particularly how an odd ratio can be esti-

mated from it. We also demostrate how logestic regression may be applied, using real-life

data set. Over the last decade the logistic regression model has become in many ﬁelds

the standed method of analysis in this situation

Before begining a study of logistic regression it is important to undestand that the goal of

an analysis using this method is the same as that of any model-bulding tecnique used in

statistics:to ﬁnd the best ﬁtting and most parsimonious, yet biologically reasonable model

to describe the relationship between an outcome (dependent or responce) variable and a

set of independent (predictor or explanatory) variables. These variables are often called

covariates. The most common example of modeling, and one assumed to be familiar to

the readers of this text, is the usual linear regression model where the outcome variable

is assumed to be continous.

nary Linear Regression.

In early statistions use ordinary linear regression for there data analysisings. They didn’t

use logistic regression with a binary outcome. Statistic develop day by day, however, are

now most statistions & phychologists use logistic regression for following reasons.

16

(i) The outcome variable in the logistic regression is binary or dichotomous.

(ii) If you use linear regression, The predicted values will become grether than one and

less than zero. If we moves ar theretically inadmissble.

(iii) One of the assumptions of regression is that the variance of Y is constant be the

case with a binary variable, become the variances is P Q.

Example2.1: When the 50 precent of the people are,t hen the variance is 0.25. Its maxi-

mum value as we move to more extreme values, the variance decreases, when p = 0.1, the

variance (p × q) is 0.1×0.9 = 0.09, so as P approaches 1 or zero, the variance approches

zero.

This diﬀerence between logistic and linear regression is reﬂected both in the choice of

parametric model and in the assumptions. Once this diﬀerence is accounted for, the

methods employed in an analysis using logistic regression follow the same general prin-

cipales used in linear regression. Thus, the techiniques used in linear regression analysis

will mortivate our approach to logistic regression. We illustrate both the similarities and

diﬀerences between logistic regression and linear regression with an example.

textbfExample2.2

Table 2.1 Age and coronary heart disease (CHD) stutus of 100 subjects.

ID AGE AGRP CHD ID AGE AGRP CHD

1 20 1 0 21 34 2 0

2 23 1 0 22 34 2 0

3 24 1 0 23 34 2 1

4 25 1 0 24 34 2 0

5 25 1 1 25 34 2 0

6 26 1 0 26 35 3 0

7 26 1 0 27 35 3 0

8 28 1 0 28 35 3 0

9 28 1 0 29 36 3 1

10 29 1 0 30 36 3 0

11 30 2 0 31 37 3 0

12 30 2 0 32 37 3 1

13 30 2 0 33 37 3 0

14 30 2 0 34 38 3 0

15 30 2 0 35 38 3 0

16 30 2 1 36 39 3 0

17 32 2 0 37 39 3 1

18 30 2 0 38 40 4 0

19 33 2 0 39 40 4 1

20 33 2 0 40 41 4 0

17

ID AGE AGRP CHD ID AGE AGRP CHD

41 41 4 0 71 53 6 1

42 42 4 0 72 54 6 1

43 42 4 0 73 55 6 1

44 42 4 0 74 55 7 1

45 42 4 1 75 55 7 0

46 43 4 0 76 56 7 1

47 43 4 0 77 56 7 1

48 43 4 1 78 56 7 1

49 44 4 0 80 57 7 1

50 44 4 0 81 57 7 0

52 44 4 1 82 57 7 0

53 45 5 1 83 57 7 1

54 45 5 0 84 57 7 1

55 46 5 1 85 57 7 1

56 46 5 0 86 58 7 1

57 47 5 1 87 58 7 0

58 47 5 0 88 58 7 1

59 47 5 0 89 59 7 1

60 48 5 1 90 59 7 1

61 48 5 0 91 60 8 0

62 48 5 1 92 60 8 1

63 49 5 1 93 61 8 1

64 49 5 0 94 62 8 0

65 49 5 0 95 62 8 1

66 50 6 1 96 63 8 1

67 51 6 0 97 64 8 1

68 52 6 1 98 64 8 1

69 52 6 0 99 65 8 1

70 53 6 0 100 69 8 1

Table 2.1 lists age in years (AGE), and presence or absence of evidence of signiﬁcant

coronary heart disease (CHD) for 100 subjects selected to participate in a study. The

table also cantains an identiﬁer variable (ID) and an age group variable (AGRP). The out

come variable is CHD, which is coded with a value of zero to indicate CHD is absent, or

1 to indecatethat it is precent in the individual.

It is of interest to explore the relationship between age and the presence or absence

of CHD in this study population. Had out come variable CHD been continuse rather

than binary, we probably would begin by forming a scatter plot of the outcome versus

the independent variable. We would use this scatterplot of the data in Table 2.1 is given

in Figer 2.1.

18

Figure 2.1: Scatterplot by CHD by AGE for 100 subjects.

In this scatterplot all points fall on one of two parallel lines representing the absence

of CHD (y = 0) and the presence of CHD (y = 1). There is some tendency for the

individuals with no evidence of CHD to be younger than those with evidence of CHD.

While this plot does depict the dichotomous nature of the outcome variable quite clearly,

it does not provide a clear picture of the nature of the relationship between CHD and age.

Other problem with Figure 2.1 is that the variabilty in CHD at all ages is large. This

makes it diﬃcult to describe the functional relationship between age and CHD.

One common method of removing some variation while still maintaning the stucture of

the relationship between the outcome and the independent variable is to create intervals

for the independent variable and compute the mean of the outcome variable within each

group. In this table 2.1 this strategy is carried out by using the age group variable,

AGRP which categorize the age data of table 2.1. Table 2.2 contains, for each age group,

the frequncy of occurreence of each outcome as well as the mean (or propotion with CHD

present) for each group.

Now we can proportion of individuals CHD versus the incependent of each age inter-

val. By examining this is a clear picture of the relationship begins to emerge. It appears

that as age increses,the propotion of individuals with evidence of CHD increases.

While this provides considerable insight in to the relationship between CHD and age

in this study, a functional form for this relationship needs to be described. The plot

in this ﬁgure is similar to what one might obtain if this same process of grouping and

averaging whre per formed in a linear regression. We will note two important diﬀerences.

19

Age Group n Absent Present Mean(Proportion)

20-29 10 9 1 0.10

30-34 15 13 2 0.13

35-39 12 9 3 0.25

40-44 15 10 5 0.33

45-49 13 7 6 0.46

50-54 8 3 5 0.63

55-59 17 4 13 0.76

60-69 10 2 8 0.80

Total 100 57 43 0.43

Figure 2.2: Plot of the presentage of subjects with CHD in each age group.

20

Table 2.2:

The ﬁrst diﬀerence concerns the nature of the relationship between the outcome and

independent variable. In any regression problem the key quantity is the mean value of

the outcome variable, given the value of the independent variable. The quntity is called

the conditional mean and will be expressed as E(Y |x) where Y denotes the outcome vari-

able and x denotes a value of the independent variable. In The quantity E(Y |x) is called

“the expected value of Y, given the value x”, In linear regression, we assume that this

mean may be expressed as equation linear in x,

E(Y |x) = β0 + β1 x (2.1)

This expression implice that it is possible for E(Y |x) to take on any value as x range

between −∞ and ∞.

The colum labeled “Mean” in Table 2.2 provides an estimate of E(Y |x). We will as-

sume for purposes of exposition, that the estimated values bploted in Figure 2.2 are be

close enouge to the true values of E(Y |x) to provide a reasonable assessment of the rela-

tionship between CHD and AGE. With dichotomous data, the conditional mean must be

greater than or equal to 1 [0 ≤ E(Y |x) ≤ 1] .

The change in the E(Y |x) per unit change in x becomes progressively smaller as the

conditional mean gets closer to 0 or1. The curve is said to be S-shaped. It resembles a

plot of a commulative distribution of random variable. It should not seemsurrising that

some well-know commulative distributions have been used to provide a model for E(Y |x)

in the case when Y is dichotomous.The model will use is that of the logistic distribution.

Many distribution functions have been proposed for use in the analysis of a dichotomous

outcome variable. Cox and Shell (1989) discuss some of these. There are two primary

reasons for choosing the logistic distribution.

First from a mathematical point of view, it is an extremely ﬂexible and easily used func-

tion, and secound it lends it self to a clinically meaningful interpretation.

Logistic regression is a mathematical modling apprach that can be use to describe the

relationship of several variable X to a dichotomous dependent variable Y where Y is

typically called as 1 or 0 for its two possible categories. The logistic model describes the

expected value of Y (i.e E(Y ) ) in termes of following “logistic formula”.

1

E(Y ) = (2.2)

1+ e−(β0 +β1 x)

21

eβ0 +β1 x

E(Y ) = (2.3)

1 + eβ0 +β1 x

For (0, 1) random variable such as Y . It follows from basic statistical principales about

expected values that E(Y ) is equivalent to probability P r(Y = 1) for a (0, 1) random

variable Y ,

can be written in a form that describes the probability of occurrence of one of the two

possible out comes of Y as follows:

1

P (Y = 1) = (2.5)

1 + e−(β0 +β1 x)

In order to simplify notation, we use the quantity,

to represent the conditional mean of Y given x when logistic distribution is used. The

speciﬁc form of the logistic regression model we use is,

eβ0 +β1 x

π(x) = (2.7)

1 + eβ0 +β1 x

A transformation of π(x) that is central to my study of logistic regression is the logistic

tarnsformation.

π(x)

g(x) = ln (2.8)

1 − π(x)

⎡ ⎤

e(β0 +β1 x)

⎢ 1 + e(β0 +β1 x) ⎥

g(x) = ln ⎢

⎣

⎥ = ln[e(β0 +β1 x) ] = β0 + β1 x

⎦ (2.9)

1

1 + e(β0 +β1 x)

(1) The logit transformation g(x) has many of the desirable properties of a linear re-

gression model.

(2) The logit, g(x), is linear in its parameters, may be continuse, and may range from

−∞ to + ∞, depending to the range of x.

22

(3) Logistic regression models concerns the conditional distribution of the out come

variable. In the linearregression model we assume that an observation out come

variable may be expressed as

Y = E(Y |x) +

The quantity is called the error and expressed an observation’s deviation from the con-

ditional mean. The most common assumption is that ε follows a normal distribution with

mean zero and some variance that is constant across levels of the independent variable.

It follows that the conditional distribution of the outcome variable x will be normal with

mean E(Y |x), and a variable that is constant. This is not the case with a dichotomous

outcome variable. In this situation we may express the value of the outcome variable given

x as y = π(x) + ε, Here the quantity ε may assume one of two possible values. If y = 0

then ε = −π(x) with probability 1 − π(x). Thus , ε has a distribution with mean zero

and variance equal to π(x)[1 − π(x)]. That is, the conditional distribution of the outcome

variable follows a binomial distribution with probability given by the conditional mean,

π(x).

Summary:

We have seen that in a regression analysis when the outcome variable is dichotomous:

(1) The conditional mean of the regression equation must be formulated to be bounded

between zero and 1. we have stated that the logistic regression model, π(x) given

in equation 2.7, satisﬁes this constraint.

(2) The binomial, not a normal, distribution describe the distribution of the errors and

will be the sistical distribution upon which the analysis is based.

(3) The principales that gride an analysis using linear regression will also guide in

logistic regression.

Suppose we have a sample of n independent observations of the pair (xi , yi) , i = 1, 2, ..n

where yi denotes the value of a dichotomous out come variable and xi is the value of the

independent value for the ith subject. Further more, assume outcome variable has been

coded as 0 or1, representing the absence or the presence of charecteristic respectively.

This coding for a dicotomous outcome is used through for our text. To ﬁt the logistic

regression model in equation 2.7 to a set ofdata requires that, we estimate the value of β0

and β1 , the unknown parameters.

In linear regression, the method used most often for estimating unkown parameter is

least squares. In that method we choose those values of β0 and β1 wchich minimize the

23

sum of squared deviations of the observed value of Y from the predicted values based

upon the model under the usual assumptions for linear regression. The method of least

squares yields estimatators with a number of desirable statistical properties. Unfortunatly

when the model of least squares is applied to a model with a dichotomous outcome of the

estimators no longer have same properties.

Maximum Likelihood Method

The general method of estimation that the least square function under the linear regres-

sion model (when the error terms are normally distributed) is called maximum likelihood

. This method will provide the function for our approach to estimation with the logistic

regression model. In a very general sence the method of maximum likelihood yields values

for the unkown parameters which maximize the probability of obtaining the obseved set of

data. we can use this method when the error terms are normally distributed.This method

will provide the function for our approach to estimation with the logistic regression model.

In a very general sence the method of maximum likelihood yields values for the unkown

parameters which maximize the probability of obtaining the observed data.

In order to apply this method ﬁrst we must construct a function called likelihood func-

tion. This function expresses the probability of the obserrved data as a function of the

unknown parameters. The maximum likelihood estimators of these parameters are chosen

to be those values that maximize this function. We now describe how to ﬁnd these values

from the logistic regression model.

arbitrary value of β=(β0 , β1 )) the conditional probability that Y equal to 1 given x. This

will denote as P (Y = 1|x). It follows the quantity 1 − Π(x) gives the conditional proba-

bility that Y is equal to the zero given x,P (Y = 0|x). Thus, for those pairs (xi , yi), where

yi = 1 the contribution to the likelihood function is π(xi ) and for those pairs whire yi ,

the contribution to the likelihood function is 1 − π(xi ), where the quantity π(xi ) denotes

the values of π(x) computed at xi . Thus for pairs (xi , yi),

yi = 0 contribution to likelihood function = 1 − π(xi ),

Where the quntity π(xi ) denotes the value value of π(x) computed at xi . A convent

way to express the contribution to the likelihood function for the (xi , yi ) is through the

expression,

Likelihood f unction f or the pair (xi , yi) = π(xi )yi (1 − π(xi ))1−yi (2.10)

24

since the observartion are assumed to be independent, the likelihood function is obtain

as the product of the terms given in expresion 2.10, as follows.

The principal of maximum likelihood states that we use as our estimate of β the value

which maximize the expressin in 2.11. However, it is easier mathematicallu to work with

the log of equation 2.11. This expression,the loglikelihood is deﬁned as,

n

L(β) = ln[l(β)] = yi ln[π(xi )] + (1 − yi ) ln[1 − π(xi )] (2.12)

i=1

n

π(xi )

L(β) = ln [l(β)] = yiln + ln [1 − π(xi )] (2.13)

i=1

1 − π(xi )

by equation2.7,

eβ0 +β1 x

π(x) =

1 + eβ0 +β1 x

then we can be obtain

1

1 − π(xi ) = (2.14)

1 + 1 + eβ0 +β1 x

Now we devide equation 2.7/2.14,

π(xi )

= eβ0 +β1 x (2.15)

1 − π[xi ]

π(xi )

ln = β0 + β1 x (2.16)

1 − π[xi ]

n

1

L(β) = ln(l(β)) = (β0 + β1 ) + ln (2.17)

i=1

1 + eβ0 +β1 x

n

∂ ∂ 1

ln(l(β)) = yi + (1 + eβ0 +β1 x )

∂β0 i=1

∂β0 1 + eβ0 +β1 x

∂ n

eβ0 +β1 x

ln(l(β)) = yi −

∂β0 i=1

1 + eβ0 +β1 x

25

∂ n

ln(l(β)) = yi − π(xi ) (2.18)

∂β0 i=1

now deﬀerentation with respect to β1

n

∂ β0 +β1 x ∂ 1

ln(l(β)) = yi xi + (1 + e )

∂β1 i=1

∂β1 1 + eβ0 +β1 x

n

i 1 + eβ0 +β1 x

= yi x − xi

i=1

1 + 1 + eβ0 +β1 x

∂ ∂

if ln(l(β)) = 0 and ln(l(β)) = 0

∂β0 ∂β1

To ﬁnd the value of β that maximizes L(β), we diﬀerentiate L(β) with respect to β0

and β1 and set the resulting expressions equal to zero. These equations, known as the

likelihood equations, are:

n

[yi − π(xi )] = 0 (2.19)

i=1

n

xi [yi − π(xi )] = 0 (2.20)

i=1

The value of β given by the solution to equations 2.19 & 2.20 is called the maximum

likelihood estimate and debote as β̂. In general, the use of symbol “.̂” denotes the max-

imum likelihood estimate of the respective quantity. For exampleπ(xi ) is the maximum

likelihood estimate of π(xi ). This quantity provides an estimate of the conditional prob-

ability that Y is equal to 1, given that is equal to xi . As such, it represents the ﬁtted or

redicted value for the logistic regression model. an interesting consequence of equation

2.19 is that,

n n

yi = π̂(xi ) (2.21)

i=1 i=1

That is, the sum of the observed value of y is equal to the sum of the predicted (expected)

values.

As an example, consider the data given in Table 2.3. Used of statistical software pakage

called Minitab, with AGE as the independent variable, produced the out put in the Table

2.3. The maximum likelihood estimates of β0 and β1 are, thus see to be β̂0 = −5.309 and

β̂ = 0.111. To ﬁtted values are given by the equation

e− 5.309 + 0.111 × AGE

π(x) = (2.22)

1 + e − 5.309 + 0.001 × AGE

and the estimated logit, ĝ(x), is given by equation

ĝ(x) = −5.309 + 0.111 × AGE (2.23)

The log likelihood given in Table 2.3 is the value of equation 2.12 computed using β̂0 and

β̂1 .

26

Variable Coeﬀ Std.Err Z P > |Z|

AGE 0.111 0.0241 4.61 0.001

Constant -5.309 1.1337 -4.68 0.001

Table 2.3: Results of ﬁtting the logistic regression model to the data in Table 2.1

The general method for assessing signiﬁcance of variables is easily illustrated in the linear

regression model, and its use there will motivate the approach used for logistic regres-

sion. A comparisoin of the two approaches will highlight the diﬀerence between modeling

continous and dichotomous responce variable. In the regresion, the assessment of the sig-

niﬁcance of the slope coeﬃcient is approached by forming what is referred to an analysis

of variance table. This table partition the total sum of squared deviations about their

mean in to two parts,

(1) The sum of squared deviations of observations about the regressionline SSE. (or

residual sum-of -squares).

(2) The sum of squares of predicted values,based on the regression model, about the

mean of the dependent variable SSR.(Or due regression sum-of-squares).

In linear regression, the comparison of observed and predicted values is based on the

square of the distance between the two. If yi denotes the observed value for the ith

individual under the model, then the statistic used to evaluate this comparison is,

n

SSE = (yi − ŷi )2 (2.24)

i=1

Under the model not contaning the indepent variable in question the only parameter is β0

and β̂0 = ȳ and mean of the responce variable. In this case, ŷi = ȳ and SSE is equal to the

total variance. When we include the independent varible in the total model any decrease

in SSE will be due to the fact that the slope coeﬃcient for the independent variable is

not zero. The change in the value of SSE is the due to the regression source of variabilty,

denote SSR, That is,

n n

2 2

SSR = (yi−) − (yi − ŷi ) (2.25)

i=1 i=1

For The Logistic Regressin

The guiding principle with logistic regression is also same: In logistic regression, compar-

ison of observed to predicted value is based on the log likelihood function deﬁne in the

27

equation 2.12

n

L(β) = ln(l(β)) = {yi ln [π(xi )] + (1 − yi ) ln[1 − π(xi )]} (2.26)

i=1

value of the responce variable as also being a predicted value resulting from a saturated

model. A saturated model is one that contains as many parameters as there one data

points. (A simpal model is a saturated model is ﬁtting a liner regression model when

there only two data points, n = 2 ).

The comparison of observed predicted values using the likelihood function is based on

the following expression:

Likelihoodof thef ittedmodel

D = −2ln (2.27)

likelihoodof thesaturatedmodel

The quantity inside the large brackets in the expression above is called the likelihood ratio.

Using minus twice its log is necessary to obtain a quantity whose distribution is known

and can there for based for hypothesis is known and can there for beased for hypothesis

testing purpose. Such a test is called the likelihood ratio test. By using equation 2.12

⎡ n ⎤

⎢ { yi ln [π̂(xi )] + (1 − yi ) ln[1 − π̂(xi )]} ⎥

⎢ i=1 ⎥

D = −2ln ⎢

⎢ ( n { yi ln [yi ] + (1 − yi) ln[1 − yi ]}) ⎥

⎥

⎣ i=1 ⎦

n

π̂i 1 − π̂i

D = −2 yi ln + (1 − yi)ln (2.28)

i=1

y i 1 − yi

The statistic, D, in equation2.7 is called the deviance by some authors, are plays a central

role in some approoches to assessing goodness of ﬁt. The deviation for logistic regression

plays the same role that the residual sum of squars plays in linear regression.

Furthermore, in a setting such as the show in Table 2.1, where the values of the outcome

variable are either 0 or 1. So likelihood of saterrated model is 1,

n

l(Sateratedmodel) = πi=1 (yi)yi × (1 − yi )(1−yi ) = 1 (2.29)

So that,

D = −2ln(Likelihoodof thef ittedmodel) (2.30)

For purpose of, assesing the signiﬁcance of an independent variable. We compare the

value of D with and without independent variable in the equation. The change in D due

to the inclusion of the independent variable madel obtain as;

G = D(Modelwithvariable) − (Modelwithoutvariable) (2.31)

28

The statistic plays the some role in logistic regression as the numerator of the patial F

test does in linear regression.Becouse the likelihood of the saturated model is common to

both values of D begin diﬀerence to compute G, then be repressed as,

Likelihood with a variable

G = −2ln (2.32)

Likelihood without variable

y i 1−yi

n eβ̂0 e β̂0

l(without variable) = πi=1 1−

1 + eβ̂0 1 + eβ̂0

y i
1−yi

n eβ̂0 1

= πi=1

1 + eβ̂0 1 + eβ̂0

n (eβ̂0 )yi

= πi=1

1 + eβ̂0

n

ln {l(without variable)} = yi β̂0 − ln(1 + eβ̂0 )

i=1

∂ n

1

| (ln(i(without variable))) = yi − eβ̂0 (2.33)

∂βo 1 + eβ̂0

i=1

∂

If (ln(i(without variable))) = 0,

∂βo

n

eβ̂0

yi = n

i=1

1 + eβ̂0

n

eβ̂0

n1 = yi = n

i=1

1 + eβ̂0

n

n1 i=1 yi eβ̂0

= = (2.34)

n n 1 + eβ̂0

Simmlary,

n

n0 = 1 − yi (2.35)

i=1

n

n0 i=1 1 − yi 1

= = (2.36)

n n 1 + eβ̂0

29

In this case,the value of G is,

⎡ n1 n0 ⎤

n1 n0

⎢ ⎥

G = −2 ln ⎣ n n yi n

(1−yi ) ⎦

(2.37)

Πi=1 π̂ (1 − π̂)

n

G=2 [yi ln(π̂i )] − [n1 ln(n1 ) + n0 ln(n0 − n ln(n))] (2.38)

i=1

Hypotesis:

H0 : β1 = 0

H1 : β1 = 0

Under the hypotesis that β1 is equal to zero, the statistic G follows a chi-square distribution

with degree of freedom. Additional mathematical assumptions are also needed.however,

for the above case they are rather than nonrestrictive and involve having a suﬃcintly large

sample size n.

We use symbol χ(ν) to denot a chi-square random variable with ν degree-of -freedom.

As an example, we consider the model ﬁt ton the data in Table 2.1 whose estimated

coeﬃcients and loglikelihood are given in Table 2.3. For these data, n1 = 43 and n0 = 70

thus evaluating G, as show in equation 2.34 yields.

= 2[−53.677 − (−68.331)]

= 29.31

Using this notation, the p− value associated with test is, P [χ2 > 29.31] < 0.001. Thus, we

have conconvancing evidence that AGE is a signiﬁcant variable inpredicting CHD. This

is merely a statement of the statistical evidence for this variable.

Other important factor to consider before concluding that the variable is clinically im-

portant would include the appropriatenss of the ﬁtted model, as well as inclusion of other

potentially inportant variables.

Wald test

The assumption needed for these tests are the same as those of the likelihood test in

equation 2.12. A More complete discussion of these tests and their assumptions may be

30

found in Rao(1973).

The wald test statistic obtained by comparing the maximum likelihood estimate of the

slope parameter,β̂1 , to an estimate of its standad error.

Hypotesis:

H0 : β1 = 0

H1 : β1 = 0

The resulting ratio,under hypothesis that β1 = 0, will follow a standad normal distribu-

tion.

Test statistic:

β̂1

W = (2.39)

SE(β̂1 )

For example, the Wald test for the logistic regression model in Table 2.3 is provided in

the colum headed Z and is,

0.111

W = = 4.61

0.024

and the the two tailed P-value, Table 2.3, is P (|Z| > 4.61) < 0.001.

Where Z denotes a random variable following the standed normal distribution.Hauck and

Donner (1977) examined to performance of the wald test and found that it behaved in

an aberrant manner, often failing to reject the null hypotesis when the coeﬃcint was

signiﬁcant. They recommended that the likelihood ratio test be used.

Jennings (1986) has also looked at adequacy of inferences in logistic regression based

on wald statistics. His conlusion are similar to those of Hauck and Donner. Both the

likelihood ratio test, G, and wald test, W , require the computation of the maximum

likelihood estimate for β1 .

Score test

In univariate case, this test is based on the conditional distribution of the derivative in

equation 2.19. In this case, we can write down an expression for the score test. The

test uses the value of equation 2.20. Computed using n β0 = ln(n1 /n0 ) and β0 = 0. As

1

noted earlier ,under these parameter values, π̂ = = ȳ. Thus ,the left-hand side

n n

of equation

2.20 become i=1 xi (yi − ȳ). It may be shown that the estimated variance

2

ȳ(1 − ȳ) (xi − x̄) . The test statistic for the score test (ST) is,

n

i=1 (yi −)

ST = n (2.40)

y(1 − y) i=1 (xi − x) 2

Hypotesis:

H0 : β1 = 0

H1 : β1 = 0

31

As an example of the Score test, consider the model ﬁt to the data in Table2.1. The value

of the test statistic for example is,

296.66

ST = √

3333.7342

= 5.41

and the two tailed P − value, P (|Z| > 5.41) < 0.001.

The basis for contruction of the interval estimators is the same statistical theory. We used

to formulate is the test for signiﬁcance of the model.

The end points of a 100(1 − α)% conﬁdent interval for the slope coeﬃcent,

β̂1 − β1

N(0, 1)

var(β̂1 )

⎛ ⎞

β̂1 − β1

P r ⎝−z1− α2 ≤ ≤ z1+ α2 ⎠

var(β̂1 )

The end points of a 100(1 − α)% conﬁdent interval for the slope coeﬃcent are,

and, for the intercept they are,

Where z1− α2 is the upper 100(1 − α2 )% point form the standed normal distribution and

ŜE denotes a model-based estimater of the standed error of the respective parameter

estimater.

As an example, consider the model ﬁt to the data Table 2.1 regressing AGE on the

presence or absence of CHD. The results are presented in the Table2.3.

The endpoints of a 95% coeﬁdence interval for the slope coeﬃdent interval for the slope

conﬁdent from 2.41are,

0.111 ± 1.96 × 0.0241

Interval is (0.064, 0.158). Brieﬂy, the results suggest that the change in the log-odds of

CHD per one year increase in age is0.111 and the change could be a little as 0.064 or as

32

0.158 with 95 percent conﬁdence.

The logit is the linear part of the logistic regression model and, it is a most like the ﬁtted

line in a linear regression model. The estimator of the logit is,

ĝ(x) = β̂0 + β̂1 x (2.43)

The estimator of the variance of the estimator of the logit requires obtaining the variance

of a sum. In this case it is,

v̂ar(ĝ(x)) = v̂ar(β̂0 ) + x2 v̂ar(β̂1 ) + 2xĉov(β̂0 , β̂1 ). (2.44)

In general the variance of a sum is equal to the sum of the variance of each term and

twice the covariance of each possible pair of terms formed from the components of sum.

The endpoints of a 100(1 − α)% Wald-based conﬁdence interval for the logit are,

ĝ(x) ∼ N(g(x), var[ĝ(x)])

ĝ(x) − g(x)

∼ N(0, 1)

SE[ĝ(x)]

The end point of a 100(1 − α) conﬁdence interval for the logit are,

ĝ(x) − g(x)

P r −z1− α2 +z1− α2

SE[ĝ(x)]

100(1 − α)% conﬁdence interval for the logit are,

ĝ(x) ± z α ŜE[ĝ(x)] (2.45)

(1− )

2

Where ŜE[ĝ(x)] be a positive square root of the variance estimater in 2.44.

The estimated logit for the ﬁtted model in Table 2.3 is show in 2.23. In order to evaluate

2.44 for a speciﬁc age we need the estimated covariance matrix.This matrix can be ob-

tained from the output from all logistic regression software pakeges.

The estimated logit form 2.23 for a subject of age 50 is, for subjet of age 50 is,

ĝ(x) = −5.309 + 0.111 × 50

= 0.24

The estimated variance using equation2.44 and result in Table 2.4 is,

v̂ar[ĝ(50)] = 1.28517 + (50)2 × (0.000579) + 2 × 50(−0.026679) = 0.0650

and the estimated standed error is, ŜE[ĝ(50)] = 0.249. Thus the end points of a 95

percent conﬁdence interval for the logit at age 50 are,

0.240 ± 1.96 × 0.2550 = (−0.260, 0.740)

33

AGE Constant

AGE 0.000579

Constant -0.026677 1.28517

Table 2.4: Estimated convariance matrix of the estimated coeﬃcicent in Table 2.3

The estimated of the logit and its conﬁdence interval provide the basis for the estimator

of the ﬁtted value, in this case the logistic probability, and its associated conﬁdence

interval.In particular, using equation2.24 at the 50 the estimated logistic probability is,

e(50)

π̂(50) =

1 + e(50)

e−5.31+0.111×50

=

1 + e−5.31+0.111×50

= 0.560

and the end points of a 95 percent conﬁdence interval are obtained from the respective end

points of the conﬁdence interval for the logit. The end point of the 100(1 − α)%conﬁdence

interval for the ﬁtted value are ,

ĝ(x) ± z α ŜE[ĝ(x)]

(1− )

2

1 + ĝ(x) ± z α ŜE[ĝ(x)]

(1− )

2

using the example at age 50 to demostrate the calculations, the lower limit is,

e−0.260

= 0.435

1 + e−0.260

and the upper limit is ,

e0.740

= 0.677

1 + e0.740

We have found that a major mistake often made by persons new to logistic regression mod-

eling is to try and apply estimates on the probability scale to individual subjects. The

ﬁtted value computed in π̂(50) analogous to a particular point on the line obtained from

a linear regression.In linear regression each point on the ﬁtted line provides an estimate

of the mean of the dependent variable in a population of subjects with co variate value “x”.

Thus the value of 0.56 in π̂(50) is an estimate of the mean (i.e propotion) of 50 years

old subjects in the population sampled that have evidence of CHD.Each individual 50

years old subject either does or does not have evidence of CHD. The conﬁdence interval

suggests that this mean could be between 0.435 and 0.677 with 95% coﬁdence.

34

Chapter 3

In the previous chapter we introduced the logistic regression model in the univariate

contex. As in the case of linear regression, the strength of a modeling technique lies in its

ability to model many variables, some of which may be on diﬀerent measurement scale. In

this chapter we will generalize the logistic model to the case of more than one independent

variable. This will be referrd to as the “multivariable case”.Central to the consideration

of multiple logistic models will be estimation of the coeﬃcients in the model and testing

for their signiﬁcance.

Consider a collection of p independent variables denoted b the vector X́=(x1 , x2 , x3 ...., xp ).

For the moment we will assume that each of these variables is at least interval scale. Let

the conditional probability that the out comeis presen be denoted by P (Y = 1|X) = π(x).

The logit of the multiple logistic regression model is given by the equation,

eg(X)

π(X) = (3.2)

1 + eg(X)

In some of the independent variable are discrete, nominal scale variables such as race,

sex, treatment group, and so forth. It is inapproprite to include them in the model as

if they were interval scale variables. The numbers used to represent the variables levals

of these nominal scale variables merely indentiﬁers, and have no numeric signiﬁcance. In

this situation the method of choice is to use a collection of design variables(Or dummy

variables).

suppose, for example, that one of the independent variable is race. Which has been

coded as “White”,“Black”, and “Others”. In this case two design variables are necessary.

35

RACE D1 (Design Variable) D2 (Design variable)

White 0 0

Black 1 0

Other 0 1

Table 3.1: An example of the coding the design variables for race coded at three levels

y

2

//

β

0

/ 1

β

0

β

0

One possible coding strategy is that when the respondent is “White”, The two design

variables, D1 and D2 , would both be set equal to zero;when represent is “Black’, D1 and

D2 , would both be set equal to zer; when the respondent is “black”, D1 would be set

equal to 1,while D2 would still equal 0; when the race of the espondent is “Other”, we

would useD1 = 0 and D2 = 1. Table 3.1 illustrates this cording the design variables.

Most logistic regression software will generate design variables, and some programs have

a choice of several diﬀerent models. The diﬀerent strategies for creation and interetation

of design variables and discussed in detail in next capter. In general, if a nominal scaled

variable has k possible values, then k − 1 design variables will be needed. This is true

since, unless stated other wise, all of our models have contant term. To illustrate the

notation used for design variables in this text suppose that the j th independent variable

xj has kth levels. The kj − 1design variables will be denoted as Djl and the coeﬃcints for

these design variable will be denoted as βjl , l = 1, 2, ...kj−1.

Thus, the logit for a model will p variables and j th variable being discrete discrete would

36

be

n

g(X) = β0 + β1 x1 + β2 x2 + ............... + βjl Djl βp xp (3.3)

l=1

2

g(X) = β0 + β1 x1 + β2l D2l l = 1, 2

l=1

= β0 + β1 x1 + β21 D21 + β22 D22 + β21 D21

(1)W hite =⇒

= β0 + β1 x

(2)Black =⇒

= β0 + β1 x + β21

= β3 + β1 x

(3)Other =⇒

= β0 + β1 x + β22

= β4 + β1 x

The 3 equations are paralel each of others. Only the Intercepts are diﬀerents. When

discussing the multiple logistic regression model we will, in general, suppress the sum-

mation and double subscripting needed to indicate when design variables are being used.

The exception to this will be the discussion of modeling strategies when we ned to use

the speciﬁc value of the coeﬃcients for any design variable in the method.

Assume that we have a sample of n indepedent observations (xi , yi ) i = 1, 2, 3, ..., n. As

in the univariate case, ﬁtting the model requires that univariate case, ﬁtting the model

requires that we obtain estimates of the vector β = (β0 , β1 , ..., βp ). The method of estima-

tion used in the multivariable case will be the same as in the univariate situation-maximum

likelihood. There will be p + 1 likelihood equations that are obtained by diﬀerentiating

the log likelihood function with respect to the p + 1 coeﬃcients.

37

Suppose we have a sample of n independent observations of the pair (x1i , x2i , x3i , ...., xpi , yi )

i = 1, ......, p, where yi denotes the value of a dichotomous outcome variable and xi is the

value of the independent variable for the ith subject. Furthermore, assume that the out-

come variable has been coded as 0 or 1, representing the obsence or the presence of the

characteristic respectively. This will be denoted as P (Y = 1|x). It follow that the quan-

tity 1 − π(x) gives the conditional probability that Y is equal to zero gives x,P (Y = 0|x).

Thus, for those pairs (x1i , x2i , .....xpi , yi) where yi = 1, the contribution to the likelihood

function is π(xij ), and for those pairs where yi = 0, the contribution to the likelihood

function is 1 − π(xi ),where the quantity π(xij ) denote the value of π(x) computed at xij .

The pair (x1i , x2i , .....xpi , yi ), contribution to,

y1 = 1 contribution to likeliood f unction = π(xij )

y1 = 0 contribution to likeliood f unction = 1 − π(xij )

The conventity way to express to express the conttribution to the likelihood function for

the pair (x1i , x2i , .....xpi , yi) is through the expression.

π(xij )yi [1 − π(xij )]1−yi

Science the observations are assumed to be independent, the likelihood function is ob-

tained as product of the terms given gy expression.

n

l(β) = πi=1 π(xij )yi [1 − π(xij )]1−yi (3.4)

Where β = (β0 , β1 , ..., βp ), The method of estimation used in the multivariate case will

be the same as in the univariate situation-maximum likelihood.

There will be P + 1 likelihood equations that are obtained by diﬀerentiating the log like-

lihood function with respect to the P + 1 coeﬃcients.

n

ln(l(β)) = { ln(π(xij + (1 − yi ) ln(1 − π(xij ))))}

i=1

n

π(xij )

= ln + ln (1 − π(xij ))

i=1

1 − π(xij )

by equation 3.2,

eg(x) 1

π(xij ) = & 1 − π(xij ) =

1 + eg(x) 1 + eg(x)

so we can get,

π(xij )

= eg(x)

1 − π(xij )

π(xij )

ln = ln(eg(x) ) = g(x)

1 − π(xij )

38

n

1

ln(l(β)) = yi g(x) + ln (3.5)

i=1

1 + eg(x)

n

1

= yi (β0 + β1 x1 + β2 x2 + ...... + βp xp ) + ln

i=1

1 + eβ0 +β1 x1 +β2 x2 +......+βpxp

(3.6)

Deﬀerentating w.r.t β0 ,

n

∂ β0 +β1 x1 +β2 x2 +......+βpxp

−eβ0 +β1 x1 +β2 x2 +......+βpxp

[ln(l(β))] = yi + 1 + e

∂β0 i=1

1 + eβ0 +β1 x1 +β2 x2 +......+βpxp

n

−eβ0 +β1 x1 +β2 x2 +......+βpxp

= yi −

i=1

1 + eβ0 +β1 x1 +β2 x2 +......+βpxp

n

= yi − π(xij )

i=1

∂ n

[ln(l(β))] = yi − π(xij ) (3.7)

∂β0 i=1

now deﬀerentatiating equation 3.5 w.r.t βi ,

n

∂ β0 +β1 x1 +β2 x2 +......+βpxp

−eβ0 +β1 x1 +β2 x2 +......+βpxp

[ln(l(β))] = yi xij + 1 + e (xij )

∂β1 i=1

1 + eβ0 +β1 x1 +β2 x2 +......+βpxp

n

β0 +β1 x1 +β2 x2 +......+βpxp

−eβ0 +β1 x1 +β2 x2 +......+βpxp

= xi yij + 1 + e (xij )

i=1

1 + eβ0 +β1 x1 +β2 x2 +......+βpxp

n

= xij [yi − π(xij )]

i=1

∂ n

[ln(l(β))] = xij [yi − π(xij )] (3.8)

∂β1 i=1

Then, 3.7 and 3.8,

∂ ∂

[ln(l(β))] = 0 & [ln(l(β))] = 0

∂β0 ∂β1

so we can get,

n

yi − π(xij = 0 (3.9)

i=1

n

xij [yi − π(xij )] = 0 (3.10)

i=1

39

For j = 1......p. As in the univariate model, the solution of the likelihood equations re-

quires special software that is available in the most, if not all, statistical packages. Let

β̂ denote the solution to these equations. Thus, the ﬁtted values for the multiple logistic

regression model are π̂(xi ), the value of the expression in the equation 3.8 computed using

hatβ and xi .

In the previous chapter only a brief mention was made of the method for estimating

the standed errors of the estimated coeﬃcients. Now that the logistic regression model

has been generalized both in concept and notation to the multivariable case, we consider

estimation of standed errors in more detail.

The method of estimating the variance of the estimated coeﬃcients follows from well-

developed theory of maximum likelihood estimation. This theory states that the estima-

tors are obtained from the matrix of secand partial derivatives of the likelihood function.

These partial derivatives have the following form,

∂ 2 L(β) n

= − x2ij πi (1 − πi ) (3.11)

∂βj2 i=1

∂ 2 L(β) n

=− xij xil πi (1 − πi ) (3.12)

∂βj ∂βl i=1

for j = 1, ....p where πi denote P i(xi ) Let the (P + 1) × (P + 1) matrix containing the

negative of the terms given in equation 3.11 and 3.16 be denote as I(β). This matrix

is called the observed information matrix. The variance and covariance of the estimated

coeﬃcients are obtain from the inverse of this matrix which we denote as V ar(β) = I −1 (β).

we will use notation V ar(βi ) to denote the j th diagonal element of this matrix, which is

the variance of β̂j ,and Cov(βj , βl )to denote an arbitrary oﬀ-diagonal element. which is

covariance of β̂i and β̂l . The elements of the variance and covariance, which will be

denote by V̂ ar(β̂), are obtained by evaluating V ar(β) at β̂. We will use V ar(β̂j ) and

ĉov(β̂j , β̂l ) j, l = 1....p to be denote the values in this matrix. Also estimated standed

errors of the estimated coeﬃcients, which we will denote as,

12

ŜE(β̂j ) = V̂ ar(β̂j ) (3.13)

for j = 1....p. We will use this notation in developing methods for coeﬃcient testing and

conﬁdence interval estimation.

A formulation of the inmormation matrix which will be useful when discussing model

ˆ β̂) = X V X, where X is an n by P + 1 matrix containing

ﬁtting and assessment of ﬁt is I(

the data for each subject,and V is an n by n diagonal matrix with general element

40

Π̂(xi )(1 − Π̂i ). That is, the matrix X is,

⎛ ⎞

1 x11 x12 . . . x1p

⎜ 1 x21 x22 . . . x2p ⎟

⎜ ⎟

X = ⎜ .. .. .. .. .. ⎟ (3.14)

⎝ . . . . . ⎠

1 xn1 xn2 . . . xnp

and the matrix V is,

⎛ ⎞

π̂1 (1 − π1 ) 0 ... 0

⎜ 0 π̂2 (1 − π2 ) . . . 0 ⎟

⎜ ⎟

V=⎜ .. .. .. ⎟ (3.15)

⎝ . 0 . . ⎠

0 ... 0 π̂n (1 − πn )

before proceeding futher we present an example that illustrates the formulation of a

multiple logistic regression model and the estimation of its coeﬃcients using a subject of

the variables from the data for the low birth weight study. The code sheet for the full data

set is given in Table(2.6). The goal of this study was to identify risk factors associated

with given birth to a low birth weight body (weighing less than 2500 grams).

Data were collected on 189 women, n1 = 59 of whom had low birth weight babies and

n0 = 130 of whom had normal birth weight babies.Four variables thought to be inportant

were,

(i) Age

(ii) Wight of the mother at her least menstrual period

(iii) Race

(iv) Number of physician visits during the ﬁrst trimester of pregnancy.

In this example, the variable race has been recoded using the two design variables in

Table 3.1. The results of ﬁtting the logistic regression model to these data are shown

in Table 3.2. In the Table 3.2 the estimated coeﬃcients for the two design variables for

race are indicated by RACE2 and RACE3 . The estimation logit is given by the following

expression.

ĝ(x) = 1.295−0.024×AGE−0.014×LW T +1.004×RACE2 +0.4333×RACE3 −0.049×F T V

The ﬁtted values are obtained using the estimated logit, ĝ(X)

There are three methods for ﬁnd the level of signiﬁcance.

(1) likelihood Ratio test

(2) Wald test

(3) score test

41

Variable Coeﬀ. Std.Err Z P > |z|

AGE -0.024 0.0337 -0.71 0.480

LWT -0.014 0.0065 -2.18 0.029

RACE2 1.004 0.4979 2.02 0.044

RACE3 0.433 0.3622 1.20 0.232

FTV -0.049 0.1672 -0.30 0.768

Constant 1.295 1.0714 1.21 0.227

Table 3.2: Table 3.2, estimated coeﬃcients for a multiple Logistic regression model using

the variables AGE, weight at least menstrual period (LWT), Race and Number of ﬁrst

trimester physician visits (FTV) for the low birth weight study.

niﬁcance Of The Model

Once we have ﬁt a paticular multiple (multivariable)logistic regression model, we begin

the process of model assessment. As in the univariate case presented in chapter 2, the ﬁrst

step in this process is the univariate assess the signiﬁcance of the variable in the model.

The likelihood ratio test for overall signiﬁcance of the p coeﬀcients for the independent

variables in the model is performed in exactly the same manner as in the univariate case.

The test is based on the statistic G given in equation 2.32. The only diﬀerence is that the

ﬁtted values, π̂, under the model are based on the vector containing p+1 parameters,β̂.

Under the null hypotesis that the p “slope” coeﬀcients for the covariantes in the model

are equal to zero, the distribution of G will be chi-square with p degree of the freedom.

Consider the ﬁtted model whose estimated coeﬃcients are given in Table 3.2 for that

model, the values of the loglikelihood, calculated by using minitab software.

Test statistic

Hypotesis:

H0 : βj = 0 j = 1....p

H1 := βj = 0

The likelihood is given by the Table 3.2 For that model the value of the likelihood, shoe at

the bottem of table is L = −111.286. The log likelihood for the constant only model may

be obtained by evaluating for the constant only model may be obtained by evaluating the

numerator equation 2.37 or by ﬁtting the constant only model.

n=189

n0 = 130

42

Variable Coeﬀ Std.Err Z P > |Z|

LWT -0.015 0.0064 -2.31 0.018

RACE2 1.081 0.4881 2.22 0.027

RACE3 0.481 0.3567 1.35 0.178

Constant 0.806 0.8452 0.95 0.340

Table 3.3: Estimated coeﬃcients for a multiple Logistic Regression model sing the variable

LWT and RACE from the low birth wight stutdy.

n1 = 59

Thus the value of the likelihood ratio test statistic from equation 2.44

= 12.099

Now we can compare this value by using χ(n − 1)degree of the freedom.

conclution

We reject the null hypotesis in this case and conclude that at least one and perhaps all p

coeﬃcient are diﬀerent from zero, an interpretation analogous to that in multiple linear

regression.

If our goal is to obtain the best ﬁtting model while minimizing the number of parameters,

the next logit step is to ﬁt a reduced model containing only those variables thought to

be signiﬁcant, and comare it to the full model contaning all the variables. The results of

ﬁtting the reduced model are given by Table 3.3. The diﬀerence between the two models

is the exclusion of the variables AGE and FTV from the full model. The likelihood ratio

test comparing these model is obtained using the diﬁnition of G given the equation 2.37.

It will have a distribution that is chi-square with 2 degrees -of fredom under hypotesis

that the coeﬃcients for the variable excluded are equal zero.

Hypotesis:

H0 : βj = 0 ∀j = 1, 2, 3, ...p

0 f or at ∀ onej.

H1 : βj =

The value of the test statistic comparing the models in Table 2.2&2.3. ,

43

which, with 2 degree-of-fredom, has a P −value of P [χ2 (2).0.688] = 0.709. Science the

P -value is large , exceeding 0.05, we conclude that the reduced model is good as full

model. Thus there is no advantage in includeing AGE and FTV in the model. However,

we must not base our models entirely on tests of statistical signiﬁcance.

Whenever a categorical independent variable is include (or exclude) from a model, all of

its design variables shold be include (or excluded); to do other wise implies that we have

recoded the variable.

For example, if we only include design variable D1 as deﬁne in Table 3.1, then RACE is

entered in to the model as a dichotomous variable coded as black or not black. If k is the

number of levels of a categorical variable, then the contribution of this variable will be

k − 1. For example, if we exclude race from the model,and race is coded at three levels

using the design variables shon in table 3.1, then there would be 2 degrees-of-freedom for

the test, one for each design variable.

Becase of multiple degree-of-freedom we must be careful in our use of the Wald(W ) statis-

tic to assess the signiﬁcance of the coeﬃcients. For example, if the W statistics for both

coeﬃcients exceed 2, then we could conclude that the design variable are signiﬁcant.The

multivariable analog of the wald test is obtained from the folowing vector-matrix calcu-

lation:

Test statistic

W = β̂ [v̂ar(β̂)]−1 β̂

= β̂ [X V X]−1 β̂,

β̂ = (β0 β1 ....βp )p×1

The matrix X is, is an n by n diagonal matrix with general element Π̂i (1 − Π̂i ),

variable.

Hypotesis

H0 : βj = 0 ∀j = 1, 2, 3, ...p

H1 : βj = 0 f or at ∀ one j.

It will be disributed as chi-square with p + 1 deree-of-fredom. Under the hypotesis that

each of the p + 1 coeﬃcients is equal is equal zero. Tests for just the p slope coeﬃcients

44

are obtained by eliminating β̂0 , from β̂ and the relevant row(ﬁrst or last) and colum(ﬁrst

or last) from (X V X). Science evaluation of this test requires the capabilty to perform

vector-matrix operations and to obtain β̂, there is no gain over the likelihood ratio test

of the signiﬁcant of the model.

We discussed conﬁdence interval estimatoers for the coeﬃcients, logit and logistic proba-

bilities for the simple logistic regression model. The methods used for coeﬁdence interval

estimators for a multiple variable model are essentially the same.

The conﬁdent interval estimator for the logistic ia a bit more comlicated for the mul-

tiple variable model than the result presented in 2.45. The basic idea is the same, only

there are now more terms involved in the summation.It follows from 3.2 that a general

expression for the estimator of the logit for a model containing p covariates is,

An alternative way to express the estimation of the logit in 3.17 is through the use of

vector notation as ĝ(X) = X β̂, where the vector β̂ = (β̂0 , β̂1 , ....β̂p ) denote the estimator

of the p+1 coeﬃcients and the vector X = (X0 , X1 , ..., Xp ) represents the constates in

the model, where x0 = 1.

p p

n

V̂ ar[ĝ(X)] = Xj2 V̂ ar(β̂j ) + 2xj xk Ĉov(β̂j β̂k )

i=1 j=0 k=j+1

We can express this resuls much more concisely by using the matrix expression for the

estimater of the variance of the estimater of the coeﬃcients.Form the expression for the

observed information matrix, we have that,

It follows from 3.18 that an equivation expression for the estimator is 3.17,

V ar[ĝ(x)] = X V̂ ar(β̂)X

= X (X V X)−1 X

Fortunately, all good logistic regression software pakeges provide the option for the user

to create a new variable containing the estimated values of 3.19 or the standed error for

45

LWD RACE2 RACE3 Constant

LWT 0.000041

RACE2 -0.000647 0.2382

RACE3 0..000036 0.0532 0.1272

Constant -0.005211 0.0226 -0.1035 0.7143

Table 3.4: Estimated covariance matrix of the estimated coeﬃcients in Table 3.3

all subjects in the data set. This feature estimaters the computational burden associated

with the matrix calculationas in 3.19 and allows the user to routinely calculate ﬁtted

values and conﬁdence interval estimaters. However it is useful to illustrate the details of

the calculations.

Using the model in Table 3.3, the estimated logit for a 150 pound white women is,

ĝ(LW T = 150, RACE = white) = 0.806 − 0.015 × LW T + 1.081 × RACE2 + 0.481 × RACE3

= 0.806 − 0.015 × (150) + 1.081 × 0 + 0.481 × 0

= −1.444

and estimated the logistic probability is,

e−1.444

Π̂(LW T = 150, RACE = white) =

1 + e−1.444

= 0.91

Conclution

The interpretation of the ﬁtted value is that the estimated proportion of low birth weight

babies amoung 150 pound white momen is 0.191.In order to use 3.17 to estimate the

variance of this estimated logit we need to obtain the estimated covariance matrix show

in the Table 3.4. Thus the estimated variance of the logit is,

ĝ(LW T, RACE = W hite) = β0 + β̂1 × LW T + β̂2 RASE2 + β̂3 RACE3

+2 × 0 × ĉov(β̂0 β̂3 ) + 2 × 150 × ĉov(β̂1 β̂2 )

+2 × 150 × 0 × ĉov(β̂1 β̂2 )2 × 0 × 0 × ĉov(β̂2 β̂3 )

+2 × 150 × (−0.0052) + 2 × 0 × 0.0226

+2 × 0 × (−0.1035) + 2 × 150 × 0 × (−0.000647)

+2 × 150 × 0 × 0.000036 + 2 × 0 × 0.0532

= 0.0768

46

The standad error is,

V ar[ĝ(LW T = 150, RACE = W hite)] = var(V ar[ĝ(LW T = 150, RACE = W hite)])

= 0.2771

The conﬁdence interval for 95% conﬁdence interval for estimated logistic is,

−1.444 ± 1.96 × (0.2771) = (−1.988, −0.0901)

47

Chapter 4

Logistic Regression Model

Introduction

In chapter 2 and 3 I discused the method for ﬁtting and testing for the signiﬁcance of the

logistic regression model. After ﬁtting a moel the emphasis shifts from the computation

and assessment of signiﬁcance of the estimated coeﬃcients to the interpret ation of their

values. Strictly speaking, an assessment of the adequacy of the ﬁtted model should pre-

cede any attempt at interpteting it. Thus, we begin this chapter assuming that a logistic

regression model has been ﬁt, that the variables in the model are signiﬁcant in either a

clinical or statistical sense, and that the model ﬁts according to some statistical measure

of ﬁt.

The interpretation of any ﬁtted model reqires that we be able to drow practical infer-

ences from the estimated coeﬃcients in the model.

The interpretation of any ﬁtted model requires that we be able to draw practical infer-

ences from the estimated coeﬃcients in the model. The question being address is: What

do the estimated coeﬃcients in the model tell us about the research questions that the

motivated the study?

For most models this involves the estimated coeﬃcients for the independent variables

in the model.On occasion, the independent coeﬃcient is of interest;but this is the excep-

tion, not the rule.

The estimated coeﬃcients for the independent variables represent the slope.(i.e, rate of

change) of a function of the dependent variable per unit of change in the independent

variable.Thus interpretation involves two issues.

(1) Determining the functional relationship between the dependent variable and the

independent variable.

48

(2) Appropriatly deﬁning the unit of change for the independent variable.

The ﬁrst step to determine what function of the dependent variable yields a linear function

of the independent variables. This is called the link function. In the case of a linear

regression model, it is the identity function science the dependent variable, by deﬁnition,

is linear in the parameters. (For those unfamiliar with the term)“identity function”, it

is the function Y = y). In the logistic regression model the link function is the logit

transfomation,

= β0 + β1 x

For a linear regression model recall that the slope coeﬃcient, β1 is equal to the diﬀerence

between the value of the dependet variable x+1 and the value of the dependent variable at

x,for any value of x. For example, if y(x) = β0 + β1 x. it follows that β1 = y(x + 1) − y(x).

In this case, the interpretation of the coeﬃcient is relatively straighforward as it expresses

the resulting change in the measurement scale of the dependent variable for a unit change

in the independent variable. For example, if in a regression of wight on hight of male

adolescents the slope is 5, then we would conclude that an increase of 1 inch in height is

associated with an increase of 5 pounds weight.

In the logistic regression model, the slope coeﬁcients represents the change in the logit cor-

responding to a change of one unit in the independent variable (i.e, β1 = g(x + 1) − g(x)).

Proper interpretation of the coeﬃcient in a logistic regression model depends on being

able to place meaning on the diﬀerence between two logits. Interpretation of this diﬀer-

ence is discussed in detail on a case-by-case basis as it relates directly to the deﬁnition

and meaning of a one-unit change in the independent variable.

We begin our consideration of the interpretation of logistic regression coeﬃcients with the

situation where tha independent variable is nominal sale and dichotomous (i.e measured

at two levels). This case provides the cnceptual foundation for all other situations.

We assume that the independent variable, x is coded as either zero or none.The diﬀerence

in the logit for a subject with x=1 and x=0 is,

g(x) = β0 + β1 x

g(1) − g(0) = (β0 + β1 ) − (β0 + β1 x)

= β1

The algebra shown in this equation is rather straightforward. We present it in this level of

detail to emphasize that the ﬁrst step in interpreting the eﬀect of a covarite in a model is

49

Outcome Independent Variable (X)

Variable(Y) x=1 x=0

β0 +β1

e eβ0

y=1 π(1)= π(0)=

1 + eβ0 +β1 1 + eβ0

1 1

y=0 1 − π(1)= β +β

1 − π(0)=

1+e 0 1 1 + eβ0

Total 1.0 1.0

Table 4.1: values of the logistic regression model when the independent variable is di-

chotomous outcome

express the desired logit diﬀerence in terms of the model. In this case the logit diﬀerence

is equal to β1 , In order to interpret this result we need to introduce and discuss a mesure

of association termed the odd ratio.

table as shown in Table 4.1. The odds of the outcome being present among individuals

with x = 1 is deﬁned as π(1)/(1 − π(1)). similary, the odds of the oucome being present

among individuals with x = 0 is deﬁnd as π(0)/(1−π(0)). The odds ratio, denoted OR, is

deﬁnd as the ratio of the odds for x = 1 to the odds for x = 0, and given by the equation,

eβ0 +β1 x

Π(x) = (4.1)

1 + β0 + β1 x

π(1)/(1 − π(1))

OR = (4.2)

π(0)/(1 − π(0)

β +β

e 0 1 1

( 1+e β0 +β1 )/( 1+eβ0 +β1 )

OR = β

e 0 1

( 1+e β0 )/( 1+eβ0 )

eβ0 +β1

=

eβ0

= eβ1

= β1

Hence, for logistic regression with a dichotomous independent variable coded 1 and 0, the

relationship between the odds ratio and the regression coeﬃcecent is,

OR = eβ1 (4.3)

This simple relationship between the coeﬃcient and the odd ratio is fundamental reason

why logistic regression has proven to be such a powerful analytic research tool.

The odds ratio is a measure of association which has found wide use, especilly in epidemi-

ology, as it approximates how much more likely (or unlikely) it is for the outcome to be

50

Outcome AGED(X)

CHD(Y) ≥ 55(1) < 55(0) Total

Present(1) 21 22 43

Absend(0) 6 51 57

Total 27 73 100

Table 4.2: cross-classiﬁcation of AGE dichotomized at 55 years and CHD for 100 subjects

For example, if y denotes the persence or absence of lung cancer and if x denotes whether

the person is a smoker, than ÔR =2 estimates taht lung cancer is twice as likely to ocur

among smokers than among nonsmokers in the study population. As another example,

suppose y denotes the precence or absence of heart disease and x denotes whether or not

the person engages in regular strenous pysical exericise.

If the estimated odds ratio is ÔR = 0.5, then occurreses of heart disease is one harf

as likely to occur among those who exercise than those who donot in the study popula-

tion.

The interpretation given for the odds ratio is based on the fact that in many instance

it approximates a quantity called the Relative risk. The parameter is equal to the ra-

tio

π(1)/π(0).

It follws from 4.1 that the odd ratio approximates the relative risk if,

1−π(0)

1−π(1)

1. This holds when π(x) is smaller for both x = 1and 0.

An example may help to clarify what the odds ratio is and how it is coputed from the

results of a logistic regression program or from a 2 × 2 table. In many examples of logis-

tic regression encountered in the literature we ﬁnd that a continuous variable has been

dichotomize at some biologically meaningful cutpoint.

Example: We consider pevious example that data displayed in Table1.1, and create a

new variable, AGED, which takes on the variable 1 if the age of the subject is greater

than or equal to 55 and zero otherwise. The result of cross classifying the dichotomized

age variable with the outcome variable CHD is preseted in Table 3.2. By usin equation

4.3, the pair (xi , yi), The product of the terms given,

n

l(β) = πi=1 π(xi )yi [1 − π(xi )]1−yi (4.4)

The data in Table 3.2 tell us that there were 21 subjects with values (x = 1, y = 1) ,22

with (x = 0, y = 1) 6 with (x = 1, y = 0), and 51 with (x = 0, y = 0). Hence, for these

data,the likelihood function shown in 2.11 simpliﬁes to,

51

Use of a logistic regression program to obtain the estimates of β0 and β1 .

The estimate of the odd ratio is ÔR = e2.094 = 8.1. Readers who have have had some

previous experience with the odds ratio undoubtedly wonder why a logistic regression

package was used to obtain the maximum likelihood estimate of the odds ratio. When it

could have been obtained directly form the cross-product ratio from Table 4.2 namely,

(21/6)

ÔR = = 2.094

(22/51)

(21/6)

β̂1 = ln[ ] = 2.094

(22/15)

We emphasize here that logistic regression, in fact, regression even in the simplest case

possible.The fact that the data may be formulated in terms of a contingency table pro-

vides the basis for interpretation of estimated coeﬃcients as the log of odds ratio.

Along with the point estimation of a parameter, it is good idia to use a conﬁdence interval

estimate of provide additional information about parameter value.

The odds ratio, OR, is usually the parameter of interest in a logistic regression due to its

ease of interpretation. However, its estimate, ÔR tends to have a distribution that the

skewed. The skewness of the sampling distribution of ÔR is due to the fact that possible

values range between 0 and ∞, with the null value equatling.

In theory, for large enogh sample sizes, the distribution of ÔRis normal.

Unfortunately, this sample size requirement typically exceeds that of most studies. hence,

inferences are unually based on sampling distribution of ln(ÔR) = β̂1 , which tends to

follow a normal distribution for much smaller sample sizes. A 100(1 − α)% conﬁdence

interval (CI) estimate for the odds ratio is obtained by ﬁrst calculating the end points of

a conﬁdence interval for the coeﬃcients β1 ,

As an exaple, consider the estimationof the odds ratio for the dichotomize variable AGE

D. The point estimation is ÔR = 8.1 and the end points of a 95% conﬁdence intervel are

Conclution

This interval is typical of the conﬁdence intervals seen for odds ratios when the point es-

timate exceeds 1. The conﬁdence interval is skewed to the right. The conﬁdence interval

suggests that CHD among those 55 and older in the study population could be as little

as 2.9 lites or much as 22.9 times more likely than those under 55, at the 95 present level

52

of conﬁdent.

We illustrate these computations in detail, as they demonstrate the general method for

computaing estimates of odds ratios in logistic regression. The estimate of the log of the

odds ratio for any independent variable at two diﬀerent levels, say x = a versus x = b, is

the diﬀerence between the estimated logits computed at these two values.

= (β̂0 + β̂1 × a) − (β̂0 + β̂1 × b)

= β̂1 (a − b)

The estimate of the odds ratio is obtained by exponentiating the logit diﬀerence,

This expression is equal to exp(β1 )only when (a-b)=1.In 4.6 & 4.7 the notation ÔR(a, b)

is use to represent the odds ratio,

OR = (4.8)

π̂(x = b)/(1 − π̂(x = b)

and when a = 1 and b = 0. We let ÔR = ÔR(0, 1) Some software packeges ofer a choice

of methods for coding design variables. The “zero-one”coding used so far in this section

is frequenty referred to as reference cell coding.

The reference cell method typically assigns the value of zero to the lower code for x and

one to hight code

Example: If sex was coded as 1 =male and 2 =female. Then the resulting design variable

under this method, D would be coded 0 =male and 1 =female, and then treating the

variable SEX as if it were interval scaled.

53

Sex(code) Design variable

Male(1) 0

Female(2) 1

Table 4.3: Illustration of the coding of the design variable using the reference cell method.

Sex(code) Design variable

Male(1) -1

Female(2) 1

Table 4.4: Illustration of the coding of the design variable using the deviation from means

method.

Another coding method is frequently reﬀed to as deviation form means coding. This

method assings the value of −1 to the lower code,and a value 1 to the higher code. The

coding for the variable SEX discussed above is shown in Table 3.4. Suppose we wish to

estimate the odds ratio of female versus male, when deviation from means coding is used.

we do this by usig the general nethod shown in 4.6 & 4.7. diﬀerent levels, say x=a versus

x=b, is the diﬀerence between the estimated logits computed at these two values.

= g(D = 1) − g(D = −1)

= (β̂0 + β̂1 × (D = 1)) − (β̂0 + β̂1 × (D = −1))

= 2β̂1

Thus, if we had exponentiated the coeﬃcient from the computer output we would have

obtained the wrong estimated of the odds ratio. This points out quite clearly that we

must pay close attention to the method to the method used to code the design variables.

The method of coding also inﬂuences the calculation of the end points of the conﬁdence

interval. For the above example, using the deviation from means coding, the estimated

standard error needed for conﬁdence interval estimation is ŜE(2β̂1 ) which is 2 × ŜE(β1 ).

Thus the end points of conﬁdence interval are

In general, the end points of the conﬁdence interval for the odds rather given in 4.9 are

54

CHD status White Black Hispanic Other Total

Present 5 20 15 10 50

Absent 20 10 10 10 50

Total 25 30 25 20 100

Odds Ratio 1 8 6 4

above95 CI - (2.3,27.6) (1.7,21.3) (1.1,14.9)

ln(ÔR) 0.0 2.08 1.79 1.39

Table 4.5: cross-classiﬁcation of hypothetical data on RACE and CHD status for 100

subjects.

since we can control how we code our dichotovariabls, we recommend that, in most situ-

ations, they be coded as 0 or 1 for analysis purpose. Each dichotomous variable is then

treated as an interval scale variable.

summary:

For a dichotomous variable the parameter of interest is the odds ratio. An estimate of this

parametermay be obtained from the estimated logistic regression coeﬃcient, regardless

of how the variable is coded. This relationship between the logistic regression coeﬃcient

and the odds ratio provideds the fundamention for interepretation of all logistic regression

results.

Suppose that insted of two categories the independent variable has k > 2 distinct values.

For example, we may have variables that denote the county of residence within a state,

the clinic used for primary health care within a city, or race. Each of these variable has

a ﬁxed number of distinct values and the scale of measurment is nomial.

In this section we present methods for creating design variables for polychotomous in-

dependent variables. The choice of a particular method depends to same extent on the

goals of the analysis and the stage of model development.

We bebin by extending the method presented in Table 4.1 for a dichotomous variables.

For example, suppose that in a study of chd the variable race is coded at four levels, and

that the cross-classiﬁcation of RACE by CHD yields the data in Table 4.5.

These data are hypothetical and have been formulated for ease of computation. The

extension to a situation where the variable has more than four levels is not conceptually

diﬀerent, so all examples in this section use k = 4.

55

RACE(code) Race(2) RACE(3) RACE(4)

white(1) 0 0 0

Black(2) 1 0 0

Hispanic(3) 0 1 0

Other(4) 0 0 1

Table 4.6: speciﬁcation of the design variables for RACE using reference cell coding with

white as the reference group.

RACE(2) 2.079 0.6325 3.29 0.001

RACE(3) 1.792 0.6466 2.78 0.006

RACE(4) 1.386 0.6708 2.07 0.039

CONSTANT -1.386 0.5000 -2.77 0.006

Log likelihood=-62.2937

Table 4.7: Results of ﬁtting the logistic regression model to the data in table 4.5 using

the disign variablesa in table 4.6.

At the bottom of the Table 4.5, the odds ratio is given for each race, using white as the

reﬀerence group. For example, for hisponic the estimated odd ratio is,

(15 × 20)

(5 × 10)

These some estimates of the odds ratio may be obtained from a logistic regression pro-

gram with an appropriate choice of design variables. The method for speciﬁng the design

variables. The method for specifying the design variables involves setting all of them

equal to zero for the reference group, and then setting a single design variable equal to

1 for each of the other groups. This is illustrated in following Table4.6. using of any lo-

gistic regression program with design variables as show in Table 4.6 yields the estimated

logistic regression coeﬃcients given in table 4.7. Use of any logistic regression program

with design variables coded as shown in Table 4.6 yields the estimated logistic regression

56

regression coeﬃcients given in Table4.7.

n

ĝ(x) = β0 + βjl Djl

i=1

= β0 + (β2l D2l + β22 D22 + β23 D23 )

= β0 + (β̂l D2l + β̂2 D22 + β̂3 D23 )

= β0 + (β̂l RACE(2) + β̂2 RACE(3) + β̂3 RACE(4))

" #

= β0 + β̂l (RACE(2) = 1) + β̂2 (RACE(3) = 0) + β̂3 (RACE(4) = 0)

" #

− β̂l (RACE(2) = 0) + β̂2 (RACE(3) = 0) + β̂3 (RACE(4) = 0)

= β̂1

also

" #

= β0 + β̂l (RACE(2) = 0) + β̂2 (RACE(3) = 1) + β̂3 (RACE(4) = 0)

" #

− β̂l (RACE(2) = 0) + β̂2 (RACE(3) = 0) + β̂3 (RACE(4) = 0)

= β̂2

= β1

(20 × 20)

=

(5 × 10)

= 8

β1 = 2.072

= β2

(15 × 20)

=

(5 × 10)

= 6

β1 = 1.792

57

RACE(code) Race(2) RACE(3) RACE(4)

white(1) -1 -1 -1

Black(2) 1 0 0

Hispanic(3) 0 1 0

Other(4) 0 0 1

Table 4.8: speciﬁcation of design variable for RACE using deviation form means coding.

= β3

(10 × 20)

=

(5 × 10)

= 2

β1 = 1.386

1 1 1 1

v̂ar(β̂j ) = + + +

5 20 20 10

ŜE(βj ) = [v̂ar(βj )]1/2

= 0.6325

We begin by computing the conﬁdence limits for the log odds ratio (the logistic regression

coeﬃcient) and then exponentiate these limits to obtain limits for the odds ratio. In

general, the limits for a 100(1 − α)% C.I for coeﬃcent are of the form,

β̂j ± (1 − α2 ) × SE(β̂j ).

The secound method of coding design variables is called deviation from means coding.

This coding expresses eﬀect as the deviation of the “group mean”from the “overall mean”.

The estimated coeﬃcients obtained using deviation from means coding may be used to

estimated the odds ratio for one category relative to a reference catogary. The equation

for the estimate is more complicated than the one obtained using the reference cell cod-

ing. However, it provides an excellent example of the basic principal of using the logit

diﬀerence of compute an odds ratio. To illustrate this we calculate the log odds ratio

of Black versus White using the coding for design variablegiven in Table4.8. The logit

58

diﬀerence is as ﬁllows.

n

ĝ(x) = β0 + βjl Djl

i=1

= β0 + (β2l D2l + β22 D22 + β23 D23 )

= β0 + (β̂l D2l + β̂2 D22 + β̂3 D23 )

= β0 + (β̂l RACE(2) + β̂2 RACE(3) + β̂3 RACE(4))

" #

ln ÔR(Balck, W hite) = ĝ(Black) − ĝ(white)

" #

= β0 + β̂l (RACE(2) = 1) + β̂2 (RACE(3) = 0) + β̂3 (RACE(4) = 0)

" #

− β̂l (RACE(2) = −1) + β̂2 (RACE(3) = −1) + β̂3 (RACE(4) = −1)

= 2β̂1 + β̂2 + β̂3

" #

ln ÔR(Balck, W hite) = 2β̂1 + β̂2 + β̂3 (4.9)

To obtain a conﬁdence interval we must estimate the variance of the of the sum of the

coeﬃcientsints in 4.9. In this example, the estimater is,

" #

var ln[ÔR(Balck, W hite)] = 4 × var(β̂1 ) + var(β̂2 ) + var(β̂3 ) + 4ĉov(β1 , β2 )

+4 × ĉov(β1 , β2 ) + 42ĉov(β2 , β3 )

The evaluation of 4.9 for the current example gives,

" #

ln ÔR(Balck, W hite) = 2(0.765) + 0.477 + 0.072

= 2.079

The estimate of the variance is obtained by evaluating 4.9 wchich, for the current example,

yields,

" #

V ar ln ÔR(Balck, W hite) = 4(0.351)2 + (0.362)2 + (0.385)2 + 4(−0.031)

4(−0.040) + 2(−0.0444)

= 0.40

and the standed error is

" #

ŜE ln[ÔR(Balck, W hite)] = 0.6325

We note that the values of the estimated log odds ratio, 2.079, and the estimated standed

error, 0.625, are identical to the values of the estimated coeﬃcient and standed error for

the ﬁrst design variable in Table 4.7. This is expected, since the design variables used to

obtain the estimated coeﬃcients in Table 4.7 were formulated speciﬁlly to yield the log

odds ratio relative to the white race category.

59

variable Coeﬃcients Std.Err z P > |z|

RACE(2) 0.765 0.3506 2.18 0.029

RACE(3) 0.477 0.3623 1.32 0.188

RACE(4) 0.072 0.3846 0.19 0.852

CONSTANT -0.072 0.2189 -0.33 0.742

Log likelihood=-62.2937

Table 4.9: Results of ﬁtting the logistic regression model to the data in Table 4.5 using

the design variables in Table4.8

When a logistic regression model contains a continous independent variable, interetation

of the estimated coeﬃcients depend on how it is entered in to model and the paticular

units of the variable. For purpose of developing the model to interpret the coeﬃcient

for a continuous variable, we assume that logit is linear in the variable.For purpose of

developing the method to interpret the coeﬃcients for a continous variable,we assume

that the logit is linear in the variable.

Under the assumtion that the logit is linear in the continuse variable, x, the equation

for the logit is g(x) = β0 + β1 x. It follows that the slope coeﬃcient β1 , give the change in

the log odds for an increase of 1 unit in x that is,

g(x) = β0 + β1 x

g(x + 1) = β0 + β1 (x + 1)

g(x + 1) − g(x) = β1

Most often the value of “1” is not clinically interesting. For example, a 1 year increase in

age or a 1mmHg increases in systolic blood may be too small to be considered important.

But a change of 10 yeares or 10mmHg might be considered more useful.

On the other hand, if the range of x is from zero to 1,then a change of 1 is too large

and a change of 0.01 may be more realistic. Hence, to provided a useful interpretation for

continuous scale covariates we need to develop a method for point and interval estimation

for an arbitrary change of “c” units in the covariate.

The logs odds ratio of c units in x is obtained from the logit diﬀerence.

g(x) = β0 + β1 x

g(x + c) = β0 + β1 (x + c)

g(x + c) − g(x) = c × β1

and the associated odds ratio is obtained by exponentiating logit diﬀerence.

OR(c) = OR(x + c, x) = exp(cβ1 )

60

An estimate may be obtained by replacing β1 with its maximum likelihood estimate β̂1 .

An estimate of the standard error needed for conﬁdence error of β1 by c. Hence the

endpoints of the 100(1 − α)% conﬁdence interval estimate of OR(c) are,

α

exp[c × β̂j ± (1 − ) × c × SE(β̂j )] j = 1...p

2

since both the point estimate and end points of the conﬁdence interval depend on the

choice of c, the particular value of c should be clearily speciﬁed in all tables and calcula-

tions. The rather arbitrary nature of the choice of c may be trouble to same.

To provide the reader of our analysis with a clear indication of the of hom the risk of

the outcome being present chanes with the variable in question changes in multiple of 5

or 10 may be most meaningful and easlily undestood.

As an example , consider the univariable model in Table 1.3, In that example a logis-

tic regression of AGE on CHD status using the data of Table 2.1 was reported. The

resulting estimated logit was

This indicates that for every increse 3.03 times.

The validity of such a statement is questionable in this example, since the additional

risk of CHD for a 40years old compared to a 30 years old may be quite diﬀerent form the

additional risk of chd for a 40 years old compared of a 30 years old may be quite diﬀerent

from the additional risk of CHD for a 60 years to 50 years old.This is an unavoidable

dilemma when continouse covariates are modeled linearly in the logit. If it is belived that

the logit is not linear in the covariate, then grouping and use of dummy variables should

be considered.

The end points of a 95% conﬁdence interval for odds ratio are,

Results similar to these may be placed in tables displaying the results of a ﬁtted logistic

regression model.

similar to that of nominal scale variable: an estimated log odds ratio. The primary

diﬀerence is that a meaningful change must be deﬁned for the continuous variable.

61

4.5 The Multivariable Model

In the previous we discussed the interpretation of an estimated logistic regression coeﬃ-

cient in the case when there is a singal variable in the ﬁtted model. Now we considers a

multivariable analysis for a more comprehensive modeling of the data. One goal of such

an analysis is to statistically adjust the estimated estimated eﬀect of each variable in the

model for diﬀerences in the distributions of and associations among the other indepen-

dent variables. Applying this concept to a multivariable logistic regression madel, we may

surmise that each estimated coeﬃcient provide an estimate of the log odds adjusting for

allother variables included in the model.

regression model requires that we ve a clear understanding of what is actually meant by

the term adjusting statistically, for the variables. We begin by examining adjustment in

the context of a linear regression model, and then extend the concept to logistic regression.

The multivariable situation we examine is one in which the model contains two inde-

pendent variables,

(1) One-dicotomous

(2) One-Continuous

but primary interest is focused on the eﬀect of the dichotomous variable. This situation

is frequently encountered in epidemiologic reserch when an exposure to a risk factor is

recorded as being either present or absend, and wish to adjust for a variable such as age.

The analogous situation in linear regression is called analysis of covariance.

For example, we wish to compare the mean weight of two groups of boys. It is known

that wightis associated with many characteristics, one of which is age.Assume that on all

characteristics except age the two groups have nealy identical distributions. If the age

distribution is also the same for the two groups, then a univariate analysis would suﬃce

and we could compare the mean weight of two groups. This comparison would provide us

with a correct estimate of the diﬀerence in weight between the two groups. The statistical

model that describe the situation in Figure 4.1 states that the value of weight, W , may

be express as,

w = β0 + β1 x + β2 a (4.11)

where,

In this model the parameter β1 represent the true diﬀerence in weight between the two

groups and β2 represent the rate of change in weight per year of age. Suppose that the

62

w

Weight(w)

w

a a a

Age(a)

Figure 4.1: Comparison of the weight of two groups of boys with diﬀerent distribution of

age.

Group1 Group2

Variable Mean Std.Dev Mean Std.Dev

PHY 0.36 0.485 0.80 0.404

AGE 39.60 5.272 47.34 5.259

Table 4.10: Descriptive statistics for two groups of 50 mens on AGE and whether they

had seen a physician(PHY)(1=Yes,0=No)within the last months.

mean age of group 1 is ā1 and the mean age of group 2 is ā2 .

This situation is describe graphically in Figer3.1. In this ﬁger it is assumed that the

relationship between age and weight is linear, with the same signiﬁcant nonzero slop in

each group.

Compare of the mean weight of group1 to mean weight of group 2 amounts to a com-

parison of w1 to w2 .In terms of the model this diﬀerence is,

w2 = β0 + β1 x + β2 a2

w1 = β0 + β2 a1

(w2 − w1 ) = β1 + β2 (a2 − a1 )

Thus the comparison involves not only the true diﬀerence between the group, β1 , but a

components β2 (a2 − a1 ) which reﬂect the diﬀerence between the ages of the groups.

The process of statistically adjesting for age involves comparing the two groups at some

common value of age. The value usually used is the mean of the two groups which, for the

example, is denoted by ā in Figure 3.1. In terms of the model this yields a comparison of

63

w4 to w3 ,

w4 = β0 + β1 x + β2 a

w3 = β0 + β2 a

w4 − w3 = β1 x + β2 (a − a)

= β1

In theory any common value would yield the same diﬀerence between two lines. The

choice of the overall mean makes sense for two reasons. It is biologically reasonable and

lies within the range for which we belive that the association between age and wight is

linear and contant within each group.

Consider the same situation show in ﬁguer 3.1, but instead of weight being the dependent

variable, assume it is a dichotomous variable and that the vertical axis denotes the logit.

That is, under the model the logit is given gy the equation

g(x, a) = β0 + β1 x + β2 a

A univariate comparison obtained from the 2 × 2 table cross-classﬁng outcome and group

would yield a log odds ratio approximately equal to β1 +β(a2 −a1 ). This would incorrectly

estimate the eﬀect of group due to the diﬀerence in the distribution of age.

This logit diﬀerence is g(x = 1, ā) − g(x = 2, ā) = β1 .Thus, the coeﬃcients β1 is the

log odds ratio that we would expect to obtain from a univariate comparison if the two

groups had the same distribution of age.

= 1.962

ÔR = 7.11

We can also see that there is a considerable diﬀerence in the age distribution of the two

groups, the mean in group 2 being on average more than 7 years older than those in group

1. We would guess that much of the apparent diﬀernce in the proportion of men secing a

physician might be due to age. Analyzing the data with a bivariate model using a coding

of Group=0 for group 1, and Group=1 for group2, yield the estimated logistic regression

shown in Table 4.11. The age adjusted log odds ratio is ÔR=exp(1.263)=3.54.Thus ,much

of the apparent diﬀerence diﬀerence between the two groups is fact,due to diﬀrences in

age.

Let us examine this adjustment i more detail using Figure 4.1. An approximation to the

64

variable Coeﬃcients Std.Err z P > |z|

GROUP 1.263 0.5361 2.36 0.018

AGE 0.107 0.0465 2.31 0.021

CONSTANT -4.866 1.9020 -2.56 0.011

Log likelihood=-54.8292

Table 4.11: Resuls of ﬁtting the logistic regression model to the data summarized in Table

4.10

the ﬁtted logistic regression model shown in Table 3.11. This diﬀerence is,

w2 − w1 = (β0 + β1 + β2 a2 ) − (β0 + β2 a1 )

= β1 + β2 (a2 − a1 )

[−4.866 + 1.263 + 0.107()047.34] − [−4.866 + 0.1107(39.60)]

= 1.263 + 0.107(47.34 − 39.60)

e1.263+0.107(47.34−39.60) = 8.09

The discrepancy between 8.09 and the actual unadjusted odds ratio, 7.11 is based on

the diﬀerence in the average logit,while the crude odds ratio is approximatly equal to a

calculation based on the averege estimated logistic probability for two groups.

The age adjested odds ratio is obtained by exponentating the diﬀerence w4 − w3 , which

is equal to the estimated coeﬃcient for GROUP. In the example the diﬀerence is,

w4 − w3 = (β0 + β1 + β2 a) − (β0 + β2 a)

= β1

[−4.866 + 1.263 + 0.107(43.47)] − [−4.866 + 0.107(43.47)]

= 1.263

Bachand and Hosmer (1999) compare two diﬀerent sets of criteria for deﬁning a covariate

to be confounder. They show that the numerical approach used in this section, examining

the change in the magnitude of the coeﬃcient for the risk factor from logistic regression

models ﬁt with and without the potential confounder both risk factor and confounder

is not fully Sshape. The method of adjestment when the variables are all dichotomous,

polychotomous, continuous or a mixture of these is identical to that just described for the

dichotomous-continuous variable case. For example, suppose that instead of treating age

as continuous it was dichotomized using a cutpoint of 45 years. To obtain the age-adjusted

eﬀect of group we ﬁt age-adjusted eﬀect of group we ﬁt the bivariate model containing the

two dichotomous variables and calculate a logit diﬀerence variables and calculate a logit

diﬀerence at the two dichotomous variables and calculate a logit diﬀerence at the two

65

levels of group and a common value of the dichotomous variables for age. This procedure

is similar for any number and mix of variables.Adjusted odds ratios are obtained by

comparing individuals who diﬀer only in the characteristic of interest and have the values

of all other variable constant.

In the last section I saw how the inclution of additional variables in a model provides

a way of statistically adjesting for potential diﬀerences in their distributions. The team

confounder is used by epidemiologyist to describe a covariate that is associated with both

the outcome variable of interest and a primary independent variable or risk factor.When

both associations are present then the relationship between the risk factor and the out-

come variable is said to be confounded.

In this section we introduce the concept of interaction and show how we can control

for its eﬀect in the logistic regression model. In addition, I illustrate with an example

how confounding and interaction may aﬀect the estimated coeﬃcients in the model.

If the association between the covariate (i.e age) and the outcome variable is the same

with in each level of the risk factor (i.e group), then there is no interaction between the

covariate and the risk factor.

Graphically, the absence of interaction yields a model with two parallel lines, one for

each level of the risk factor variable. In general, the absence of interaction is character-

ized by a model that contains no second or higher order terms involving two or more

variables. When interaction is present, the asspociation between the risk factor and the

outcome variable diﬀers, or depends in some way on the level of the convariate. That is,

the covariate modeﬁes the eﬀect of the risk factor. Epidemiologists use the term eﬀect

modiﬁer to describe a variable that interacts with a risk factor. In the previous exam-

ple,the logit in linear in age for the men in group 1, then interaction implies that the

logit does not follow a line with the same slope for the secound group. In theory, the

association in group 2 could be described by almost any model except one with the same

slope for the secound group. In theory, the association in group 2 could be dascribe by

almost any model except one ith the same slope as the logit for group 1.

Figure 3.2 presents the graphs of three diﬀerent logits.In this graph, 4 has been addsd

to each of the logists to make plotting more convenient. The graphs of these logits are

used to exlpain what is meant by interaction. Consider an explain what is meant by in-

teraction. Consider an example where the outcome variable is the presence or absence of

CHD, the risk factor is sex, and the covariate is age. Suppose that the line labeled l1 cor-

responds to the logit for female aas a function of age. Line l2 represents the logit for males.

66

l3

5

l2

4

LogOdds+4

3

l1

35 40 45 50 55 60 65

AGE

Figure 4.2: Plot of the logits under three diﬀerent models showing the presence and

absence of interaction.

Model Constant SEX AGE SEX × AGE DEVIATION G

1 0.060 1.981 419.816

2 -3.3374 1.356 0.082 407.78 12.036

3 -4.216 -4.216 0.103 -0.062 406.392 1.388

Table 4.12: Estimate logistic regression coeﬃcients , deviance, and the likelihood ratio test

statistic (G) for an example showing evidence of confounding but no interation(n=400)

These two lines are parallel to each other, indicating that the relationship between age

and CHD is the same for male and females. In this situation there is no interection and

the log odds ratios for sex(male vs Female) Controlling for age is given gy diﬀerence be-

tween line l2 and l1 , l2 − l1 , This diﬀerence is equal to the vertical distance between the

two lines, which is the same for all ages.

Suppose instead that the logit for males is given by the line l3 . This line is steeper than

the line l1 , for females, indicating that the relationship between age and CHD among

males is diﬀerent from that among females. When this occurs we say there is an interactin

between age and sex. The estimate of the log-odds for sex (male versus female)controlling

for age is still given by the vertical distance between the lines, l3 − l1 , but this diﬀerence

now depends on the age at which the comparison is being made. Thus, we cannot esti-

mate the odds ratio for sex without ﬁrst ﬁrst specifying the age at which the comparison

is being made. In order words, age is an eﬀect moﬁﬁer.

67

Model Constant SEX AGE SEX × AGE DEVIATION G

1 0.201 2.386 376.712

2 -6.672 1.2774 0.166 338.688 38.024

3 -4.825 -7.838 0.121 0.205 330.654 8.034

Table 4.13: Table 4.13 Estimate logistic regression coeﬃcients, deviance, and the like-

lihood ratio test statistic (G) for an example showing evidence of confounding but no

interation(n=400)

Table 4.12 and 4.13 present the result of of ﬁtting a series of logistic regression models

to two diﬀerent sets of hypothetical data. The variable in of the data sets are the same

SEX,AGE and the outcome variable CHD. In addition to the estimated coeﬃcients, the

deviation for each model is given. Recall that the change in the signiﬁcance of coeﬃcients

for variables added to the model. An interaction is added to the model gy creating variable

that is equal to the product of the value of SEX and the value of AGE. Some programs

have synatex that automatically creates interaction variables in a statistical model, while

other require the user to create them through a data modiﬁcation step.

Examining the results in Table 4.12 we see that estimated coeﬃcient for the variable

SEX changed from 1981 in the model 1 to 1.356 a 46 pesent decrease, when AGE was

added in model 2. Heance, there is clear evidence of a confounding eﬀect due to AGE

when the interection term “AGE×SEX” is added in model 3 we see that the change in

the deviance is only 1.3888, when compared to the chi-squar distribution with 1 degree

of freedom, yields a p-value of 0.24, which is clearly not signiﬁcant.

Note that the coeﬃcient for sex changed from 1.356 to 4.239. This is not surprising

science the inclusion of an interaction term, especially when it involves a continuous vari-

able usually produces fairly marked changes in the estimated coeﬃcients of dichotomous

variables involved in the interaction. Thus, when an interaction term is presented in

the model we cannot assess confounding via the change in a coeﬃcient. For these data

we would prefer to use model 2 that suggests age is a confounder but not an eﬀect modiﬁer.

The resuls in Table 3.13 show evidence of both confounding and interaction due to age.

Comparing model 1 and model 2 we see that the coeﬃcient for sex changes from 2.386 to

1.374 an 87 percent decrease. When the age by sex interaction is added to the model we

see that the change in the deviance is 8.034 with a p-value of 0.005. Science the change

in the deviance is signiﬁcant we prefer model 3 to model 2 and should regard age as both

a confounder and an eﬀect modiﬁer. The net result is that any estimate of the odds ratio

for sex shoud be made with reference to a speciﬁc age.

68

4.7 Estimation Of Odds Ratios In The Presence Of

Interaction

In previous section we showed that when there was interection between a risk factor and

another variable,the estimater of the odds ratio for the risk factor depends on the value of

the variable that is interacting with it. In this situation we may not be able to estimate

the odds ratio by simply exponentiating on estimated coeﬃcient one approch that will

always yield the correct model-based estimate is to,

(1) Write down the expressins for the logit at the two levels of the risk factor being

compared.

(2) Algebraically simplify the diﬀerence between the two logits and compute the value.

As a ﬁrst example, we develop the method for a model containing only two variables and

their interection.

In this model, denote the risk factor as, F, the covariate as X and their interection as

F×X and X=x is,

g(f, x) = β0 + β1 f + β2 x + β3 f × x (4.12)

Assume we want the odds ratio comparing two levels of F, F=f1 versus and F, F=f0 at

X=x. Following the three step procedure ﬁrst we evaluate the expressions procedure ﬁrst

we evaluate the expression for the two logits yielding.

g(f1 , x) = β0 + β1 f1 + β2 x + β3 f1 × x

g(f0 , x) = β0 + β1 f0 + β2 x + β3 f0 × x

Second we compute and simpify there diﬀerence to obtain the log-odds ratio yieling.

= (β0 + β1 f1 + β2 x + β3 f1 × x)

−(β0 + β1 f0 + β2 x + β3 f0 × x)

= β1 (f1 − f0 ) + β3 × x(f1 − f0 )

Third we obtain the odds ratio by exponentiating the diﬀerence obtained at step 2 yieding.

The expression for the log-odds ratio in4.13 does not simlipy to a single coeﬃcient. Instead

it involves two coeﬃcients, the diﬀerence in the values of the risk factor and the interaction

variable.The estimator of the log-odds ratio is obtained by replaceing the parameters in

4.13 with the three estimaters.we calculate the end points for the conﬁdence interval for

69

the log-0dds ratio and then exponentiate the end points is the estimater of the variance

of the estimater of the log odds ratio in 4.13. Using methods for calculating the variance

of a sum we obtain the following estimater.

" #

v̂ar ln[ÔR(F = f1 , F = f0 , X = x)] = (f1 − f0 )2 × v̂ar(β̂1 ) (4.14)

+[x(f1 − f0 )]2 var(β̂3 ) (4.15)

2

+2x(f1 − f0 ) × cov(β1 , β3 ) (4.16)

The end points of a (1 − α)% conﬁdence interval estimator for the log-odds ratio are,

" #

[β1 (f1 − f0 ) + β3 × x(f1 − f0 )] ± z(1− 2 ) ŜE ln[ÔR(F = f1 , F = f0 , X = x)]

α (4.17)

Where the standad error in 4.17is the positive square root of the variabne estimator in

4.13. We obtain the end points of the conﬁdence interval estimator for the odds ratio by

exponentiating the endpoints in 4.17.

The estimators for the log-odds and its variance simplify in the case when F is a di-

chotomous risk factor. If let f1 = 1and f0 = 0 then the estimator of the log-odds ratio

is,

" #

(β̂1 + β̂3 x) ± z(1− 2 ) ŜE ln[ÔR(F = f1 , F = f0 , X = x)]

α (4.20)

Example:

We consider a logistic regression model using the low birth weight data described in section

1.6 containing the variables AGE and a dichotomous variable, LWD, based on the weight

of the mother at least menstrul period. This variable takes on the value 1 if LWT< 110

pounds, and is zero otherwise. The result of ﬁtting a series of logistic regressio models

given in Table 4.14. using the estimation coeﬃcent for LWD in model 1. we estimated

the odds ratio as exp(1.054) = 2.87. The result shown in Table 4.14 indicate that AGE

is not a stong confounder, β̂% = 4.2, present,but it does interact with LWD, P=0.076.

Thus, to assess the risk of low wight at the last menstrual period correctly. We must

include the interaction of this variable with the women’s age becouse the odds ratio is not

constant over age.

An eﬀective way to see the presence of interaction is via a graph of the estimated logit

under model 3 in Table 4.14 This is show in Figure 4.3.

70

Model Constant LWD Age lWD × AGE ln[l(β) G P

0 -0.790 -117.34

1 -1.054 1.054 -113.12 8.44 0.0004

2 -0.027 1.010 -0.044 -112.14 1.96 0.160

5 0.774 -1.944 -0.080 0.132 -110.57 3.14 0.076

Table 4.14: Table 4.14 estimated logistic regression coeﬃcients, Deviance, the likeli-

hood ratio teststatistic (G), and the P-value for the change for models containing LWD

and AGE from the low birthwight containing LWD and AGE from the low birthwight

data(n=189)

Constant 0.828

LWD -0..828 2.975

AGE -0.352-02 -0.353-01 0.157-02

LWD*AGE -0.352-01 -0.128 -0.157-02 0.573-02

Table 4.15: Estimated covariance matrix for the estimated parameters in model 3 of Table

4.14.

The upper line in Figure 4.3 corresponds to the estimated logitfor women with LWD=1

and the lower line is for women with LWD=0. Separate plotting symbols have been used

for the two LWD groups. The estimated log-odds ratio for LWD=1 verses LWD=0 at

AGE=x from 4.18 is equal to the vartical distance between the two lines at AGE=x in

Figure 4.3 that none of the women in the low wight group, LWD=1, are older than about

33 years. Thus we should restrict our estimates of the eﬀect of low wight to the range

of 14 to 33 years. Based on these observations we estimate the eﬀect of low weight at

15, 20, 25 and 30 years of age.

Using 4.18and the result for model 3 the estimated log-odds ratio for low weight at the

last menstrual period for a women of AGE a is,

ln[ÔR(LW D = 1, LW D = 0, AGE = a)] = −1.944 + 0.132 (4.21)

In oder to obtain the estimated variance we must ﬁrst obtain the estimated covariance

matrix is symmetric most logistic regression.

Soft packages print the result in the form similar to that shown in thae Table 4.15.

The estimated variance of the log-odds ratio given 4.17 is obtain from 4.20 and is

" #

V ar ln[ÔR(LW D = 1, LW D = 0, AGE = a)] = 2.975 + a2 × 0.0057 + 2 × (−0.128)

(4.22)

71

Age 15 20 25 30

OR 1.04 2.01 3.90 7.55

95 CIE 0.29-3.79 0.91-4.44 1.71-8.88 1.95-29.19

Table 4.16: Estimated odds ratios and 95 present conﬁdence intervals for LWD,controlling

for AGE.

values of the estimated odds ratio and 95% conﬁdence interval (CI) using 4.22 for several

ages are given in Table 4.16. The result show in Table 4.16 demostrate that the eﬀect

of lwd on the odds of having a low birth weight baby increase exponentially with age.

The result also show that the increase in risk is signiﬁcant for low weight women 25 years

and older. In particular low weight women 25 years and older. In particular low weight

women of age 30 are estimated to have a risk that is about 7.5 times that of women of

the same age who are not low weight. The increase in risk could be as little as two times

ar as much as 29 times with 95% coeﬁdence.

72

Chapter 5

Mothods For Logistic Regression

In the previous cahpters we fouused on estimating, testing, and interpreting the coeﬃ-

cients in a logistic regression model. The examples discussed were characterized by having

few independent variables, and there was perceived to be only one possible model. While

there may be situations where this is the case, it is more typical that there are many

independent variables that could potentially be included in the model. Hence, we need to

develop a stretegy and associated methods for handling these more complex situations.

The goal of any method is to select those variables that result in a “best” model within

the scientiﬁc context of the problem. In order to achieve this goal we must have:

(1) A basic plan for selecting the variables for the model

(2) A set of methods for assessing the adequacy of the model both in terms of its

individual variables and its overall ﬁt.

We suggest a general stetegy that consider both of these areas. Succesful modeling of a

complex data set is part science, part statisticalmethods, and part eperience and common

sense. It is our goal to provide the reader with a paradigm that, when applied thoughtfully,

yields the dest possible model within the constraints of the available data.

The criteria for includeing a variable in a model may vary from one problem to the next

and from one scientiﬁc discipline to another. The traditional approach to statistical model

building involves seeking the most parsimonious model that still explains the data. The

rationale for minimizing the number of variables in the model is that the resultant model

is more likely to be numerically stable, and is more easily generalized. The more variables

included in a model, the greater the estimated standard errors become,and the more de-

pendent the model becomes on the obseved data. Epidemiologic methodologists suggest

73

including all clinically and intuitively relevant variables in the model, regardless of their

“statistical signiﬁcance”. The rationale for this approach is to provide as complete con-

trol of confounding as possible within given the dataset. This is based on the fact that

it is possible for individual variables not to exhibit strong confounding, but when taken

collectively, considerable confounding can be present in the data.

The major problem with this approach is that model may be “overﬁt”, producing nu-

merically unstable estimates. Overﬁtting is typically characterized by unrelistically large

estimated coeﬃcients and/or estimated standard errors. This may be especially trou-

blesome in problems where the number of variables in the model is large relative to the

number of subjects and/or when the overall proportion responding (y = 1) is close to

either 0 or 1.

There are several steps one can be follow to aid in the selection of variables for a lo-

gistic regression model. The process of model building is quite similar to the one used in

linear regression.

(1) the section process should being with a careful univariable analysis of each variable.

For nominal, ordinal, and continuous variables with few integer values, we suggest

this be done with a continuous variable of outcome (y=0,1) versus the k levels of

the independent variable. The likelihood ratio chi-square test with k-1 degrees -of-

freedom is exactly equal to the value of the likelihood ratio test for the signiﬁcance

of the coeﬃcients for k-1 design variables in a univariable logistic regression model

that contains that single independent variable.Since the pearson chi-square test is

asymptotically equivalent to the likelihood ratio chi-square test, it may also be used.

In addition to the overall test, it is good idea, for those variables exhibiting at least

a moderate level of association, to estimate the individual odds ratio (along with

conﬁdence limits) using one of the levels as the reference group.

For continuous variables, the most desirable univariable analysis involves ﬁtting

a univariable logistic regression model to obtain the estimated coeﬃcient, the es-

timated standed error, the likelihood ratio test for the signiﬁcance of the coeﬃ-

cient,and the univariable Wald statistic. An alternative analysis, which is equivalent

at the univariable level, may be based on the two-sample t-test. Descriptive statis-

tics avalable from two-sample t-test analysis generally include group means,standed

deviations, the t-statistic, and its p− value. The similarity of this approach the logis-

tic regression analysis follows from the fact that the univariable linear discriminant

function estimate of the logistic regression coeﬃcient is

$

(x̄1 − x̄0 ) t 1 1

= + (5.1)

s2p sp n1 n0

and that the linear discriminant function and the maximum likelihood estimate of

the logistic regression coeﬃcient are usually quit close. When the independent vari-

ble is approximatly normally distributed within each of the outcome groups, y=0,1.

74

Thus the univariate analysis based on the t-test shoud be useful in determining

whether the variable should be included in the model,science the p-value should be

of the same order of magnitude as that of the Wald statistic, or likelihood ratio test

from logistic regression.

For continous covariates, we may wish to supplement the evaluation of the uni-

variable logistic ﬁt with some sort of smoothed scatterplot. This plot is helpful, not

only in asertainingthe potential importance of the variable and possible presence

and eﬀect of extreme of extreme (large or small) observations, but also its appro-

priate scale. One sample and easily comuted from of a smoothed scaterplot was

illustrated in Figure 1.2 using the data in Figure 1.2.

Other more complecated methods that have greater precision are available.

Kay and Little (1987) illustrate the use of a method proposed by Copas(1983).

This method requires computing a smoothed value for the resonse variable for each

subject that is a weighted average of the values of the outcome variable overall sub-

jects. The weight for each subject is a continous decreasing function of the distance

of the value of the covariate for the subject under consideration from the value of

the covariate for all other cases. For example, for covariate x for the ith subject we

compute n

w(xi xj )yi

ȳi = i=1

n (5.2)

i=1 w(xi xj )

Where w(xi xj ) represents a paticular weight function. For example if we use

STATA’s scatterplot smooth command, ksm,with the wight option and band width

k, then

3

|xi − xj |3

w(xi , xj ) = 1 − (5.3)

where is deﬁne so that the miximum value for the weight is 1 and the two

indices deﬁning the sumation. ij and iu , include the k precent of the n subjects

with x value closet to xi . Other wight function are possible as well as additional

smoothing using locally weighted least squares regression, called lowess in some

packages.

(2) Upon completion of the univariable analysis,we select variables for the multivariable

analysis.Any variable whose univariable test has a p-value < 0.25 is a candidate for

the multivariable model along with all variable of known clinical importance. Once

the variable have been identiﬁed, we bwgin with a model containing all of the se-

lected variables.

Our recomandation that 0.25 level be used as a screening criterion for variable

selection is based on the work by Bendel and Aﬁﬁ (1977) on linear regression and

on the work by Mickey and Greenland (1989) on logistic regression. Thus we can

75

show that use of more traditional level (such as 0.05 level) often fails to identify

variables known to be important. Use of the higher level has the disadvantage of

including variable that are of questionable importance at the model bulding stage.

For this reson, it is importance to review all variables added to a model critically

berore a dicision is reached regarding the ﬁnal model.

analytic philosophies as well as diﬀerent statistical methods. One school of thought

argues for the inclusion of all scientiﬁcally relevent variables in to multivariable

model regardless of the results of univariable model analyses. In the general, the

appropriateness of the decision to begin the multivariable model with all possible

variables depends on the overall sample size and the number in each outcome group

relative to the total number of candidate variables. When the data are adequate

to support such an analysis it may be useful to begin the multivariable modeling

from this point. However, when the data are inadequate, this appoach can produce

a numerically unstable multivariable discussed in greater detail. In this case the

Wald statistics should not be used to select variables becouse of the unstable nature

of the results. Insted, we should select a subset of variables based on results of the

univariable analyses and reﬁne the deﬁnition of “Scientiﬁcally relevent”.

which variables are selected either for inclusion from the model in a sequentioal fash-

ion based solely on statistical criteria. There are two main versions of the stepwise

procedure.

(b) Backward elimination followed by a test for forward selection.

This algorithms used to deﬁne these procedures in logistic regression. The stepwise

approach is useful and intuitively appeling in that it buids models in a sequential

fashion and it allows for the examination of a collection of models which might not

otherwise have been examined.

“Best subsets selection” is a selection method that has not been used extensively in

logistic regression.

Stepwise, best subsets, and other machanical selection procedures have criticized

because they can yield a biologically implausible model[Greenland(1989)] and can se-

lect irrelevant, or noise, variables [Flack and chang (1987), Griﬃths and Pope(1987)].

The problem is not the fact that the computer can select such models, but rather

that the analyst fails to scretinize the resulting model carefully, and reports such

result as the ﬁnal, dest model. The wide avalability and ease with wtich stepwise

mothods can be used has undoubtedly reduced some analysts to the role of assisting

the computer in model selecting rather than the more appropriate alternative. It is

76

only when the anayst understands the strengths, and especially the limitations, of

the methods that these methods can serve as useful tools in the model-building

process. The analyst, not the computer, is ultimately responsible for the review

and evaluation model.

(3) Following the ﬁt of multivariable model, the impotance of each variable included

in the model should be variﬁed.This shoud be veriﬁﬁed.This should include(a) an

examination of the Wald statistic for each variable and (b) a comparison of each

estimated coeﬃcient with the coeﬃcient from the model containing only that vari-

able. Variable that do not contribute to the model based on these criteria shoud

be eliminated and contribute to the model based on these criteria should be elim-

inated and a new model shoud be ﬁt.The new model should be compared to the

old, lager, model using the likelihood ratio test. Also, the estimated coeﬃcients for

the remaining variables should be concerned about variables whose coeﬃcients have

changed markedly in magnitude. This indicates that one or more of the excluded

variables was important in the sense of providing a needed adjustment of the eﬀect

of the variable that remained in the model. This process of deleting, reﬁtting, and

varifying continues until it appears that all of the important variable are included

in the model and those excluded are clinically and/or statistically unimportant.

At this point, we suggest that any variable not selected for the original multivari-

able model de added back in to the model. This step can be helpful in identifying

variable that,by themselves, are not signiﬁcantly related to the outcome but make

an important contribution in the presence of other variable.

We refer to the model at the end of step (3) as the preliminary main eﬀect model.

(4) Once we have obtained a model that we feel that we feel contains the essential

variable, we sshoud look more closely at the variables in the model. The question

of the appropriate categories for discrete variables should have been addressed at

the univariable stage.For continouse variables we should check the assumptuion of

linearity in the logit.

shoud be in the model. The graph for several diﬀerent relationships between the

logit and a continous independent variable are show in Figure5.1. The Figure 5.1

illustrates the case when the logit is

(a) Linear

(b) Quadratic

(c) Some othe nonlinear continuous relationship

(d) binary

77

y

(a) (c)

(d)

Log−odds or Logit (b)

Binary

Liner

Quadratic Other Nonliner

x

Covariate

Figure 5.1: Diﬀerent types of models for relationship between the logit and a continuous

variable.

Where there is a cutpoint above and below which the logit is constant. In each of the

situations describe in Figure 5.1 ﬁtting a linear model would yield a signiﬁcant slop.

Once the variable is identiﬁed as important, we can obtain the correct parametric

relationship or scale in the model reﬁnement stage.The exxeption to this would be

the rare instance where the function is U−shaped. We refer to the model at the

end of step(4) as the mainef f ectsmodel.

(5) Once we have reﬁned the main eﬀects model and ascertauned that each of the con-

tinuous variables is scaled correctly, we cheek for interactions among the variables

in the model. In any model an interaction between two variables implies that the

eﬀect of one of the variables is not constant over levels of the other. For example,

an interaction between sex and age implies that the slope coeﬃcient for age is dif-

ferent for male and females. The ﬁnal decision as to whether an interaction term

should be included in a model should be based on statistical as well as practical

considerations. Any interaction term in the model must make sence from a clinical

perspective.

We address the clinical plausibility issue by creating a list of possible pairs of vari-

ables in the model that have some scientiﬁc basis to interact with each other. The

interaction variables are created as the arithmetic product of the pairs of main eﬀect

variables. We add the interaction variables, one at a time, to the model containing

78

all the main eﬀects and assess their signiﬁcance using a likelihood ratio test. We

feel that interactions must contribute to the model at traditional levels of statistical

signiﬁcance. Inclsion of an interaction term in the model that is not signiﬁcant typ-

ically increases the estimated standad errors without chaning the point estimates.

In general, for an interaction term to alter both point and interval estimates, the

estimated coeﬃcient for the interaction term must be statistically signiﬁcant.

In step(1) we mentioned mentioned that one way to examine the scale of the co-

variate is to use a scatterplot smooth, plotting the reults on the logit scale. Un-

fortunately scatterplot smoothing method are not easily extended to multivariable

models and trhus have limited applicability in the model reﬁnement step. However,

it is possible to extend the grouping type smooth show in Figure 2.2 to multivariable

models.

based on the following observation. The diﬀerence, adjusted for other model covari-

ates, between the logits for two diﬀerent groups is equal to the value of an estimated

coeﬃcient from a ﬁtted logistic regression model that treats the grouping variable

as catogorical. We have found that the following implementation of the grouped

smooth is usually adequate for purposes of visually checking the scale of a continu-

ous covariate.

First, using the descriptive statistics capabilities of most any statistical package,

obtain the quartiles of the distribution of the variable. Next create a categorical

variable with 4 levels using three cutpoints based on the quartiles. Other grouping

strategies can be used but one based on quartiles seems to work well in practice. Fit

the multivariable model replacing the continous variable with the 4− level catoga-

rical variable. To do this, 3 design variables must be used with the lowest quartile

serving as the reference group. Following the ﬁt of the model; plot the estimated

coeﬃcients versus the midpoints of the groups. In addition, plot a coeﬃcients versus

the midpoints of the groups.In addition, plot a coeﬃcients versus the midpoint of

the group. In addition, plot a coeﬃcient equal to zero at the midpoint of the ﬁrst

quartile. To aid in the interpretation we connect the four plotted points. Visually

inspect the plot and choose the most logical parametric shape(s) for the scale of the

variable.

The next step is to reﬁt the model using the possible parametric forms suggested

by the plot and choose one that is signiﬁcantly diﬀerent form the linear model and

makes clinical sense. It is possible that two or more diﬀerent parameterations of the

covariate will yield similar models in the sence that they are signiﬁcantly diﬀerent

from the linear model. However, it is our experience that one of the possible models

will be more appealing clinically, thus yielding more easily interpreted parameter

estimated.

79

Another more analytic approach is to use the method of fractional polynomials.

Fractional polynomial was developed by Royston and Altman (1994),to suggest transfor-

mations. We wish to determine what value of xp ields the best model for the covariate. In

theory, we could incorporate the power, p, as an additional parameter in the estimation

procedure. However, this greatly increases the complexity of the estimation problem.

Royston and Altman propose replacing full maximum likelihood estimation of the power

by a search through a small but reasonablee set of possible values. Hosmer and Lemeshow

(1999) provide a brief introduction to the use of fractional polynomials when ﬁtting a pro-

portional hazards regression model. This material provivdes the basis for our discussion

of its application to logistic regression.

The method of fractional polynomials may be used with a multivariable logistic regression

model, but for sake of simplicity, we describe the procedure using a model with a single

continuous covariate.

g(x, β) = β0 + xβ1 (5.4)

Where β deno0tes the vector of modelcoeﬃcients.One way to generalize this function is

specify it as

j

g(x, β) = β0 + Fj (x)βj (5.5)

j=1

The functions Fj (x) are a particular type of power function.The value of the ﬁrst function

is F1 (x) = xp1 . In theory, the power, p1 , could be any number, but in most applid

settings it makes sense to try to use somethin simple. Royston and Altman (1994) propose

restricting the power to be among those in the set Ω = −2, −1, −0.5, 0, 0.5, 1, 2, 3,where

p1 = 0 denotes the log of the variable.The remaining functions are deﬁned as,

xpj f or pj = pj−1

F (x) =

Fj−1 ln(x) f or pj =j−1

for j = 2, 3, .... and restriciting powers to those in Ω. For example, if we chose j = 2 with

P1 = 2 and P2 = 2, then the logit is by usin ??

g(x, β) = β0 + F1 (x)β1 + F2 β2

F1 (x) = ln(x) and F2 (x) = √1

x

1

g(x, β) = β0 + lnxβ1 + √ β2

x

80

Variable Coeﬀ. Std.Err ÔR 95%CI G P

AGE 0.018 0.0153 1.20∗ (0.89 , 1.62) 1.40 0.237

BECK -0.008 0.0103 0.96+ (0.87 , 1.06) 0.64 0.425

NDRGTX -0.075 0.0247 0.93 (0.88 , 0.97) 11.84 0.001

IVHX2 -0.481 0.2657 0.62 (0.37 ,1.04) - -

IVHX3 -0.775 0.2166 0.46 (0.30 , 0.70) 13.35 0.001

RACE 0.459 0.2110 1.58 (1.04 , 2.39) 4.62 0.032

TREAT 0.437 0.1931 1.55 (1.06 , 2.26) 5.18 0.023

SITE 0.264 0.2034 1.30 (0.87 , 1.94) 1.67 0.197

g(x, β) = β0 + x2 β1 + β2 x2 ln(x)

The model is quadratic in x when j = 2 with P1 = 2and P2 = 2. Again we could allow the

covariate to enter the model with any number of functions.j; but in most applied settings

anadequte transformation may be found if we use j = 1 or 2.

Example:

As an example of the model-bulding process, consider the analysis of the UMARU IM-

PACT study(USI). The study is described in section 2.6 and a code sheet for the data

is shown in Table 2.8. Brieﬂy the goal of the analysis is to determine whether there is a

diﬀerence between the two treatment programs after adjesting for potential confounding

and interaction variables.

One outcome of considerable public health interest is whether or not a subject remained

drug free for at least one year from randomization to treatment (DeREE in table 2.8).

A total of the 575 subjects (25.57%), considered in the analyses in this text, remained

drug free for at least one year. The analyses in this chapter are primarily designed to

demostrate speciﬁc aspects of the logistic model building.

The results of ﬁtting the univariable logistic regression models to these data are given in

Table 4.1.In this table we present, for each variable listed in the ﬁrst column,the following

information.

(1) The estimated slope coeﬃcient (s) for the univariable logistic regression model con-

taining only this variable

(2) The estimated standad error of the estimated slop coeﬃcient

81

Variable Coeﬀ. Std.Err z P > |z|

AGE 0.050 0.0173 2.91 0.004

NDRGTY -0.062 0.0256 -2.40 0.016

IVHX2 -0.603 0.2873 -2.10 0.036

IVHX3 -0.733 0.2523 -2.90 0.004

RACE 0.226 02233 1.01 0.311

TREAT 0.443 0.1993 2.22 0.026

SITE 0.149 0.2172 0.68 0.494

Constant -204 0.5548 -4.34 ¡0.001

Log likelihood -309.6241

Table 5.2: Results for a multivariable model containing the covariates siniﬁcant at the

level of Table5.1.

(3) The estimated odds ratio, which is obtained by exponentiating the estimated coeﬃ-

cient. For the variable AGE the odds ratio is for a 5−point increase. This wos done

since a change of 1 year or 1 point would not be clinically meaningful.

(4) The 95% CI for the odds ratio.

(5) The likelihood ratio test statistic, G, for the hypothesis that the slope coeﬃcient

is zero. Under the nullhypothesis,this quantity follows the chi-square distribution

with 1 degree of freedom,except for the variable IVHX, where it has 2 degrees of

freedom.

(6) The signiﬁcance level for the likelihood ratio test.

With the exception of Beck score there is evidence that each of the variables has some as-

sociation (P< 0.25) with the outcome, remaining drug free for at least one year(DFREE).

The covariate recording historyof intravenous drug use (IVHX)is modeled via two design

variables using “1=Never” as the reference code. Thus its likelihood ratio test has two

degrees-of-freedom. We begin the multivariable model with all but BLCK. The results of

ﬁtting the multivariable model are given in Table 5.2. The results in Table 5.2, When com-

pared to Table 5.1, indicate weaker associations for some cocovariates when controlling

for other variable s.In particular, the signiﬁcance level for the Wald test for the coeﬃcient

for SITE is p = 0.494 and for RACE is P = 0.311. Strict adherence to conventional levels

of statistical signiﬁcance would dictate that we consider a smaller model deleting these

covariates. However, due to the fact that subjects were randomized to treatment within

site we keep SITE in the model. On consultation with our colleagues we were advised that

race is an important control variable. Thus on the basis of subject matter considerations

we keep RACE in the model.

The next step in themodeling process is to check the scale of the continuous covariates

in the model, AGE and NDRGTX in this case. One approach to developing the order in

82

which to check for scale is to rank the continous variable by their pespective signiﬁcance

levels. Results in Table 5.2 suggest that weconsider Age and then NDRGTX.

83

Chapter 6

Detail data analysis is used to represent the results and basic characteristics of a distri-

bution in a much accurate and attractive form. So that, some one can get a rough idea

about the corresponding distribution very easily by investigating the distribution report.

First, I studied the reasons for the requirement of installation of a teller machine within

the university premises for the public usage and here, I wish to perform a data analysis

for these collected data.

So that, to do this, I have chosen the entire academic and non-academic staﬀ as the

population of my distribution.

Because of this, I decided to give a rough idea about the distribution structure of the

lecturers, students, and non-academics of the Wellamadama premises.

Situation Size of the population

Academic staﬀ 185

Temp & demonstrate staﬀ 110

Non Academic staﬀ 399

Security staﬀ 45

Intetnal student 5109

External Student 704

Total 6552

Now, let us analyses this entire distribution structure in detail according to the distinct

faculties and disciplines.

First, let us have a close look at the structure of the academic staﬀ.Besically this can be

divided in to two parts.

(1.) Lecturers

(2.) Others

84

Figure 6.1: Size of the University population

All lecturers, who are working at each faculty, can be put in to the ﬁrst category, and

all the remaining permanent and temporary academics are fallen in to the second category.

First, let us choose the set of lecturers for the faculties of the faculties of Science, Arts,

Management & Finance and Fisheries & Marine Science. This set can be shown as in the

following table according to their faculties.

Facuilty Academic staﬀ Presentage

H & SS 80 43.24

Managment & Finance 28 15.14

Science 67 36.22

Fisheries & Marine Science 10 5.41

Total 185 100

When these data are being investigated, we can clearly see that, a considerable number

of lecturers are working at the Arts and Science faculties. The reason for this is the most

number of students of the university are studding in these two faculties. On the other

hand, only the few lecturers are working at the Management & Fisheries faculties due to

the lack of students.

Then, we can study the distribution of the temporary and demonstration staﬀ of the

university.

Facuilty Demonstrate staﬀ Presentage

H & SS 17 15.45

Managment & Finance 5 8.46

Science 75 70.12

Fisheries & Marine Science 13 11.84

Total 110 100

85

Figure 6.2: Academic staﬀ

When we study the above graph, it is clear that the distribution of the temporary and

demonstration staﬀ is completely diﬀerent from the previous distribution of the academic

staﬀ. Here we can see that the other academic staﬀ of the science faculty is mush lager

than the other faculties of the university.

The reason for this is that science faculty needs a considerable number of temporary

staﬀ member to guide the students at their practical sessions.

Then let us analyse the distribution of the non-academic staﬀ of the university.

Facuilty Non academic staﬀ

Administrate 262

H & SS 39

Managment & Finance 10

Science 84

Fisheries & Marine Science 4

Total 399

If we analyse the above graph, we can see that the most of the non-academic members

are working under the administration branch. They are distributed as the,

86

Figure 6.3: Temporary and Demonstrate staﬀ

Within the Finance branch, Administration branch, Library and other faculties.

Student of the university are considered as the major entity and it is much important

to study their distribution within the university. Besically we can divide them in to two

parts as follows.

(1.) Internal students

(2.) External student

Internal students are studying in the Arts, Science, Management and Fisheries & Marine

Science and most of them are staying at the university hostels. Moreover we can see that

most of them are in the some age.

But external students have their own residence places and belong to diﬀerent age stages,

job categories and social status.

Up to this point, I have described the distribution and nature of the academic, non-

academic and student entities of the Wellamadama premises.

But my may intention was to study the reasons for the requirements of installation of

a teller machine within the university premises for the public usage. To do this, I have

chosen a sample of 600 entities out of 7500 total collection. At this point, I was being

careful to choose them among the each and every faculty.

Here I considered about savings accounts of the customers and their connection with the

government and private banks.

Owns a savings Account No.of servings Accounts Presentage

Yes 484 96.8

No 16 3.2

500

87

Figure 6.4: Nonacademic staﬀ

Figure 6.5: Number of saving accounts for the sample taken out of university premises.

Here data were parted as academic & non academic sector and students for the convenience

of analyzing process.

Have a servings Accounts Yes No

Student 297 15 312

Academic & Non Acodemic 187 1 188

484 16 500

According to the above graphs it is clear that about 97% of the total maintain their ac-

counts in the both government and private banks. Some of them maintain their accounts

in Peoples bank and Bank of Ceylon which are incorporated to the university. Now I try

to analyse these data.

88

Figure 6.6: Analysis of havings Accounts for the sample taken out of university premises.

First I separated the savings accounts holders from my sample and them obtained set

were parted as university Peoples bank and university Bank of Ceylon accounts holders.

Furthermore, those sets parted as academic, non-academic and students. Following these

steps, I was able to give much attractive aspect to my work.

University Bank No. of Accounts Precentage

People’s Bank 210 42

Bank Of Ceylon 68 13.6

Both 140 28

No 82 16.4

Total 500 100

Now let us consider about the savings accounts which are maintained by the students.

University Bank People’s Bank Bank Of ceylon Both No

Student 144 20 30 118

Academic & Non Academic 66 48 52 22

210 68 82 140 500

Among them, a considerable number of students maintain their accounts in the Peoples

bank which is incorporated with the university, rather than in the Bank of Ceylon. The

reason for this is that, they have to retrieve their bursaries and Mahapola scholarship

fund through the Peoples bank.

and the bank incorporated with the university. They are the major customers for both

these bank for them. We can see this clearly by studding the following facts.

To do this, I have chosen all the staﬀ members who keep their savings accounts in these

banks.

89

Figure 6.7: University staﬀ & student maintain their accounts with the university People’s

bank & BanK Of Ceylon according my sample.

People’s Bank 215 49,449,300.00

Bank Of Ceylon 228 3,954,310.00

443 53,430,610.00

According to the above graphs, we can say that most number of staﬀ members maintain

their savings accounts in the Bank of Ceylon.

Now let us have some idea about the capitals of these banks, due to these accounts.

By comparing the above two graphs, we can say that the Peoples bank is the one who

contributed to the biggest part of circulation of money.

So I decided to study about this moreover, and divide the above set in to two parts

as academic and non-academic.

People’s Bank Bank Of Ceylon

Academic 122 58

Non Academic 93 170

215 228

According to the above graphs, we can say that, most of the academic staﬀ members

maintain their accounts in the Peoples bank and most numbers of non-academic staﬀ

members maintain their accounts in the Bank of Ceylon incorporated to the university.

People’s Bank Bank Of Ceylon

Academic 3,653,590.00 1,957,860.00

Non Academic 1,335,750.00 1,996,450.00

49,499,300.00 3,954,310.00

90

Figure 6.8: Analysis of university staﬀ & student maintain their accounts with the uni-

versity People’s bank & Banl Of Ceylon according my sample.

Generally academic scullery scale is little bit higher than the non-academic salary scale.

Due to these facts it is natural to observe this kind of large capital in the Peoples bank

relative to the Bank of Ceylon.

Now I try to study nature of the savings accounts maintain by the sample members

except the banks incorporated with the university.

Banks No.of Accounts Precentage

People’s Bank 95 18.6

Pank Of Ceylon 72 14.4

Commercial 52 10.4

Seylan 14 2.8

Sampath 55 11.0

Other 212 42.8

Total 500 100

By dividing this sample further (as academic, non-acodemic and student) distribution can

be achieved much attraction.

Student 73 57 36 7 25 115

Academic & Non Aca. 22 15 16 7 30 97

95 72 52 14 55 212 500

According to the above information, we can say that most of them maintain savings ac-

counts in the government bank as well as in the private banks. We can see this situation

91

Figure 6.9: Number of account obtain in university Banks

very sharply among the students. High compactions and their eﬀort of introducing new

accounts for the young generations are the major reasons for this.

Private sector has a well developed computer network, specilly commercial and sam-

path banks. So it is possible to receive any amount of money at any time, at any part of

the island. Since most of the university students are staying at out side of their houses,

theyhave used to maintain their accounts in the private banks.

At present, even though the seylan bank is very popular among the business commu-

nity it is not so popular among the ordinary people. We can see this fact according to

the above graphs.

So far I have studied about the customers who maintain their accounts in the government

and private banks. But there are about 4% of sample members who do not maintain any

kind of savings accounts. This percentage becomes 16 students and one non-academic staﬀ

member out of the whole sample. In order to investigate this situation, ﬁrst I studied the

eﬀect of their residence place.

Place of living No.of obtained

In the Matara urban area 10

[h] Out side of the Matara urban area 4

Hostal in side university 2

Hostal in out side university 16

According to the above graphs, we can say that most of them (10) live around the matara

town and they may take pocket money from houses to fulﬁll their daily needs. Because

92

Figure 6.10: How the money of payments devided each of the University banks.

of these facts, sometimes they did not want to maintain a savings account.

Moreover, in the case of our non-academic member, he was unaware about these sav-

ings accounts. So it is clear that if it is possible to aware them about these accounts, they

will deﬁnitely open their own savings accounts.

Next I would like to further analyse data according to by them. But before that I have

to remind about the samples which I have parted.

First I have randomly selected a sample with 500 entities, out of 7000 total collection.

Then the sample was dividing in to two parts as, members who maintain savings accounts

and do not. Their ﬂavor of maintain teller cards is diﬀerent form one to another. So I

decided to investigate the reasons for this.

Following graphs show the number of sample members who maintain and who do not

maintain teller card facilities.

Obtained Teller card No.of People Presentage

Yes 380 78.51

[h]

No 104 21.49

484 100.00

In order to study their requirements in detail and much convenient way, the sample was

parted as academic, non-acodemic and students.

93

Figure 6.11: Number of accounts obtain in University Banks

Student 252 45

Academic & Non academic 128 59

380 104 484

According to the above graphs, most of the sample members have obtained teller facil-

ity. Relative to the academic and non academic staﬀ members, a considerable number of

students have obtained this facility. Since most of them are staying at out side of their

houses. They have to use bursary and their pocket money all through the month and

they have used to deposit money in the banks due to provide secure, and it is reasonable

to use a teller cards.

But when we consider about the case of non-academic staﬀ, we can see some kind of

collapse of their teller card usage, and there may be verity of reasons behind this.

One such possibility is, that they may think since it is possible to retrieve money at any

time via the teller machine. It could be turned in to a waste. On the other hand most of

them afraid of new technology.

Then I decided to analyse this case according to the places where these teller machines

have been established.

Bank Obtain Teller Card

University People’s Bank 261

Other Bank 78

Both 41

No card 484

94

Figure 6.12: How the money of payments devided each of the University banks.

Figure 6.13: University staﬀ and student maintain their accounts with the govenment &

private banks accounts with the govenment & private banks accouding my sample.

In order to study this case moreover. I parted the above sample as academic and non-

academic staﬀ.

Student 163 53 20 59

Academic & Non Academic 98 25 21 45

261 78 41 104 484

According to the above graphs, it is clear that academic as well as the non-academic staﬀ

members have used to get use of teller facilities from the Peoples bank incorporated to the

university. On the other hand, most of the students have used to maintain a private bank

teller card due to the facilities they provide, and they use Peoples bank savings accounts

only to retrieve their bursaries and Mahapola scholarships. But most of them wish to

95

Figure 6.14: University staﬀ and student maintain their accounts with the govenment &

private banks accordin my sample.

withdraw these accounts when they pass out from the university and it is not so good in

the point of view of the government banks.

As I have analysed the above, I found that, even if the most of the university popu-

lation use the teller facilities provided by the Peoples bank which is incorporated to the

university, it is not so enough. This kind of situation may be occurred due to the following

reasons.

(1.) Since the teller machine is established out side of the university premises. They

have to go to go out side if they to retrieve money. But these people are very busy

and it is not so covenant at all.

(3.) The process in between retrieving money and take back to the university is no so

secure.

96

Figure 6.16: Sample members have obtain Teller facility

(4.) This teller machine is established little bit far away from the university premises

and one can see very frequent collisions in between the university students and the

young boys who are living in the village. So the most of the university people

little bit scare to retrieve at the evening. One such an example is, few months ago

someone has attempted to commit robbery here. Because of these things, I decide to

investigate the requirements of installation of a teller machine within the university

premises.

Following are the ideas which were revealed by my sample members about their needs.

At what place are you interested No. of interesterd

In side the University 472

Out side the University 28

500

With a view to analyses this furthermore, I have parted my sample as academic,

on-academic and students.

In side the University Out side the University

Student 297 15

Academic & Non academic 175 13

472 28 500

According to the above graphs, it is clear that most of them need the installation of a

teller machine within the university premises, and lots of reasons are behind this.

(1.) It is very convenient to retrieve money.

97

Figure 6.17: The sample members have obtained Teller facility or not.

Figure 6.18: Where the sample members have obtained teller facility.

Because of these things, girls who are staying within the university premises also can

retrieve money without any hesitate.

As well as the majority of the sample need the installation of a teller machine within

the university premises, there are a few who dont wish such a facility. This minority

thinks with the availability of this facility, their expenses will be increased without any

control. According to my point of view, bearing these kinds of ideas, harm a lot for the

majority.

In my study, I have specially studied about the Bank of Ceylon which is situated within

the university premises. Now I try to analyse about the sample members who would ready

to get the facilities with the availability of a Bank of Ceylons teller machine.

98

Figure 6.19: Where the sample members have obtain teller facility.

99

Figure 6.21: New teller machin to be withen or outside of university area.

Figure 6.22: The number of custermers who are interested in opening a new account.

Interesting 196

Having 140

Not interesting 164

Total 500

If we analyse above graphs we can clearly see that there are about 140 customers who are

dealing with the Bank of Ceylon from long time and they wish to get the teller facilities.

Moreover there are about 160 new customers who are going to open savings accounts.

This is much beneﬁcial for the bank.

100

Figure 6.23: Are you interested in opening a new account.

opening a new Accoount

Student 141 56 115 312

Academic staﬀ 11 11 7 29

Tempory staﬀ 14 4 30 48

Non academic staﬀ 31 69 11 111

500

Then the above graph indicates the members who dont like to deal with the bank. Most

of them are ﬁnal year student and temporary academic staﬀ. They are going to leave the

university very nearly. So they dont what to maintain accounts furthermore.

Are you interesting in opening a new Account Uses not uses

Student 197 115

Academic staﬀ 22 7

Temporary staﬀ 17 31

Non academic staﬀ 100 11

336 164 500

101

Installation of a teller machine within the university premises will provide a lot of indirect

beneﬁts for the banking sector. Already the majority of the academic and non-academic

members ready to obtain the credit card facilities from the bank with the bank with the

availability of a teller machine which I have mentioned above. Since most of their scales

are good enough it is possible to issue credit cadres with higher values.

So it is much suitable to study about the customers who are willing to deal with the

banks. As an example, let us observe the column chart of the facility of ﬁsheries biology.

It is clear that most of them wish to get the teller facilities within the university premises.

Because most of them little bit scare to go to the Matara town.

So ﬁnally, considering each and every end which I have analysed, we can decide that

it is much suitable to install a teller machine within the university premises.

102

Chapter 7

Discussion

Man was using diverse stratergies for providing his basic needs such as foods, cloths and

housing since his civilization.

Although they were practicing the exchange of goods and services in past, the situation

changed with the introduction of the exchange unit called “money”. The unit of money

subsequently became the exchange unit of both goods and services exchanges.

The industrial revolution resulted diverse services and as a result of this, the demand

for money also increased. This high demand for money led rapid spreading of banks and

other ﬁnancial services all around the world.

With the help of modern technology, the current electronic banking systems capable of

providing money for any one, any where of the world.

In Sri Lanka, some of recently established private banks have done vast changes in banking

and ﬁnancial sector of the country. The compitition among these banks for advancement

one over another has resulted many beniﬁts for the customers.

The University of Ruhuna, Sri Lanka is situated in a peacefull and elegant premises close

to Matara town, southern region.

Large number of residental and non-residental people from all over the country access the

university premises daily for both academic and non-academic purposes.

As example, the daily visitors represent the distant districts like Jafna, Kilinochchi while

some others from Badulla, Anuradhapura and Polonnaruwa districts. Most of these reg-

ular visitors(particulary students) reside in university hostals while some others are from

private loggins around the university.

To accomplish the ﬁnancial and banking needs of this diverse community of ruhuna uni-

versity, both governmental and private banks are functioning with this context.

Being a students of university of ruhuna, which also means being a member of it’s com-

munity I am really interesting to run a reaserch on the banking needs of this community

which will produce guidences for a more suphosticated banking system withing the uni-

103

h

versity premises.

As an approach to the current study, it was worth studying the set up for performing

the early banking needs of university of ruhuna which was established 28 years ago, in

wellamadama, matara.

Initially, the university has kept it’s trust on governmental banks opening an internal

branch of “Bank of cylon” withing the university premises.

Both academics and non-academics then started dealing with this Bank Of Ceylon inter-

nal branch for their ﬁnancial purposes.

By 1980’s, another Sri lanka bank called “People’s bank” achieved it’s signiﬁcant involve-

ment in Sri Lankan economy launching it’s branches throught the country which also

established an external people’s bank branch adjacent to the university premises.

As result of more interesting banking services, with more advantageble account systems

of this newly established bank, people of the university community who were previously

dealing with bank of cylon internal branch started opening atleast one saving account

with this external People’s bank. This new trend has diverted most of academics and

non-academics from the internal Bank Of Cylon branch to external People’s bank branch.

This situation is clearly illustrated in following charts and ﬁgures. The Figure 7.1 seems

to show that still the Bank Of Cylon is holding much of university accounts, although

the Figure 7.2 shows the reality where the proﬁt sharing between the two banks clearly

illustrates the later failure of banks of cylon withing the university banking context.

104

h

Figure 7.2: How the money of payments devided each of the University banks.

particularly, the more eﬃcient banking services of People’s bank have moved much of

academic’s accounts from internal bank of cylon branch to the external People’s bank,

The Bank Of Ceylon is experencing a great loss of income due to the fact that academic

accounts exchange great some of money because the academics belong to the highst salary

scale withing the university community.

Therefore, the bank of cylon can increase their proﬁt if they recover the academics trust

on their bank.

As the second part, I thought of studying the ﬁnancial exchanges of university students

who are the next prominent component of the university community.

The interviewes with students reviwed that they have kept their trust mostly on pri-

vate banks, as examples “Commercial” and “Sampath” banks which are rich in latest

technology. The reason for this particular attraction is the two bank have oﬀered the

facilities for quick money transfer for any where of the country which is an essential need

for students who are depending on parents money which are to be credited to the students

accounts from their distant areas. Many students are using People’s bank branch only for

obtaining their Mahapola scholarship installments and bursaries, because it is the only

bank having anthority on mahapola and bursary transactions.

To my feeling, the two government banks are highly unluckey that they were not ca-

pable of holding the trust of students who makes the blood stream of a living country.

To my openion, studying the reasons for overcomming of people’s bank over Bank Of

Ceylon withing the university premises may be a guide for future advancements in uni-

105

h

versity banking and ﬁnancial services. Under this context, the behaviour of the university

community came in to place where the most of academics, non-academics and students

are fairly busy with their daily works so that they are unable to waste their working

hours for bankings. The overlapping of their working hours with the bank opening hours

caused this problem and they were in need of any 24 hours banking service at least for

withdrawal of money from their accounts.

The People’s bank focused on this issue and they established a teller machine at the

external People’s bank branch several years ago facilitating easy withdrawal of money

both day and night.

Although this was a real help for students, some other issues associated with this teller

machine also worth considering.

Some conﬂicts between students and the village community are frequent and during such

conﬂict periods, it is unsafe the students to behave out side of the university premises and

hostals. Such conﬂicts therefore limit the students to reach the out side teller machine.

This situation is particular for evenings where the students mostly use the teller machine

during which the students can also be attacked by the village boys.

Also some robberies for withdrawn money also not rear during the way between uni-

versity premises and the outside teller machine. Since the teller machine is established

so close to the Matara - Kataragama road, many outsiders also complete with students

for withdrawal of money majority of Matara area is Sinhalese and therefore the students

who have come from north and eastern areas (where some conﬂicts are going on) are

suspecious of leaving the university premises to the external teller machine in evenings

106

h

Figure 7.5: University staﬀ and student maintain their accounts with the govenment &

private banks accordin my sample.

107

h

The above mentioned reasons implies the need to install an internal teller machine withing

the university premises.

As the ﬁrst step of my study, the university comprising around 5000 was selected as the

study population. The population was then dividid in to diﬀerent sectors (as descibed

under data analysis) and a sample of 500 was selected for the study. Except 16 of the

sample, all the other were currenty dealing with either bank. The study shows that the

best place to establish an internal teller machines is the internal Bank Of Cylon branch.

With establishing this internal teller machine, the Bank Of Cylon branch may re-attact

around 7000 of the university community. Except 164 of my 500 sample, all the oth-

ers aggreed to start banking with the internal Bank Of Cylon branch. The majority of

disagreed people were ﬁnal year students who are to complete their university education

near future and then to leave the university.

Also, the disagreed accedemics were mostly temporary tutors whose appointments will be

terminated at the year end.

Therefor, I suggest to establish a teller machine at the internal Bank Of Ceylon ap-

proving the ideas of the majority of the current study population.

The methodslogy called “Logistic regression ”, which is capable of isolating the most sf-

fective factor and out of severel factors aﬀecting on a given issue.

As example, in my study the following customer facts were concederd as the facts that

may potentially aﬀect on establishing an internal teller machine withing the univarsity

premises.

• Nationality

• Gender

108

• Loggin

Wether establishing an intenal teller machine would be more easier, more safe and more

time eﬃcient.

Then the logistic regression was conducted and the factors having nign p− value was

removed and the remaning factors were studied.

Thus, in my study, the GENDER = 0.4562 < 0.25. This factor was removed and

restudied for another factor. Second, the study was carried out in minitab and the

T C = 0.2562 < 0.25 factor was removed. This factor reduction proccess is called “step

down wise ” method and the ultimatly remaining factors having p value less than 0.025

conceded to aﬀect signiﬁcantly on the study problem.

The results of the study will futher be described under the conclusion section. The results

of the study will further be described under the conclusion section.

The logistic transformation equation obtained at the end of the data analysis can be

used for move clear and more reasonable results.

109

Chapter 8

Conclusion

8.1 Results

Since a long time, people have used to deposit their main requirements such as foods,

cloths and money after their daily usage, with a view to use in the future. Step by step

with use of money, the concept of banking was become popular all over the world, and

rapidly improved with the new out comes of the technology revolution. so that they in-

troduced teller and credit card facilities for the convenience of public.

As an example, it is clear that a lot of people who are in our university also keeping

context with the banking sector, and most of them like to keep their accounts in the

government banks rather then the private banks. Considerable number of these people

uses the teller machines and credit cards. But most of the university people have face for

a lot of troubles, due to these teller machines are installed within the urban areas. So my

main intention is to investigate this matter.

About 3500 people come daily, in to the university premises for their academic and oﬃcial

work. The goal of this study to survey study of the requirement of a teller machine within

the university premises for the public usage. We prepared to questionnaires for collect

data and interview more than 500 people around the university area in wellamadama

premises comprising academic, non academic staﬀ, internal external student and security

staﬀ.

The goal of the analysis was to identify who were interested for ATM facility for uni-

versity premises, and why were they inerest for this facilities. The following variables

were tracked in a computer ﬁle called ,“INT − AT M”.

the variable “PLACE” was treated as a categorical variable in the regression analysis, so

three dummy variable had to distinguish the four places they are living. These variables

were diﬁne so that the “ Living in side matara urban area ” (PLMA) was the referent

110

Variable Code Abbreviation

Identiﬁcation ID-Student IDS

code ID-Non Academic IDNA

ID-Academic IDA

ID-Temparary Academic IDTA

Place of Inside the Matara area PLMA

living Outside the Matara area PLOM

Hostal in side University PLIUH

Hostal out side University PLOUH

Gender 1-Female GENDER

0-Male

Having sarvings 1-Yes HABOC

account in BOC 0-No

Having teller 1-Yes TC

facility 0-No

Interested for 1-Yes INTEREST

BOC ATM facility 0-No

in University

primises

group, follows:

1 if P LMA 1 if P LOM

P LMA = P LOM =

0 Othre 0 other

1 if P LIUH 1 if P LOUH

P LIUH = P LOUH =

0 Other 0 Other

Also The variable “ ID” was treated as a categorical variable. So dummy variable had to

distinguish the ﬁve sections where the members are working. These variables were diﬁne,

so that “ Students” (IDS) was the referent group, follows:

1 if IDS 1 if IDA

IDS = IDA =

0 Othre 0 other

1 if IDNA 1 if IDT A

IDNA = IDT A =

0 Other 0 Other

The following block of edited computer out comes from ﬁtted Minitab’s Logistic pro-

cedure for the dichotomous out come variable. INTEREST on the predictor GENDER,

111

TC, and IDj for j=1,2,3,4 and P LACEj , for j = 1, 2, 3, 4.

logit[P r(Y = 1)] = β0 + β1(IDNA) + β2 (IDA) + β3 (IDT A)

+β4 (P LOM) + β5 (P LIUH) + β6 (P LOUH)

+β7 (GENDER) + β8 (HABOC) + β9 (T C)

Where, Y denotes the depend variable INTEREST. The results of ﬁtting the univariable

Variable Code Abbreviation

Identiﬁcation ID-Student IDS

code ID-Non Academic IDNA

ID-Academic IDA

ID-Temparary Academic IDTA

Place of Inside the Matara area PLMA

living Outside the Matara area PLOM

Hostal in side University PLIUH

Hostal out side University PLOUH

Gender 1-Female GENDER

0-Male

Having sarvings 1-Yes HABOC

account in BOC 0-No

Having teller 1-Yes TC

facility 0-No

Interested for 1-Yes INTEREST

BOC ATM facility 0-No

in University

primises

logistic regression models to these data are given in result Table. In this table we present,

for each variable listed in the ﬁrst column, the following information.

(1) The estimated slope coeﬃcient (s) for the univariable logistic regression model con-

taining only this variable

(2) The estimated standad error of the estimated slop coeﬃcient

(3) Normal values of variables.

(4) The signiﬁcance level for the likelihood ratio test.

(5) The estimated odds ratio, which is obtained by exponentiating the estimated coeﬃ-

cient. For example, the variable AGE the odds ratio is for a 5−point increase. This

wos done since a change of 1 year or 1 point would not be clinically meaningful.

112

(6) The 95% CI for the odds ratio.

Where y denotes the dependent variable DENGUE. Using the given table, we now fo-

cus on the information provided under the heading “ Analysis of Maximum likelihood

estimaters.” From this information, we can see that the ML coeﬃcients obtained for the

ﬁtted model are

β̂0 = 1.6810, β̂1 = 0.9152, β̂2 = 0.2731, β̂3 = −1.0998, β̂4 = −0.1135, β̂5 = −0.6987

β̂6 = −1.4646, β̂7 = −0.2563, β̂8 = 1.5276, β̂9 = −0.8361

So that ﬁtted model is given (in logit form) by,

−0.1135(P LOM) − 0.6987(P LIUH) − 1.4646(P LOUH)

−0.2563(GENDER) + 1.5276(HABOC) − 0.8361(T C)

Based on this ﬁtted model and the information provided in the computer output, we can

compute the estimated odds ratio ratio for ﬁnding factors of interesting the ATM machine

within the university premises for the public usage.

Basd on this ﬁtted model and the information provided in the computer output, and

the information provided in the computer output, we can compute the estimated p- value

for ﬁnding factors oﬁnteresting the ATM macine facility within the university premises

for the public usage. We do this using the previously state rule for (Forward selection

with a test for backward estimation) adjusted p- value for (0 − 1) variable.

With the exception of GENDER there is evidence that each of the variable has some

association (p−value= 0.293 < 0.25) with the outcome, remaning the factors of interest-

ing the ATM machine within the university premises for the public usage.

The results in Table 9.3, when campared to Table 9.2, indicate weaker associations for

some covariates when controlling for other variables.

+β4 (P LOM) + β5 (P LIUH)

+β6 (P LOUH) + β7 (HABOC) + β8 (T C)

By using the computer output we can see that the ML coeﬃcients obtained for the new

ﬁtted model are,

β̂0 = 1.510, β̂1 = 0.9230, β̂2 = 0.3067, β̂3 = −1.1366, β̂4 = −0.1302, β̂5 = −0.7866

β̂6 = −1.4239, β̂7 = 1.5263, β̂8 = −0.8070,

113

Variable Code Abbreviation

Identiﬁcation ID-Student IDS

code ID-Non Academic IDNA

ID-Academic IDA

ID-Temparary Academic IDTA

Place of Inside the Matara area PLMA

living Outside the Matara area PLOM

Hostal in side University PLIUH

Hostal out side University PLOUH

Gender 1-Female GENDER

0-Male

Having sarvings 1-Yes HABOC

account in BOC 0-No

Having teller 1-Yes TC

facility 0-No

Interested for 1-Yes INTEREST

BOC ATM facility 0-No

in University

primises

logit[P r(Y = 1)] = 1.510 + 0.9230(IDNA) + 0.3067(IDA) − 1.1366(IDT A)

−0.1302(P LOM) − 0.7866(P LIUH)

−1.4239(P LOUH) + 1.5263(HABOC) − 0.8070(T C)

By usin equation 2.4 & 2.5, 2.6,we can get,

p̂i

ln = β0 + β1(IDNA) + β2 (IDA) + β3 (IDT A) + β4 (P LOM) + β5 (P LIUH)

1 − p̂i

+β6 (P LOUH) + β7 (HABOC) + β8 (T C)

p̂i

= e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}

1 − p̂i

The logit transformation is given by,

e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5(P LIU H)+β6 (P LOU H)+β7(HABOC)+β8 (T C)}

p̂i =

1 + e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}

The ﬁtted values are given by using 2.6 is,

e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5(P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}

π̂(x) =

1 + e{β0 +β1(IDN A)+β2 (IDT A)+β3 (IDA)+β4 (P LOM )+β5 (P LIU H)+β6 (P LOU H)+β7 (HABOC)+β8 (T C)}

114

By using the privious results, we can substitued the values for logistic transformation,

π̂(x) =

1 + e{1.510+0.9230(IDN A)+0.3067(IDA)−1.1366(IDT A)−0.1302(P LOM )+....+1.5263(HABOC)}

8.2 Conclusion

Compared with the other subjects, the non academic staﬀ (IDNA) were interested in more

than three times to the requirment of a Teller machine facility within university premises

for the public usage, with the odds ratio (OR) of 2.52 and conﬁdence Interval (CI) of

1.21 and 5.24. Also the current account owners in “ Bank Of Ceylon” were interested

more than 5 times with the odds ratio (OR) of 4.61 and conﬁdence Interval (CI) of 2.46

and 8.63. But Temporary and Demostrate staﬀ were interested more than three times

less for the requirment of this “ BOC” teller machine (OR=0.33; CI=0.17-0.66). Also the

Academic staﬀ & Students are less interested. But they were interested to the requirment

“ people’s Bank” Teller machine facility within university primises for the public usage.

115

Chapter 9

Appendx

116

List of Figures

1.2 Example Data set-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Motivation for the Least-squares Regression line . . . . . . . . . . . . . . . 5

1.4 Line A and B Both satisﬁng the criation ni=1 (yi − ŷi ) = 0 . . . . . . . . . 6

1.5 The least-squares procedure minimizes the sum of the squares of the resid-

uals ei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Example of possible population regression lines . . . . . . . . . . . . . . . 9

1.7 Depth of water Vs Water Tempature. . . . . . . . . . . . . . . . . . . . . . 10

1.8 Quadaratic model:μ = β0 + β1 x + β2 x2 . . . . . . . . . . . . . . . . . . . . 14

1.9 Cubic model:μ = β0 + β1 x + β2 x2 + β3 x3 . . . . . . . . . . . . . . . . . . . 15

2.2 Plot of the presentage of subjects with CHD in each age group. . . . . . . 20

4.1 Comparison of the weight of two groups of boys with diﬀerent distribution

of age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Plot of the logits under three diﬀerent models showing the presence and

absence of interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1 Diﬀerent types of models for relationship between the logit and a continuous

variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.2 Academic staﬀ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.3 Temporary and Demonstrate staﬀ . . . . . . . . . . . . . . . . . . . . . . . 87

6.4 Nonacademic staﬀ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.5 Number of saving accounts for the sample taken out of university premises. 88

6.6 Analysis of havings Accounts for the sample taken out of university premises. 89

117

6.7 University staﬀ & student maintain their accounts with the university Peo-

ple’s bank & BanK Of Ceylon according my sample. . . . . . . . . . . . . . 90

6.8 Analysis of university staﬀ & student maintain their accounts with the

university People’s bank & Banl Of Ceylon according my sample. . . . . . 91

6.9 Number of account obtain in university Banks . . . . . . . . . . . . . . . . 92

6.10 How the money of payments devided each of the University banks. . . . . . 93

6.11 Number of accounts obtain in University Banks . . . . . . . . . . . . . . . 94

6.12 How the money of payments devided each of the University banks. . . . . . 95

6.13 University staﬀ and student maintain their accounts with the govenment

& private banks accounts with the govenment & private banks accouding

my sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.14 University staﬀ and student maintain their accounts with the govenment

& private banks accordin my sample. . . . . . . . . . . . . . . . . . . . . . 96

6.15 Number of obtained haven’t savings accounts. . . . . . . . . . . . . . . . . 96

6.16 Sample members have obtain Teller facility . . . . . . . . . . . . . . . . . . 97

6.17 The sample members have obtained Teller facility or not. . . . . . . . . . . 98

6.18 Where the sample members have obtained teller facility. . . . . . . . . . . 98

6.19 Where the sample members have obtain teller facility. . . . . . . . . . . . . 99

6.20 New teller machin to be withen or outside of university area. . . . . . . . . 99

6.21 New teller machin to be withen or outside of university area. . . . . . . . . 100

6.22 The number of custermers who are interested in opening a new account. . . 100

6.23 Are you interested in opening a new account. . . . . . . . . . . . . . . . . . 101

6.24 who are Interesred in new teller facility . . . . . . . . . . . . . . . . . . . . 101

7.2 How the money of payments devided each of the University banks. . . . . . 105

7.3 Number of accounts obtain in University Banks . . . . . . . . . . . . . . . 106

7.4 Number of accounts obtain in University Banks . . . . . . . . . . . . . . . 107

7.5 University staﬀ and student maintain their accounts with the govenment

& private banks accordin my sample. . . . . . . . . . . . . . . . . . . . . . 107

7.6 who are Interesred in new teller facility . . . . . . . . . . . . . . . . . . . . 108

118

List of Tables

1.2 Example Data set-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Computations for ﬁnding β0 and β1 . . . . . . . . . . . . . . . . . . . . . . 8

2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Results of ﬁtting the logistic regression model to the data in Table 2.1 . . . 27

2.4 Estimated convariance matrix of the estimated coeﬃcicent in Table 2.3 . . 34

3.1 An example of the coding the design variables for race coded at three levels 36

3.2 Table 3.2, estimated coeﬃcients for a multiple Logistic regression model

using the variables AGE, weight at least menstrual period (LWT), Race

and Number of ﬁrst trimester physician visits (FTV) for the low birth

weight study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Estimated coeﬃcients for a multiple Logistic Regression model sing the

variable LWT and RACE from the low birth wight stutdy. . . . . . . . . . 43

3.4 Estimated covariance matrix of the estimated coeﬃcients in Table 3.3 . . . 46

4.1 values of the logistic regression model when the independent variable is

dichotomous outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 cross-classiﬁcation of AGE dichotomized at 55 years and CHD for 100 subjects 51

4.3 Illustration of the coding of the design variable using the reference cell

method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Illustration of the coding of the design variable using the deviation from

means method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 cross-classiﬁcation of hypothetical data on RACE and CHD status for 100

subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6 speciﬁcation of the design variables for RACE using reference cell coding

with white as the reference group. . . . . . . . . . . . . . . . . . . . . . . . 56

119

4.7 Results of ﬁtting the logistic regression model to the data in table 4.5 using

the disign variablesa in table 4.6. . . . . . . . . . . . . . . . . . . . . . . . 56

4.8 speciﬁcation of design variable for RACE using deviation form means coding. 58

4.9 Results of ﬁtting the logistic regression model to the data in Table 4.5 using

the design variables in Table4.8 . . . . . . . . . . . . . . . . . . . . . . . . 60

4.10 Descriptive statistics for two groups of 50 mens on AGE and whether they

had seen a physician(PHY)(1=Yes,0=No)within the last months. . . . . . 63

4.11 Resuls of ﬁtting the logistic regression model to the data summarized in

Table 4.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.12 Estimate logistic regression coeﬃcients , deviance, and the likelihood ratio

test statistic (G) for an example showing evidence of confounding but no

interation(n=400) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.13 Table 4.13 Estimate logistic regression coeﬃcients, deviance, and the likeli-

hood ratio test statistic (G) for an example showing evidence of confound-

ing but no interation(n=400) . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.14 Table 4.14 estimated logistic regression coeﬃcients, Deviance, the likeli-

hood ratio teststatistic (G), and the P-value for the change for models

containing LWD and AGE from the low birthwight containing LWD and

AGE from the low birthwight data(n=189) . . . . . . . . . . . . . . . . . . 71

4.15 Estimated covariance matrix for the estimated parameters in model 3 of

Table 4.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.16 Estimated odds ratios and 95 present conﬁdence intervals for LWD,controlling

for AGE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Results for a multivariable model containing the covariates siniﬁcant at the

level of Table5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.2 INTERESTED Vs IDA ,IDS,IDTA,PLOM,GENDER,PLIU,PLOU,HABOC,TC112

8.3 INTEREST Vs IDA ,IDS,IDTA,PLOM,PLIU,PLOU,HABOC,TC . . . . . 114

120

Bibliography

[1] Albright, Winston ,and Zappe Schaum’s Outline Series Theory and Problems of

Bisiness Statistics , Schaum’s Outline Series, McGraw-Hill

[2] Amemiya, T.and Powell, Applied Statistics with Microsoft Exel , Duxbury Thom-

son Learning

[3] Berk, k.and carry, Wayne L. Winston, Christopher Zappe, Data Analysis and

Decision Making with Microsoft Exel , Thomson Brooks / Cole

[4] carver, Data Analysis with MINITAB 12. , Duxbury Thomson Learning

cases , Statistic and the Law. New York, NY: Wiley, 1986

[6] David W Hosmer ,Applied Logistic Regression (3rd ed.) , John Wiley and Sons

[7] Freund, R.and Littell, SAS system for Regression analysis (2nd ed.) , Duxbury

Thomson Learning

[8] Hidebrand Statistical Thinking for managers (4th ed) , Duxbury Thomson Learn-

ing

[9] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multi-

variable Methods (2nd ed.), Duxbury press, Cole publishing company

[10] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multi-

variable Methods (3rd ed.), Duxbury press, Cole publishing company

[11] Lehmann, Zeitz Statistical Explorations With Microsoft Excel , Duxbury press,

1990

[12] Hidebrand Statistical Thinking for managers (4th ed) , Duxbury Thomson Learn-

ing

121

[13] Keleinbaum, Kupper & Muller Applied Regression Analysis and Other Multi-

variable Methods (2nd ed.), Duxbury press, Cole publishing company

[14] State collage MINITAB Users Guide, reliase for 12th Windows , State collage,

PA: Minitab, Inc

[15] Shiﬄer, Adoms Introduction Business Statistic with computer Applications (2nd

ed.) , Duxbury Thomson Learning ,1995

[16] Terry Dietman Applied Regression Analysis For Business and Economics ,

Thomson Brooks / Cole

122

- Statistical Inference, Regression SPSS ReportÎncărcat deIjaz Hussain Bajwa
- Data Driven Decision RegressionÎncărcat deJen Chang
- 5 Forecasting-Ch 3(Stevenson).pdfÎncărcat desadasdasdas
- Locating gender in ICTD projects - E-KrishiÎncărcat deIT for Change
- staad pro 3Încărcat degs_shopnil
- 4. MV - Multiple Regression.pptÎncărcat deRochana Ramanayaka
- Bentley MX - 2004 Edition MXROAD_Introduction 022004Încărcat deultravoilet4u
- Regression AnalysisÎncărcat deSachin Shekhawat
- 1.Man-challenges in Service Delivery Within the-peter Kiprotich Cheruiyot 123Încărcat deImpact Journals
- v081n05p117 South African Mining and Metallurgy JournalÎncărcat deAndreas Meyer
- Linear Models NotesÎncărcat deMinhyoung Kang
- L12 DynamicÎncărcat dePrasad Raikar
- Mx 2421262131Încărcat deAnonymous 7VPPkWS8O
- Course 6 Econometrics RegressionÎncărcat deValentina Olteanu
- Presentation RegressionÎncărcat deSachin Gadhave
- 02 Multiple RegressionÎncărcat deRhea Mirchandani
- BA REPORTÎncărcat dejoey_sacramento
- Simple Linear Regression.pdfÎncărcat demanish
- Linear RegressionÎncărcat deJonesius Eden Manoppo
- 02. 2015 - Language Learning Strategies in Chinese StudentsÎncărcat deDao Mai Phuong
- Financial determinants of the accounting choice to capitalize expenses: The case of start-up and interest costsÎncărcat deinventionjournals
- Moore 16Încărcat deankuriitb
- Chapter 3Încărcat degoegiorgiana
- Predicting Long-Term Outcome After Acute Ischemic Stroke 2008Încărcat deBilge Serhateri Yenidemir
- Linear RegressionÎncărcat dekentbnx
- 9780470316757.ch2Încărcat dekherberrus
- Logistic RegressionÎncărcat deSubodh Kumar
- PGP09168_AT&TÎncărcat deReshma Majumder
- FRM 1 AIM Statements for NotesÎncărcat deMartin Bresser
- latih excel Cookies ErrorÎncărcat deDesy Purlianti

- Mary Meeker's Internet Trends Report 2018Încărcat deCNBC.com
- amira 1Încărcat deapi-300981172
- Beachy Head the Negation of the SolarÎncărcat deMiguel
- Rosary Crusade Booklet 2011 Web VersionÎncărcat deIamhereforscribd
- TPD incident report in Dan Markel deathÎncărcat debmorton
- Migrating EIGRP to OSPFÎncărcat deGiang Nguyen
- Multiple IntegralsÎncărcat deDK White Lion
- 1-CV EliseoPaPe_Engineer.pdfÎncărcat deElysée PauPer
- Samsung Moment m900 for SprintÎncărcat deWirefly
- Unit 1 - Bt Mlh 11 - Test 2Încărcat demiyoen
- 12 1 cÎncărcat deSonia Saini
- Taxi AggregatorsÎncărcat deShubham Khaitan
- The Dynamics of International Business NegotiationsÎncărcat deRavi Gauba
- poems walt whitmanÎncărcat deJames Weiser
- Online+ESL+Audio+Resources.docÎncărcat devanilla_42kp
- QurbaniÎncărcat deShahzad Shameem
- BMS-STBN08E BACNet NVspecificationÎncărcat dehijazensis
- Flat Plate Deflection-Chapter13Încărcat deSrisaran Srinivas Arasavelli
- How to configure OpenFiler iSCSI Storage for use with VMware ESX.docxÎncărcat deBlas Diaz
- Bloodwork - June 2017Încărcat deColton
- clasificación .pdfÎncărcat deprado908
- Marlton - 1021.pdfÎncărcat deelauwit
- Activare GPRSÎncărcat deealfora
- describing people physical appearance worksheet.pdfÎncărcat decanary
- Stuart Hall School Spring Newsletter 2010Încărcat deStuartHallSchool
- Thorsten Ball-Writing an interpreter in Go (2017).pdfÎncărcat deFran Krišto
- hindu women muslim men love jihad and conversionsÎncărcat deMuni Vijay Chaudhary
- Games CD KeyÎncărcat desandeep.kale123
- Sample Motion to Vacate Judgment for Fraud on the Court Under Rule 60(d)(3) in United States District CourtÎncărcat deStan Burman
- DCU MBS in HR StrategiesÎncărcat deDublin City University, International

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.