Sunteți pe pagina 1din 11

Owing to the outliers observed in the dataset, the values in each of the independent variables were

standardised to have a mean of zero and a standard deviation of one in order for the all variables to
contribute evenly to the model, aid in comparing the variables and make it easier to interpret
results of the logistic regression (Hosmer & Lemeshow, 2000).

According to Christensen (1997), where a dataset is standardized, the residuals in


binary logistic regression are expected to mostly have values within 2 and where
some residuals are outside of this range, outliers still exist. Due to the presence of
outliers in the residuals, which according to Cook (1998) can unduly influence the
results of the logistic analysis and lead to incorrect inferences, robust pooled logistic
regression was used to obtain the final most efficient result for interpretation.

logit(Y) = log[O(Y)] = log[y/(1-y)]

Pr(Yi = 1) = G (0 + 1X1i + 2X2i), where G () is a function taking on values


between zero and one: 0 < G (z) < 1 for any real z

Baum, C. F. (2013). Limited Dependent Variables. Berlin Birmingham Business School.


Available at http://www.birmingham.ac.uk/Documents/college-social-
sciences/business/economics/kit-baum-workshops/Bham13P4slides.pdf

Advice
In summary, my personal advice (and I have respect for conflicting opinions) is

I never worry about whether (1) is true. I assume the logit link is OK.

If I think the model is reasonably specified, I use the ML variance estimator for logistic
regression.

Only if I have good reason to believe that the model is poorly specified would I use the robust
variance estimator. That is, if the model fails goodness-of-fit tests, etc. Sometimes one just has
to live with missing predictors and badly fitting models because data were collected for only a
few predictors. In this case, Id use the robust variance estimator.

And, obviously, Id use the robust variance estimator if I had clustered data.

This recommendation is in contrast to the advice Id give for linear regression for which Id
say always use the robust variance estimator.

Linear predictor function

The basic idea of logistic regression is to use the mechanism already developed for linear
regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of
the explanatory variables and a set of regression coefficients that are specific to the model at hand but
the same for all trials. The linear predictor function for a particular data point i is written as:
where are regression coefficients indicating the relative effect of a particular explanatory variable
on the outcome.

The model is usually put into a more compact form as follows:

The regression coefficients 0, 1, ..., m are grouped into a single vector of size m + 1.

For each data point i, an additional explanatory pseudo-variable x0,i is added, with a fixed value
of 1, corresponding to the intercept coefficient 0.

The resulting explanatory variables x0,i, x1,i, ..., xm,i are then grouped into a single vector Xi of
size m + 1.

This makes it possible to write the linear predictor function as follows:

using the notation for a dot product between two vectors.

As a generalized linear model[edit]


The particular model used by logistic regression, which distinguishes it from standard linear
regression and from other types of regression analysis used for binary-valued outcomes, is
the way the probability of a particular outcome is linked to the linear predictor function:

Written using the more compact notation described above, this is:

This formulation expresses logistic regression as a type of generalized linear model,


which predicts variables with various types of probability distributions by fitting a linear
predictor function of the above form to some sort of arbitrary transformation of the
expected value of the variable.

The intuition for transforming using the logit function (the natural log of the odds) was
explained above. It also has the practical effect of converting the probability (which is
bounded to be between 0 and 1) to a variable that ranges over thereby matching
the potential range of the linear prediction function on the right side of the equation.

Note that both the probabilities pi and the regression coefficients are unobserved, and
the means of determining them is not part of the model itself. They are typically
determined by some sort of optimization procedure, e.g. maximum likelihood
estimation, that finds values that best fit the observed data (i.e. that give the most
accurate predictions for the data already observed), usually subject
to regularization conditions that seek to exclude unlikely values, e.g. extremely large
values for any of the regression coefficients. The use of a regularization condition is
equivalent to doing maximum a posteriori (MAP) estimation, an extension of
maximum likelihood. (Regularization is most commonly done using a squared
regularizing function, which is equivalent to placing a zero-mean Gaussian prior
distribution on the coefficients, but other regularizers are also possible.) Whether or
not regularization is used, it is usually not possible to find a closed-form solution;
instead, an iterative numerical method must be used, such as iteratively reweighted
least squares (IRLS) or, more commonly these days, a quasi-Newton method such as
the L-BFGS method.

The interpretation of the j parameter estimates is as the additive effect on the log of
the odds for a unit change in the jth explanatory variable. In the case of a
dichotomous explanatory variable, for instance gender, is the estimate of the odds of
having the outcome for, say, males compared with females.
An equivalent formula uses the inverse of the logit function, which is the logistic
function, i.e.:

The formula can also be written as a probability distribution (specifically, using


a probability mass function):

[page needed]
Menard, Scott W. (2002). Applied Logistic Regression (2nd ed.). SAGE. ISBN 978-0-7619-2208-7.

The linear probability model (LPM) is by far the simplest way of dealing with binary
dependent variables, and it is based on an assumption that the probability of an
event occurring, Pi , is linearly related to a set of explanatory variables (Brooks,
2008:512) x2i , x3i , ... , xki Pi = p(yi = 1) = 1 + 2x2i + 3x3i ++ k xki + ui, i =
1,..., N (11.1) The actual probabilities cannot be observed, so we would estimate a
model where the outcomes, yi (the series of zeros and ones), would be the dependent
variable. This is then a linear regression model and would be estimated by OLS. The
set of explanatory variables could include either quantitative variables or dummies or
both. The fitted values from this regression are the estimated probabilities for yi = 1
for each observation i. The slope estimates for the linear probability model can be
interpreted as the change in the probability that the dependent variable will equal 1
for a one-unit change in a given explanatory variable, holding the effect of all other
explanatory variables fixed. Suppose, for example, that we wanted to model the
probability that a firm i will pay a dividend (yi = 1) as a function of its market
capitalisation (x2i , measured in millions of US

While the linear probability model is simple to estimate and intuitive to interpret, the
diagram should immediately signal a problem with this setup. For any firm whose
value is less than $25m, the model-predicted probability of dividend payment is
negative, while for any firm worth more than $88m, the probability is greater than
one. Clearly, such predictions cannot be allowed to stand, since the probabilities
should lie within the range (0,1). An obvious solution is to truncate the probabilities at
0 or 1, so that a probability of 0.3, say, would be set to zero, and a probability of,
say, 1.2 would be set to 1. However, there are at least two reasons why this is still not
adequate: (1) The process of truncation will result in too many observations for which
the estimated probabilities are exactly zero or one. (2) More importantly, it is simply
not plausible to suggest that the firms probability of paying a dividend is either
exactly zero or exactly one. Are we really certain that very small firms will definitely
never pay a dividend and that large firms will always make a payout? Probably not, so
a different kind of model is usually used for binary dependent 514 Introductory
Econometrics for Finance variables -- either a logit or a probit specification. These
approaches will be discussed in the following sections. But before moving on, it is
worth noting that the LPM also suffers from a couple of more standard econometric
problems that we have examined in previous chapters. First, since the dependent
variable takes only one or two values, for given (fixed in repeated samples) values of
the explanatory variables, the disturbance term1 will also take on only one of two
values. Consider again equation (11.1). If yi = 1, then by definition
Hence the error term cannot plausibly be assumed to be normally distributed. Since
ui changes systematically with the explanatory variables, the disturbances will also
be heteroscedastic. It is therefore essential that heteroscedasticity-robust standard
errors are always used in the context of limited dependent variable models.

Both the logit and probit model approaches are able to overcome the limitation of the
LPM that it can produce estimated probabilities that are negative or greater than one.
They do this by using a function that effectively transforms the regression model so
that the fitted values are bounded within the (0,1) interval. Visually, the fitted
regression model will appear as an S-shape rather than a straight line, as was the
case for the LPM. This is shown in figure 11.2. Brooks, 514

Log [p/(1-p)] = a + bx

Log [p/(1-p)] = a + b1x1+ b2x2 + b3x3 + ... + bnxn.

Wooldridge, J. M. (533) Introductory-Econometrics-A-Modern-Approach 529-534

Brooks, C. (2008). Introductory Econometrics for Finance 2 nd edition. Cambridge


University Press, New York.

Wooldridge (533) puts it, for estimating limited dependent variable models,
maximum likelihood methods are indispensable.

Robust Standard Errors for Nonlinear Models


Andr Richter wrote to me from Germany, commenting on the reporting of robust standard
errors in the context of nonlinear models such as Logit and Probit. He said he 'd been led to
believe that this doesn't make much sense. I told him that I agree, and that this is another of
my "pet peeves"!

Yes, I do get grumpy about some of the things I see so-called "applied econometricians" doing
all of the time. For instance, see my Gripe of the Day post back in 2011. Sometimes I feel as if I
could produce a post with that title almost every day!

Anyway, let's get back to Andr's point.

The following facts are widely known (e.g., check any recent edition of Greene's text) and it's
hard to believe that anyone could get through a grad. level course in econometrics and not be
aware of them:
In the case of a linear regression model, heteroskedastic errors render the OLS estimator, b, of
the coefficient vector, , inefficient. However, this estimator is still unbiased and weakly consistent.
In this same linear model, and still using OLS, the usual estimator of the covariance matrix of b
is an inconsistent estimators of the true covariance matrix of b. Consequently, if the standard errors of
the elements of b are computed in the usual way, they will inconsistent estimators of the true standard
deviations of the elements of b.
For this reason,we often use White's "heteroskedasticity consistent" estimator for the covariance
matrix of b, if the presence of heteroskedastic errors is suspected.
This covariance estimator is still consistent, even if the errors are actually homoskedastic.
In the case of the linear regression model, this makes sense. Whether the errors are
homoskedastic or heteroskedastic, both the OLS coefficient estimators and White's standard errors are
consistent.
However, in the case of a model that is nonlinear in the parameters:
The MLE of the parameter vector is biased and inconsistent if the errors are heteroskedastic
(unless the likelihood function is modified to correctly take into account the precise form of
heteroskedasticity).
This stands in stark contrast to the situation above, for the linear model.
The MLE of the asymptotic covariance matrix of the MLE of the parameter vector is also
inconsistent, as in the case of the linear model.
Obvious examples of this are Logit and Probit models, which are nonlinear in the parameters, and
are usually estimated by MLE.
I've made this point in at least one previous post. The results relating to nonlinear models are really
well-known, and this is why it's extremely important to test for model mis-specification (such as
heteroskedasticity) when estimating models such as Logit, Probit, Tobit, etc. Then, if need be, the model
can be modified to take the heteroskedasticity into account before we estimate the parameters. For
more information on such tests, and the associated references, see this page on my professional
website.

Unfortunately, it's unusual to see "applied econometricians" pay any attention to this! They tend to just
do one of two things. They either
1. use Logit or Probit, but report the "heteroskedasticity-consistent" standard errors that their
favourite econometrics package conveniently (but misleading) computes for them. This involves a
covariance estimator along the lines of White's "sandwich estimator". Or, they
2. estimate a "linear probability model" (i.e., just use OLS, even though the dependent variable is a
binary dummy variable, and report the "het.-consistent standard errors".
If they follow approach 2, these folks defend themselves by saying that "you get essentially the same
estimated marginal effects if you use OLS as opposed to Probit or Logit." I've said my piece about this
attitude previously (here, here, here, and here), and I won't go over it again here.

My concern right now is with approach 1 above.

The "robust" standard errors are being reported to cover the possibility that the model's errors may be
heteroskedastic. But if that's the case, the parameter estimates are inconsistent. What use is a
consistent standard error when the point estimate is inconsistent? Not much!!

This point is laid out pretty clearly in Greene (2012, pp. 692-693), for example. Here's what he
has to say:
"...the probit (Q-) maximum likelihood estimator is not consistent in the presence of any form
of heteroscedasticity, unmeasured heterogeneity, omitted variables (even if they are
orthogonal to the included ones), nonlinearity of the form of the index, or an error in the
distributional assumption [ with some narrow exceptions as described by Ruud (198)]. Thus, in
almost any case, the sandwich estimator provides an appropriate asymptotic covariance
matrix for an estimator that is biased in an unknown direction." (My underlining; DG.) "White
raises this issue explicitly, although it seems to receive very little attention in the
literature.".........."His very useful result is that if the QMLE converges to a probability limit,
then the sandwich estimator can, under certain circumstances, be used to estimate the
asymptotic covariance matrix of that estimator. But there is no guarantee the the
QMLE will converge to anything interesting or useful. Simply computing a robust covariance
matrix for an otherwise inconsistent estimator does not give it redemption. Consequently, the
virtue of a robust covariance matrix in this setting is unclear."
Back on July 2006, on the R Help feed, Robert Duval had this to say:
"This discussion leads to another point which is more subtle, but more important...
You can always get Huber-White (a.k.a robust) estimators of the standard errors even in non-linear
models like the logistic regression. However, if you believe your errors do not satisfy the standard
assumptions of the model, then you should not be running that model as this might lead to biased
parameter estimates.
For instance, in the linear regression model you have consistent parameter estimates independently of
whether the errors are heteroskedastic or not. However, in the case of non-linear models it is usually the
case that heteroskedasticity will lead to biased parameter estimates (unless you fix it explicitly
somehow).
Stata is famous for providing Huber-White std. errors in most of their regression estimates, whether
linear or non-linear. But this is nonsensical in the non-linear models since in these cases you would be
consistently estimating the standard errors of inconsistent parameters.
This point and potential solutions to this problem is nicely discussed in Wooldrige's Econometric Analysis
of Cross Section and Panel Data."
Amen to that!

Regrettably, it's not just Stata that encourages questionable practices in this respect. These
same options are also available in EViews, for example.

Reference

Greene, W. H., 2012. Econometric Analysis. Prentice Hall, Upper Saddle River, NJ.

Greene, W. H. (:389)

Greene, W. H. (2012:389). Econometric Analysis. 7th edition. Pearson Education


publishing as Prentice Hall 2012.
SEVENTH EDITION ECONOMETRIC ANALYSIS
INTERNATIONAL EDITION Q
William H. Greene
(In the panel data context, this is also called the population averaged model under
the assumption that any latent heterogeneity has been averaged out.) In this form, if
theremainingassumptionsoftheclassicalmodelaremet(zeroconditionalmeanof it,
homoscedasticity,independenceacrossobservations,i,andstrictexogeneityofxit),then
no further analysis beyond the results of Chapter 4 is needed. Ordinary least squares
is the efficient estimator and inference can reliably proceed along the lines developed
in Chapter 5.

Binary 189-190

A standardized variable (sometimes called a z-score or a standard score) is a variable


that has been rescaled to have a mean of zero and a standard deviation of one. For a
standardized variable, each cases value on the standardized variable indicates its
difference from the mean of the original variable in number of standard deviations (of
the original variable). For example, a value of 0.5 indicates that the value for that
case is half a standard deviation above the mean, while a value of -2 indicates that a
case has a value two standard deviations lower than the mean. Variables are
standardized for a variety of reasons, for example, to make sure all variables
contribute evenly to a scale when items are added together, or to make it easier to
interpret results of a regression or other analysis.

Standardizing a variable is a relatively straightforward procedure. First, the mean is


subtracted from the value for each case, resulting in a mean of zero. Then, the
difference between the individuals score and the mean is divided by the standard
deviation, which results in a standard deviation of one. If we start with a variable x,
and generate a variable x*, the process is:

x* = (x-m)/sd

Where m is the mean of x, and sd is the standard deviation of x.

To illustrate the process of standardization, we will use the High School and Beyond
dataset (hsb2). We will create standardized versions of three variables, math,
science, and socst. These variables contain students scores on tests of knowledge of
mathematics (math), science (science), social studies (socst). First, we will use the
summarize command (abbreviated as sum below) to get the mean and standard
deviation for each variable.

There are many reasons for transformation. The list here is not
comprehensive.

1. Convenience
2. Reducing skewness
3. Equal spreads
4. Linear relationships
5. Additive relationships

If you are looking at just one variable, 1, 2 and 3 are relevant, while
if you are looking at two or more variables, 4 and 5 are more important.
However, transformations that achieve 4 and 5 very often achieve 2 and 3.

1. Convenience A transformed scale may be as natural as the original


scale and more convenient for a specific purpose (e.g. percentages
rather than original data, sines rather than degrees).

One important example is standardisation, whereby values are adjusted for


differing level and spread. In general

value - level
standardised value = -------------.
spread

Standardised values have level 0 and spread 1 and have no units: hence
standardisation is useful for comparing variables expressed in different
units. Most commonly a standard score is calculated using the mean and
standard deviation (sd) of a variable:

x - mean of x
z = -------------.
sd of x
Standardisation makes no difference to the shape of a distribution.

3. Equal spreads A transformation may be used to produce approximately


equal spreads, despite marked variations in level, which again makes data
easier to handle and interpret. Each data set or subset having about the
same spread or variability is a condition called homoscedasticity: its
opposite is called heteroscedasticity. (The spelling -sked- rather than
-sced- is also used.)

Dear hyojoung,

--- "Hyojoung Kim" <hyojoung@u...> wrote:


> i am running a logistic regression and have a compelling reason to
> show how serious the problem of heterodoscedasticity is in this
> regression. are you aware of an equivalent of - hettest - for
> logistic regression analysis?

What you could do is estimate a model with -hetprob- and -probit- and
do a likelihood ratio test (-lrtest-). This is an test for
heteroscedasticity in probit regression, which is very close to
logisitic regression, except you don't get the nice odds ratios.

> alternatively, would it be acceptable if a - hettest - is run in


> OLS and use it as an indirect evidence for the presence of
> heteroscedasticity?

Logistic and probit regression are so close that the choice between
them is often based on practical grounds and tradition within the
discipline and not on substantial grounds. The pressence of -hetprob-
would be such a practical reason why you might want to switch to
probit in this case. If you realy want to use logit, and want to put
up with indirect evidence than the comparison of -hetprob- and -
probit- would in my eyes be more convincing indirect evidence than -
hettest- on a linear probability model.

Maarten

Dave Giles Robust Standard Errors Logit

http://www.statalist.org/forums/forum/general-stata-discussion/general/5564-
heteroskedasticity-test-for-logit-and-logistic-models
http://www.stata.com/manuals13/rhetprobit.pdf

related to issues and PR on score/LM and conditional moment tests

two blog articles for R


http://www.r-bloggers.com/the-problem-with-testing-for-heteroskedasticity-in-probit-
models/
https://diffuseprior.wordpress.com/2012/11/19/the-heteroskedastic-probit-model/
Shazam for LM test Davidson-MacKinnon with Shazam output (useful for unit test)
http://www.econometrics.com/intro/logit3.htm
data: http://www.econometrics.com/intro/school.htm

another blog article in favor of LM test


http://davegiles.blogspot.ca/2011/05/gripe-of-day.html
but he has things mostly in "unreadable" Eviews
workfiles http://web.uvic.ca/~dgiles/downloads/binary_choice/index.html

For logistic regression with one or two predictor variables, it is relatively


simple to identify outlying cases with respect to their X or Y values by means
of scatter plots of residuals and to study whether they are influential in
affecting the fitted linear predictor. When more than two predictor variables
are included in the logistic regression model, however, the identification of
outlying cases by simple graphical methods becomes difficult. In such a case,
traditional standardized residual plots can highlight little regarding outliers
and some derived statistics and their plots from basic building blocks with
lowess smooth and bubble plots are potential to detect outliers and influential
cases (Kutner et al., 2005; Hosmer and Lemeshow, 2000).

There are three ways that an observation can be considered as unusual,


namely outlier, influence and leverage. In logistic regression, a set of
observations whose values deviate from the expected range and produce
extremely large residuals and may indicate a sample peculiarity is called
outliers. These outliers can unduly influence the results of the analysis and
lead to incorrect inferences. An observation is said to be influential if
removing the observation substantially changes the estimate of coefficients.
Influence can be thought of as the product of leverage and outliers. An
observation with an extreme value on a predictor variable is called a point
with high leverage. Leverage is a measure of how far an independent variable
deviates from its mean. In fact, the leverage indicates the geometric
extremeness of an observation in the multi-dimensional covariate space.
These leverage points can have an unusually large effect on the estimate of
logistic regression coefficients (Cook, 1998).

suggested that if the residuals in binary logistic regression


Christensen (1997)
have been standardized in some fashion, then one would expect most of them
to have values within 2. Standardized residuals outside of this range are
potential outliers. Thus studentized residuals less than -2 and greater than +2
definitely deserve closer inspection. In that situation, the lack of fit can be
attributed to outliers and the large residuals will be easy to find in the plot.
But analysts may attempt to find group of points that are not well fit by the
model rather than concentrating on individual points. Techniques for judging
the influence of a point on a particular aspect of the fit such as those
developed by Pregibon (1981) seem more justified than outlier detection (Jennings,
1986).
Detection of outliers and influential cases and corresponding treatment is very
crucial task of any modeling exercise. A failure to detect outliers and hence
influential cases can have severe distortion on the validity of the inferences
drawn from such modeling exercise. It would be reasonable to use diagnostics
to check if the model can be improved in case of Correct Classification Rate
(CCR) is smaller than 100. The main focus in this study is to detect outliers
and influential cases that have a substantial impact on the fitted logistic
regression model through appropriate graphical method including smoothing
technique.

The logistic regression model

The "logit" model solves these problems:

ln[p/(1-p)] = a + BX + e or

[p/(1-p)] = exp(a + BX + e)

where:

ln is the natural logarithm, logexp, where exp=2.71828

p is the probability that the event Y occurs, p(Y=1)

p/(1-p) is the "odds ratio"

ln[p/(1-p)] is the log odds ratio, or "logit"

all other components of the model are the same.

The logistic regression model is simply a non-linear transformation of the linear


regression. The "logistic" distribution is an S-shaped distribution function which
is similar to the standard-normal distribution (which results in a probit
regression model) but easier to work with in most applications (the probabilities
are easier to calculate). The logit distribution constrains the estimated
probabilities to lie between 0 and 1.

For instance, the estimated probability is:

p = 1/[1 + exp(-a - BX)]

With this functional form:

if you let a + BX =0, then p = .50


as a + BX gets really big, p approaches 1

as a + BX gets really small, p approaches 0.

S-ar putea să vă placă și