Documente Academic
Documente Profesional
Documente Cultură
Regression Analysis
A common question is whether there is indeed a minimum acceptable value for R-squared.
Does an R-squared value need to be greater than a certain arbitrary level (often 50 percent
or greater)? Is a regression model with a higher R-squared categorically preferable to one
with a lower R-squared? In this article, we examine whether there is indeed a minimum
threshold for R-squared through the examination of R-squared values reported in a sample
of peer-reviewed economics studies. If there is a minimum threshold, it should be evident
from the economics studies. To the contrary, we find that many empirical studies make no
mention of R-squared, and for those articles that report an R-squared value, approximately
half of the studies report an R-squared less than 0.5 (or 50 percent), and approximately 7
percent of the studies report an R-squared less than 0.1 (or 10 percent).
One way to examine the question of whether there is a minimum threshold for R-squared
for an econometric model is to review the reported R-squared values in highly regarded,
peer-reviewed economics journals. We reviewed 315 economics articles where a
regression analysis was undertaken across three journals published in 2014 and 2015.[6] If
there was indeed a minimum level of R-squared, we would expect to see this threshold in
the peer-reviewed economics articles. To the contrary, we do not observe a minimum
threshold and instead observe a wide range of values for R-squared in these published
articles. A reasonable interpretation is that there is no consensus of what constitutes an
acceptably high (or unacceptably low) R-squared to warrant inclusion in (or exclusion from)
a peer-reviewed economics journal. Another interesting observation is the fact that almost
half of the empirical articles (146 out of 315 papers, or 46 percent) do not even provide an
R-squared statistic.
The chart below shows all reported instances for R-squared in the reviewed articles,
including adjusted R-squared and pseudo R-squared.[7] Noticeably, the majority of reported
models exhibit fairly “low” (and occasionally negative) R-squared values.
The second figure below shows (unadjusted) R-squared statistics from articles where the
estimation methodology is or appears to be ordinary least squares (OLS).[8]
To further strengthen comparability, the figure below shows the distribution of standard R-
squared statistics in instances where the authors explicitly identified the model was OLS. As
we tighten the criteria for comparability, the distribution of R-squared statistics is
increasingly skewed towards the lower end of the spectrum.
Interestingly, for those articles where the OLS methodology was explicit, only 6 percent of
the articles comment on the R-squared statistic within the text, and the comments are
invariably very brief and generally only state what the R-squared value is.
We also have not identified any article from our sample that explicitly cites the R-squared as
a criterion for preferring one model specification over another. Generally, the R-squared is
provided as one of several statistics describing the regression results with no discussion of
its significance within the text. Moreover, the value of the R-squared does not appear to be
a criterion for determining whether a regression analysis is fit for publication in peer-
reviewed economics journals.
The fact that the surveyed articles fail to show a minimum acceptable level of R-squared
should not be surprising for economists who recognize the limitations of interpreting the R-
squared value. As noted in a standard econometrics textbook, “low R-squareds in
regression equations are not uncommon, especially for cross-sectional analysis ... [A]
seemingly low R-squared does not necessarily mean that an OLS regression equation is
useless.”[9] In particular, a “high” R-squared is merely an indication that the model fits the
existing data well. By itself, it does not provide insight into whether the model is
economically or statistically meaningful, and it might in fact only reflect a strong but spurious
correlation (and not causation) between the dependent and explanatory variables.
Various examples of high R-squared exist that illustrate this point, such as the relationship
between annual attendance at Fenway Park (the stadium of the Boston Red Sox) and the
annual number of U.S. patent applications filed from 1995 to 2014, which has an R-squared
of 0.8 (or 80 percent).[10] While the play of the Boston Red Sox may have inspired its share
of inventors in the Boston area in recent decades, it is obviously not a significant driver of
U.S. patent applications. In this case, the high R-squared value is economically
meaningless and reflects spurious correlation.[11]
The value of R-squared will vary depending upon the type of economic analysis. For
example, time series analysis tends to result in a higher R-squared than cross-sectional
analysis, and it would be improper to compare R-squared values across disparate
regressions.[12] If the dependent variable is a nonstationary time series,[13] an R-squared
value close to one is unimpressive, and if very close to one, may be a troubling sign that
there are significant time patterns in the errors because a large part of the explanatory
power of the regression may rely on the time trend as opposed to economically interesting
variables. Alternatively, if the dependent variable is a stationary time series, then an R-
squared of 0.25 (or 25 percent), for example, may be reasonable so long as the model is
properly specified.[14]
It is important to bear in mind that often an analyst is not attempting to explain all or a high
proportion of the variation in the dependent variable. Rather, the analyst is often seeking to
assess whether the relationship between the explanatory and dependent variable is
economically material and statistically significant.[15] Given sufficient data it may still be
possible to reliably estimate the impact of individual explanatory variables even though the
model has a “low” R-squared.[16] For example, an econometric study analyzing the
relationship between selected variables and economic growth may yield a low R-squared,
but can reveal an economically meaningful (and potentially important) relationship between
one of the observed variables and economic growth, even if it leaves a large amount of
variation in the data unexplained.[17] It also is important to understand that R-squared is not
a proper measure of whether important explanatory variables have been omitted or whether
there is omitted variable bias.[18]
Finally, it would be improper to categorically favor models simply because one specification
yielded a higher R-squared over an alternative, and potentially warranted, specification. A
focus on a “high” R-squared could lead an analyst to discard unduly a theoretically sound
model in favor of another model that achieves a higher R-squared from including arbitrarily
chosen variables. The addition of variables to a model needs to be done with proper
consideration about the cause-and-effect assumptions implicit with those variables, and one
must be careful to examine how the additional variables change the estimated coefficients
of other variables. Targeting a high R-squared could make the interpretation of the overall
model more difficult, mask relationships between key economic variables, and raise
questions about the overall reliability of the model.[19]
—By William Choi, Pablo Florian and Stuart Miller, AlixPartners LLP
William Choi, Ph.D., is a managing director in AlixPartners’ San Francisco office. Pablo
Florian is a vice president in the firm's Chicago office. Stuart Miller is a vice president in the
firm's Dallas office.
The opinions expressed are those of the author(s) and do not necessarily reflect the views
of the firm, its clients, or Portfolio Media Inc., or any of its or their respective affiliates. This
article is for general information purposes and is not intended to be and should not be taken
as legal advice.
[1] The interpretation of R-squared only applies for results from the ordinary least squares
(OLS) regression, which is the most commonly used regression model, particularly in
litigation.
[2] Specifically, it is the percentage of the variation in the dependent variable that is
explained by the regression model.
[3] Some brief comments on the purpose of regression models in litigation may be
appropriate. Suppose that, for example, the analyst wishes to assess the impact of a legal
dispute on a firm’s sales or the prices it pays for an input, such as due to a breach of
contract or anti-competitive behavior. Unfortunately, there may have been other factors
affecting sales or prices, so that the analyst cannot simply compare prices/sales before,
during and (if appropriate) after the dispute. The challenge for the analyst is therefore to
disentangle the effects of the legal dispute from other factors which may have affected sales
or prices, and this is where regression models can help. In this case, the dependent (or
response) variable would be sales or prices, and the explanatory (or independent or
predictor) variables would be those which potentially may have affected sales/prices (such
as costs and demand) as well as the impact of the dispute.
[4] Mary P. Valentino v. United States Postal Service. 511 F.Supp. 917 (1981) and Brenda
S. Griffin, et al. v. Board of Regents of Regency Universities, et al. 795 F.2d 1281 (1986).
[5] John J. Calandra, Michael D. Hall, and Sandra B. Saunders (2013), “U.S. Supreme
Court Again Strikes Down a Regression Model Offered for Class Certification: Is More
Rigorous Scrutiny on the Way?”, Bloomberg BNA: Expert Evidence Report.
[6] We reviewed papers in the American Economic Review, the American Economic Journal:
Applied Economics, and American Economic Journal: Economic Policy. Given that the May
issues of the American Economic Review contain the Papers and Proceedings of the
Annual Meeting of the American Association we excluded these, because these tend to
contain articles considering similar issues or topics. In total, our data reflects the content of
430 papers. Including the May issues for the AER would have added 224 articles to the
database.
[7] An R-squared less than zero is possible in certain types of estimation techniques that
are not ordinary least-squares regressions, as well as for R-squared statistics that adjust for
the number of explanatory variables in the model.
[8] These include R-squared statistics for those instances where the authors did not specify
a non-OLS estimation methodology and the specification of the regression is such that OLS
is likely to have been employed.
[15] The p-value is the probability of observing the available data if the alternative
hypothesis were true, where the alternative hypothesis tends to be that there is no effect,
i.e., that the true value for the coefficient in question is zero. For example, if our null
hypothesis is that a cartel has had an effect on prices and the alternative hypothesis is it did
not, then a p-value of 3% for the coefficient meant to capture the cartel effect means that
the likelihood that we would observe the available data if the cartel had no effect is 3%, i.e.,
very unlikely, and the coefficient is held to be statistically different from zero.
[18] Three specific tests are the LR test, Wald test, and Lagrange Multiplier test.