Documente Academic
Documente Profesional
Documente Cultură
Multiple benefits
It explores the significant relationships between dependent variable and
independent variable.
It explores the strength of impact of multiple independent variables on
a dependent variable.
Regression analysis also allows us to compare the effects of variables
measured on different scales, such as the effect of price changes and
the number of promotional activities. These benefits help market
researchers / data analysts / data scientists to eliminate and evaluate
the best set of variables to be used for building predictive models.
Important Points:
There must be linear relationship between independent and dependent
variables.
Multiple regression suffers from multicollinearity, autocorrelation,
heteroskedasticity.
Linear Regression is very sensitive to Outliers. It can terribly affect the
regression line and eventually the forecasted values.
Multicollinearity can increase the variance of the coefficient estimates
and make the estimates very sensitive to minor changes in the model.
The result is that the coefficient estimates are unstable.
In case of multiple independent variables, we can go with forward
selection, backward elimination and step wise approach for selection of
most significant independent variables.
Important Points:
Widely used for classification problems
Logistic regression can handle various types of relationships because it
applies a non-linear log transformation to the predicted odds ratio
To avoid over fitting and under fitting, we should include all
significant variables. A good approach to ensure this practice is to use
a step wise method to estimate the logistic regression
It requires large sample sizes because maximum likelihood estimates
are less powerful at low sample sizes than ordinary least squares
Modifiied minimum chisquare is an alternative method - long and rich
history!
y = a + bx + cx 2 + ...
Best fit is quadratic, cubic, quartic,...
Important Point:
Usually a temptation to fit a higher degree polynomial to get lower
error, this can result in over-fitting. Always plot the relationships to
see the fit and focus on making sure that the curve fits the nature of
the problem.
Important Points:
The assumptions of this regression is same as least squared regression
except normality is not to be assumed
It shrinks coefficients to zero (exactly zero), which certainly helps in
feature selection
If group of predictors are highly correlated, Lasso picks only one of
them and shrinks the others to zero
I The Model
The model and the data are the starting points of an econometric
project.
The first step in formulating a model is to select a topic of interest
and to consider the model’s scope and purpose.
State and understand objectives of the study, what boundaries to
place on the topic, what hypotheses might be tested, what variables
might be predicted, and what policies might be evaluated.
Close attention must be paid, however, to the availability of adequate
data. In particular the model must involve causal relations among
measurable variables.
particular market (the market for Pitzer graduates, the market for
economists, the market for ice cream, the markets for private education), a
process (economic development, inflation, unemployment), demographic
phenomena (birth rates, death rates), environmental phenomena (water
quality, air quality), political phenomena (elections, voting behavior of
legislatures), some combination of these, or some other topic.
”Air pollution and Population”
”Birth Rates, Death Rates, and Economic Growth in Developing
Economies”
”Demand for and Supply of Higher Education”
”Differential Growth in Indian cities”
After both the model and data have been developed, the next step is
to utilize econometric techniques to estimate the model.
We can use STATA 14 or any other statistical package for the
statistical analysis. Basic statistics packages include Minitab and
Excel. For careful work in econometrics we will want to use EViews,
STATA, SAS, TSP, LimDep, SPSS or Shazam.
Make sure that we have enough observations for all the variables and
that the dependent and independent variables show some variation
over the observations.
Define and discuss the specification of the selected model What variables
are included in the model? Explain why we chose those variables and the
role they play in the model. Have we included all the important variables
in the model? What are the expected signs of all the coefficients?
V. Data Description
Provide complete description of all the data, their sources, refinements
used, and their possible biases or other possible weaknesses.
Present the estimates of the model and its related statistics such as
standard errors, t statistics and the R 2 . Discuss which coefficients are
significant at the 5% and 1% levels. If relevant, a discussion of possible
serial correlation and its correction; a discussion of possible
heteroscedasticity and its correction; and a discussion of possible
multicollinearity and its correction. Estimate alternative models to test the
robustness of the results.
Discuss the signs and magnitudes of the estimated coefficients and their
comparisons to predicted or theoretical signs and magnitudes. What have
we learned? Consider how the model might be reformulated in future
studies, and implications for future econometric research.
IX. Bibliography
Include complete citations of all items referred to in the paper.
X. Data
If reasonable, provide a table of all the data used. At a minimum, provide
the summary statistics for the data.
Each of these three different methods has various tools and techniques
that fall underneath the silo in question. And each of these methods is
going to be appropriate in different kinds of circumstances.
Causal methods typically involves regression analysis and some of the
different types of specialized regression analysis that are going to be
useful in various circumstances.
Time series methods often involves various forms of trend
analysis.Things like exponential smoothing, trend prediction, et
cetera.
And then,
qualitative methods involve using surveys and other subjective ad hoc
methods of gathering data in order to make predictions. In causal
forecasting we’re relying on relationships between variables.
A Case Study
Krishna Rama-Murthy
Master’s Thesis, Virginia Polytechnic Institute & State University, 2006
Based from the figure above, the fare model for the Unrestricted
Coach Class(Y) behaves similar to the Business Class fare model.
Hence, Unrestricted Coach Class fares were combined with Business
class fares for the analysis.
Bimal Sinha (UMBC) Regression Analysis December , 2017 51 / 68
The final cluster of fare class groups that were used to develop the
models is given below:
a. Business Class - Unrestricted First Class (F), Restricted First Class (G),
Unrestricted Business Class (C) Restricted Business Class (D) and
Unrestricted Coach Class (Y)
b. Non-Business Class - Restricted Coach Class (X)
The Fare flow model was then tested using a non parametric statistical
test for non-similarity between the generic fare models. The Wilcoxon
Rank Sum Test is a nonparametric alternative to the two-sample t-test
which is based solely on the order in which the observations from the two
samples fall. The results from Wilcoxon Rank Sum test performed on the
“Fare flow models” indicate that the models are dissimilar and are
independent from each other. The p-values imply that the models are
statistically significant.
Parameter Estimate for Coach Fare Class Regression, Average Coach Fare
where
fbij : annual average round-trip fare for business class between i and j
dij : round trip distance in statue miles between i and j
hi : Herfindahl Index at the origin airport i
hj : Herfindahl Index at the origin airport j
pcij : annual coach class type passenger flows between i and j
oi : origin airport type (i) [1, 2, 3, and 4]
dj : destination airport type (j) [1, 2, 3, and 4]
β0 , β1 , β2 , β3 , β4 , β5 , β6 : model parameters to be estimated.
eij : residual
Bimal Sinha (UMBC) Regression Analysis December , 2017 65 / 68
Results