Sunteți pe pagina 1din 10

Introduction

Linear regression is a basic and commonly used type of predictive analysis. The
overall idea of regression is to examine two things: (1) does a set of predictor
variables do a good job in predicting an outcome (dependent) variable? (2) Which
variables in particular are significant predictors of the outcome variable, and in
what way do they–indicated by the magnitude and sign of the beta estimates–
impact the outcome variable? These regression estimates are used to explain the
relationship between one dependent variable and one or more independent
variables. The simplest form of the regression equation with one dependent and
one independent variable is defined by the formula y = c + b*x, where y =
estimated dependent variable score, c = constant, b = regression coefficient, and x
= score on the independent variable.

There are many names for a regression’s dependent variable. It may be called an
outcome variable, criterion variable, endogenous variable, or regressand. The
independent variables can be called exogenous variables, predictor variables, or
regressors.

Three major uses for regression analysis are (1) determining the strength of
predictors, (2) forecasting an effect, and (3) trend forecasting.
First, the regression might be used to identify the strength of the effect that the
independent variable(s) have on a dependent variable. Typical questions are what
is the strength of relationship between dose and effect, sales and marketing
spending, or age and income.

Second, it can be used to forecast effects or impact of changes. That is, the
regression analysis helps us to understand how much the dependent variable
changes with a change in one or more independent variables. A typical question is,
“how much additional sales income do I get for each additional $1000 spent on
marketing?”
Third, regression analysis predicts trends and future values. The regression
analysis can be used to get point estimates. A typical question is, “what will the
price of gold be in 6 months?”
Simple Linear Regression

By (1.1), the simple linear regression model for n observations can be written as
yi ¼ b0 þ b1xi þ 1i, i ¼ 1, 2, ... , n

The designation simple indicates that there is only one x to predict the response
y,and linear means that the model (6.1) is linear in b0 and b1. [Actually, it is the
assumption E( yi) ¼ b0 þ b1xi that is linear; see assumption 1 below.] For
example, a model such as yi ¼ b0 þ b1x2 i þ 1i is linear in b0 and b1, whereas the
model yi ¼ b0 þ eb1xi þ 1i is not linear.
In this chapter, we assume that yi and 1i are random variables and that the values
of xi are known constants, which means that the same values of x1, x2, ... , xn
would be used in repeated sampling. The case in which the x variables are random
variables is treated in Chapter 10.
To complete the model in, we make the following additional assumptions:

1. E(1i) ¼ 0 for all i ¼ 1, 2, ... , n, or, equivalently, E(yi) ¼ b0 þ b1xi.


2. var(1i) ¼ s2 for all i ¼ 1, 2, ... , n, or, equivalently, var(yi) ¼ s2.
3. cov(1i, 1j) ¼ 0 for all i= j, or, equivalently, cov( yi, yj) ¼ 0.

Assumption 1 states that the model (6.1) is correct, implying that yi depends only
on xi and that all other variation in yi is random. Assumption 2 asserts that the
variance of 1 or y does not depend on the values of xi. (Assumption 2 is also
known as the assumption of homoscedasticity, homogeneous variance or constant
variance.) Under assumption 3, the 1 variables (or the y variables) are uncorrelated
with each other. In Section 6.3, we will add a normality assumption, and the y (or
the 1) variables will thereby be independent as well as uncorrelated. Each
assumption has been stated in terms of the 1’s or the y’s. For example, if var(1i) ¼
s2, then var(yi) ¼ E[ yi E(yi)]2 ¼ E(yi b0 b1xi) 2 ¼ E(12 i Þ ¼ s2.

Any of these assumptions may fail to hold with real data. A plot of the data will
often reveal departures from assumptions 1 and 2 (and to a lesser extent
assumption 3).
y distributional assumptions (for maximum likelihood estimators based on
normality, see Section 7.6.2).
The simple linear regression model

We consider the modeling between the dependent and one independent variable.
When there is only one independent variable in the linear regression model, the
model is generally termed as simple linear regression model. When there are more
than one independent variables in the model, then the linear model is termed as the
multiple linear regression model.

linear regression is a linear approach to modeling the relationship between a scalar


response (or dependent variable) and one or more explanatory
variables (or independent variables). The case of one explanatory variable is
called simple linear regression. For more than one explanatory variable, the
process is called multiple linear regression.[1] This term is distinct
from multivariate linear regression, where multiple correlated dependent variables
are predicted, rather than a single scalar variable.[2]
In linear regression, the relationships are modeled using linear predictor
functions whose unknown model parameters are estimated from the data. Such
models are called linear models.[3] Most commonly, the conditional mean of the
response given the values of the explanatory variables (or predictors) is assumed to
be an affine function of those values; less commonly, the conditional median or
some other quantile is used. Like all forms of regression analysis, linear regression
focuses on the conditional probability distribution of the response given the values
of the predictors, rather than on the joint probability distribution of all of these
variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously,
and to be used extensively in practical applications.[4] This is because models
which depend linearly on their unknown parameters are easier to fit than models
which are non-linearly related to their parameters and because the statistical
properties of the resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories:
 If the goal is prediction, or forecasting, or error reduction. Linear regression can
be used to fit a predictive model to an observed data set of values of the
response and explanatory variables. After developing such a model, if
additional values of the explanatory variables are collected without an
accompanying response value, the fitted model can be used to make a
prediction of the response.
 If the goal is to explain variation in the response variable that can be attributed
to variation in the explanatory variables, linear regression analysis can be
applied to quantify the strength of the relationship between the response and the
explanatory variables, and in particular to determine whether some explanatory
variables may have no linear relationship with the response at all, or to identify
which subsets of explanatory variables may contain redundant information
about the response.

Linear regression models are often fitted using the least squares approach, but they
may also be fitted in other ways, such as by minimizing the "lack of fit" in some
other norm (as with least absolute deviations regression), or by minimizing a
penalized version of the least squares cost function as in ridge regression (L2-norm
penalty) and lasso (L1-norm penalty). Conversely, the least squares approach can
be used to fit models that are not linear models. Thus, although the terms "least
squares" and "linear model" are closely linked, they are not synonymous.
Interpretation

A fitted linear regression model can be used to identify the relationship between a
single predictor variable xj and the response variable y when all the other predictor
variables in the model are "held fixed". Specifically, the interpretation of βj is
the expected change in y for a one-unit change in xj when the other covariates are
held fixed—that is, the expected value of the partial derivative of y with respect
to xj. This is sometimes called the unique effect of xj on y. In contrast,
the marginal effect of xj on y can be assessed using a correlation
coefficient or simple linear regression model relating only xj to y; this effect is
the total derivative of y with respect to xj.
Care must be taken when interpreting regression results, as some of the regressors
may not allow for marginal changes (such as dummy variables, or the intercept
term), while others cannot be held fixed (recall the example from the introduction:
it would be impossible to "hold ti fixed" and at the same time change the value
of ti2).
It is possible that the unique effect can be nearly zero even when the marginal
effect is large. This may imply that some other covariate captures all the
information in xj, so that once that variable is in the model, there is no contribution
of xj to the variation in y. Conversely, the unique effect of xj can be large while its
marginal effect is nearly zero. This would happen if the other covariates explained
a great deal of the variation of y, but they mainly explain variation in a way that is
complementary to what is captured by xj. In this case, including the other variables
in the model reduces the part of the variability of y that is unrelated to xj, thereby
The data sets in the Anscombe's quartet are designed to have approximately the same
linear regression line (as well as nearly identical means, standard deviations, and
correlations) but are graphically very different. This illustrates the pitfalls of relying solely
on a fitted model to understand the relationship between variables.

strengthening the apparent relationship with xj.


The meaning of the expression "held fixed" may depend on how the values of the
predictor variables arise. If the experimenter directly sets the values of the
predictor variables according to a study design, the comparisons of interest may
literally correspond to comparisons among units whose predictor variables have
been "held fixed" by the experimenter. Alternatively, the expression "held fixed"
can refer to a selection that takes place in the context of data analysis. In this case,
we "hold a variable fixed" by restricting our attention to the subsets of the data that
happen to have a common value for the given predictor variable. This is the only
interpretation of "held fixed" that can be used in an observational study.
The notion of a "unique effect" is appealing when studying a complex system
where multiple interrelated components influence the response variable. In some
cases, it can literally be interpreted as the causal effect of an intervention that is
linked to the value of a predictor variable. However, it has been argued that in
many cases multiple regression analysis fails to clarify the relationships between
the predictor variables and the response variable when the predictors are correlated
with each other and are not assigned following a study design. Commonality
analysis may be helpful in disentangling the shared and unique impacts of
correlated independent variables.

Multiple linear regression

The very simplest case of a single scalar predictor variable x and a single scalar
response variable y is known as simple linear regression. The extension to multiple
and/or vector-valued predictor variables (denoted with a capital X) is known
as multiple linear regression, also known as multivariable linear regression. Nearly
all real-world regression models involve multiple predictors, and basic descriptions
of linear regression are often phrased in terms of the multiple regression model.
Note, however, that in these cases the response variable y is still a scalar. Another
term, multivariate linear regression, refers to cases where y is a vector, i.e., the
same as general linear regression.

Estimation of the parameters by least squares


Let ˆyi = βˆ 0 + βˆ 1xi be the prediction for Y based on the ith value of X. Then ei
= yi − yˆi represents the ith residual.
Let ˆyi = βˆ 0 + βˆ 1xi be the prediction for Y based on the ith value of X. Then ei
= yi − yˆi represents the ith residual.
• We define the residual sum of squares (RSS) as RSS = e 2 1 + e 2 2 + · · · + e 2
n , or equivalently as RSS = (y1−βˆ 0−βˆ 1x1) 2+(y2−βˆ 0−βˆ 1x2) 2+. . .+(yn−βˆ
0−βˆ 1xn) 2 .

Let ˆyi = βˆ 0 + βˆ 1xi be the prediction for Y based on the ith value of X. Then ei
= yi − yˆi represents the ith residual.
• We define the residual sum of squares (RSS) as RSS = e 2 1 + e 2 2 + · · · + e 2
n , or equivalently as RSS = (y1−βˆ 0−βˆ 1x1) 2+(y2−βˆ 0−βˆ 1x2) 2+. . .+(yn−βˆ
0−βˆ 1xn) 2 .
• The least squares approach chooses βˆ 0 and βˆ 1 to minimize the RSS. The
minimizing values can be shown to be βˆ 1 = Pn i=1(xi − x¯)(yi − y¯) Pn i=1(xi −
x¯) 2 , βˆ 0 = ¯y − βˆ 1x, ¯ where ¯y ≡ 1 n Pn i=1 yi and ¯x ≡ 1 n Pn i=1 xi are the
sample means.

INFERENCE OF LINEAR REGRESSION


Linear regression attempts to model the relationship between two variables by
fitting a linear equation to observed data. Every value of the independent
variable x is associated with a value of the dependent variable y. The variable y is
assumed to be normally distributed with mean y and variance . The least-
squares regression line y = b0 + b1x is an estimate of the true population regression
line, y = 0 + x. This line describes how the mean response
1 y changes with x.
The observed values for y vary about their means y and are assumed to have the

same standard deviation . The fitted values b0 and b1 estimate the true intercept
and slope of the population regression line.

Since the observed values for y vary about their means y, the statistical model
includes a term for this variation. In words, the model is expressed as DATA = FIT +
RESIDUAL, where the "FIT" term represents the expression 0 + 1x. The
"RESIDUAL" term represents the deviations of the observed values y from their
means y, which are normally distributed with mean 0 and variance . The
notation for the model deviations is .
In formal terms, the model for linear regression is the following:
Given n pairs of observations (x1, y1), (x2, y2), ... , (xn, yn), the observed response
is yi = 0 + 1xi + i.

In the least-squares model, the best-fitting line for the observed data is calculated by
minimizing the sum of the squares of the vertical deviations from each data point to
the line (if a point lies on the fitted line exactly, then its vertical deviation is 0).
Because the deviations are first squared, then summed, there are no cancellations
between positive and negative values. The least-squares estimates b0 and b1 are
usually computed by statistical software. They are expressed by the following
equations:

The computed values for b0 and b1 are unbiased estimators of 0 and 1, and are
normally distributed with standard deviations that may be estimated from the data.

The values fit by the equation b0 + b1xi are denoted i, and the residuals ei are equal
to yi - i, the difference between the observed and fitted values. The sum of the
residuals is equal to zero.

The variance ² may be estimated by s² = , also known as the mean-squared


error (or MSE).
The estimate of the standard error s is the square root of the MSE.

Example

The dataset "Healthy Breakfast" contains, among other variables, the Consumer
Reports ratings of 77 cereals and the number of grams of sugar contained in each
serving. (Data source: Free publication available in many grocery stores. Dataset
available through the Statlib Data and Story Library (DASL).) The correlation between the two
variables is -0.760, indicating a strong negative association. A scatterplot of the two
variables indicates a linear relationship:

Using the
MINITAB "REGRESS" command with "sugar" as an explanatory variable and
"rating" as the dependent variable gives the following result:
Regression Analysis

The regression equation is


Rating = 59.3 - 2.40 Sugars
A plot of the data with the regression line added is shown to the right:

After fitting the regression line, it is important


to investigate the residuals to determine whether or not they appear to fit the
assumption of a normal distribution. A plot of the residuals y - on the vertical axis
with the corresponding explanatory values on the horizontal axis is shown to the left.
The residuals do not seem to deviate from a random sample from a normal
distribution in any systematic manner, so we may retain the assumption of normality.

The MINITAB output provides a great deal of information. Under the equation for the
regression line, the output provides the least-squares estimate for the constant b0 and
the slope b1. Since b1 is the coefficient of the explanatory variable "Sugars," it is listed
under that name. The calculated standard deviations for the intercept and slope are
provided in the second column.
Predictor Coef StDev T P
Constant 59.284 1.948 30.43 0.000
Sugars -2.4008 0.2373 -10.12 0.000

S = 9.196 R-Sq = 57.7% R-Sq(adj) = 57.1%

THANK YOU

S-ar putea să vă placă și