Regression Analysis in R

2 Bivariate Regression Model
Bivariate regression analysis is used to analyse the cause and effect relationship between two variables under
study. In other words the purpose of regression analysis is to study the dependence of one variable (dependent
variable) on another variable (independent variable) and to estimate the expected values of the dependent
variable with the help of known values of the independent variables.
In regression analysis the researcher is interested to know the dependence among variables. The dependent
variable is considered as a stochastic variable, whereas the independent variables are deterministic in nature. For
example the dependent variable crop production is stochastic in nature but may depend upon temperature,
rainfall, sunshine, and fertilizer.
Regression analysis deals with the dependence of one variable on other variables, but it does not necessarily
imply causation. The logic of cause and effect relationship should come from some theory or other.
The correlation between two variables indicates the strength of linear association between them, however in
regression analysis we are try to estimate or predict the average value of one variable on the basis of the fixed
values of other variables.
In Bivariate regression model the number of independent variable is one, whereas in multiple regression
model number of independent variables are more than one.
The Bivariate regression model can be expressed as:
Slope coefficient.
Represents rate of change
of Y w.r.t. change in X
Hypothetical
value of Y when
X=0
Dependent
Variable
Independent
Variable
Method of least square

For a regression model with one dependent stochastic variable Y and one independent deterministic variable X,
the unknown regression coefficients (, ) are estimated in such a way that the sum of squared differences
between the predicted values of Y (lying on the straight line) and actual values of Y is minimised .
i.e.
should be minimum.
This method is known as method of least square.
Assumptions of Regression Model

1.
The regression model should be linear in the parameters. This means the power of intercept and slope
coefficients must be 1.
2.
The independent variable X is a non-stochastic variable. It is repeated and deterministic in nature.

However the dependent variable (Y) depends upon X and stochastic in nature.
3.
The difference between the actual observations of Y and predicted values of Y at the same observation
of X is known as error or residual. The residual terms in the regression model should have normal
distribution with zero mean and constant variance
4.
Homoscedasticity or equal variances of error term. The variance of the error terms for different
samples should be similar.
5.
The error terms for different observations of X should not be correlated. If they found to be correlated
this is a problem known as autocorrelation problem. This problem can be detected with the help of
Durbin-Watson statistic.
6.
The error terms should not be correlated with either of dependent variable (Y) or an independent
variable (X).
7.
The number of observations n must be greater than the number of parameters to be estimated.
8.
The X values in a given sample must not all be the same. In other words the variance of X should be a
finite positive number.
9.
The regression model is correctly specified. In other words there should be no specification bias or
error in the model. This means that the selection of the dependent and independent variable in the
regression model should be based on theoretical background.
10. If there is more than one independent variable in the regression model these variables should be
independent to each other. This means that the independent variables should not be highly correlated
with each other. If the independent variables are found to be highly correlated with each other, this is a
problem and is known as Multicollinearity. The problem of multi-collinearity can be analysed with the
help of VIF, Tolerance, Condition Index and Eigen value statistics.
If all the above mentioned assumptions are fulfilled, then the estimated regression coefficients are known as
BLUE (Best Linear Unbiased Estimators). This means
1. The regression coefficients are linear function of a dependent variable Y in the regression model.
2. They are unbiased, that is, its average or expected value is equal to the true value.
3. They have minimum variance.
Illustration of Regression Analysis:
Regression analysis is largely concerned with estimating the expected average value of dependent variable at a
given known values of X (independent variable). Consider the data given in table 10.1. The data consists of two
variables collected from 60 companies for a particular month. The dependent variable is Sales (Rs Crore) and
the independent variable is Advertisement Expenditure (Rs Crore). The purpose of applying regression model is
to analyse the impact of advertisement expenditure on sales.
Table 10.1: Data for practice
Sales (Rs
Crore)
86
90
91
93
97
108
120
130
134
142
146
152
156
189
176
190
201
Advertisement (Rs
Crore)
5.5
6
6.5
7
7.5
6.5
7
7.4
8
8.5
8.8
7.9
8.4
9
9.4
9.8
8
Sales (Rs
Crore)
217
227
239
254
245
254
256
267
286
293
308
323
324
356
345
350
365
Advertisement (Rs
Crore)
10.8
11.3
11.5
11.2
11.2
11
11.6
11.8
12.4
11
11.5
12
13
13.5
13.4
12
13.6
Sales (Rs
Crore)
425
414
424
434
456
478
498
509
518
524
528
529
554
589
645
656
675
Advertisement (Rs
Crore)
13.5
13.7
14
15.2
15.7
16
16.2
13.7
14.4
15.5
16.5
17.5
18.9
15
15.2
17.5
18.5
210
214
219
9.3
9.5
10.3
367
387
397
14
14.4
14.2
690
702
780
19.1
19.3
21.2
The output of regression analysis and its interpretation are explained below:
Table10.2: Model Summary

Model Summary
Adjusted R
Std. Error of the Durbin-Watson
Square
Estimate
1
.962a
.926
.925
50.22180
.906
a. Predictors: (Constant), Advertisement
b. Dependent Variable: Sales
Model
R Square
Table 10.2 represents the summary of the regression model. The interpretation of the different statistics shown
in the table is explained below:
(a) R : this indicates the coefficient of multiple correlation. In bivariate regression model it represents the
correlation between Y and X. However in multivariate regression model the value of R represents the
correlation between predicted values of dependent variable (Y) and actual values of Y. The range of R
values is -1 to +1. The higher values of R indicate that the predicted values of Y are closer to actual
values, which means the regression model is a good fit. In the example the value of R is 0.962 which is
close to 1. This indicates that the predicted values of Sales are closer to actual sales values. This
indicates the regression model is a good fit.
(b) R Square: R square is the square of the value of R and known as coefficient of determination. The
value of R Square indicates the percentage of variance of dependent variable which can be explained
with the help of variations in Independent variable. In the example the R square value of 0.926
indicates that 92.6 percent of the variations in the behaviour of Sales can be explained with the help of
variations in Advertisement. This indicates that explained variance of sales is 92.6 percent and rest of
the variance in sales is unexplained
(c) Adjusted R Square: This statistic is used to compare the different models. The issue is that when we
add a new variable in the regression model, it brings some information to explain the dependent
variable, because of this the R Square value always increases. But the problem of adding a new
variable is that the degree of freedom of the regression model decreases. The R square value should be
adjusted to the degree of freedom in order to analyse the real contribution of a new variable in the
model. Hence if the adjusted R Square increases after addition of a new variable in the regression
model, it means that the variable is brining significant information in the regression model. Hence the
model with the new variable is a better fit. On the other side, if the adjusted R Square decreases after
addition of a new variable in the regression model, it means that the variable is not at all useful for the
regression model.
(d) Durbin-Watsan: One of the assumptions of the regression model is that the error terms should not be
correlated. This problem is known as autocorrelation problem. Durbin-Watson statistics measures the
presence of autocorrelation problem in the regression model. The ideal value of Durbin-Watson
statistic is 2. The DW value lower than 2 indicate the presence of Positive autocorrelation problem, and
the value more than 2 indicates the presence of negative autocorrelation problem. In the example the
low value of Durbin-Watson (0.906) indicates that the error terms are suffering from positive
autocorrelation problem.
Table 10.3: ANOVA
Model
1
Regression
Residual
Sum of Squares
1833721.637
146289.296
Total
a. Dependent Variable: Sales
1980010.933
ANOVAa
df
1
58
59
Mean Square
1833721.637
2522.229
F
727.024
Sig.
.000b
b. Predictors: (Constant), Advertisement
In order to explain the above results shown in table 2, we have to understand the type of deviations in regression
model. In total there are three types of deviations at every observation of X values. These are:
(a) Total Deviation: The difference in the actual value of Y and the grand mean of Y (grand mean is the
single value representing the mean of all the values of actual Y) is known as total deviation. The sum
of squares of all total deviations for each observation of X is known as Total Sum of Squares (TSS).
The TSS can be expressed as:
The degree of freedom of TSS is n-1

(b) Deviation from Actual and Predicted Y (Residual): The difference between the predicted value of Y
and the actual value of Y is the error. The sum of the square of these error terms is known as Residual
Sum of Squares (RSS).
RSS =
The degree of freedom of RSS is n-k.
(c) Deviation from Predicted Y and the Grand Mean (Explained Deviation): The difference between
the predicted values of Y and the grand mean of Y is Explained deviation. The sum of squares of the
explained deviation for all observations of X is known as Estimated Sum of Squares (ESS).

The degree of freedom of RSS is k-1.
Y (Actual Value of Y at a given value of X)
RSS =
(Mean of Actual Y Values)
X
The relation between TSS, ESS and RSS can be expressed as:
TSS = ESS + RSS
The R Square, which is the square of multiple correlation (R), can also be expressed as:
F- Statistics: The F statistic as represented in ANOVA table represents the measure of statistical fit of the
regression model. Higher the value of F statistic better is the model fitness. The F statistic in the regression
model can be calculated by using the following formula:
Assuming 95 percent level of confidence, if p-value of F statistic is found to be more than 0.05, this means that
the regression model is not fit. In the table, the F statistic is found to be 727.024 and the p-value is found to be
less than 5 percent level of significance. Hence with 95 percent confidence level the null hypothesis of poor fit
cannot be accepted and hence it can be concluded that the regression model is a good fit.
Rule of F greater than 4: As F distribution is the square of t distribution, in case when the number of
observations is more than 30 and confidence level is 95 percent, then any value of F statistic more than 4 is
sufficient for not accepting the null hypothesis that the model is not a good fit. Hence for any F statistic more
than 4, the model is said to be good fit. Also the higher values of F statistic represent a better fit.
Table 10.4: Regression Coefficients
Model
(Constant)
Advertisement
a. Dependent Variable: Sales
1
Coefficientsa
Unstandardized Coefficients
B
-235.310
Std. Error
22.083
46.635
1.730
Standardized
Coefficients
Beta
.962
Sig.
-10.656
.000
26.963
.000
In a regression model there are two types of coefficients. These are:

(a) Intercept
(b) Slope Coefficient
Intercept ( ): The intercept () is the hypothetical value of Y in case when X is zero. Basically this is the point
on Y axis at which the regression line passes. The intercept is more than an adjusting factor for the regression
model. In the example the value of the intercept is -235.310. This a hypothetical value of sales when
advertisement expenditure is zero. Since the sales cannot be negative, so this value is a hypothetical value and
have no practical significance.
Slope coefficient (): The slope coefficient () represents the impact of X on Y. It can be defined as the rate of
change of Y with respect to per unit change in X.
There are two types of slope coefficients in the table 10.4, These are:
(a) Standardised Beta
(b) Unstandardised Beta
Standardised Beta: When the regression model is analysed on the z scores of Y and X, the calculated slope
coefficients are reported as standardised Beta coefficients.
In case of a regression model with standardised slope coefficients, the value of intercept () is zero and
therefore the regression line passes through the origin.
Unstandardised Beta: When regression model is estimated on original observations of Y and X, the estimated
coefficients are known as unstandardised slope coefficient as shown in table 10.4. In the example the
unstandardised beta is found to be 46.635 which means that on an average the sales will increase by Rs 46.63 if
advertisement expenses increases by Rupee one
T statistic: The t statistic as shown in table 10.4, test the null hypothesis that there is no significant impact of
X on Y i.e. = 0. If the p value of the t statistic is less than 0.05, then with 95 percent confidence level the null
hypothesis cannot be accepted and the significant impact of X on Y can be concluded. In the example the t
statistic is found to be 26.963 and its p value is less than 0.05. Hence with 95 percent confidence level the null
hypothesis of = 0, i.e. there is no impact of advertisement on sales cannot be accepted. Hence it can be
concluded that there exists significant impact of advertisement expenditure on sales.
3. Multiple Regression with R

A multiple regression model can be represented as
Y =
Dependent
Stochastic
Variable
1X1
Intercept
Hypothetical
Value of Y when
All Xi are zero
2X2
3X3
4X4
Partial Slope
Coefficients
Rate of change
of Y w.r.t Xi
Keeping all
other Xi
constants
Independent
deterministic
variables
Residuals
Regression Plane: A bivariate regression equation can be represented by a straight line, however the regression
equation with two independent variables represent a regression plane in three dimensional diagram.
R : In a multiple regression model R statistic in the output is the coefficient of Multiple correlation. It
represents the correlation between the predicted (fitted) values of dependent variable (Y) and its observed
(actual) values.
R Square: R square is known as coefficient of determination. It represents the percentage of variance in Y
that can be explained with the regression model.
setwd("C:/users/akchauhan/Desktop/R data")
library(foreign)
data<- read.spss("Demand.sav", use.value.labels=TRUE, to.data.frame=TRUE)
data
reg1<-lm(Y ~ X2, data=data)
reg6<-lm(Y ~ X2 + X3 + X4 + X5 + X6, data=data)
summary(reg1)
library("car")
library("QuantPsyc")
library("boot")
lm.beta(reg1)
confint(reg1)
anova(reg1)
coefficients(reg1)
fitted(reg1)
residuals(reg1)
influence(reg1)
vif(reg1)
dwt(reg1)
round(data, digits=3)
plot(reg1)

Regression Analysis in R

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Regression Analysis in R

Încărcat de

Drepturi de autor:

Formate disponibile

2 Bivariate Regression Model

Method of least square

This method is known as method of least square.

Assumptions of Regression Model

The independent variable X is a non-stochastic variable. It is repeated and deterministic in nature.

Table10.2: Model Summary

b. Predictors: (Constant), Advertisement

The degree of freedom of TSS is n-1

The degree of freedom of RSS is k-1.

Y (Actual Value of Y at a given value of X)

(Mean of Actual Y Values)

In a regression model there are two types of coefficients. These are:

3. Multiple Regression with R

S-ar putea să vă placă și