Linear Regression

1
Decision Sciences II
Mid-Term Examination
Friday 19 October, 2018
Time : 180 minutes
Total No. of Pages :21
Name________________________
Total No. of Questions: 3 Roll No. ________________________
Total marks:40 Section ________________________
Instructions
1. This is a closed book exam. You are NOT allowed to use text book and class notes.
2. Answer all questions only in the space provided following the question.
3. Show all work and give adequate explanations to get full credit.
4. You may use the backside of the last page for rough work only if needed. Do NOT attach any
rough work/sheets.
5. Encircle or underline your final answer for each part.
6. No clarifications will be made during the exam.
7. Assume 95% confidence level if necessary ( = 0.05).
8. Use approximate critical values for Z, t and F tests if the exact value is not available in the tables
attached with the question paper.
Question Number Q1 Q2 Q3 Total

Max Marks 8 16 16 40
Marks Scored
2
Question 1 (8 points)
Data on Life Expectancy at birth (measured in years) and Infant Mortality (deaths per thousand
live births) is collected for 20 countries in South and East Asia and the Pacific Region.
The descriptive statistics related to the data are given in the table below:
Life Expectancy Infant Mortality

Mean 68.90 36.30
SD 7.86 31.20
Min 54 3
Max 81 102
A simple linear regression model is developed to predict Life Expectancy using Infant Mortality
as the only explanatory variable. The regression outputs obtained are given below:
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
ANOVA
df SS MS F Significance F
Regression 1040.828 140.893 6.01375E-10
Residual
Total 1173.800
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 0.946 81.900 0.000 75.523 79.500

77.511
Infant Mortality -0.279 -0.195
Use these regression results to answer the following questions:

3
Question 1.1
What is the change in Life Expectancy when Infant Mortality increases by one unit? (1 point)
Question 1.2
Calculate Life Expectancy for India given that it has an Infant Mortality of 51. (1 point)
Question 1.3
What percentage of variation in Life Expectancy can be explained by Infant Mortality? (1 point)
4
Question 1.4
At 95% confidence level, calculate the maximum value of Life Expectancy for Vietnam which
has an Infant Mortality of 34. (2 points)
Question 1.5
At 95% confidence level, test the hypothesis that Life Expectancy for New Zealand is 77, given
that Infant Mortality is 5. (2 points)
5
Question 1.6
Can the regression result be used to predict the Life Expectancy of a country in South Asia with
an Infant Mortality of 110? Why or why not? (1 point)
Question 2 (16 points)
Consider the following data on sericulture (production of silk and rearing silk worms), collected
from 561 farmers in Karnataka, where the variables are listed in Table 2.1.
Table 2.1 Variables in the data

S.No Variable Name Description Code used in SPSS
1 Income per acre Income earned per acre Income
2 Loan Amount Loan taken by the farmer for sericulture Loan Amount
Total cost of establishing and maintaining
3 Rearing Cost Rearing Cost
sericulture farm
Categorical Variable that captures 1 = insured
4 Crop Insured
whether the crop is insured 0 = not insured
Years of experience in sericulture for the
5 Years of Experience Experience
farmer
1 = Attended a training
Categorical variable that captures
program
6 Training whether the farmer has undergone any
0 = Did not attend
training in that year
training program
6
The following descriptive statistics of these variables are shown in Table 2.2 below.
Table 2.2 Descriptive Statistics

n Minimum Maximum Mean Std. Deviation
Income 561 -79646.0177 155000.0000 49476.8672 44371.3894
Crop insured 561 0 1 .09 .290
110000.00000
Rearing cost 561 .000000 37140.2852 21479.2195
0
Experience 561 0 40 16.82 10.398
Training 561 0 1 .30 .458
250000.00000
Loan amount 561 .000000 47698.4491 64761.24981
0
Valid n (list wise) 561
Model 1
A linear regression model was developed between income per acre as the dependent variable (Y)
and years of experience as the independent variable. The SPSS model outcome is shown in
Tables 2.3 to 2.5.
Table 2.3 Model Summaryb
Model 1 R R Square Adjusted R Std. Error of the

Square Estimate
1 44043.3492
a. Predictors: (Constant), Experience

b. Dependent Variable: Income
Table 2.4 ANOVAa

Model 1 Sum of Squares df Mean Square F Sig.
Regression 18181829555.883 1 18181829555.883 9.373 .002b
1 Residual 1084357485250.006 559 1939816610.465
Total 560
a. Dependent Variable: Income
b. Predictors: (Constant), Experience
7
Table 2.5 Coefficient Values
Unstandardized Coefficients
Model 1 t Sig.
B Std. Error
(Constant) 3538.000 11.380 .000

1
Experience 547.976 178.988
Answer questions 2.1-2.4 based on model 1.

Question 2.1 (1 point)
What proportion of the variation in income per acre is not explained by the variable experience?
Question 2.2
Is it possible to claim that the income per acre increases by at least 200 rupees for every one-year
increase in experience at a 5% significance level? (2 points)
8
Question 2.3
What is the average revenue earned by the first-time farmers? (1 point)
Question 2.4 (2 points)

What is the probability that a specific farmer with 20 years of experience in sericulture will earn
at least 50,000 rupees (income per acre)? What assumptions are made in this calculation?
A second model is developed based on whether a farmer has undergone any training or not and
the results obtained are displayed in Table 2.6.
Table 2.6 Coefficientsa
Model 2 Unstandardized Coefficients Standardized t Sig.

Coefficients
B Std. Error Beta
(Constant) 56936.976 2161.365 26.343 .000

1
Training 3961.424 -.258 -6.326
9
Question 2.5
Is the variable “training” statistically significant at a 1% significance level? (1 point)
Question 2.6
Is it possible to conclude that the farmers should not attend the training program? State your
reasons clearly. (1 point)
Model 3 is developed using both training and experience as independent variables, and the
corresponding SPSS output is shown in Table 2.7

Coefficients
B Std. Error Beta
(Constant) 50428.002 3848.024 13.105 .000
1 Training -23582.580 4016.019 -.243 -5.872 .000
Experience 360.897 176.752 .085 2.042 .042

10
Question 2.7
Comment on the type of correlation (positive or negative) that exists between training and
experience. Clearly show all the steps. (2 points)
Model 4 is developed using the following independent variables: (1) Training; (2) Experience;
and (3) TrainExp, which is an interaction variable between Training and Experience, and the
SPSS output for this model is recorded in Table 2.8 below.

Coefficients
B Std. Error Beta
(Constant) 46229.978 4347.769 10.633 .000
Training -12542.529 7101.003 -.119 -1.766

1
Experience 593.661 209.558 .139 2.833 .005
TrainExp -795.315 387.361 -.151 -2.053 .041
Question 2.8
Can we say that farmers who have attended the training program will always earn less than those
who did not attend the training program, irrespective of their experience? (2 points)
11
Model 5 is developed using stepwise regression with all variables included, and the results are
documented in Table 2.9 below.

Coefficients
B Std. Error Beta
(Constant) 53710.641 3909.911 13.737 .000
Training -19570.055 4066.697 -4.812 .000
Loan amount -.089 .028 -3.198 .001
Crop insured -20004.165 6375.008 -3.138 .002
Rearing cost .207 .083 2.497 .013
Question 2.9
Among the continuous variables used in Model 5, which independent variable has the highest
impact on the income per acre? (1 point)
Model 6 is developed after adding a new variable “chawki_bivoltine”, which is a dummy variable
that captures whether the farmer used a hybrid variety called “bivoltine” (1 implies that the
farmer used the hybrid variety and 0 implies otherwise). The outputs are shown in Table 2.10-
2.11.
12
Table 2.10 Model Summaryg
Model R R Square Adjusted R Std. Error of the

Square Estimate
1 .311a .097 .095 42213.5119132141

2 .353b .125 .122 41583.5513778100
3 .368c .136 .131 41366.0371499883
4 .379d .144 .137 41209.9652100806
5 .386e .149 .141 41114.5470540031
6 .389f .151 .142 41099.2361349924
a. Predictors: (Constant), chawki_bivoltine

b. Predictors: (Constant), chawki_bivoltine, TrainExp
c. Predictors: (Constant), chawki_bivoltine, TrainExp, crop_insured
d. Predictors: (Constant), chawki_bivoltine, TrainExp, crop_insured,
loan_amount
e. Predictors: (Constant), chawki_bivoltine, TrainExp, crop_insured,
loan_amount, rearing_cost
f. Predictors: (Constant), chawki_bivoltine, TrainExp, crop_insured,
loan_amount, rearing_cost, TrainRear
g. Dependent Variable: Income

Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 56944.065 4155.765 13.702 .000
chawki_bivoltine -20881.352 4194.800 -.216 -4.978 .000
TrainExp -591.775 282.168 -.113 -2.097 .036
6 crop_insured -15349.878 6329.923 -.100 -2.425 .016
loan_amount -.065 .028 -.094 -2.349 .019
rearing_cost .201 .091 .097 2.221 .027
TrainRear -.154 .130 -.071 -1.189 .235
13
Question 2.10 (2 points)

Comment whether it is worth adding variable “rearing cost”, at 1% significance level after adding
variables 1. Chawki bivoltine, 2. TrainExp, 3. Crop insured and 4. Loan amount.
Question 2.11 (1 point)

The normal probability plot of the model 6 is shown in Figure 2.1. Comment on the model
validity.
Figure 2.1
14
Question 3
(16 points)
The Election Commission (EC) and the Supreme Court have of late been trying to improve the
election process by looking at various data related to winners from different constituencies in the
recent elections. Some of this data comes from the EC, while the rest is gathered directly from the
winners. One of the questions that the EC is trying to understand is whether the margin of victory
depends on: (a) the financial strength of the candidate; (b) the candidate’s involvement in various
court cases and; (c) other demographic data related to the candidate. This would enable the EC to
better plan security and other arrangements for the next election cycle.
In this context, the EC uses the following data, which is given in Table 3.1 below.
Table 3.1 Variables used in the data

Variable
Win Margin The percentage of votes that the winner won by
(Winner assets – Runner up assets)/Runner up assets
Assets1over2 (expressed as %)
Liab diff Winner Financial liabilities – Runner up Financial liabilities
There were other categorical and dummy variables that the EC was interested in:
Gp1, Gp2, Gp3 are three Groups from which the winners come (e.g., General, SC, ST)
Cases Win gt Loser = 1 if the winner was involved in more court cases than the runner up; 0
otherwise
Serious W gt L = 1 if the winner was involved in more serious court cases than the runner up; 0
otherwise.
Grad, PG, Professional, Ph D, None are all 0-1 variables which take a value of 1 if the winner
had finished a graduate degree, a post graduate degree, a professional degree, Ph D or None of
these respectively, and 0 otherwise. The highest educational qualification obtained by the winner
was taken. Some winners did not go to college.
Various regressions were tried and the results obtained for some of them are given below.
Regression 1 Output
Multiple R 0.176394898
R Square 0.03111516
Adjusted R Square 0.023788999
Standard Error 1.133067839
Observations 534
15
ANOVA
df SS MS F Significance F
Regression 4 21.81058803 5.452647008 4.247130033 0.00215867
Residual 529 679.1528032 1.283842728
Total 533 700.9633912
Standard
Coefficients Error t Stat P-value
Intercept 1.89774 0.09752 19.46022 0.00000
Grad -0.36673 0.14378 -2.55057 0.01104
Professional -0.39892 0.14783 -2.69844 0.00719
PG -0.51749 0.13485 -3.83750 0.00014
PhD -0.11008 0.22278 -0.49412 0.62143
Regression 2 Output
Multiple R 0.685085
R Square 0.469342
Observations 534
ANOVA
Significance
df SS MS F F
Regression 1 328.9912 328.9912 470.528 3.18E-75
Residual 532 371.9722 0.699196
Total 533 700.9634
Standard
Intercept 1.471912 0.036599 40.21676 2E-163
3.18E-
Assets 1 over 2 0.017381 0.000801 21.69166 75
Regression 3 Output
Multiple R 0.69933132
R Square 0.489064296
Observations 534
16
ANOVA
Significance
df SS MS F F
Regression 4 342.8162 85.70404 126.5888 9.51E-76
Residual 529 358.1472 0.677027
Total 533 700.9634
Coefficients Standard Error t Stat P-value

Intercept 1.694965067 0.064215 26.39504 1.38E-98
Assets 1 over 2 0.017204116 0.00079 21.78539 1.39E-75
Grad -0.244591858 0.099814 -2.45049 0.014589
Professional -0.28809465 0.102863 -2.80075 0.005285
PG -0.407917914 0.092979 -4.38722 1.39E-05
Regression 4 Output
Multiple R 0.129845
R Square 0.01686
Observations 534
ANOVA
Significance
df SS MS F F
Regression 2 11.81811 5.90905 4.55304 0.01095
Residual 531 689.1453 1.29783
Total 533 700.9634
Standard
Intercept 1.66155 0.056678 29.31541 4.5E-113
Gp2 -0.41222 0.136612 -3.01748 0.002671
Gp3 -0.06565 0.177274 -0.37036 0.711265
Identify the most appropriate regression model and give proper explanations for your
answers to the following questions.
3.1 Are people who attended college likely to win with a greater margin? (2 points)
17
3.2 Given that the values of the variable “Asset 1 over 2” are within 3 standard deviations, what
is the maximum margin of victory? (3 points)
3.3 What is the best inference you can make about the relationship between education levels and
Assets1over2 ? (2 points)
3.4 Rank the Groups in terms of margin of victory. (2 points)

18
The interaction effects between Asset1over2 and the other variables are added and
Stepwise Regression is carried out using all the variables. The output is given below.
Regression 5 Output
Coefficientsa
Unstandardized Coefficients Standardized Coefficients

Model t Sig.
B Std. Error Beta

(Constant) 1.472 0.037 40.217 0
1
Assets1over2 0.017 0.001 0.685 21.692 0
(Constant) 1.325 0.036 37.265 0

2 Assets1over2 0.017 0.001 0.659 23.051 0
Serious W gt L 1.024 0.092 0.317 11.102 0
(Constant) 1.32 0.035 37.356 0

Assets1over2 0.016 0.001 0.623 20.332 0
3
Serious W gt L 1.04 0.092 0.322 11.342 0
Interact Asset_PG 0.006 0.002 0.093 3.037 0.003
(Constant) 1.392 0.041 34.256 0

Assets1over2 0.016 0.001 0.616 20.232 0
4 Serious W gt L 1.024 0.091 0.317 11.28 0
Interact Asset_PG 0.007 0.002 0.109 3.572 0
PG -0.253 0.073 -0.099 -3.478 0.001
(Constant) 1.427 0.043 33.338 0

Assets1over2 0.016 0.001 0.616 20.334 0
Serious W gt L 1.001 0.091 0.31 11.012 0
5
Interact Asset_PG 0.007 0.002 0.107 3.496 0.001
PG -0.241 0.072 -0.094 -3.325 0.001
Gp2 -0.223 0.088 -0.071 -2.523 0.012
19
Model Summaryf
Model R R Square Adjusted R Std. Error of the Change Statistics

Square Estimate R Square F Change df1
Change
1 .685a .469 .468 .8361 .469 470.528 1

2 .100 123.260 1
3 .759c .577 .574 .7482 .007 9.226 1
4 1
5 .769e .591 .587 .7367 6.368 1
Descriptive Statistics
Mean Std. Deviation N
1.59105 1.146790 534

6.85440 45.20085 534
-2533605.17041 86610730.66986 534
.884 .3207 534
.157 .3644 534
.086 .2808 534
.184 .3875 534
.148 .3554 534
.215 .4115 534
.195 .3964 534
.277 .4480 534
.060 .2376 534
Use Regression 5 Outputs to answer the following questions:
3.5 The EC is concerned about constituencies where the margin of victory is the lowest. In what
type of constituencies does this happen? What are the attributes of the winners in constituencies
where this happens? Assume that the winner had 15% less Assets than the runner up. (3 points)
20
3.6 Which of the following is true? (2 points)

(i) The gap in the Margin of Victory between PG and non PG winners remains constant as
Asset1over2 increases
(ii) The gap in the Margin of Victory between PG and non PG winners goes down as
(iii) The gap in the Margin of Victory between PG and non PG winners goes up as
(iv) Nothing can be said about this
3.7. What is the change in R Square value due to the entering variable between Step 4 and 5 in
Regression Output 5? (2 Points)
21
ROUGH SHEET

Linear Regression

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Linear Regression

Încărcat de

Drepturi de autor:

Formate disponibile

1

Question Number Q1 Q2 Q3 Total

Life Expectancy Infant Mortality

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 0.946 81.900 0.000 75.523 79.500

Use these regression results to answer the following questions:

Question 2 (16 points)

Table 2.1 Variables in the data

Table 2.2 Descriptive Statistics

Table 2.3 Model Summaryb

Model 1 R R Square Adjusted R Std. Error of the

a. Predictors: (Constant), Experience

Table 2.4 ANOVAa

Table 2.5 Coefficient Values

(Constant) 3538.000 11.380 .000

Answer questions 2.1-2.4 based on model 1.

Question 2.4 (2 points)

Model 2 Unstandardized Coefficients Standardized t Sig.

B Std. Error Beta

(Constant) 56936.976 2161.365 26.343 .000

Table 2.7 Coefficientsa

Model 3 Unstandardized Coefficients Standardized t Sig.

B Std. Error Beta

(Constant) 50428.002 3848.024 13.105 .000

1 Training -23582.580 4016.019 -.243 -5.872 .000

Experience 360.897 176.752 .085 2.042 .042

a. Dependent Variable: Income

Model 4 Unstandardized Coefficients Standardized t Sig.

B Std. Error Beta

(Constant) 46229.978 4347.769 10.633 .000

Training -12542.529 7101.003 -.119 -1.766

TrainExp -795.315 387.361 -.151 -2.053 .041

Table 2.9 Coefficientsa

Table 2.10 Model Summaryg

Model R R Square Adjusted R Std. Error of the

1 .311a .097 .095 42213.5119132141

a. Predictors: (Constant), chawki_bivoltine

Table 2.11 Coefficientsa

Question 2.10 (2 points)

Question 2.11 (1 point)

Table 3.1 Variables used in the data

Coefficients Standard Error t Stat P-value

3.4 Rank the Groups in terms of margin of victory. (2 points)

Unstandardized Coefficients Standardized Coefficients

B Std. Error Beta

(Constant) 1.325 0.036 37.265 0

(Constant) 1.32 0.035 37.356 0

(Constant) 1.392 0.041 34.256 0

(Constant) 1.427 0.043 33.338 0

Model R R Square Adjusted R Std. Error of the Change Statistics

1 .685a .469 .468 .8361 .469 470.528 1

Mean Std. Deviation N

1.59105 1.146790 534

Use Regression 5 Outputs to answer the following questions:

3.6 Which of the following is true? (2 points)

S-ar putea să vă placă și