Sunteți pe pagina 1din 5

CSUS/Spring 2018/ DS 101/ HW-4

Dummy Variable and Multiple Regressions Analysis

Due Date: December 16th, 2018


Name: Demi To [Maximum Points= 20]

Q1. Develop an estimated regression equation that relates risk of a stroke to the person’s age,
blood pressure, smoking habit, gender and ethnicity where: dummy smoke is defined as 1for
smoke and 0 for non-smoke, dummy gender is defined as 1 for female and 0 for male and
dummy ethnicity is defined as 1 for white and 0 for non –white.
a. Write the model for :
𝑅𝑖𝑠𝑘 = 𝛽0+𝛽1 𝐴𝑔𝑒 + 𝛽2 𝐵𝑙𝑜𝑜𝑑 𝑃𝑟𝑒𝑠𝑠𝑢𝑟𝑒 + 𝛽3 Smoking +𝛽4 Gender +𝛽5 Ethnicity

b. Is smoking and gender a significant factor in the risk of a stroke? Explain. Use
alpha= 0.05. Explain Significance of Smoking and Gender variables.

Yes, there is a significant relationship between smoking and a stroke. However, the relationship
between gender and stroke is not significant based on the p value at the 95% confidence interval.

c. What is the probability of a stroke over the next 10 years for Mrs. Smith, a 68 years
old smoker who has blood pressure of 175, is a female and of white ethnicity?

The probability of stroke over the next 10 years for Mrs. Smith is: Risk = -94.0283 +
1.08415*Age + 0.270265*Pressure + 8.87355*Smoker + 0.614361*Gender - 2.8532*Ethnicity

d. Calculate probability/risk of stroke for Mrs. Smith?

Risk = -94.0283 + 1.08415*68 + 0.270265*175 + 8.87355*1+ 0.614361*1- 2.8532*1=33.62%


The probability of a 33.62%.

e. What action might the physician recommend to this patient?

The physician should recommend her quit smoking because her the risks are based on variables,
age gender and blood pressure.
[ 5 points]

1
Plot of Risk

60

50

40
observed

30

20

10

0
0 10 20 30 40 50 60
predicted

Multiple Regression - Risk


Dependent variable: Risk
Independent variables:
Age
Pressure
Smoker
Gender
Ethnicity
Number of observations: 20

Standard T
Parameter Estimate Error Statistic P-Value
CONSTANT -94.0283 16.2594 -5.78301 0.0000
Age 1.08415 0.174021 6.23002 0.0000
Pressure 0.270265 0.0530679 5.09283 0.0002
Smoker 8.87355 3.21345 2.76137 0.0153
Gender 0.614361 4.299 0.142908 0.8884
Ethnicity -2.8532 4.18612 -0.681586 0.5066

Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 3683.98 5 736.797 20.35 0.0000
Residual 506.967 14 36.2119
Total (Corr.) 4190.95 19

R-squared = 87.9033 percent


R-squared (adjusted for d.f.) = 83.583 percent
Standard Error of Est. = 6.01764
Mean absolute error = 4.039
Durbin-Watson statistic = 1.33126 (P=0.0746)
Lag 1 residual autocorrelation = 0.278987

Q2. Run a multiple regression model using the following formulation and conduct a hypothesis
test. The model is (𝑆𝑎𝑙𝑒𝑠𝑡 = 𝛽0 + 𝛽1 𝑆𝑎𝑙𝑒𝑠𝑡−1 + 𝛽2 𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑒𝑡−2+𝜀) and data set is
PINKHAM. Answer the following questions.

a. State the null and alternate hypothesis for both slopes.

Ho: Bi=0 Ho: Bi ≠0

2
b. P value and comparison at 𝛼 = 0.05 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙 for𝛽1, 𝛽2 and model

The P value is less than alpha, 0.05. There is a significant relationship between the beta
variables sales and advertise at the 95% confidence interval.

c. What is r squared value and its interpretation?

The R-squared is 89%. 89% of variability is fitted in the model. This is considered a relatively
good model.

d. What conclusions have you derived after performing hypothesis tests?

Sales and advertise have a strong relationship at the 95% confidence interval. They are both
significant in the hypothesis test for this model.

e. What will you state to your Sales Director after running the hypothesis test?
f.
The hypothesis test is a useful to predict sales numbers and interpret the results. With the
model, we can figure out the relationship between the variables. We can also find the minimum
sales when all variables are equal to zero.

Plot of sales

3900

3400

2900
observed

2400

1900

1400

900
900 1400 1900 2400 2900 3400 3900
predicted

3
Multiple Regression - sales
Dependent variable: sales
Independent variables:
lag (sales,1)
lag (advertise,2)
Number of observations: 52

Standard T
Parameter Estimate Error Statistic P-Value
CONSTANT 243.057 89.4894 2.71604 0.0091
lag (sales,1) 1.10492 0.062965 17.5482 0.0000
lag (advertise,2) -0.454662 0.106287 -4.27767 0.0001

Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 1.76913E7 2 8.84567E6 218.52 0.0000
Residual 1.9835E6 49 40479.6
Total (Corr.) 1.96748E7 51

R-squared = 89.9186 percent


R-squared (adjusted for d.f.) = 89.5071 percent
Standard Error of Est. = 201.195
Mean absolute error = 154.699
Durbin-Watson statistic = 1.61715 (P=0.0508)
Lag 1 residual autocorrelation = 0.181097

sales = 243.057 + 1.10492*lag (sales,1) - 0.454662*lag (advertise,2)

[ 5 points]

Q3. In real estate often the housing prices are dependent upon the cost of construction, location,
and various other attributes of housing such as number of bedroom, bathrooms, size of lot, and
proximity to various locations (employment, freeway, and other amenities).

a. Use the housing data to develop a relationship among prices, square foot, bedrooms,
bathrooms, 3 car garage (dummy variable with 1 as a house with 3 garages and 0 less
than 3 garages), and pool (dummy variable with 1 as a house with a pool and 0 without).
b. Is a three car garage a significant factor in the estimation of house price? Explain.

Yes, after running the data in statgraphics, a 3 car garage has a significant positive impact on
price since the p-value at a 95% confidence interval.

c. What is the expected price of a house if you are considering buying a house in this
neighborhood with the following characteristics: 2500 square foot, 5 bedrooms, 3 baths,
three car garage and a pool?

Price = 14.4487 + 0.055348*Sg ft + 0.50722*Bed Rooms + 4.39513*Bathrooms + 33.7154*3


Car Garage + 10.0184*Pool

14.4487 + 0.055348*2500 + 0.50722*5 + 4.39513*3 + 33.7154*1 + 10.0184*1 = $212,273.00

4
[ 5 points]

Q4. Define the terms dummy variable and multicollinearity (MC). How will you use the dummy
variable? Describe an example. When does the problem of MC occur and how will you address
the problem?

Multicollinearity a explanatory variable that defines unique information that is not provided by
other explanatory variables. It is used when tier is not unique and they are duplicated. The
problem occurs when two variables are highly correlated and the solution to this is to remove it.

Dummy variables is used when to incorporate qualitative variables into our analysis. For
example, when looking at qualatiative variable there could be two dummy variables expressed as
x=1 if garage =0.

[5 points]
*****************

S-ar putea să vă placă și