Sunteți pe pagina 1din 8

Assignment 9: How much for that car?

Raymond Guo
2020-04-13

Exercise 1
i. The other continuous variable is mileage.
cars %>%
gather(Mileage, Liter, key = "catagory", value = "value") %>%
ggplot() +
geom_point(mapping = aes(x = value, y = Price)) +
facet_wrap(~catagory, scales = "free_x") +
labs(title="Relationship Between Price with Liter and Mileage")

Relationship Between Price with Liter and Mileage


Liter Mileage

60000
Price

40000

20000

2 3 4 5 6 0 10000 20000 30000 40000 50000


value

Exercise 2

continuous_model <-lm(Price~Mileage + Liter, data = cars)

continuous_model %>%
tidy()

term estimate std.error statistic p.value


(Intercept) 9426.6014688 1095.0777745 8.608157 0.0e+00
Mileage -0.1600285 0.0349084 -4.584237 5.3e-06
Liter 4968.2781155 258.8011436 19.197280 0.0e+00

continuous_model %>%
glance() %>%
select(r.squared)

1
r.squared
0.3291279

The r.squared is closer to 0 than 1 which means this model is doing a poor job in capturing the
varability of Price. ## Exercise 3
# predict model plane over sensible grid of values
lit <- unique(cars$Liter)
mil <- unique(cars$Mileage)
grid <- with(cars, expand.grid(lit, mil))
d <- setNames(data.frame(grid), c("Liter", "Mileage"))
vals <- predict(continuous_model, newdata = d)

# form surface matrix and give to plotly


m <- matrix(vals, nrow = length(unique(d$Liter)), ncol = length(unique(d$Mileage)))
p <- plot_ly() %>%
add_markers(
x = ~cars$Mileage,
y = ~cars$Liter,
z = ~cars$Price,
marker = list(size = 1)
) %>%
add_trace(
x = ~mil, y = ~lit, z = ~m, type="surface",
colorscale=list(c(0,1), c("yellow","yellow")),
showscale = FALSE
) %>%
layout(
scene = list(
xaxis = list(title = "mileage"),
yaxis = list(title = "liters"),
zaxis = list(title = "price")
)
)
if (!is_pdf) {p}

This model accurately fits with the data from excerise 1. I do not even know how am I suppose to
integrate the 3 assumptions with the looks of this 3D model. It is much easier to understand the
2D model compared to the 3D.

Exercise 4

continuous_df <- cars %>%


add_predictions(continuous_model) %>%
add_residuals(continuous_model)

2
ggplot(continuous_df) +
geom_point(mapping = aes(x = pred, y = Price)) +
geom_abline(
slope = 1,
intercept = 0,
color = "red",
size = 1
) +
labs(title="Observed vs Predicted of Price",
x = "Predicted Price",
y = "Observed Price")

Observed vs Predicted of Price

60000
Observed Price

40000

20000

10000 20000 30000 40000


Predicted Price

This graph barely shows a linear relationship from the explanatory variable and the response
variable.
ggplot(continuous_df) +
geom_point(mapping =aes(pred, resid)) +
geom_ref_line(h = 0) +
labs(title="Residual vs Predicted", x = "Predicted", y = "Predicted")

3
Residual vs Predicted
40000

30000

Predicted 20000

10000

−10000
10000 20000 30000 40000
Predicted

It sort of looks funky because there happens to be a large contingent within the southern border,
but the northern border shows a few points that look like outliers. I say it is roughly yields a
constant variability.
ggplot(data = continuous_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid)) +
labs(title="Theoretical Residuals vs Actual Residuals")
Theoretical Residuals vs Actual Residuals
40000

20000
sample

−20000
−2 0 2
theoretical

This obviously does not follow a bell shape curve. ## Exercise 5


cars %>%
ggplot() +
geom_boxplot(aes(x = reorder(Make, Price, FUN=median), y = Price)) +
labs(x = "Make of car", title = "Effect of make of car on price")

4
Effect of make of car on price

60000

Price
40000

20000

Saturn Chevrolet Pontiac Buick SAAB Cadillac


Make of car

Based on these box plots, there are instances where outliers only exist on the right side for half of
them. The value for q3 is significantly higher because of the outliers.
i. Cadillac
ii. Cadillac
iii. Chevrolet

Exercise 6

cars %>%
gather(Model:Cylinder, Doors:Leather, key="original_column", value="value") %>%
ggplot() +
geom_boxplot(aes(x = reorder(value, Price, FUN=median), y = Price)) +
facet_wrap(~original_column, scales = "free_x") +
labs(title = "Boxplot of All Categorical Variables") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1))

5
Boxplot of All Categorical Variables
Cruise Cylinder Doors

60000

40000

20000

2
Leather Model Sound

60000
Price 40000

20000

AVEO
Sunfire
Cavalier
Classic
Ion
Cobalt
Grand Am
Vibe
Century
L Series
Malibu
Grand Prix
Impala
G6
Lesabre
Monte Carlo
Bonneville
Lacrosse
Park Avenue
9−2X AWD
9_5 HO
9_3
GTO
9_3 HO
9_5
CTS
Deville
STS−V6
Corvette
STS−V8
CST−V
XLR−V8

1
Trim Type

60000

40000

20000
SVM Sedan 4D
SVM Hatchback 4D
LS Hatchback 4D
LT Hatchback 4D
Coupe 2D
LS Sport Coupe 2D
LS Coupe 2D
LS Sport Sedan 4D
Quad Coupe 2D
LT Sedan 4D
LS Sedan 4D
GT Sportwagon
AWD Sportwagon 4D
Sportwagon 4D
GT Coupe 2D
L300 Sedan 4D
Sedan 4D
LS MAXX Hback 4D
SE Sedan 4D
MAXX Hback 4D
LT MAXX Hback 4D
Custom Sedan 4D
GT Sedan 4D
CX Sedan 4D
GTP Sedan 4D
LT Coupe 2D
SLE Sedan 4D
Limited Sedan 4D
SS Coupe 2D
CXL Sedan 4D
CXS Sedan 4D
GXP Sedan 4D
SS Sedan 4D
Linear Sedan 4D
Special Ed Ultra 4D
Aero Sedan 4D
Aero Wagon 4D
Linear Wagon 4D
Arc Sedan 4D
Arc Wagon 4D
Aero Conv 2D
Linear Conv 2D
Arc Conv 2D
DHS Sedan 4D
DTS Sedan 4D
Conv 2D
Hardtop Conv 2D

Hatchback

Coupe

Sedan

Wagon

Convertible
reorder(value, Price, FUN = median)

Exercise 7

cars_factor_df <- cars %>%


mutate(Cylinder = as.factor(Cylinder))

mixed_model <-lm(Price~Mileage + Liter + Cylinder + Make +


Type, data = cars_factor_df)

mixed_model %>%
tidy()

term estimate std.error statistic p.value


(Intercept) 1.885018e+04 892.4119413 21.122738 0.0000000
Mileage -1.861764e-01 0.0106433 -17.492387 0.0000000
Liter 5.697442e+03 342.7322419 16.623596 0.0000000
Cylinder6 -3.312544e+03 619.9683651 -5.343086 0.0000001
Cylinder8 -3.672597e+03 1246.2162662 -2.946998 0.0033032
MakeCadillac 1.450444e+04 517.9855224 28.001635 0.0000000
MakeChevrolet -2.270807e+03 355.9736337 -6.379145 0.0000000
MakePontiac -2.355468e+03 363.9063301 -6.472731 0.0000000
MakeSAAB 9.905074e+03 450.2011112 22.001443 0.0000000
MakeSaturn -2.090266e+03 470.8305609 -4.439529 0.0000103
TypeCoupe -1.163869e+04 464.7055454 -25.045297 0.0000000
TypeHatchback -1.172638e+04 545.3936364 -21.500769 0.0000000
TypeSedan -1.178618e+04 411.1021489 -28.669707 0.0000000
TypeWagon -8.156551e+03 500.6379995 -16.292312 0.0000000

Yes, there are slopes for all of the categorical variables.


mixed_model %>%
glance() %>%

6
select(r.squared)

r.squared
0.9389165

Exercise 8

mixed_df <- cars_factor_df %>%


add_predictions(mixed_model) %>%
add_residuals(mixed_model)

ggplot(mixed_df) +
geom_point(mapping = aes(x = pred, y = Price)) +
geom_abline(
slope = 1,
intercept = 0,
color = "red",
size = 1
) +
labs(title="Observed vs Predicted of Price",
x = "Predicted Price",
y = "Observed Price")

Observed vs Predicted of Price

60000
Observed Price

40000

20000

10000 20000 30000 40000 50000


Predicted Price

ggplot(mixed_df) +
geom_point(mapping =aes(pred, resid)) +
geom_ref_line(h = 0) +
labs(title="Residual vs Predicted", x = "Predicted", y = "Predicted")

7
Residual vs Predicted
15000

10000

Predicted
5000

−5000

10000 20000 30000 40000 50000


Predicted
ggplot(data = mixed_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid)) +
labs(title="Theoretical Residuals vs Actual Residuals")
Theoretical Residuals vs Actual Residuals
15000

10000
sample

5000

−5000

−2 0 2
theoretical

Exercise 9
i. The value for r.squared is significantly closer to 1 compared to the 2 variable model. The
observed vs predicted graph perfectly shows a linear relationship. The variability of points
around the line is perfectly constant. The 2 variable model meets these requirements, but it
is a lot weaker. The qqplot clearly shows a bell shape curve compared to the first where it
obviously was not.
ii. The second model is the best because there are 3 conditions that needs to be satisfied to be a
reliable model. The second model does that job more effectively than the first one.

S-ar putea să vă placă și