Sunteți pe pagina 1din 9

Building and evaluation of a

linear regression model

Name: ​GuruPrasad
Bhat
Roll No: ​1713009
Acknowledgemen
t

I would like thank Mrs. Jayshree Kundargi Ma’am for providing this internship “Data Science
with R” and for providing extensive support and guidance throughout the same. Secondly I
would like to thank my college ,my parents and colleagues for the support.

Description of selected dataset:

Appliances energy prediction Data Set

The dataset is based on place which has nearest airport weather station as Chievres Airport, Belgium. It
has data regarding the temperature and humidity of various places in the house. Moreover there are data
which has the amount of power used by appliances and lights. The house temperature and humidity
conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the
temperature and humidity conditions around 3.3 min. Hence there are 19735 data values distributed over
29 attributes in this dataset. The attributes are as follows:
1. date time year-month-day hour:minute:second 2. Appliances, energy
use in Wh 3. lights, energy use of light fixtures in the house in Wh 4. T1,
Temperature in kitchen area, in Celsius 5. RH_1, Humidity in kitchen
area, in % 6. T2, Temperature in living room area, in Celsius 7. RH_2,
Humidity in living room area, in % 8. T3, Temperature in laundry room
area 9. RH_3, Humidity in laundry room area, in % 10. T4, Temperature
in office room, in Celsius 11. RH_4, Humidity in office room, in % 12.
T5, Temperature in bathroom, in Celsius 13. RH_5, Humidity in
bathroom, in % 14. T6, Temperature outside the building (north side), in
Celsius 15. RH_6, Humidity outside the building (north side), in % 16.
T7, Temperature in ironing room , in Celsius 17. RH_7, Humidity in
ironing room, in % 18. T8, Temperature in teenager room 2, in Celsius
19. RH_8, Humidity in teenager room 2, in % 20. T9, Temperature in
parents room, in Celsius 21. RH_9, Humidity in parents room, in % 22.
To, Temperature outside (from Chievres weather station), in Celsius 23.
Pressure (from Chievres weather station), in mm Hg 24. RH_out,
Humidity outside (from Chievres weather station), in % 25. Wind speed
(from Chievres weather station), in m/s 26. Visibility (from Chievres
weather station), in km 27. Tdewpoint (from Chievres weather station),
°C 28. rv1, Random variable 1, nondimensional
29. rv2, Random variable 2, nondimensional
As the name suggests, appliances and light energy consumption values were recorded along with
various other physical environmental factors such as temperature, humidity, visibility, windspeed and
dewpoint. For such a dataset, I expected for dependent variable to be appliances and lights and rest
others as independent variables.

Data Analysis:

Firstly I expected the appliances energy usage to depend on all factors(independent variables) and
lights energy usage on visibility alone. Hence I plotted the scatter plots across all these combinations.
Energy use by lights seems to be quite uniform when comparing with other factors. However we can see
a sort of relation between lights and visibility.

When it comes to appliances , I considered 2 major factors that will affect the energy usage.
The parameters measured inside the building and parameters outside.
From the above scatter plots we can see that, some factors are roughly linearly dependent while some fit
the normal distribution better. However we can see a difference in the indoor parameters and the outdoor
ones.

Now, the box plots of temperatures, humidities and energy use was plotted separately. In the box plot
of energy use, the appliances energy use was found to be more widespread and higher than the
energy by lights.

In the box plot of temperatures, we can clearly see the difference in box plot of T6 and T_out than others.
These are the outdoor temperatures and were obviously found to be varying than the other indoor
temperatures.Among indoor temperatures, there is hardly any difference among each other.

In box plot of humidities, we can infer similar results as that from temperature box plot. The outdoor
humidity values( RH_6 and RH_out) were differing from the indoor humidity values. However we can
notice a slight difference in humidity level of RH_5 : humidity data from bathroom. This value seems to
be a bit higher than other indoor humidity values.
Building of regression models:

I considered the following 3 hypothesis:


1. Energy usage by lights is linearly dependent on visibility. 2. Energy usage by appliances is
linearly dependent on outdoor factors like temperatures(T6,
T_out) , pressure, humidity(RH_6, RH_out) and dewpoint. 3. Energy
usage by appliances is linearly dependent on indoor factors.

Hence the 3 linear models were built.

Presentation and interpretation of model parameters:

1. Linear model among lights and visibility:

Call: lm(formula = lights ~ Visibility, data =


train)

Residuals
:
Min 1Q Median 3Q Max -4.155
-3.817 -3.752 -3.565 56.183

Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept)
3.296695 0.227959 14.462 <2e-16 *** Visibility
0.013001 0.005679 2.289 0.0221 * --- ​Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.886 on 13812 degrees of freedom


Multiple R-squared: 0.0003793, Adjusted R-squared: ​0.0003069
​ .02208
F-statistic: 5.24 on 1 and 13812 DF, p-value: 0

The model has coefficients with 3 stars and 1 star respectively making it a okay significant linear
relationship between lights and visibility. The model is a poor fit as it has significantly higher
p-value and very bad R-square value as highlighted above. Hence we can drop the hypothesis
and consider the null hypothesis.

2. Linear model between appliances and outdoor


parameters:

Call: lm(formula = Appliances ~ T6 + RH_6 + T_out + RH_out +


Press_mm_hg +
Windspeed + Tdewpoint, data =
train)

Residuals
:
Min 1Q Median 3Q Max -132.25
-47.58 -25.92 -0.49 982.08

Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept)
802.58284 102.42989 7.835 5.01e-15 *** T6 8.45033
0.66272 12.751 < 2e-16 *** RH_6 0.40459 0.05103
7.929 2.38e-15 *** T_out -18.58404 1.78429 -10.415 <
2e-16 *** RH_out -3.62803 0.37592 -9.651 < 2e-16 ***
Press_mm_hg -0.54370 0.12424 -4.376 1.22e-05 ***
Windspeed 1.05528 0.41874 2.520 0.0117 * Tdewpoint
10.64888 1.70706 6.238 4.56e-10 *** --- ​Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 101 on 13806 degrees of freedom


Multiple R-squared: 0.04486, Adjusted R-squared: 0.04438
F-statistic: 92.64 on 7 and 13806 DF, p-value: < 2.2e-16

The model has good correlation between the predictor variables and response variable by looking
at the coefficients and the significance of them. The p-value is very low making the model true to
its hypothesis. The R square value suggests that the model is 4% strength between the outdoor
parameters and the appliances . Hence I can conclude that this model is a good fit.

3. Linear model between appliances and indoor parameters:

Call: lm(formula = Appliances ~ T1 + T2 + T3 + T4 + T5 + T7 + T8 +


T9 + RH_1 + RH_2 + RH_3 + RH_4 + RH_5 + RH_7 + RH_8 +
RH_9, data = train)

Residuals
:
Min 1Q Median 3Q Max -283.74
-44.32 -21.49 4.96 978.87

Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept)
63.8354 19.7489 3.232 0.00123 ** T1 -1.4049
2.1204 -0.663 0.50760 T2 -16.1087 1.8141 -8.880 <
2e-16 *** T3 24.1927 1.2368 19.560 < 2e-16 *** T4
0.7976 1.1678 0.683 0.49462 T5 -3.6027 1.3990
-2.575 0.01003 * T7 -0.8608 1.5285 -0.563 0.57333
T8 9.9922 1.1551 8.651 < 2e-16 *** T9 -16.4420
1.9733 -8.332 < 2e-16 *** RH_1 17.8805 0.8215
21.766 < 2e-16 *** RH_2 -15.0268 0.9052 -16.600 <
2e-16 *** RH_3 3.0286 0.7613 3.978 6.98e-05 ***
RH_4 3.9187 0.7279 5.384 7.41e-08 *** RH_5
0.1371 0.1047 1.310 0.19033 RH_7 -1.0468 0.4917
-2.129 0.03326 * RH_8 -6.1958 0.4400 -14.081 <
2e-16 *** RH_9 -1.5083 0.4887 -3.086 0.00203 ** ---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 95.97 on 13797 degrees of freedom


Multiple R-squared: 0.1386, Adjusted R-squared: 0.1376
F-statistic: 138.7 on 16 and 13797 DF, p-value: < 2.2e-16

The model has good relation between some of the predictor variables with the response
variable. However some parameters do not seem to have strong coefficient with response
variable. The p-value is very low making the hypothesis very likely. The R squared value
suggests a good 13% strength for the model. To conclude, I feel the model is a good fit.

Evaluation of Root mean square error(RMSE) on


test data:
1. For lights vs visibility model rmse was found to be 8.046562 . The root mean square error was a
bit low considering the levels of lights were a gap of 10 units (i.e lights data was 10,20,30...
intervals of 10) 2. For appliances vs output parameters model, rmse was 98.76715. This value can
be said that it
was a bit okay. The predicted value was suitably close to the real value. 3. For appliances vs input
parameters model, rmse was 93.88502. Similar to above case, the error
can be considered tolerable.

Conclusion
:

Out of the 3 models, lights vs visibility model didn’t seem convincing though the rmse value was low. This
is a surprising result as it can be thought that energy usage of lights would be linearly dependent on
visibility on that day. The model appliance vs outdoor parameters was good enough and even the rmse
value validated the model’s credibility. The main reason for choosing such a hypothesis was the slight
variations I found in the scatter plots of the response variable with the outdoor physical parameters. The
appliance vs indoor parameters proved to be a credible model with a suitable regression model and a low
rmse value. The peculiarity in indoor parameters can be seen that the coefficients of certain parameters
like that of living room, parents room and teenagers room were stronger than that of parameters of areas
like ironing room and laundry room. So it is possible that the appliances are used majorly in places where
people stay the most. And it makes sense considering that the changes in temperatures and humidities of
unoccupied rooms would matter the usage of any appliances. This conclusion can help for a better
energy efficient buildings or even get a rough idea about the energy usage for a building that is yet to be
constructed by solely looking at the purpose for construction.

Code
:

>mydata<- read.csv(“D:\\DataScience using R\\Assignment 5\\energydata_complete”) >anyNA(mydata)


> plot(mydata$T1,mydata$Appliances, xlab="Temperature in kitchen area", ylab="Appliances energy
use in Wh")
plot(mydata$T2,mydata$Appliances, xlab="Temperature in living room area", ylab="Appliances energy
use in Wh") > plot(mydata$T3,mydata$Appliances, xlab="Temperature in laundry room area",
ylab="Appliances energy use in Wh") > plot(mydata$T4,mydata$Appliances, xlab="Temperature in office
room area", ylab="Appliances energy use in Wh") > plot(mydata$T5,mydata$Appliances,
xlab="Temperature in bathroom area", ylab="Appliances energy use in Wh") >
plot(mydata$T6,mydata$Appliances, xlab="Temperature in outside building(north side)",
ylab="Appliances energy use in Wh") > plot(mydata$T7,mydata$Appliances, xlab="Temperature in
ironing room", ylab="Appliances energy use in Wh") > plot(mydata$T8,mydata$Appliances,
xlab="Temperature in teenager's room", ylab="Appliances energy use in Wh") >
plot(mydata$T9,mydata$Appliances, xlab="Temperature in parent's room", ylab="Appliances energy use
in Wh") > plot(mydata$T_out,mydata$Appliances, xlab="Temperature outside", ylab="Appliances energy
use in Wh") > boxplot(mydata$Appliances,mydata$lights)
>boxplot(mydata$T1,mydata$T2,mydata$T3,mydata$T4,mydata$T5,mydata$T6,mydata$T7,mydata$T8,
mydata$T9,mydata$T_out,xlab="Temperature across different areas")
>boxplot(mydata$RH_1,mydata$RH_2,mydata$RH_3,mydata$RH_4,mydata$RH_5,mydata$RH_6,myd
a ta$RH_7,mydata$RH_8,mydata$RH_9,mydata$RH_out,xlab="Humidity across different areas")
> plot(mydata$Appliances,mydata$T_out) > plot(mydata$T_out,mydata$Appliances) >
abline(lm(Appliances~T_out,data=mydata)) > plot(mydata$RH_1,mydata$Appliances,xlab="Humidity in
kitchen area",ylab="Appliances engery use in Wh") >
plot(mydata$RH_2,mydata$Appliances,xlab="Humidity in living room area",ylab="Appliances engery use
in Wh") > plot(mydata$RH_3,mydata$Appliances,xlab="Humidity in laundry room area",ylab="Appliances
engery use in Wh") > plot(mydata$RH_4,mydata$Appliances,xlab="Humidity in office room
area",ylab="Appliances engery use in Wh") > plot(mydata$RH_5,mydata$Appliances,xlab="Humidity in
bathroom area",ylab="Appliances engery use in Wh") >
plot(mydata$RH_6,mydata$Appliances,xlab="Humidity outside building(north side)",ylab="Appliances
engery use in Wh") > plot(mydata$RH_7,mydata$Appliances,xlab="Humidity in ironing
room",ylab="Appliances engery use in Wh") > plot(mydata$RH_8,mydata$Appliances,xlab="Humidity in
teenager's room",ylab="Appliances engery use in Wh") >
plot(mydata$RH_9,mydata$Appliances,xlab="Humidity in parent's room",ylab="Appliances engery use in
Wh") > plot(mydata$RH_out,mydata$Appliances,xlab="Humidity outside",ylab="Appliances engery use in
Wh") > plot(mydata$Press_mm,mydata$Appliances,xlab="Pressure outside",ylab="Appliances engery
use in Wh") > plot(mydata$Visibility,mydata$lights,xlab="Visibility outside",ylab="Lights engery use in
Wh")
>library(dplyr) > n=nrow(mydata) > trainIndex=sample(1:n, size= round(0.7*n), replace=FALSE) >
train=mydata[trainIndex,] > test=mydata[-trainIndex,] > dim(mydata) [1] 19735 29 > dim(train) [1] 13814
29 > dim(test) [1] 5921 29
>linmod<-lm(Appliances~T1+T2+T3+T4+T5+T7+T8+T9+RH_1+RH_2+RH_3+RH_4+RH_5+RH_7+RH_8
+RH_9,data=train) > summary(linmod) >
linmod2<-lm(Appliances~T6+RH_6+T_out+RH_out+Press_mm_hg+Windspeed+Tdewpoint,data=train) >
summary(linmod2)
>new_data_test<-data.frame(T6=test$T6,RH_6=test$RH_6,T_out=test$T_out,RH_out=test$RH_out,Pres
s_mm_hg=test$Press_mm_hg,Windspeed=test$Windspeed,Tdewpoint=test$Tdewpoint) >
pred<-predict(linmod2,new_data_test) > rmse <- sqrt(sum((pred - test$Appliances)^2)/length(test$T6)) >
print(rmse) > linmod1<-lm(lights~Visibility,data=train) > summary(linmod1)
> new_data_test1<-data.frame(Visibility=test$Visibility) > pred1<- predict(linmod1,new_data_test1) >
rmse1 <- sqrt(sum((pred1 - test$lights)^2)/length(test$Visibility)) > rmse1
>new_data_test2<-data.frame(T1=test$T1,T2=test$T2,T3=test$T3,T4=test$T4,T5=test$T5,T7=test$T7,T
8=test$T8,T9=test$T9,RH_1=test$RH_1,RH_2=test$RH_2,RH_3=test$RH_3,RH_4=test$RH_4,RH_5=te
st$RH_5,RH_7=test$RH_7,RH_8=test$RH_8,RH_9=test$RH_9) > pred2<-
predict(linmod,new_data_test2) > rmse2 <- sqrt(sum((pred2 - test$Appliances)^2)/length(test$T1))
>rmse2

S-ar putea să vă placă și