Documente Academic
Documente Profesional
Documente Cultură
Prescriptive Analytics
Regression in R
Here, we are going to develop the regression equation with independent variable as tempera-
ture and dependent variable as distress_ct
We need to use the covariance and standard deviation function available in R to calculate the coefficients of
the regrssion equation.
Temperature data is as shown below:
spaceShuttle$temperature
## [1] 66 70 69 68 67 72 73 70 57 63 70 78 67 53 67 75 70 81 76 79 75 76 58
Distress_ct data is as shown below:
spaceShuttle$distress_ct
## [1] 0 1 0 0 0 0 0 0 1 1 1 0 0 2 0 0 0 0 0 0 0 0 1
Now, we are going to evaluate the covariance and standard deviation as required to calculate the regression
coefficients
b<-cov(spaceShuttle$temperature,spaceShuttle$distress_ct)/var(spaceShuttle$temperature);
## [1] -0.05746032
1
Now, we will evaluate the coefficient a of the regression equation using following:
a <-mean(spaceShuttle$distress_ct)- (b*mean(spaceShuttle$temperature));
a
## [1] 4.301587
Thus, our regression line with y=mx+b format can be plotted as below
plot(spaceShuttle$temperature, spaceShuttle$distress_ct,xlab = "Temperature", ylab = "Distress_CT", main
abline(a,b)
1.0
0.5
0.0
55 60 65 70 75 80
Temperature
Correlation
Now, we will find how closely the variable Temperature & Distress_ct are related to each other using
the correlation formula.
rho<-cov(spaceShuttle$temperature,spaceShuttle$distress_ct)/(sd(spaceShuttle$temperature)*sd(spaceShuttl
rho
## [1] -0.725671
Alternative Way, is to use the cor() function available in R.
rho_1<-cor(spaceShuttle$temperature, spaceShuttle$distress_ct)
rho_1
## [1] -0.725671
2
Medical Expenses Prediction
Here, we are going to predict the medical expenses using the dataset “insurance.csv”
Initially, we load the dataset.
medication<-read.csv("insurance.csv", header = T, stringsAsFactors = T)
Now, we will summarize the data as before, using the str() function
str(medication)
3
Medication Charges Distribution
350
250
Frequency
150
50
0
Charges
Next, we will use the table function to summarize the categorical variable regions of the medical data
table(medication$region)
##
## northeast northwest southeast southwest
## 324 325 364 325
Correlation Matrix
We will first see how the independent variables are related to the dependent variable using the correlation
matrix.
cor(medication[c("age", "bmi", "children", "charges")])
Scatterplot Matrix
We can show the scatterplot of various variables simultaneously using the following function available in R.
pairs(medication[c("age", "bmi", "children", "charges")])
4
20 30 40 50 0 20000 50000
60
age
40
20
40
bmi
20
4
children
2
0
30000
charges
0
20 30 40 50 60 0 1 2 3 4 5
Multiple Regression
Now, we will train the data and fit a linear regression line.
reg_model<-lm(charges~ age + children + bmi + sex + smoker + region, data=medication)
TO find the Beta coefficient of various independent variable, type the variable name.
reg_model
##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker +
## region, data = medication)
##
## Coefficients:
## (Intercept) age children bmi
## -11938.5 256.9 475.5 339.2
## sexmale smokeryes regionnorthwest regionsoutheast
## -131.3 23848.5 -353.0 -1035.0
## regionsouthwest
## -960.1
We can also summarize the model generated as follows.
summary(reg_model)
##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker +
5
## region, data = medication)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
References
1. https://cran.r-project.org/manuals.html
2. Horton, N.J. and Kleinman, K., 2010. Using R for data management, statistical analysis, and graphics.
CRC Press.
3. Ohri, A., 2012. R for business analytics. Springer Science & Business Media.