Sunteți pe pagina 1din 6

Practitioners Course In Descriptive,Predictive And

Prescriptive Analytics
Regression in R

Set the working directory


#setwd("your project working directory")

Space Shuttle Challenger dataset


Initially, we will import the dataset of challenger space shuttle
spaceShuttle<-read.csv("challenger.csv",header = T)

Now generate the summary of whole dataset with str() function


str(spaceShuttle)

## 'data.frame': 23 obs. of 5 variables:


## $ o_ring_ct : int 6 6 6 6 6 6 6 6 6 6 ...
## $ distress_ct: int 0 1 0 0 0 0 0 0 1 1 ...
## $ temperature: int 66 70 69 68 67 72 73 70 57 63 ...
## $ pressure : int 50 50 50 50 50 50 100 100 200 200 ...
## $ launch_id : int 1 2 3 4 5 6 7 8 9 10 ...

Regression Equation for two variables

Here, we are going to develop the regression equation with independent variable as tempera-
ture and dependent variable as distress_ct
We need to use the covariance and standard deviation function available in R to calculate the coefficients of
the regrssion equation.
Temperature data is as shown below:
spaceShuttle$temperature

## [1] 66 70 69 68 67 72 73 70 57 63 70 78 67 53 67 75 70 81 76 79 75 76 58
Distress_ct data is as shown below:
spaceShuttle$distress_ct

## [1] 0 1 0 0 0 0 0 0 1 1 1 0 0 2 0 0 0 0 0 0 0 0 1
Now, we are going to evaluate the covariance and standard deviation as required to calculate the regression
coefficients
b<-cov(spaceShuttle$temperature,spaceShuttle$distress_ct)/var(spaceShuttle$temperature);

## [1] -0.05746032

1
Now, we will evaluate the coefficient a of the regression equation using following:
a <-mean(spaceShuttle$distress_ct)- (b*mean(spaceShuttle$temperature));
a

## [1] 4.301587
Thus, our regression line with y=mx+b format can be plotted as below
plot(spaceShuttle$temperature, spaceShuttle$distress_ct,xlab = "Temperature", ylab = "Distress_CT", main
abline(a,b)

Regression Line and Data


2.0
1.5
Distress_CT

1.0
0.5
0.0

55 60 65 70 75 80

Temperature

Correlation

Now, we will find how closely the variable Temperature & Distress_ct are related to each other using
the correlation formula.
rho<-cov(spaceShuttle$temperature,spaceShuttle$distress_ct)/(sd(spaceShuttle$temperature)*sd(spaceShuttl

rho

## [1] -0.725671
Alternative Way, is to use the cor() function available in R.
rho_1<-cor(spaceShuttle$temperature, spaceShuttle$distress_ct)

rho_1

## [1] -0.725671

2
Medical Expenses Prediction
Here, we are going to predict the medical expenses using the dataset “insurance.csv”
Initially, we load the dataset.
medication<-read.csv("insurance.csv", header = T, stringsAsFactors = T)

Now, we will summarize the data as before, using the str() function
str(medication)

## 'data.frame': 1338 obs. of 7 variables:


## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Here, the dependent variable is charges. So, we will summarize the charges data for 5 point summary as
below.
summary(medication$charges)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1122 4740 9382 13270 16640 63770

Histogram of “charges” shows that the data is right-skewed as below.

hist(medication$charges, main = "Medication Charges Distribution", xlab = "Charges", ylab = "Frequency")

3
Medication Charges Distribution
350
250
Frequency

150
50
0

0 10000 20000 30000 40000 50000 60000

Charges
Next, we will use the table function to summarize the categorical variable regions of the medical data
table(medication$region)

##
## northeast northwest southeast southwest
## 324 325 364 325

Correlation Matrix

We will first see how the independent variables are related to the dependent variable using the correlation
matrix.
cor(medication[c("age", "bmi", "children", "charges")])

## age bmi children charges


## age 1.0000000 0.1092719 0.04246900 0.29900819
## bmi 0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges 0.2990082 0.1983410 0.06799823 1.00000000
Here, age and bmi show moderate relation.

Scatterplot Matrix

We can show the scatterplot of various variables simultaneously using the following function available in R.
pairs(medication[c("age", "bmi", "children", "charges")])

4
20 30 40 50 0 20000 50000

60
age

40
20
40

bmi
20

4
children

2
0
30000

charges
0

20 30 40 50 60 0 1 2 3 4 5

Multiple Regression

Train the model

Now, we will train the data and fit a linear regression line.
reg_model<-lm(charges~ age + children + bmi + sex + smoker + region, data=medication)

TO find the Beta coefficient of various independent variable, type the variable name.
reg_model

##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker +
## region, data = medication)
##
## Coefficients:
## (Intercept) age children bmi
## -11938.5 256.9 475.5 339.2
## sexmale smokeryes regionnorthwest regionsoutheast
## -131.3 23848.5 -353.0 -1035.0
## regionsouthwest
## -960.1
We can also summarize the model generated as follows.
summary(reg_model)

##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker +

5
## region, data = medication)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16

References
1. https://cran.r-project.org/manuals.html
2. Horton, N.J. and Kleinman, K., 2010. Using R for data management, statistical analysis, and graphics.
CRC Press.
3. Ohri, A., 2012. R for business analytics. Springer Science & Business Media.

S-ar putea să vă placă și