Rcodes2 RegressionInR

Practitioners Course In Descriptive,Predictive And
Prescriptive Analytics
Regression in R
Set the working directory

#setwd("your project working directory")
Space Shuttle Challenger dataset

Initially, we will import the dataset of challenger space shuttle
spaceShuttle<-read.csv("challenger.csv",header = T)
Now generate the summary of whole dataset with str() function

str(spaceShuttle)
## 'data.frame': 23 obs. of 5 variables:

## $ o_ring_ct : int 6 6 6 6 6 6 6 6 6 6 ...
## $ distress_ct: int 0 1 0 0 0 0 0 0 1 1 ...
## $ temperature: int 66 70 69 68 67 72 73 70 57 63 ...
## $ pressure : int 50 50 50 50 50 50 100 100 200 200 ...
## $ launch_id : int 1 2 3 4 5 6 7 8 9 10 ...
Regression Equation for two variables
Here, we are going to develop the regression equation with independent variable as tempera-
ture and dependent variable as distress_ct
We need to use the covariance and standard deviation function available in R to calculate the coefficients of
the regrssion equation.
Temperature data is as shown below:
spaceShuttle$temperature
## [1] 66 70 69 68 67 72 73 70 57 63 70 78 67 53 67 75 70 81 76 79 75 76 58
Distress_ct data is as shown below:
spaceShuttle$distress_ct
## [1] 0 1 0 0 0 0 0 0 1 1 1 0 0 2 0 0 0 0 0 0 0 0 1
Now, we are going to evaluate the covariance and standard deviation as required to calculate the regression
coefficients
b<-cov(spaceShuttle$temperature,spaceShuttle$distress_ct)/var(spaceShuttle$temperature);
## [1] -0.05746032
1
Now, we will evaluate the coefficient a of the regression equation using following:
a <-mean(spaceShuttle$distress_ct)- (b*mean(spaceShuttle$temperature));
a
## [1] 4.301587
Thus, our regression line with y=mx+b format can be plotted as below
plot(spaceShuttle$temperature, spaceShuttle$distress_ct,xlab = "Temperature", ylab = "Distress_CT", main
abline(a,b)
Regression Line and Data

2.0
1.5
Distress_CT
1.0
0.5
0.0
55 60 65 70 75 80
Temperature
Correlation
Now, we will find how closely the variable Temperature & Distress_ct are related to each other using
the correlation formula.
rho<-cov(spaceShuttle$temperature,spaceShuttle$distress_ct)/(sd(spaceShuttle$temperature)*sd(spaceShuttl
rho
## [1] -0.725671
Alternative Way, is to use the cor() function available in R.
rho_1<-cor(spaceShuttle$temperature, spaceShuttle$distress_ct)
rho_1
## [1] -0.725671
2
Medical Expenses Prediction
Here, we are going to predict the medical expenses using the dataset “insurance.csv”
Initially, we load the dataset.
medication<-read.csv("insurance.csv", header = T, stringsAsFactors = T)
Now, we will summarize the data as before, using the str() function
str(medication)
## 'data.frame': 1338 obs. of 7 variables:

## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Here, the dependent variable is charges. So, we will summarize the charges data for 5 point summary as
below.
summary(medication$charges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1122 4740 9382 13270 16640 63770
Histogram of “charges” shows that the data is right-skewed as below.
hist(medication$charges, main = "Medication Charges Distribution", xlab = "Charges", ylab = "Frequency")
3
Medication Charges Distribution
350
250
Frequency
150
50
0
0 10000 20000 30000 40000 50000 60000
Charges
Next, we will use the table function to summarize the categorical variable regions of the medical data
table(medication$region)
##
## northeast northwest southeast southwest
## 324 325 364 325
Correlation Matrix
We will first see how the independent variables are related to the dependent variable using the correlation
matrix.
cor(medication[c("age", "bmi", "children", "charges")])
## age bmi children charges

## age 1.0000000 0.1092719 0.04246900 0.29900819
## bmi 0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges 0.2990082 0.1983410 0.06799823 1.00000000
Here, age and bmi show moderate relation.
Scatterplot Matrix
We can show the scatterplot of various variables simultaneously using the following function available in R.
pairs(medication[c("age", "bmi", "children", "charges")])
4
20 30 40 50 0 20000 50000
60
age
40
20
40
bmi
20
4
children
2
0
30000
charges
0
20 30 40 50 60 0 1 2 3 4 5
Multiple Regression
Train the model
Now, we will train the data and fit a linear regression line.
reg_model<-lm(charges~ age + children + bmi + sex + smoker + region, data=medication)
TO find the Beta coefficient of various independent variable, type the variable name.
reg_model
##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker +
## region, data = medication)
##
## Coefficients:
## (Intercept) age children bmi
## -11938.5 256.9 475.5 339.2
## sexmale smokeryes regionnorthwest regionsoutheast
## -131.3 23848.5 -353.0 -1035.0
## regionsouthwest
## -960.1
We can also summarize the model generated as follows.
summary(reg_model)
##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker +
5
## region, data = medication)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
References
1. https://cran.r-project.org/manuals.html
2. Horton, N.J. and Kleinman, K., 2010. Using R for data management, statistical analysis, and graphics.
CRC Press.
3. Ohri, A., 2012. R for business analytics. Springer Science & Business Media.

Rcodes2 RegressionInR

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Rcodes2 RegressionInR

Încărcat de

Drepturi de autor:

Formate disponibile

Practitioners Course In Descriptive,Predictive And

Set the working directory

Space Shuttle Challenger dataset

Now generate the summary of whole dataset with str() function

## 'data.frame': 23 obs. of 5 variables:

Regression Equation for two variables

Regression Line and Data

## 'data.frame': 1338 obs. of 7 variables:

## Min. 1st Qu. Median Mean 3rd Qu. Max.

Histogram of “charges” shows that the data is right-skewed as below.

hist(medication$charges, main = "Medication Charges Distribution", xlab = "Charges", ylab = "Frequency")

0 10000 20000 30000 40000 50000 60000

## age bmi children charges

Train the model

S-ar putea să vă placă și