Sunteți pe pagina 1din 5

Q1. Code for cluster analysis.

library(class)

library(MASS)

library(ISLR)

library(car)

library(leaps)

#Get the data in R

pharma <- read.csv("C:/Users/Pranav Naik/Desktop/Pharmaceuticals.csv")

#Normalize the numeric variables first

#The first and second column of the data are names, so to get the numeric data matrix, delete the first
column

#pharma.num <- pharma[-1:2]

pharma.num <- pharma [-1]

pharma.final <- pharma.num [-1]

pharma.scaled <- scale(pharma.final)

#Elbow Method for finding the optimal number of cluster

set.seed(100)

# Compute and plot wss for k = 2 to k = 15.

k.max = 10

data = pharma.scaled

wss = sapply(1:k.max,function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})

wss

plot(1:k.max, wss, type="b", pch = 19, frame = FALSE,

xlab="Number of clusters K",

ylab="Total within-clusters sum of squares")


Q2. A. Code for the classification

# Fitting Classification Trees

install.packages("tree")

install.packages("ISLR")

library(tree)

library(ISLR)

heart <- read.csv("C:/Users/Pranav Naik/Desktop/Heart.csv")

attach(heart)

#Fit the classification trees

tree.hearts=tree(hd~.,heart)

tree.hearts

summary() functions lists:

summary(tree.hearts)

#Plot the tree

plot(tree.hearts)

plot(tree.hearts, uniform=TRUE,margin=0.2)

text(tree.hearts, use.n=TRUE, all=TRUE, cex=.5)


From the above classification tree it can be seen that the data has 19 terminal nodes.

2b. For Pruning, we first divide the data into test and validation.( We have assumed 300 samples for
testing and 162 samples for validation)

#Breaking the data into training and validation sets

set.seed(2)

train=sample(1:nrow(heart), 300)

test=heart[-train,]

We then fit the tree on the training data.

#Fitting the tree on the training dataset

tree.hearts=tree(hd~.,heart,subset=train)

tree.pred=predict(tree.hearts,test,type="class")

table(tree.pred,test)
The cv.tree() function reports number of terminal nodes of each tree considered ('size'),

#corresponding error rate ('dev'), and value of cost-complexity parameter ('k')

set.seed(3)

cv.heart=cv.tree(tree.hearts,FUN=prune.misclass)

names(cv.heart)

cv.heart

We then Plot the error rate as function of 'size'

plot(cv.heart$size,cv.heart$dev,type="b")

Finally, we apply prune.misclass() function to prune the tree to obtain the best tree

prune.hearts=prune.misclass(tree.hearts,best=9)

plot(prune.hearts)

text(prune.hearts,pretty=0)

We then test the performance of the regression tree on the test data.

tree.pred=predict(prune.hearts,test,type="class")

table(tree.pred,High.test)

The best pruned tree has 11 terminal nodes.

Q3. Scatter plot shows the correlation between two variables.

a. Correlation between Price and Age:


As can be seen from the scatter plot a regression line that can be drawn is somewhat downward
sloping which would pass through the maximum number of points. Hence it can be said that
Price and Age seem to have a comparatively strong negative correlation between them.
b. Correlation between Age and Weight
As can be seen from the scatter plot an almost vertical regression line can be drawn through the
maximum number of points on the scatter plot, which shows that there is low or no correlation
between the Age and Weight parameter of the given data.
c. Correlation between Weight and KM
As can be seen from the scatter plot of the two variables, a horizontal regression line can be
drawn that can pass through the maximum number of points on the scatter plot. This also
indicates that there is low to no correlation between the Weight and KM parameters of the
given data.
d. Correlation between Price and Weight
As can be seen from the given scatter plot of the two variables, there seems to be no positive or
negative correlation between the variables as the points on the scatter plot are neither
diagonally upward sloping nor diagonally downward sloping.
e. Correlation between Price and KM
As can be seen from the scatter plot of the two variables, a curved (exponential) regression line
can be passed through the maximum data points on the scatter plot, which show that there is
no positive or negative correlation between the two variables
f. Correlation between Age and KM
As can be seen from the scatter plot of Age and KM, an upward sloping regression line (with
slope greater than 45 degree) can be made to pass through the maximum number of points on
the scatter plot. Hence the two parameters can be said to have high degree of positive
correlation.

3 b. 1.The given boxplot shows price of different types of fuel types. The bottom most line of the box
plot (rectangle) shows the value for the first quartile, the middle line shows the value for the second
quartile (also the mean value) and the top most line shows the value for the third quartile.

The bottom most line of the box plot (not a part of the box) indicates the lowest value while the top
most line of the box plot (not a part of the box) indicates the highest value of the parameter.

2.From the given boxplot we can see the value of the first quartile is almost on similar for CNG and
Diesel and is slightly higher for petrol which shows that the price of the first quartile value of Petrol is
slightly higher than that of CNG and Diesel.

3. Similarly, the 2nd quartile value is higher for Petrol as compared to CNG and Diesel which shows that
the mean price of Petrol is slightly higher than that of CNG and Diesel.

4. On similar lines, as the 3rd quartile value is higher for Diesel as compared to CNG and Petrol it shows
that the price of 3rd quartile value of Petrol is slightly higher than that of CNG and Petrol.

5. Diesel has the lowest price and also the highest price among all the types of fuels, however its mean
price is lower than that of petrol.

S-ar putea să vă placă și