Documente Academic
Documente Profesional
Documente Cultură
STUDENTS
SOLUTION
1. Data Preparations
For the assignment, we are told to used the mtcars data sets. The mtcars data sets are as follows:
One of the way to decide is by plotting a 'within sum of square x number of clusters' graph. We can
do this in R by:
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for
(i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss) plot(1:15, wss, type="b",
xlab="Number of Clusters", ylab="Within groups sum
of squares")
The resulting graph will be:
From the graph we can determine the best number of clusters for the data set using the Elbow
Method. By looking at the point at the graph where adding another cluster will not change much
value, we can set that as the optimum number of cluster. (The point where it creates an 'elbow' in the
graph) On our graph, that point is when number of cluster is 5.
3. Clustering Algorithm
1. Partitioning with Kmeans()
We can create the cluster with Kmeans algorithm in R by:
# K-Means Cluster Analysis
fitk <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fitk$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, fitk$cluster)
Note that we set the number of clusters to be generated is 5, according to the graph we found
earlier. 'fitk' is the variable to store our cluster data.
We can see the cluster groups of each data sets by:
fitk$cluster
We can also show the cluster groups just like what we did with kmeans using:
cutree(fitha, k = 5) #cutting the tree to 5 clusters
We can also show a graph that shows the p-values of each hierarchical clusters. We do this by
using the pcvlust() library. However this library clusters by columns not rows. Our 'mydata' is by
rows. So before using the library we must transpose mydata with:
mydatat=t(mydata)
Then:
We can also check the data set classification like before using:
fitmb$classification
Note that the algorithm decides that only two clusters are needed for our data. However we can
'force' the mclust() to generate any number of clusters we want. I set it to 5 clusters so we can
compare with our last 2 algorithm:
fitmb5 <- Mclust(mydata,G=5)
Fitmb5$classification
6. Investigation
Using the within sum of square graph, how do you determine the number of clusters?
As I mentioned in step 2, we can use the sum of square graph to find the best number of cluster
to give a good result. From the graph we can determine the best number of clusters for the data set
using the Elbow Method. By looking at the point at the graph where adding another cluster will not
change much value, we can set that as the optimum number of cluster. (The point where it creates an
'elbow' in the graph) On our graph, that point is when number of cluster is 5.
Are the solutions from K-means always the same? Why?
No. Because in a K-means algorithm, the k number of initial cluster points are randomly selected
each time it is run. So running the same algorithm twice on a same data set and same k value will most
probably generate different but similar result. This is known as initialization problem.
Compare the clustering results. Explain your findings.
Using the information from step 5, I ran the cluster.stat on the 4 cases of cluster result I got earlier:
K-Means 5 clusters, Hierarchical Agglomerative cut at 5 clusters, Model Based with 2 clusters,
Model based of 5 clusters. Recall that
10
a=fitk$cluster
#kmeans
b=cutree(fitha, k = 5) #Hierarchical cut to 5 clusters
c=fitmb$classification #Model
d=fitmb5$classification #Model force to 5 clusters
To analyze each cluster and compare them, I analyze the stats of each cluster:
cluster.stats(dist,
cluster.stats(dist,
cluster.stats(dist,
cluster.stats(dist,
a)
b)
c)
d)
The result is shown and summarized on the table on the next page. The rows marked by the star
are the properties I used to compare and analyze the clustering algorithm.
Cluster Size: The more similar the cluster size the better quality are the result. For a 5 cluster
result, K-means seems to have the closest in similarity for all 5 clusters.
Average Distance/Average Between: Refers to the inter distance between the 5 clusters. The
higher the distance, the better the quality. From the table, the highest average goes to Hierarchical
Agglomerative (2.485788) and lowest is K-Means (2.463476).
Average Within: Refers to the intra distance between the data points within a cluster. The
smaller the intra distance the better the quality. From the table, the lowest average goes to Model
Based algorithm with 5 clusters (2.135135) and highest average is Model Based algorithm with 2
clusters (2.354167).
Entropy: Measures the probability for a data cluster to be less useful if datasets increased in
size. So the lower the entropy the better are the cluster result. The lowest entropy according to table
is Model Based algorithm with 2 clusters (0.6931472) while the highest is K-means (1.548269).
11
What are your personal thoughts about the three different clustering algorithms?
12
For k-means algorithm, it only divide clusters to group only once so it has a good performance at
runtime. Because of this it seems to be a quick and efficient way to do clustering on a large data set.
However to make sure the result is useful the right number of clusters must be properly chosen. So the
data quality is not as good as other available algorithm.
For Hierarchical Agglomerative, it takes longer time in performance because of dividing the clusters
in more than one level. So for a very large data set this will take more time. But the quality of the
clustered data is better than k-means.
For Model Based algorithm, I feel like it is more like an approximate cluster based on the datasets.
During this lab it decides that 2 clusters is best but we see from the sum of square graph that 5 is the
best choice of cluster numbers. So data quality is not as good as Hierarchical
Agglomerative.
13