Documente Academic
Documente Profesional
Documente Cultură
Your Code
setwd("E:/Amrit/Dataset")
# Read file
mydata = read.csv(file.choose(), TRUE)
# Prepare Data
mydata.features <- na.omit(mydata.features) # listwise deletion of missing
#mydata.features <- scale(mydata.features) # standardize variables
if i want to see what data point belongs to which cluster i can type:
sum(result$cluster==1)
mydata[,2:3]
First, lets plot the data. Since you have a multidimensional
dataframe, and you can only plot two dimensions in a standard plot, you have
to do it like this. Select the variables you want to plot, For example var 2
and 3 (column 2 and 3).
table(mydata$household_key, results$cluster)
plot(mydata[c("DAY","WEEK_NO")], col=mydata$household_key)
library(cluster)
clusplot(mydata.features,
lines=0)
fit$cluster,
color=TRUE,
shade=TRUE,
labels=2,
3. Algorithm
Let n be the number of clusters you want
Let S be the set of feature vectors (|S| is the size of the set)
Let A be the set of associated clusters for each feature vector
Let sim(x,y) be the similarity function
Let c[n] be the vectors for our clusters
Init:
Let S' = S//choose n random vectors to start our clusters
for i=1 to n
j = rand(|S'|)
c[n] = S'[j]S' = S' - {c[n]} //remove that vector from S' so we can't choose it
again
end
//assign initial clusters
for i=1 to |S|
A[i] = argmax(j = 1 to n) { sim(S[i], c[j]) }
End
Run:
Let change = true
while change
change = false //assume there is no change
//reassign feature vectors to clusters
for i = 1 to |S|
a = argmax(j = 1 to n) { sim(S[i], c[j]) }
if a != A[i]
A[i] = a
change = true //a vector changed affiliations -- so we need to//recompute our
cluster vectors and run again
end
end
//recalculate cluster locations if a change occurred
if change
for i = 1 to n
mean, count = 0
for j = 1 to |S|
if A[j] == i
mean = mean + S[j]
count = count + 1
end
end
c[i] = mean/count
end
end
4.
As a simple illustration of a k-means algorithm, consider the following data set consisting of the
scores of two variables on each of seven individuals:
Subject
1.0
1.0
1.5
2.0
3.0
4.0
5.0
7.0
3.5
5.0
4.5
5.0
3.5
4.5
This data set is to be grouped into two clusters. As a first step in finding a sensible initial
partition, let the A & B values of the two individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:
Individual
Mean Vector
(centroid)
Group 1
(1.0, 1.0)
Group 2
(5.0, 7.0)
The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:
Cluster 1
Cluster 2
Step
Individual
Mean
Vector
(centroid)
Individual
Mean
Vector
(centroid)
(1.0, 1.0)
(5.0, 7.0)
1, 2
(1.2, 1.5)
(5.0, 7.0)
1, 2, 3
(1.8, 2.3)
(5.0, 7.0)
1, 2, 3
(1.8, 2.3)
4, 5
(4.2, 6.0)
1, 2, 3
(1.8, 2.3)
4, 5, 6
(4.3, 5.7)
1, 2, 3
(1.8, 2.3)
4, 5, 6, 7
(4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Individual
Mean Vector
(centroid)
Cluster 1
1, 2, 3
(1.8, 2.3)
Cluster 2
4, 5, 6, 7
(4.1, 5.4)
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individuals distance to its own cluster mean and to
that of the opposite cluster. And we find:
Individual
Distance to Distance to
mean
mean
(centroid) of (centroid) of
Cluster 1
Cluster 2
1.5
5.4
0.4
4.3
2.1
1.8
5.7
1.8
3.2
0.7
3.8
0.6
2.8
1.1
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own
(Cluster 1). In other words, each individual's distance to its own cluster mean should be
smaller that the distance to the other cluster's mean (which is not the case with individual
3). Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:
Individual
Mean Vector
(centroid)
Cluster 1
1, 2
(1.3, 1.5)
Cluster 2
3, 4, 5, 6, 7
(3.9, 5.1)
The iterative relocation would now continue from this new partition until no more relocations
occur. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be
a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
setwd("E:/Amrit/Dataset")
# Read file
mydata = read.csv(file.choose(), TRUE)
head(mydata)
# Prepare Data
mydata.features <- na.omit(mydata.features) # listwise deletion of missing
#mydata.features <- scale(mydata.features) # standardize variables
table(mydata$household_key, results$cluster)
plot(mydata[c("DAY","WEEK_NO")], col=mydata$household_key)
library(cluster)
clusplot(mydata.features,
lines=0)
fit$cluster,
color=TRUE,
shade=TRUE,
plot(mydata[mydata.5means$cluster==1,],col="red",
xlin=c(min(mydata[,1]),max(mydata[,1]) ),
xlin=c(min(mydata[,2]),max(mydata[,2]))
points(mydata[mydata.5means$cluster==2,], col="blue")
points(mydata[mydata.5means$cluster==3,], col="green")
points(mydata.5means$centers,pch=2, col="green")
labels=2,
Amrit results
>
>
>
>
>
>
>
>
#show data:
head(mydata)
[,1]
[,2]
[,3]
[,4] [,5]
[1,] 1.242642087 0.7389868 -0.1503366 -1.5326992 1
[2,] 0.877972725 -0.7260345 -2.1590601 -0.9633446 2
[3,] -1.059693719 -1.0189481 -1.9863722 -0.9259022 3
[4,] 1.620849035 2.0455703 0.5787970 0.4458540 4
[5,] -0.006106217 0.7298301 0.2319037 -2.0492439 5
[6,] -0.710845765 0.4860186 -0.4223030 0.3213001 6
>
> # Prepare Data
> mydata <- na.omit(mydata)
>
> # Determine number of clusters
> wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
> for (i in 2:20) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
> plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of
squares")
>
> # K-Means Cluster Analysis
> fit <- kmeans(mydata, 3)
> # get cluster means
> mydata <- aggregate(mydata,by=list(fit$cluster),FUN=mean)
> # append cluster assignment
> mydata.features <- data.frame(mydata, fit$cluster)
Error in data.frame(mydata, fit$cluster) :
arguments imply differing number of rows: 3, 50
>
> bindResults <- cbind(fit$cluster, mydata.features)
Error in cbind(fit$cluster, mydata.features) :
object 'mydata.features' not found
>
> write.csv(bindResults, file ="bindResults4.csv",row.names=FALSE)
Error in is.data.frame(x) : object 'bindResults' not found
>
> mydata2 <- read.csv("bindResults4.csv", TRUE)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'bindResults4.csv': No such file or directory
> View(mydata2)
Error in View : object 'mydata2' not found
>
#add ID to data
mydata <-cbind(mydata, 1:50)
#show data:
head(mydata)
# Prepare Data
mydata <- na.omit(mydata)
# Prepare Data
mydata.features <- na.omit(mydata.features) # listwise deletion of missing
#mydata.features <- scale(mydata.features) # standardize variables