Amrit Aggrwal Kmeansalgo

1.
Your Code
setwd("E:/Amrit/Dataset")
# Read file
mydata = read.csv(file.choose(), TRUE)
#view the file

View(mydata)
mydata$CURR_SIZE_OF_PRODUCT <- NULL
#create new data set

mydata.features = mydata
mydata.features$household_key <- NULL
View(mydata.features)
# Prepare Data
mydata.features <- na.omit(mydata.features) # listwise deletion of missing
#mydata.features <- scale(mydata.features) # standardize variables
# Determine number of clusters

wss <- (nrow(mydata.features)-1)*sum(apply(mydata.features,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata.features, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum

of squares")
# K-Means Cluster Analysis

fit <- kmeans(mydata.features, 3)
# get cluster means
aggregate(mydata.features,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata.features <- data.frame(mydata.features, fit$cluster)
results <- kmeans(mydata.features, 3)

results
if i want to see what data point belongs to which cluster i can type:
sampledata[result$cluster==1,] This will output a matrix, with all the

values and the Data Point Id in the last Column:
sum(result$cluster==1)
mydata[,2:3]
First, lets plot the data. Since you have a multidimensional
dataframe, and you can only plot two dimensions in a standard plot, you have
to do it like this. Select the variables you want to plot, For example var 2
and 3 (column 2 and 3).
plot(sampledata[,2:3], col=result$cluster ,main="Affiliation of

observations")
+ points(result$centers, col=1:4, pch="x", cex=3)
table(mydata$household_key, results$cluster)
#write.csv(exporttable, file ="table.csv",row.names=FALSE)
plot(mydata[c("DAY","WEEK_NO")], col= results$cluster)
plot(mydata[c("DAY","WEEK_NO")], col=mydata$household_key)
library(cluster)
clusplot(mydata.features,
lines=0)
fit$cluster,
color=TRUE,
shade=TRUE,
labels=2,
unique <- unique(mydata$DAY)

length(unique)
2. K means is a clustering method.

When you apply a clustering method to your dataset, it allows
you to separate your data in groups that maximize the
similarity between data in the same group and maximise the
dissimilarity between data in different groups. The number of
groups is an input parameter of the problem, that is you will
choose it. K-means groups data, and returns k centroids, i.e. k
vectors that represent the center points of the groups, and
returns a matrix that assigns each sample in your dataset to a
group.
3. Algorithm
Let n be the number of clusters you want
Let S be the set of feature vectors (|S| is the size of the set)
Let A be the set of associated clusters for each feature vector
Let sim(x,y) be the similarity function
Let c[n] be the vectors for our clusters
Init:
Let S' = S//choose n random vectors to start our clusters
for i=1 to n
j = rand(|S'|)
c[n] = S'[j]S' = S' - {c[n]} //remove that vector from S' so we can't choose it
again
end
//assign initial clusters
for i=1 to |S|
A[i] = argmax(j = 1 to n) { sim(S[i], c[j]) }
End
Run:
Let change = true
while change
change = false //assume there is no change
//reassign feature vectors to clusters
for i = 1 to |S|
a = argmax(j = 1 to n) { sim(S[i], c[j]) }
if a != A[i]
A[i] = a
change = true //a vector changed affiliations -- so we need to//recompute our
cluster vectors and run again
end
end
//recalculate cluster locations if a change occurred
if change
for i = 1 to n
mean, count = 0
for j = 1 to |S|
if A[j] == i
mean = mean + S[j]
count = count + 1
end
end
c[i] = mean/count
end
end
4.
k-Means: Step-By-Step Example
As a simple illustration of a k-means algorithm, consider the following data set consisting of the
scores of two variables on each of seven individuals:
Subject
1.0
1.0
1.5
2.0
3.0
4.0
5.0
7.0
3.5
5.0
4.5
5.0
3.5
4.5
This data set is to be grouped into two clusters. As a first step in finding a sensible initial
partition, let the A & B values of the two individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:
Individual
Mean Vector
(centroid)
Group 1
(1.0, 1.0)
Group 2
(5.0, 7.0)
The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:
Cluster 1
Cluster 2
Step
Individual
Mean
Vector
(centroid)
Individual
Mean
Vector
(centroid)
(1.0, 1.0)
(5.0, 7.0)
1, 2
(1.2, 1.5)
(5.0, 7.0)
1, 2, 3
(1.8, 2.3)
(5.0, 7.0)
1, 2, 3
(1.8, 2.3)
4, 5
(4.2, 6.0)
1, 2, 3
(1.8, 2.3)
4, 5, 6
(4.3, 5.7)
1, 2, 3
(1.8, 2.3)
4, 5, 6, 7
(4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Individual
Mean Vector
(centroid)
Cluster 1
1, 2, 3
(1.8, 2.3)
Cluster 2
4, 5, 6, 7
(4.1, 5.4)
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individuals distance to its own cluster mean and to
that of the opposite cluster. And we find:
Individual
Distance to Distance to
mean
mean
(centroid) of (centroid) of
Cluster 1
Cluster 2
1.5
5.4
0.4
4.3
2.1
1.8
5.7
1.8
3.2
0.7
3.8
0.6
2.8
1.1
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own
(Cluster 1). In other words, each individual's distance to its own cluster mean should be
smaller that the distance to the other cluster's mean (which is not the case with individual
3). Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:
Individual
Mean Vector
(centroid)
Cluster 1
1, 2
(1.3, 1.5)
Cluster 2
3, 4, 5, 6, 7
(3.9, 5.1)
The iterative relocation would now continue from this new partition until no more relocations
occur. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be
a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
5. Changes as per your code
setwd("E:/Amrit/Dataset")
# Read file
mydata = read.csv(file.choose(), TRUE)
head(mydata)
#show me the data and find whether groupings
#view the file

View(mydata)
or
plot(mydata)

# Prepare Data
# Determine number of clusters

wss <- (nrow(mydata.features)-1)*sum(apply(mydata.features,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata.features, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum
of squares")
#As you said 5 means exist

mydata.5means <- kmeans(mydata,centers,3)
#Show the Centers
mydata.5means$centers
# Show the Clusters
mydata.5means$clusters
# K-Means Cluster Analysis
fit <- kmeans(mydata.features, 3)
# get cluster means
aggregate(mydata.features,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata.features <- data.frame(mydata.features, fit$cluster)
results <- kmeans(mydata.features, 3)

results
table(mydata$household_key, results$cluster)
#write.csv(exporttable, file ="table.csv",row.names=FALSE)
plot(mydata[c("DAY","WEEK_NO")], col= results$cluster)
plot(mydata[c("DAY","WEEK_NO")], col=mydata$household_key)
library(cluster)
clusplot(mydata.features,
lines=0)
fit$cluster,
color=TRUE,
shade=TRUE,
plot(mydata[mydata.5means$cluster==1,],col="red",
xlin=c(min(mydata[,1]),max(mydata[,1]) ),
xlin=c(min(mydata[,2]),max(mydata[,2]))
points(mydata[mydata.5means$cluster==2,], col="blue")
points(mydata[mydata.5means$cluster==3,], col="green")
#plot the Centers on the plot
points(mydata.5means$centers,pch=2, col="green")
unique <- unique(mydata$DAY)

length(unique)
labels=2,
Amrit results
>
>
>
>
>
>
>
>
#generate sample data

mydata<- matrix(data=rnorm(200,0,1),50,4)
#add ID to data
mydata <-cbind(mydata, 1:50)
#show data:
head(mydata)
[,1]
[,2]
[,3]
[,4] [,5]
[1,] 1.242642087 0.7389868 -0.1503366 -1.5326992 1
[2,] 0.877972725 -0.7260345 -2.1590601 -0.9633446 2
[3,] -1.059693719 -1.0189481 -1.9863722 -0.9259022 3
[4,] 1.620849035 2.0455703 0.5787970 0.4458540 4
[5,] -0.006106217 0.7298301 0.2319037 -2.0492439 5
[6,] -0.710845765 0.4860186 -0.4223030 0.3213001 6
>
> # Prepare Data
> mydata <- na.omit(mydata)
>
> # Determine number of clusters
> wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
> for (i in 2:20) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
> plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of
squares")
>
> # K-Means Cluster Analysis
> fit <- kmeans(mydata, 3)
> # get cluster means
> mydata <- aggregate(mydata,by=list(fit$cluster),FUN=mean)
> # append cluster assignment
> mydata.features <- data.frame(mydata, fit$cluster)
Error in data.frame(mydata, fit$cluster) :
arguments imply differing number of rows: 3, 50
>
> bindResults <- cbind(fit$cluster, mydata.features)
Error in cbind(fit$cluster, mydata.features) :
object 'mydata.features' not found
>
> write.csv(bindResults, file ="bindResults4.csv",row.names=FALSE)
Error in is.data.frame(x) : object 'bindResults' not found
>
> mydata2 <- read.csv("bindResults4.csv", TRUE)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'bindResults4.csv': No such file or directory
> View(mydata2)
Error in View : object 'mydata2' not found
>
> mydata2$fit.cluster.1 <- NULL

Error in mydata2$fit.cluster.1 <- NULL : object 'mydata2' not found
>
> #Sort the data by Product_ID
> sortbycluster <- mydata2[order(mydata2$fit.cluster), ]
Error: object 'mydata2' not found
>
> View(sortbycluster)
Error in View : object 'sortbycluster' not found
>
> write.csv(sortbycluster, file ="bindResults5.csv",row.names=FALSE)
Error in is.data.frame(x) : object 'sortbycluster' not found
>
> mydata3 <- read.csv("bindResults5.csv", TRUE)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'bindResults5.csv': No such file or directory
> View(mydata3)
Error in View : object 'mydata3' not found
>
> summary(mydata3$fit.cluster)
Error in summary(mydata3$fit.cluster) : object 'mydata3' not found
>
> fit$cluster
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2
[30] 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> #library(cluster)
> # clusplot(mydata3, mydata3$household_key, color=TRUE, shade=TRUE, labels=2,
lines=0)
>
> table <- tbl_df(fit$cluster)
Error: could not find function "tbl_df"
#generate sample data

mydata<- matrix(data=rnorm(200,0,1),50,4)
#add ID to data
mydata <-cbind(mydata, 1:50)
#show data:
head(mydata)
# Prepare Data
mydata <- na.omit(mydata)
#show me the data and find whether groupings
#view the file

View(mydata)
or
plot(mydata)

# Prepare Data

Amrit Aggrwal Kmeansalgo

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Amrit Aggrwal Kmeansalgo

Încărcat de

Drepturi de autor:

Formate disponibile

1.

#view the file

mydata$CURR_SIZE_OF_PRODUCT <- NULL

#create new data set

# Determine number of clusters

plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum

# K-Means Cluster Analysis

results <- kmeans(mydata.features, 3)

sampledata[result$cluster==1,] This will output a matrix, with all the

plot(sampledata[,2:3], col=result$cluster ,main="Affiliation of

#write.csv(exporttable, file ="table.csv",row.names=FALSE)

plot(mydata[c("DAY","WEEK_NO")], col= results$cluster)

unique <- unique(mydata$DAY)

2. K means is a clustering method.

k-Means: Step-By-Step Example

5. Changes as per your code

#show me the data and find whether groupings

#view the file

mydata$CURR_SIZE_OF_PRODUCT <- NULL

#create new data set

# Determine number of clusters

#As you said 5 means exist

results <- kmeans(mydata.features, 3)

#write.csv(exporttable, file ="table.csv",row.names=FALSE)

plot(mydata[c("DAY","WEEK_NO")], col= results$cluster)

#plot the Centers on the plot

unique <- unique(mydata$DAY)

#generate sample data

> mydata2$fit.cluster.1 <- NULL

#generate sample data

#show me the data and find whether groupings

#view the file

mydata$CURR_SIZE_OF_PRODUCT <- NULL

#create new data set

S-ar putea să vă placă și