Sunteți pe pagina 1din 15

1.

Your Code

setwd("E:/Amrit/Dataset")

# Read file
mydata = read.csv(file.choose(), TRUE)

#view the file


View(mydata)

mydata$CURR_SIZE_OF_PRODUCT <- NULL

#create new data set


mydata.features = mydata
mydata.features$household_key <- NULL
View(mydata.features)

# Prepare Data
mydata.features <- na.omit(mydata.features) # listwise deletion of missing
#mydata.features <- scale(mydata.features) # standardize variables

# Determine number of clusters


wss <- (nrow(mydata.features)-1)*sum(apply(mydata.features,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata.features, centers=i)$withinss)

plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum


of squares")

# K-Means Cluster Analysis


fit <- kmeans(mydata.features, 3)
# get cluster means
aggregate(mydata.features,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata.features <- data.frame(mydata.features, fit$cluster)

results <- kmeans(mydata.features, 3)


results

if i want to see what data point belongs to which cluster i can type:

sampledata[result$cluster==1,] This will output a matrix, with all the


values and the Data Point Id in the last Column:

sum(result$cluster==1)
mydata[,2:3]
First, lets plot the data. Since you have a multidimensional
dataframe, and you can only plot two dimensions in a standard plot, you have
to do it like this. Select the variables you want to plot, For example var 2
and 3 (column 2 and 3).

plot(sampledata[,2:3], col=result$cluster ,main="Affiliation of


observations")
+ points(result$centers, col=1:4, pch="x", cex=3)

table(mydata$household_key, results$cluster)

#write.csv(exporttable, file ="table.csv",row.names=FALSE)

plot(mydata[c("DAY","WEEK_NO")], col= results$cluster)

plot(mydata[c("DAY","WEEK_NO")], col=mydata$household_key)

library(cluster)
clusplot(mydata.features,
lines=0)

fit$cluster,

color=TRUE,

shade=TRUE,

labels=2,

unique <- unique(mydata$DAY)


length(unique)

2. K means is a clustering method.


When you apply a clustering method to your dataset, it allows
you to separate your data in groups that maximize the
similarity between data in the same group and maximise the
dissimilarity between data in different groups. The number of
groups is an input parameter of the problem, that is you will
choose it. K-means groups data, and returns k centroids, i.e. k
vectors that represent the center points of the groups, and
returns a matrix that assigns each sample in your dataset to a
group.

3. Algorithm
Let n be the number of clusters you want
Let S be the set of feature vectors (|S| is the size of the set)
Let A be the set of associated clusters for each feature vector
Let sim(x,y) be the similarity function
Let c[n] be the vectors for our clusters
Init:
Let S' = S//choose n random vectors to start our clusters
for i=1 to n
j = rand(|S'|)
c[n] = S'[j]S' = S' - {c[n]} //remove that vector from S' so we can't choose it
again
end
//assign initial clusters
for i=1 to |S|
A[i] = argmax(j = 1 to n) { sim(S[i], c[j]) }
End
Run:
Let change = true
while change
change = false //assume there is no change
//reassign feature vectors to clusters
for i = 1 to |S|
a = argmax(j = 1 to n) { sim(S[i], c[j]) }
if a != A[i]

A[i] = a
change = true //a vector changed affiliations -- so we need to//recompute our
cluster vectors and run again
end
end
//recalculate cluster locations if a change occurred
if change
for i = 1 to n
mean, count = 0
for j = 1 to |S|
if A[j] == i
mean = mean + S[j]
count = count + 1
end
end
c[i] = mean/count
end
end

4.

k-Means: Step-By-Step Example

As a simple illustration of a k-means algorithm, consider the following data set consisting of the
scores of two variables on each of seven individuals:
Subject

1.0

1.0

1.5

2.0

3.0

4.0

5.0

7.0

3.5

5.0

4.5

5.0

3.5

4.5

This data set is to be grouped into two clusters. As a first step in finding a sensible initial
partition, let the A & B values of the two individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:
Individual

Mean Vector
(centroid)

Group 1

(1.0, 1.0)

Group 2

(5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:
Cluster 1

Cluster 2

Step

Individual

Mean
Vector
(centroid)

Individual

Mean
Vector
(centroid)

(1.0, 1.0)

(5.0, 7.0)

1, 2

(1.2, 1.5)

(5.0, 7.0)

1, 2, 3

(1.8, 2.3)

(5.0, 7.0)

1, 2, 3

(1.8, 2.3)

4, 5

(4.2, 6.0)

1, 2, 3

(1.8, 2.3)

4, 5, 6

(4.3, 5.7)

1, 2, 3

(1.8, 2.3)

4, 5, 6, 7

(4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Individual

Mean Vector
(centroid)

Cluster 1

1, 2, 3

(1.8, 2.3)

Cluster 2

4, 5, 6, 7

(4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individuals distance to its own cluster mean and to
that of the opposite cluster. And we find:

Individual

Distance to Distance to
mean
mean
(centroid) of (centroid) of
Cluster 1
Cluster 2

1.5

5.4

0.4

4.3

2.1

1.8

5.7

1.8

3.2

0.7

3.8

0.6

2.8

1.1

Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own
(Cluster 1). In other words, each individual's distance to its own cluster mean should be
smaller that the distance to the other cluster's mean (which is not the case with individual
3). Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:
Individual

Mean Vector
(centroid)

Cluster 1

1, 2

(1.3, 1.5)

Cluster 2

3, 4, 5, 6, 7

(3.9, 5.1)

The iterative relocation would now continue from this new partition until no more relocations
occur. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be
a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.

5. Changes as per your code

setwd("E:/Amrit/Dataset")

# Read file
mydata = read.csv(file.choose(), TRUE)
head(mydata)

#show me the data and find whether groupings

#view the file


View(mydata)
or
plot(mydata)

mydata$CURR_SIZE_OF_PRODUCT <- NULL

#create new data set


mydata.features = mydata
mydata.features$household_key <- NULL
View(mydata.features)

# Prepare Data
mydata.features <- na.omit(mydata.features) # listwise deletion of missing
#mydata.features <- scale(mydata.features) # standardize variables

# Determine number of clusters


wss <- (nrow(mydata.features)-1)*sum(apply(mydata.features,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata.features, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum
of squares")

#As you said 5 means exist


mydata.5means <- kmeans(mydata,centers,3)
#Show the Centers
mydata.5means$centers
# Show the Clusters
mydata.5means$clusters
# K-Means Cluster Analysis
fit <- kmeans(mydata.features, 3)
# get cluster means
aggregate(mydata.features,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata.features <- data.frame(mydata.features, fit$cluster)

results <- kmeans(mydata.features, 3)


results

table(mydata$household_key, results$cluster)

#write.csv(exporttable, file ="table.csv",row.names=FALSE)

plot(mydata[c("DAY","WEEK_NO")], col= results$cluster)

plot(mydata[c("DAY","WEEK_NO")], col=mydata$household_key)

library(cluster)
clusplot(mydata.features,
lines=0)

fit$cluster,

color=TRUE,

shade=TRUE,

plot(mydata[mydata.5means$cluster==1,],col="red",
xlin=c(min(mydata[,1]),max(mydata[,1]) ),

xlin=c(min(mydata[,2]),max(mydata[,2]))
points(mydata[mydata.5means$cluster==2,], col="blue")

points(mydata[mydata.5means$cluster==3,], col="green")

#plot the Centers on the plot

points(mydata.5means$centers,pch=2, col="green")

unique <- unique(mydata$DAY)


length(unique)

labels=2,

Amrit results
>
>
>
>
>
>
>
>

#generate sample data


mydata<- matrix(data=rnorm(200,0,1),50,4)
#add ID to data
mydata <-cbind(mydata, 1:50)

#show data:
head(mydata)
[,1]
[,2]
[,3]
[,4] [,5]
[1,] 1.242642087 0.7389868 -0.1503366 -1.5326992 1
[2,] 0.877972725 -0.7260345 -2.1590601 -0.9633446 2
[3,] -1.059693719 -1.0189481 -1.9863722 -0.9259022 3
[4,] 1.620849035 2.0455703 0.5787970 0.4458540 4
[5,] -0.006106217 0.7298301 0.2319037 -2.0492439 5
[6,] -0.710845765 0.4860186 -0.4223030 0.3213001 6
>
> # Prepare Data
> mydata <- na.omit(mydata)
>
> # Determine number of clusters
> wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
> for (i in 2:20) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
> plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of
squares")
>
> # K-Means Cluster Analysis
> fit <- kmeans(mydata, 3)
> # get cluster means
> mydata <- aggregate(mydata,by=list(fit$cluster),FUN=mean)
> # append cluster assignment
> mydata.features <- data.frame(mydata, fit$cluster)
Error in data.frame(mydata, fit$cluster) :
arguments imply differing number of rows: 3, 50
>
> bindResults <- cbind(fit$cluster, mydata.features)
Error in cbind(fit$cluster, mydata.features) :
object 'mydata.features' not found
>
> write.csv(bindResults, file ="bindResults4.csv",row.names=FALSE)
Error in is.data.frame(x) : object 'bindResults' not found
>
> mydata2 <- read.csv("bindResults4.csv", TRUE)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'bindResults4.csv': No such file or directory
> View(mydata2)
Error in View : object 'mydata2' not found
>

> mydata2$fit.cluster.1 <- NULL


Error in mydata2$fit.cluster.1 <- NULL : object 'mydata2' not found
>
> #Sort the data by Product_ID
> sortbycluster <- mydata2[order(mydata2$fit.cluster), ]
Error: object 'mydata2' not found
>
> View(sortbycluster)
Error in View : object 'sortbycluster' not found
>
> write.csv(sortbycluster, file ="bindResults5.csv",row.names=FALSE)
Error in is.data.frame(x) : object 'sortbycluster' not found
>
> mydata3 <- read.csv("bindResults5.csv", TRUE)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'bindResults5.csv': No such file or directory
> View(mydata3)
Error in View : object 'mydata3' not found
>
> summary(mydata3$fit.cluster)
Error in summary(mydata3$fit.cluster) : object 'mydata3' not found
>
> fit$cluster
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2
[30] 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> #library(cluster)
> # clusplot(mydata3, mydata3$household_key, color=TRUE, shade=TRUE, labels=2,
lines=0)
>
> table <- tbl_df(fit$cluster)
Error: could not find function "tbl_df"

#generate sample data


mydata<- matrix(data=rnorm(200,0,1),50,4)

#add ID to data
mydata <-cbind(mydata, 1:50)

#show data:
head(mydata)

# Prepare Data
mydata <- na.omit(mydata)

#show me the data and find whether groupings

#view the file


View(mydata)
or
plot(mydata)

mydata$CURR_SIZE_OF_PRODUCT <- NULL

#create new data set


mydata.features = mydata
mydata.features$household_key <- NULL
View(mydata.features)

# Prepare Data
mydata.features <- na.omit(mydata.features) # listwise deletion of missing
#mydata.features <- scale(mydata.features) # standardize variables

S-ar putea să vă placă și