Documente Academic
Documente Profesional
Documente Cultură
Assignment -4
Durgesh kalwar
SC19M077
Elbow Method-
1.Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
2.For each k, calculate the total within-cluster sum of square (wss).
3.Plot the curve of wss according to the number of clusters k.
4.The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the average silhouette of observations (avg.sil).
3. Plot the curve of avg.sil according to the number of clusters k
4.The location of the maximum is considered as the appropriate number of clusters.
Gap statistic method-
The Gap statistic compares the total withinn intra-cluster variation for different values of k
with their expected values under null reference distribution of the data. The estimate of the
optimal clusters will be value that maximize the gap statistic (i.e that yields the largest gap
statistics). This means that the clustering structure is far away from the random uniform
distribution of points.
If we know the actual label of clusters (ground truth) then we can go for Extrinsic Methods,
which compare the clustering against the group truth and measure or else we have to use
Intrinsic which evaluate the goodness of a clustering by considering how well the clusters are
separated.
Extrinsic Methods
Here we assign a score Q(C,Cg), to a clustering, C, given the ground truth Cg. This depends on
Clustering Homogeneity-which measures the purity of clustering, Cluster Completeness - if
any two objects belong to the same category, according to ground truth, then they should be
assigned to the same cluster, Rag Bag- if some points that cannot be merged with others,
clustering should put it in Rag bag category.
Intrinsic Methods
Silhouette Coefficient -For clusters C-1,C-2…C-k, for every object,o, we calculate a(o) as
the average distance between o and all other objects in the cluster to which o belongs. Similarly,
b(o) is the minimum average distance from o to all clusters to which o does not belong. Then
Silhouette Coffiecient is defined as
S(o) = (b(o) – a(o))/(max{b(o),a(o)})
The value of the silhouette coefficient is between -1 and 1. The value of a(o)reflects the
compactness of the cluster to which o belongs. The smaller the value, the more compact the
cluster. The value of b(o) captures the degree to which o is separated from other clusters.
The larger b(o) is, the more separated o is from other clusters. Therefore, when the silhouette
coefficient value of o approaches 1, the cluster containing o is compact and o is far away from
other clusters, which is the preferable case. However, when the silhouette coefficient value is
negative (i.e., b(o) < a(o)), this means that, in expectation, o is closer to the objects in another
cluster than to the objects in the same cluster as
Davies–Bouldin index
For a clusters is calculated as
Q.1.c) KSOM Visualization-
Once the SOM is trained using the input data, the final map is not expected to have any twists.
If the map is twist-free, the distance between the codebook vectors of neighboring neurons
gives an approximation of the distance between different parts of the underlying data. When
such distances are depicted in a grayscale image, light colors depict closely spaced node
codebook vectors and darker colors indicate more widely separated node codebook vectors.
Thus, groups of light colors can be considered as clusters, and the dark parts as the boundaries
between the clusters. This representation can help to visualize the clusters in the high-
dimensional spaces, or to automatically recognize them using relatively simple image
processing techniques.
Q.2.a)
no of data points is 8. This problem is done by programming --
Finding Best k using Davies Bouldin Index-
No. of DB Index
cluster
2 0.8058
3 0.4767
4 0.384
5 0.248
6 0.289
7 0.168
after 3 no of cluster ,DB index is not changing to much so we can say best k is 3.
Q.3
k=2
Q.5.b)
In kmean algorithm the Distortion Function J(c,mu) is not convex function. So it is very
sensitive to initial value of k-centeriod , and it may converge to local minima not to global
minima, to resolve this problem we randomly initialize k-centeriod from datapoint 5-10 times
and find out for which converged centeriod, distortion function is having minimum value.
Q.5.c) Plotting the cluster for data1 and data 2 obtatin by Kmean, Kmediod, Hierarchical
clustering(Divisive clustering), KSOM.
Data 1 Data 2
Q.5.d) J(c,mu) vs k
Data 1 Data 2
Q.5.e.)
KSOM visualization techniques-
Data 1 Data 2
Q.6.a)
Technique used to find number of cluster is Elbow methos-
Data 3 Data 4
Q.6.b)
No of cluster for data 3-
form data 3 elbow figure ,we can see elbow is forming at about k=4 or 5 so no of cluster is 5.
Q.6.c)
Quality of Clusters-
Q.13
Q.14
Finding the optimal dimension k of the subspace to which the data can be projected-
N
1
Average squared projection error - ∑ ‖X i−X iapproximate‖2
N i=1
N
1
Total Variation in the data- ∑ ‖X i‖
N i=1
(data are centralized, mean is zero)
for 99% variance to be retained in approximated data, if bound value is 0.03 means 97%
variance is retained in approximated data.
SVD(XTX)=USVT
where S is diagonal matrix and contain eigen values of XTX (symmetric matrix).
∑ S ii
1− i=1
n
≤0.01
∑ S ii
i=1
Q.15.a)
N
1
Total Variation in the data- ∑ ‖X i‖ (data are centralized, mean is zero)
N i=1
for 99% variance to be retained in approximated data, if bound value is 0.03 means 97%
variance is retained in approximated data.
SVD(XTX)=USVT
where S is diagonal matrix and contain eigen values of XTX (symmetric matrix).
∑ S ii
1− i=1
n
≤0.01
∑ S ii
i=1
Q.15.d) Classification on New arrthymia data- This is multi class classification problem, LDA
algorithm is applied and accuracy obtained is 90.92%.
Regression on New communities and crime data – Mean Square Error is 0.00885371
Q.16.a) &b)
Q.17.a) Describe the experimental settings- Iris data, multi class classification problem, no. Of class is
3. LDA algorithm is used for classification.
One vs All-
This data consists of 3 classes so 3 model is developed, by considering one class as positive
and other two as negative class for each model. Now each model become two class
classification problem for that classification LDA algorithm is used.
All vs All -
Here nC2 models is made, where n=3. So 3 models M1 with C-1 and C-2, M2 with C-2&C-3
and M3 with C-1&C-3. The voting was carried out for each model. And for each data points,
the class with majority of votes was declared winner.
Q.17.b) Performance Measure b/w two multiclass techniques-