Sunteți pe pagina 1din 16

Data Mining

Assignment -4

Durgesh kalwar
SC19M077

Q.1.a) Techniques for determining the number of cluster-

Elbow Method-

1.Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
2.For each k, calculate the total within-cluster sum of square (wss).
3.Plot the curve of wss according to the number of clusters k.
4.The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.

Average silhouette method-

1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the average silhouette of observations (avg.sil).
3. Plot the curve of avg.sil according to the number of clusters k
4.The location of the maximum is considered as the appropriate number of clusters.
Gap statistic method-

The Gap statistic compares the total withinn intra-cluster variation for different values of k
with their expected values under null reference distribution of the data. The estimate of the
optimal clusters will be value that maximize the gap statistic (i.e that yields the largest gap
statistics). This means that the clustering structure is far away from the random uniform
distribution of points.

Q.1.b) Techniques for measuring the quality of clusters-

If we know the actual label of clusters (ground truth) then we can go for Extrinsic Methods,
which compare the clustering against the group truth and measure or else we have to use
Intrinsic which evaluate the goodness of a clustering by considering how well the clusters are
separated.
Extrinsic Methods
Here we assign a score Q(C,Cg), to a clustering, C, given the ground truth Cg. This depends on
Clustering Homogeneity-which measures the purity of clustering, Cluster Completeness - if
any two objects belong to the same category, according to ground truth, then they should be
assigned to the same cluster, Rag Bag- if some points that cannot be merged with others,
clustering should put it in Rag bag category.
Intrinsic Methods
Silhouette Coefficient -For clusters C-1,C-2…C-k, for every object,o, we calculate a(o) as
the average distance between o and all other objects in the cluster to which o belongs. Similarly,
b(o) is the minimum average distance from o to all clusters to which o does not belong. Then
Silhouette Coffiecient is defined as
S(o) = (b(o) – a(o))/(max{b(o),a(o)})
The value of the silhouette coefficient is between -1 and 1. The value of a(o)reflects the
compactness of the cluster to which o belongs. The smaller the value, the more compact the
cluster. The value of b(o) captures the degree to which o is separated from other clusters.
The larger b(o) is, the more separated o is from other clusters. Therefore, when the silhouette
coefficient value of o approaches 1, the cluster containing o is compact and o is far away from
other clusters, which is the preferable case. However, when the silhouette coefficient value is
negative (i.e., b(o) < a(o)), this means that, in expectation, o is closer to the objects in another
cluster than to the objects in the same cluster as

Davies–Bouldin index
For a clusters is calculated as
Q.1.c) KSOM Visualization-

Once the SOM is trained using the input data, the final map is not expected to have any twists.
If the map is twist-free, the distance between the codebook vectors of neighboring neurons
gives an approximation of the distance between different parts of the underlying data. When
such distances are depicted in a grayscale image, light colors depict closely spaced node
codebook vectors and darker colors indicate more widely separated node codebook vectors.
Thus, groups of light colors can be considered as clusters, and the dark parts as the boundaries
between the clusters. This representation can help to visualize the clusters in the high-
dimensional spaces, or to automatically recognize them using relatively simple image
processing techniques.

Q.2.a)
no of data points is 8. This problem is done by programming --
Finding Best k using Davies Bouldin Index-

No. of DB Index
cluster
2 0.8058
3 0.4767
4 0.384
5 0.248
6 0.289
7 0.168

after 3 no of cluster ,DB index is not changing to much so we can say best k is 3.

Q.2.b) J(c,mu) vs no of cluster

Using Euclidean Using Mahatten


Distance Distance
Cluster are as-

C0={ [5,6] [6,9] }


C1={ [6,14] [8,13] [9,12] }
C2={ [10,8] [11,9] [12,8] }

Q.3
k=2

culsrer center is {[21],[45]}


Q.5.a)
K-mean cluster centers for data 1-

mu1 = [ 9.6901731e-01 9.7551245e-01]


mu 2 = [-1.7655450e-04 5.0115570e+00]
mu 3= [ 3.9892065e+00 6.0057065e+00]

K-mean cluster centers for data 2-

mu1 = [ 5.53186536 -6.1081348 3.15937707]


mu2 = [16.543127 13.016817 7.9750388 ]
` mu3 = [-6.579707 -8.267905 22.96139 ]

K-mediod cluster centers for data 1-

mu1 = [0.037553, 5.0372 ]


mu 2 = [3.9089 , 5.8664 ]
mu 3= [0.91102 , 0.95461 ]

K-mediod cluster centers for data 2-

mu1 = [16.843 , 11.201 , 7.2582],


mu2 = [-6.0975, -8.6116, 21.826 ],
` mu3 = [ 4.5076, -7.3343, 1.6398]

Q.5.b)
In kmean algorithm the Distortion Function J(c,mu) is not convex function. So it is very
sensitive to initial value of k-centeriod , and it may converge to local minima not to global
minima, to resolve this problem we randomly initialize k-centeriod from datapoint 5-10 times
and find out for which converged centeriod, distortion function is having minimum value.

Q.5.c) Plotting the cluster for data1 and data 2 obtatin by Kmean, Kmediod, Hierarchical
clustering(Divisive clustering), KSOM.

Data 1 Data 2
Q.5.d) J(c,mu) vs k

Data 1 Data 2

Q.5.e.)
KSOM visualization techniques-

Data 1 Data 2
Q.6.a)
Technique used to find number of cluster is Elbow methos-
Data 3 Data 4

Q.6.b)
No of cluster for data 3-
form data 3 elbow figure ,we can see elbow is forming at about k=4 or 5 so no of cluster is 5.

No of cluster for data4-


from data 4 elbow figure , we can see elbow is forming at k=3, so no of cluster is 3.

Q.6.c)
Quality of Clusters-

for data3 DB index is 0.812


for data4 DB index is 0.525

Q.6.d) Data 3 Data 4

-2 class represent outliers.


Q.7. Quality of clusters-
Algorithms No. of Clusters Quality Measure
(DB Index)
K-means 3 0.72
K-mediod 3 0.4048
KSOM 3 0.867
DBSCAN 2 2.7495
Divisive Clus. 3 1.156

Q.8- matrix norm 1-

Q.9 Page Ranking-


Q.10) Page Ranking-

Q.11) Page Ranking-


Q.12)

Q.13
Q.14
Finding the optimal dimension k of the subspace to which the data can be projected-
N
1
Average squared projection error - ∑ ‖X i−X iapproximate‖2
N i=1

N
1
Total Variation in the data- ∑ ‖X i‖
N i=1
(data are centralized, mean is zero)

Typically choose k to be smaller value so that-


N
1
∑ ‖X i− X iapproximate‖2
N i=1
N
≤0.01 --------- equation 1
1
∑ ‖X i‖
N i=1

for 99% variance to be retained in approximated data, if bound value is 0.03 means 97%
variance is retained in approximated data.

SVD(XTX)=USVT

where S is diagonal matrix and contain eigen values of XTX (symmetric matrix).

for finding the best value of k, we can also use-


k

∑ S ii
1− i=1
n
≤0.01
∑ S ii
i=1
Q.15.a)

Choosing the number of principal components (k)-


N
1
Average squared projection error - ∑ ‖X i−X iapproximate‖2
N i=1

N
1
Total Variation in the data- ∑ ‖X i‖ (data are centralized, mean is zero)
N i=1

Typically choose k to be smaller value so that-


N
1
∑ ‖X i− X iapproximate‖2
N i=1
N
≤0.01 --------- equation 1
1
∑ ‖X i‖
N i=1

for 99% variance to be retained in approximated data, if bound value is 0.03 means 97%
variance is retained in approximated data.

SVD(XTX)=USVT

where S is diagonal matrix and contain eigen values of XTX (symmetric matrix).

for finding the best value of k, we can also use-


k

∑ S ii
1− i=1
n
≤0.01
∑ S ii
i=1

Arrthymia Data Community and


crime data
Q.15.b) Basis of the optimal subspace for Arrthymia data is stored in Arr_Basis.csv.
Basis of the optimal subspace for Communty and crime data is stored in C&C_Basis.csv.

Q.15.c) New data points for Arrthymia data is stored in Arr_Newdata.csv.


New data points for Cmmunity and crime data is stored in C&C_Newdata.csv.

Q.15.d) Classification on New arrthymia data- This is multi class classification problem, LDA
algorithm is applied and accuracy obtained is 90.92%.

Regression on New communities and crime data – Mean Square Error is 0.00885371

Q.15.e) Classification on approximated arrthymia data- accuraccy is 4.64%

Regression on communities and crime data- MSE is [1.09587179]

Q.16.a) &b)

Features selected Experimental Result (Accuracy)


F=[1] 0.9045
F=[1, 2] 0.9097
F=[1, 2, 9] 0.91216
F=[1, 2, 9, 0] 0.91216

Preprocessing of data- missing values in attribute column is replaced by mean value of


particular attribute column

Q.17.a) Describe the experimental settings- Iris data, multi class classification problem, no. Of class is
3. LDA algorithm is used for classification.

One vs All-
This data consists of 3 classes so 3 model is developed, by considering one class as positive
and other two as negative class for each model. Now each model become two class
classification problem for that classification LDA algorithm is used.

All vs All -
Here nC2 models is made, where n=3. So 3 models M1 with C-1 and C-2, M2 with C-2&C-3
and M3 with C-1&C-3. The voting was carried out for each model. And for each data points,
the class with majority of votes was declared winner.
Q.17.b) Performance Measure b/w two multiclass techniques-

one vs all – accuracay is 74%


all av all – accuracy is 98%

S-ar putea să vă placă și