Sunteți pe pagina 1din 7

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division


Second Semester 2018-2019

Mid-Semester Test
(EC-2 Makeup)

Course No. : IS ZC415


Course Title : DATA MINING
Nature of Exam : Closed Book
Weightage : 30% No. of Pages =2
Duration : 2 Hours No. of Questions = 6
Date of Exam : 09/03/2019 (AN)
Note:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.
4. Unnecessary long answers may attract negative mark.

Q.1. For the data set given below in Table 1, Fill the missing values. [5]
Table-1
SL NO: Fet1 Fet2 Fet3 Fet4
1 3 4 4 3
2 0 12 13 13
3 10 15 12 18
4 0 23 17 0
5 4 7 6 2
6 2 14 7 12
Upper Quartile = 13 or Lower Quartile= 2 or
Mode = 0 Mean=12.5 12 3

Q.2. Assume the following dataset is given: (1,2,1), (4,4,2), (3,5,1), (1,6,2), (8,8,2),(7,9,2),
(0,4,2), (4,0,2). K-Means is used with k=3 to cluster the dataset. Moreover, Euclidean
distance is used as the distance function to compute distances between centroids and
objects in the dataset. Moreover, K-Means’s initial clusters C1, C2, and C3 are as
follows:
C1: {(1,2,1), (4,4,2), (1,6,2)}
C2: {(3,5,1), (0,4,2),(4,0,2)}
C3: {(8,8,2), (7,9,2)}

Now K-means is run for a single iteration; what are the new clusters and what are
their centroids? [5]

New Center
C1(2,4,1.667)
C2(2.333,3,1.667)
C3 (7.5,8.5,2)

Distance Matrix

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 1 of 2


C1 C2 C3

O1 2.33 1.80 9.25

O2 2.03 1.97 5.70

O3 1.56 2.21 5.79

O4 2.26 3.30 6.96

O5 7.22 7.56 0.71

O6 7.08 7.61 0.71

O7 2.03 2.56 8.75

O8 4.48 3.45 9.19

Now:

Cluster 1: O3, O4, O7


Cluster 2: O1, O2, O8
Cluster 3: O5, O6

New Cluster=
C1: (1.333,5,1.6667)
C2:(3,2,1.667)
C3:(7.5,8.5,2)

Q.3. Consider the following dataset [5]


A = {80, 16, 96, 10, 10, 35, 10, 10, 30, 10, 10, 10, 39, 10, 16, 33}
A. Calculate the frequency and mode for the above Dataset

10 8
16 2
30 1
33 1
35 1
39 1
80 1
96 1
Mode: 10
B. Calculate the standard deviation for the above Dataset.
26.29
C. Calculate the variance for the above Dataset.
691.59
D. Calculate 25%,75%,50% for the above Dataset.
25%=10

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 2 of 2


50%=13
75%=34
E. Use the data to create a box-and-whisker plot.
IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 2

Q.4 (a) What are the common Properties of a metrics?


a. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
b. d(p, q) = d(q, p) for all p and q. (Symmetry)
c. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 3 of 2


Q.4 (b) What is the difference between noise and outliers?
Outlier: A data object that does not comply with the general behavior of the data

Noise: It is meaningless data or corrupted data.It unnecessarily increases the amount of


storage space

Q.4 (c) Show that the set difference metric given by

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 4 of 2


d(A,B) = size(A − B) + size(B − A)
satisfies the metric rule. A and B are sets and A − B is the set difference.
[1.5+ 1.5+ 3 =6]

Q.5 (a) Sketch a histogram of the below data? [2 +3= 5]


X={9,6,6,4,6,8,5,4,4,4,4,8,5,6,5,4,6,8,6,4,4,6,4,4,8}

Q.5 (b) Prove that the mean of n1=(a+b)/2 and n2=(a-b)/2 is a/2, and the variance is a*b/4
Mean(n1,n2)=(n1+n2)/2=a/2
Var(o1,o2)=

Not possible

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 5 of 2


Q.6. Write down the advantages of hierarchical clustering approach over partition
clustering approach. Perform single-link hierarchical clustering method with the
similarity matrix given in Table 3 and show the results using a dendogram. Show the
all steps. [4]

Table 3: Euclidean distance

O1 O2 O3 O4 O5

O1 0.00 13.22 15.56 91.99 19.11


O2 13.22 0.00 16.64 112.29 140.53
O3 15.56 16.64 0.00 117.44 28.81
O4 91.99 112.29 117.44 0.00 81.83
O5 19.11 140.53 28.81 81.83 0.00

Advantages of hierarchical clustering approach over partition clustering approach :

It does not assume a particular value of 𝑘, as needed by 𝑘-means clustering. 2. The generated tree
may correspond to a meaningful taxonomy. 3. Only a distance or “proximity” matrix is needed to
compute the hierarchical clustering

Min Distance= D(O1,O2)

O1, O2 O3 O4 O5

O1,O2 0.00 15.56 91.99 19.11


O3 15.56 0.00 117.44 28.81
O4 91.99 117.44 0.00 81.83
O5 19.11 28.81 81.83 0.00

Min Distance= D(O1,O3,O2)

O1, O2,O3 O4 O5

O1,O2, O3 0.00 91.99 19.11


O4 91.99 0.00 81.83
O5 19.11 81.83 0.00

Min Distance= D(O4,O5)

O1, O2,O3, O5 O4

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 6 of 2


O1,O2, O3,
O5 0.00 81.83
O4 81.83 0.00

*********

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 7 of 2

S-ar putea să vă placă și