Data Mining

Birla Institute of Technology & Science, Pilani
Work-Integrated Learning Programmes Division

Second Semester 2018-2019
Mid-Semester Test
(EC-2 Makeup)
Course No. : IS ZC415

Course Title : DATA MINING
Nature of Exam : Closed Book
Weightage : 30% No. of Pages =2
Duration : 2 Hours No. of Questions = 6
Date of Exam : 09/03/2019 (AN)
Note:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.
4. Unnecessary long answers may attract negative mark.
Q.1. For the data set given below in Table 1, Fill the missing values. [5]
Table-1
SL NO: Fet1 Fet2 Fet3 Fet4
1 3 4 4 3
2 0 12 13 13
3 10 15 12 18
4 0 23 17 0
5 4 7 6 2
6 2 14 7 12
Upper Quartile = 13 or Lower Quartile= 2 or
Mode = 0 Mean=12.5 12 3
Q.2. Assume the following dataset is given: (1,2,1), (4,4,2), (3,5,1), (1,6,2), (8,8,2),(7,9,2),
(0,4,2), (4,0,2). K-Means is used with k=3 to cluster the dataset. Moreover, Euclidean
distance is used as the distance function to compute distances between centroids and
objects in the dataset. Moreover, K-Means’s initial clusters C1, C2, and C3 are as
follows:
C1: {(1,2,1), (4,4,2), (1,6,2)}
C2: {(3,5,1), (0,4,2),(4,0,2)}
C3: {(8,8,2), (7,9,2)}
Now K-means is run for a single iteration; what are the new clusters and what are
their centroids? [5]
New Center
C1(2,4,1.667)
C2(2.333,3,1.667)
C3 (7.5,8.5,2)
Distance Matrix
IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 1 of 2

C1 C2 C3
O1 2.33 1.80 9.25
O2 2.03 1.97 5.70
O3 1.56 2.21 5.79
O4 2.26 3.30 6.96
O5 7.22 7.56 0.71
O6 7.08 7.61 0.71
O7 2.03 2.56 8.75
O8 4.48 3.45 9.19
Now:
Cluster 1: O3, O4, O7

Cluster 2: O1, O2, O8
Cluster 3: O5, O6
New Cluster=
C1: (1.333,5,1.6667)
C2:(3,2,1.667)
C3:(7.5,8.5,2)
Q.3. Consider the following dataset [5]

A = {80, 16, 96, 10, 10, 35, 10, 10, 30, 10, 10, 10, 39, 10, 16, 33}
A. Calculate the frequency and mode for the above Dataset
10 8
16 2
30 1
33 1
35 1
39 1
80 1
96 1
Mode: 10
B. Calculate the standard deviation for the above Dataset.
26.29
C. Calculate the variance for the above Dataset.
691.59
D. Calculate 25%,75%,50% for the above Dataset.
25%=10

50%=13
75%=34
E. Use the data to create a box-and-whisker plot.
IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 2
Q.4 (a) What are the common Properties of a metrics?

a. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
b. d(p, q) = d(q, p) for all p and q. (Symmetry)
c. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)

Q.4 (b) What is the difference between noise and outliers?
Outlier: A data object that does not comply with the general behavior of the data
Noise: It is meaningless data or corrupted data.It unnecessarily increases the amount of

storage space
Q.4 (c) Show that the set difference metric given by

d(A,B) = size(A − B) + size(B − A)
satisfies the metric rule. A and B are sets and A − B is the set difference.
[1.5+ 1.5+ 3 =6]
Q.5 (a) Sketch a histogram of the below data? [2 +3= 5]

X={9,6,6,4,6,8,5,4,4,4,4,8,5,6,5,4,6,8,6,4,4,6,4,4,8}
Q.5 (b) Prove that the mean of n1=(a+b)/2 and n2=(a-b)/2 is a/2, and the variance is a*b/4
Mean(n1,n2)=(n1+n2)/2=a/2
Var(o1,o2)=
Not possible

Q.6. Write down the advantages of hierarchical clustering approach over partition
clustering approach. Perform single-link hierarchical clustering method with the
similarity matrix given in Table 3 and show the results using a dendogram. Show the
all steps. [4]
Table 3: Euclidean distance
O1 O2 O3 O4 O5
O1 0.00 13.22 15.56 91.99 19.11

O2 13.22 0.00 16.64 112.29 140.53
O3 15.56 16.64 0.00 117.44 28.81
O4 91.99 112.29 117.44 0.00 81.83
O5 19.11 140.53 28.81 81.83 0.00
Advantages of hierarchical clustering approach over partition clustering approach :
It does not assume a particular value of 𝑘, as needed by 𝑘-means clustering. 2. The generated tree
may correspond to a meaningful taxonomy. 3. Only a distance or “proximity” matrix is needed to
compute the hierarchical clustering
Min Distance= D(O1,O2)
O1, O2 O3 O4 O5
O1,O2 0.00 15.56 91.99 19.11

O3 15.56 0.00 117.44 28.81
O4 91.99 117.44 0.00 81.83
O5 19.11 28.81 81.83 0.00
Min Distance= D(O1,O3,O2)
O1, O2,O3 O4 O5
O1,O2, O3 0.00 91.99 19.11

O4 91.99 0.00 81.83
O5 19.11 81.83 0.00
Min Distance= D(O4,O5)
O1, O2,O3, O5 O4

O1,O2, O3,
O5 0.00 81.83
O4 81.83 0.00
*********

Data Mining

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

Course No. : IS ZC415

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 1 of 2

O1 2.33 1.80 9.25

O2 2.03 1.97 5.70

O3 1.56 2.21 5.79

O4 2.26 3.30 6.96

O5 7.22 7.56 0.71

O6 7.08 7.61 0.71

O7 2.03 2.56 8.75

O8 4.48 3.45 9.19

Cluster 1: O3, O4, O7

Q.3. Consider the following dataset [5]

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 2 of 2

Q.4 (a) What are the common Properties of a metrics?

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 3 of 2

Noise: It is meaningless data or corrupted data.It unnecessarily increases the amount of

Q.4 (c) Show that the set difference metric given by

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 4 of 2

Q.5 (a) Sketch a histogram of the below data? [2 +3= 5]

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 5 of 2

Table 3: Euclidean distance

O1 0.00 13.22 15.56 91.99 19.11

Advantages of hierarchical clustering approach over partition clustering approach :

Min Distance= D(O1,O2)

O1,O2 0.00 15.56 91.99 19.11

Min Distance= D(O1,O3,O2)

O1,O2, O3 0.00 91.99 19.11

Min Distance= D(O4,O5)

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 6 of 2

IS ZC415 (EC-2 Makeup) Second Semester 2018-2019 Page 7 of 2

S-ar putea să vă placă și