Data Warehousing and Data Mining

Code No: 2321204 Set No.
1
III B.Tech II Semester Regular Examinations, April/May 2009
DATA WAREHOUSING AND DATA MINING
(Information Technology)
Time: 3 hours Max Marks: 80
Answer any FIVE Questions
All Questions carry equal marks
⋆⋆⋆⋆⋆
1. (a) Explain the major issues in data mining.

(b) Explain the three-tier datawarehousing architecture. [8+8]
2. (a) Briefly discuss the data smoothing techniques.
3. (a) Explain the syntax for Task-relevant data specification.
co m
(b) Explain about concept hierarchy generation for categorical data. [8+8]
ld .
(b) Explain the syntax for specifying the kind of knowledge to be mined. [8+8]
o r
4. (a) Write and explain the basic algorithm for Attribute-oriented induction.
(b) What are the differences between concept description in large data bases and
OLAP?
t u w
5. Explain the Apriori algorithm with example.
[8+8]
[16]
j n
6. (a) Why is tree pruning useful in decision tree induction? What is a draw back
.
of using a separate set of samples to evaluate pruning?
w Explain.
w
(b) How rough set approach and fuzzy set approaches are useful for classification?
[8+8]
w
7. (a) What are the categories of major clustering methods? Explain.
(b) Explain about outlier analysis. [6+10]
8. An e-mail database is a database that stores a large number of electronic mail

messages. It can be viewed as a semistructured database consisting mainly of text
data. Discuss the following.
(a) How can such an e-mail database be structured so as to facilitate multi-

dimensional search, such as by sender, by receiver, by subject, by time, and
so on?
(b) What can be mined from such an e-mail database?
(c) suppose you have roughly classified a set of your previous e-mail messages as
junk, unimportant, normal, or important. Describe how a data mining system
may take this as the training set to automatically classify new e-mail messages
or unclassified ones. [5+5+6]
⋆⋆⋆⋆⋆
1 of 1
Code No: 2321204 Set No. 2
⋆⋆⋆⋆⋆
1. (a) Discuss about data mining on data warehousing

(b) Discuss about various types of warehouse servers for OLAP processing. [8+8]
2. Explain various data reduction techniques. [16]
3. (a) Explain the syntax for concept hierarchy specification.
co m
.
(b) Explain the syntax for specifying the kind of knowledge to be mined. [8+8]
4. (a) What is Concept description? Explain.
r ld
(b) What are the differences between concept description in large data bases and
OLAP?
o [8+8]
t u w
5. (a) Which algorithm is an influential algorithm for mining frequent item sets for
Boolean association rules. Explain.
. j n
(b) Discuss about association mining using correlation rules. [8+8]
6. The following table consists of training data from an employee database. The data
w w
have been generalized. For a given row entry, count represents the number of data
tuples having the values for department, status, age, and salary given in that below:
Department status age salary count
w Sales
Sales
Sales
Systems
Senior
Junior
Junior
Junior
31...35
26...30
31...35
21...25
46K....50K
26K...30K
31K...35K
46K...50K
30
40
40
20
Systems Senior 31...35 66K...70K 5
Systems Junior 26...30 46K...50K 3
Marketing Senior 36...40 46K...50K 10
Marketing Junior 31...35 41K...45K 4
Secretary Senior 46...50 36K...40K 4
Secretary Junior 26...30 26K...30K 6
Let salary be the class label attribute.
Given a data sample with the values “systems”, “junior;”, and “26...30” for the
attributes department, status, and age, respectively, what would a naive Bayesian
classification of the salary for the sample be? [16]
7. (a) Categorize major clustering methods.
1 of 2
(b) Explain OPTICS algorithm.
(c) What is an outlier? Why is Outlier mining important? Briefly discuss about
statistical-based outlier detection. [4+4+8]
8. An e-mail database is a database that stores a large number of electronic mail

messages. It can be viewed as a semistructured database consisting mainly of text
data. Discuss the following.
(a) How can such an e-mail database be structured so as to facilitate multi-

dimensional search, such as by sender, by receiver, by subject, by time, and
so on?
(b) What can be mined from such an e-mail database?
co m
(c) suppose you have roughly classified a set of your previous e-mail messages as
junk, unimportant, normal, or important. Describe how a data mining system
may take this as the training set to automatically classify new e-mail messages
or unclassified ones.
ld . [5+5+6]
⋆⋆⋆⋆⋆
o r
t u w
. j n
w w
w
2 of 2
⋆⋆⋆⋆⋆
1. (a) Explain about advance database systems and advance database applications.
(b) Draw the integrated OLAM and OLAP architecture. Explain. [8+8]
2. Suppose that the data for analysis include the attribute age. The age values for
the data tuples are (in increasing order):
co m
13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46, 52,70.
ld .
(a) Use smoothing by bin means to smooth the above data, using a bin depth of
data.
o r
3. Illustrate your steps. Comment on the effect of the technique for the given
t u w
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?
. j n
3. (a) Briefly discuss about Task-relevant data specification.
[16]
w w
(b) Explain the syntax for Task-relevant data specification. [8+8]
4. Suppose that the data for analysis include the attribute age. The age values for
w the data tuples are (in increasing order):

13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70.
(a) What is the mean of the data?

(b) What is the median?
(c) What is the mode of the data? Comment on the data’s modality.
(d) What is the mid range of the data?
(e) Can you find (roughly) the first quartile(Q1),and third quartile(Q3) of the
data?
(f) Give the five number summaries of the data.
(g) Show a box plot of the data.
(h) How is the quantile-quantile plot different from a quantile plot? [16]
5. (a) Which algorithm is an influential algorithm for mining frequent item sets for
Boolean association rules. Explain.
1 of 2
(b) Discuss about association mining using correlation rules. [8+8]
6. The following table consists of training data from an employee database. The data
have been generalized. For a given row entry, count represents the number of data
tuples having the values for department, status, age, and salary given in that below:
Department status age salary count
Sales Senior 31...35 46K....50K 30
Sales Junior 26...30 26K...30K 40
Sales Junior 31...35 31K...35K 40
Marketing Senior
Marketing Junior
36...40
31...35
co m
46K...50K
41K...45K
10
4
Secretary
Secretary
Senior
Junior
46...50
26...30
ld .
Let salary be the class label attribute.
36K...40K
26K...30K
4
6
o r
Given a data sample with the values “systems”, “junior;”, and “26...30” for the
w
attributes department, status, and age, respectively, what would a naive Bayesian
u
classification of the salary for the sample be? [16]
n t
7. (a) What are the types of data in cluster analysis? Explain.
j
w .
(b) Explain about partitioning methods in detail.
8. (a) What kinds of association can be mined in multimedia data? What are the
[8+8]
w
differences between mining association rules in multimedia databases versus
transactional databases?
w (b) How does latent semantic indexing reduce the size of the term frequency ma-
trix? Explain.
(c) Describe the construction of a multilayered web information base.[3+3+6+4]
⋆⋆⋆⋆⋆
2 of 2
⋆⋆⋆⋆⋆
1. (a) Draw and explain the architecture for on-line analytical mining.
(b) Briefly discuss the data warehouse applications. [8+8]
2. Briefly discuss the Discretization and concept hierarchy techniques. [16]
co m
3. The four major types of concept hierarchies are: schema hierarchies, set-grouping
hierarchies, operation-derived hierarchies, and rule-based hierarchies.
(a) Briefly define each type of hierarchy.
(b) For each hierarchy type, provide an example.
ld . [16]
4. Write short notes for the following in detail:
o r
(a) Attribute-oriented induction.
t u w
(b) Efficient implementation of Attribute-oriented induction. [8+8]
. j n
5. (a) Explain the basic concept of Association rule mining and a road map of it.
(b) Briefly explain about Constraint based Association mining. [8+8]
w w
6. (a) Write an algorithm for k-nearest neighbor classification given k and n, the
number of attributes describing each sample.
w (b) What is linear regression? Give an example of linear regression using the
method of least squares.
7. (a) Given the following measurement for the variable age:
16, 25, 28, 46, 29, 44, 38, 37, 54, 27
[8+8]
Standardize the variable by the following:

i. Compute the mean absolute deviation of age.
ii. Compute the Z-score for the first four measurements.
(b) Explain clustering using representatives algorithm with example.
(c) Write an algorithm for DBSCAN and give an example of DBSCAN.[4+4+4+4]
8. (a) What are different approaches for similarity-based retrieval in image data-
bases?
(b) Define similarity search. Explain similarity search in time-series analysis.
(c) Write a note on mining the World Wide Web. [4+6+6]
⋆⋆⋆⋆⋆
1 of 1

Data Warehousing and Data Mining

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Warehousing and Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

Code No: 2321204 Set No.

1. (a) Explain the major issues in data mining.

2. (a) Briefly discuss the data smoothing techniques.

3. (a) Explain the syntax for Task-relevant data specification.

8. An e-mail database is a database that stores a large number of electronic mail

(a) How can such an e-mail database be structured so as to facilitate multi-

1. (a) Discuss about data mining on data warehousing

2. Explain various data reduction techniques. [16]

3. (a) Explain the syntax for concept hierarchy specification.

4. (a) What is Concept description? Explain.

7. (a) Categorize major clustering methods.

8. An e-mail database is a database that stores a large number of electronic mail

(a) How can such an e-mail database be structured so as to facilitate multi-

w the data tuples are (in increasing order):

(a) What is the mean of the data?

Standardize the variable by the following:

S-ar putea să vă placă și