Documente Academic
Documente Profesional
Documente Cultură
PART – A (2 marks)
18. What are the business applications of reporting and query tools?
19. State why, for the integration of multiple heterogeneous information sources, many
companies in industry prefer the update-driven approach (which constructs and uses data
warehouses), rather than the query-driven approach (which applies wrappers and
integrators). Describe situations where the query-driven approach is preferable over the
update-driven approach.
30. When can we implement the top-down approach and mention its advantages?
31. Write a brief outline on bottom-up approach?
32. Consider the following multifeature cube query: Grouping by all subsets of {item, region, month},
find the minimum shelf life in 2014 for each group, and the fraction of the total sales due to tuples
whose price is less than $100, and whose shelf life is between 1.25 and 1.5 of the minimum shelf
life. Is this a distributive multifeature cube? Comment it
67. What is star schema explain with an example? Or Define Star Schema
(Nov/Dec 2014) (May/Jun 2012) (Nov/Dec 2016)
68. Explain what the overview of SYBASE IQ is?
69. What is bitmapped indexing? (Apr/May 2011)
70. How could we summarize the concept of Data cardinality?
71. What is Data Transformation? Give example. (Apr/May 2011)
72. Write short notes on Data replication tools.
73. Suppose that a data warehouse consists of the three dimensions time, doctor, and patient,
and the two measures count and charge, where charge is the fee that a doctor charges a
patient for a visit. Write an SQL query assuming the data is stored in a relational database
with the schema fee (day, month, year, doctor, hospital, patient, count, charge).
74. List out the features of Metacenter?
75. List out the benefits of the Integrity tool.
Metadata
76. What is metadata dictionary?
77. Define Metadata with an example. (Apr/May 2015) (Nov/Dec 2014)
78. List the contents of meta data Repository (May/Jun 2016)
79. Draw the framework of Metadata Interchange Framework.
80. List the components of Metadata interchange frameworks.
(b) Starting with the base cuboid [student; course; semester; instructor], what specific OLAP
operations (e.g., roll-up from semester to year) should one perform in order to list the
average grade of CS courses for each Big University student.
(c) If each dimension has five levels (including all), such as “student < major < status <
university <all", how many cuboids will this cube contain (including the base and apex
cuboids)? (13)
11. Design a data warehouse for a regional weather bureau. The weather bureau has about 1,000
probes, which are scattered throughout various land and ocean locations in the region to
collect basic weather data, including air pressure, temperature, and precipitation at each hour.
All data are sent to the central station, which has collected such data for over 10 years. Your
design should facilitate efficient querying and on-line analytical processing, and derive
general weather patterns in multidimensional space. (13)
12. Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and
game, and the two measures, count and charge, where charge is the fare that a spectator pays
when watching a game on a given date. Spectators may be students, adults, or seniors, with
each category having its own charge rate.
Draw a star schema diagram for the data warehouse.
(b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP
operations should one perform in order to list the total charge paid by student spectators
at GM Place in 2014?
(c) Bitmap indexing is useful in data warehousing. Taking this cube as an example,
briefly discuss advantages and problems of using a bitmap index structure. (13)
1. Describe the differences between the following approaches for the integration of a data
mining system with a database or data warehouse system: no coupling, loose coupling,
semitight coupling, and tight coupling. State which approach you think is the most
popular, and why. (15)
2. Briefly describe the following advanced database systems and applications: object-
relational databases, spatial databases, text databases, multimedia databases, the World
Wide Web. (15)
3. State why for the integration of multiple heterogeneous information sources many
companies in industry prefer the update-driven approach (which con- structs and uses
data warehouses), rather than the query-driven approach (which applies wrappers and
integrators). Describe situations where the query-driven approach is preferable over the
update-driven approach. (15)
65. Point-of-sales data and sales made via call-center or the web are stored in different
location and formats. It would a time consuming process for an executive to obtain OLAP
reports such as – What are the most popular products purchased by customers between
the ages 15 to 30?
66. Mention the vendor approaches for deploying tools in web.
67. What was the main idea of HTML Publishing and helper application approach in web?
68. Write about Server-centric components, Java and ActiveX applications approaches in
web.
OLAP Tools and the Internet:
69. How would you summarize Arbor Essbase Web?
70. Define OLAP tool. (Apr/May 2010)
71. Comment on OLAP Tools on Internet. (Nov/Dec 2016)
72. What is Virtual warehouse? (Nov/Dec 2014)
73. Define Micro Strategy DSS Web?
74. What is Brio Technology?
75. Mention the advantages and disadvantages of MOLAP.
76. List out the advantages and disadvantage of ROLAP.
77. What are the reasons to builds a query and reporting environment?
78. Define Power builder.
79. What do you think about application painter and Window painter?
80. What is meant by Data windows painter?
81. Write about Database painter and structure painter.
82. Write a note on function painter and user object painter.
10. (i) Design multi-dimensional data model for hospital data warehouse, consist three
dimensions time, doctor, and patient and the two measures count and charge, where
charge is a fee that a doctor charges a patients for a visit. (3+3)
(1) Enumerate three classes of schema that are popularly used for modeling
data warehouses.
(2) Draw a schema diagram for the above data warehouse using all of the
schema classes listed in (1).
(ii) How to reduce the size of the fact table? Explain with an example.
(Nov/Dec14) 7
11. Regarding the computation of measures in a data cube: 13
(a) Enumerate three categories of measures, based on the kind of aggregate functions
used in computing a data cube.
(b) For a data cube with the three dimensions time, location, and item, which category
does the function variance belong to? Describe how to compute it if the cube is
partitioned into many chunks.
Hint: The formula for computing variance is where xi is the average
of N xis.
(c) Suppose the function is \top 10 sales." Discuss how to efficiently compute this
measure in a data cube.
12. With relevant examples discuss multidimensional online analytical processing and multi
relational online analytical processing. 13
1. Design a data warehouse for a regional weather bureau. The weather bureau has about
1,000 probes, which are scattered throughout various land and ocean locations in the
region to collect basic weather data, including air pressure, temperature, and precipitation
at each hour. All data are sent to the central station, which has collected such data for
over 10 years. Your design should facilitate efficient querying and on-line analytical
processing, and derive general weather patterns in multidimensional space. (15)
2. What are hypercube? How do they apply in an OLAP system? (15)
3. A popular data warehouse implementation is to construct a multidimensional database,
known as a datacube. Unfortunately, this may often generate a huge, yet very sparse
multidimensional matrix. Present an example illustrating such a huge and sparse data
cube. (15)
4. Suppose that a data warehouse contains 20 dimensions, each with about five levels of
granularity.
(a) Users are mainly interested in four particular dimensions, each having three
frequently accessed levels for rolling up and drilling down. How would you design a data
cube structure to efficiently support this preference?
(b) At times, a user may want to drill through the cube, down to the raw data for one or
two particular dimensions. How would you support this feature? (15)
5. For class characterization, what are the major differences between a data cube based
implementation and a relational implementation such as attribute-oriented induction?
Discuss which method is most efficient and under what conditions this is so. (15)
PART- A (2 MARKS)
Introduction:
1. What are the evolutionary paths in the development of database system?
2. What motivated data mining? Why is it so important?
Data Mining:
3. Define and Draw the architecture of Data mining system.
4. Is data mining a simple transformation of technology developed from databases, statistics,
and machine learning?
5. Present an example where data mining is crucial to the success of a business. What data
mining functions does this business need? Can they be performed alternatively by data
query processing or simple statistical analysis?
6. What is KDD? What are the steps involved in KDD?
7. List some of the data mining techniques?
29. Illustrate the concept of Web usage mining or web log mining?
30. Use the two methods below to normalize the following group of data: 200, 300, 400, 600,
1000 min-max normalization by setting min = 0 and max = 1 (b) z-score normalization.
31. Summarize the web search services of data mining?
32. List out the primitives that satisfy a data mining task.
Data Mining Functionalities:
33. List out the data mining functionalities. (April/May 2015)
34. How would you summarize class/concept descriptions and give an example?
35. How to derive the concept / class descriptions?
36. Define Data characterization and data discrimination?
37. Illustrate the concept of frequent patterns and structured pattern?
38. Use a flow chart to summarize the stepwise forward selection procedures for attribute
subset selection.
39. Write in your own words about Multidimensional association rules?
40. Use a flow chart to summarize the stepwise backward elimination procedures for
attribute subset selection .
41. Write Short notes on Classification and Prediction.
42. Define the concept of Decision tree and Neural Network.
43. Illustrate what you think about regression analysis and relevance analysis.
44. Use a flow chart to summarize a combination of forward selection and backward
elimination procedures for attribute subset selection.
45. Define the concept of cluster analysis?
46. Write two methods that can be used to detect outliers and discuss which one is more
reliable?
Interestingness of Patterns
47. Define the term interestingness of patterns. (Nov/Dec 2014)
48. Explain what is meant by subjective interestingness measures?
49. How are the buckets determined and the attribute values partitioned?
50. Discuss data mining system generate only interesting patterns and How it generate all of
the interesting patterns?
Classification of Data Mining Systems
51. List out the categories of data mining systems?
52. Define a pattern. (Nov/Dec 2015)
53. Write in your own words about Meta learning?
54. What are the classifications of Data mining systems?
55. What is descriptive and predictive data mining?
56. What is data mining query?
Data mining task primitives:
57. List the primitives for specification of a data mining task. (Apr/May 2017)
58. What do you think about DMQL?
1. (i) What is data preprocessing? Explain the various data reduction techniques.
Or
Why do we need to preprocess data? What are the different forms of preprocessing?
(Apr/May 2017) (9)
(ii) Explain the basic methods for data cleaning. (May/June 2016) (4)
9. Describe in detail data mining functionalities and the different kinds of patterns can
be mined? (Apr/May17) (13)
10. Outline the major research challenges of data mining in one specific application domain,
such as stream/sensor data analysis, spatiotemporal data analysis, or bioinformatics.
(13)
11. Recent applications pay special attention to spatiotemporal data streams. A
Spatiotemporal data stream contains spatial information that changes over time, and is
in the form of stream data, i.e., the data flow in-and-out like possibly infinite streams
(a) Present three application examples of spatiotemporal data streams.
(b) Discuss what kind of interesting knowledge can be mined from such data streams,
with limited time and resources.
(c) Identify and discuss the major challenges in spatiotemporal data mining.
(d) Using one application example, sketch a method to mine one kind of knowledge from
such stream data efficiently. (13)
6. Present an example where data mining is crucial to the success of a business. What data
mining functions does this business need? Can they be performed alternatively by data
query processing or simple statistical analysis? (15)
7. What is data mining? In your answer, address the following:
Is it another hype?
(b) Is it a simple transformation of technology developed from databases, statistics, and
machine learning?
(c) Explain how the evolution of database technology led to data mining.
(d) Describe the steps involved in data mining when viewed as a process of knowledge
discovery. (15)
8. Data quality can be assessed in terms of accuracy, completeness, and consistency.
Propose two other dimensions of data quality. (DATA PREPROCESSING) (15)
9. Give three additional commonly used statistical measures (i.e., not illustrated in this
chapter) for the characterization of data dispersion, and discuss how they can be
computed efficiently in large databases. (15)
10. Propose and outline a level-shared mining approach to mining multilevel association
rules in which each item is encoded by its level position, and an initial scan of the
database collects the count for each item at each concept level, identifying frequent and
subfrequent items. Comment on the processing cost of mining multilevel associations
with this method in comparison to mining single-level associations. (15)
FREQUENT PATTERNS:
1. What are frequent patterns? (Nov 2007)
2. Prove that all nonempty subsets of a frequent itemset must also be frequent.
3. What is market basket analysis? (May 2009)
4. Write the frequency notation of item set?
5. When an itemset X is said to be closed?
6. Mention the criteria’s for classifying the frequent pattern mining.
7. Explain frequent pattern mining based on the completeness of patterns.
8. How would you mine frequent patterns based on the levels of abstraction?
9. Explain frequent pattern mining based on data dimensions?
10. Give an example of frequent pattern mining based on the kinds of rules?
11. Suppose that frequent itemsets are saved for a large transaction database, DB. Discuss how
to efficiently mine the (global) association rules under the same minimum support threshold
if set of new transactions, denoted as ∆DB, is (incrementally) added in?
ASSOCIATION RULES:
12. Prove that the support of any nonempty subset s0 of itemset s must be at least as great as the
support of s.
13. What is the main idea of structural pattern matching?
14. What is rule base classification? (Nov 2011)
15. How to mine association rules from large databases? (Nov 2007)
16. List the interesting measures for association rules.
(Apr/May 2008,2009) (Nov 2012)
APRIORI ALGORITHM:
17. State the Apriori property.
18. What is an antimonotone?
19. Give the properties of Apriori algorithm.
20. Write the outline about prune step in Apriori property?
21. List the methods to improve Apriori‟s efficiency. (Nov 2016)
22. What is conditional probability?
23. Give a note on hash based techniques.
24. What is Partitioning? Give example.
25. Illustrate the main idea of local frequent itemset?
26. Write the definition of Sampling?
27. What is dynamic item set counting?
FP TREE:
28. What is frequent- pattern growth? (May 2010)
29. Generalize the special features of frequent pattern tree and closed frequent item sets?
30. The price of each item in a store is nonnegative. For each of the following cases, identify the
kinds of constraint they represent and briefly discuss how to mine such association rules
efficiently.
(a) Containing at least one Nintendo game
(b) Containing items the sum of whose prices is less than $150
66. How do you choose best split while constructing a decision tree? (May 2014)
67. Elucidate two phase involved in decision tree induction? (Nov 2016)
68. It is important to calculate the worst-case computational complexity of the decision tree
algorithm. Given data set D, the number of attributes n, and the number of training tuples
|D|, show that the computational cost of growing a tree is at most n n | D | log(| D |)
69. List the conditions for terminating recursive partitioning.
SELECTION MEASURES:
70. Give the formula for gain ratio and gini index.
71. What is the use of pruning in decision tree construction? (May 2013 / May2016)
72. What is the drawback of using a separate set of tuples to evaluate pruning?
73. Differentiate between prepruning and postpruning.
74. What is a support vector machines? (May 2011)
75. Develop a scalable SVM algorithm for efficient SVM classification in large datasets. (S).
BAYESIAN CLASSIFICATION:
76. What is naïve Bayesian classification? How is it different from Bayesian classification?
(May 2012)
77. State Bayes‟ theorem. (May 2016)
78. What is lazy learner? Give an example. (Nov 2014) (Apr 2017)
79. How do you evaluate accuracy of a classifier? (Apr 2017)
80. Design an efficient method that performs effective naive Bayesian classification over an
infinite data stream.
81. Compare the advantages and disadvantages of eager classification (e.g., decision tree,
Bayesian, neural network) versus lazy classification (e.g., k-nearest neighbor, case-based
reasoning).
82. Write about BOAT.
83. Define pessimistic pruning.
PART-B (13Marks)
1. a. Distinguish classification and prediction. State the issues regarding classification and
prediction. (May 2016) (4)
b. Give the algorithm for Decision Tree Induction and explain with an example.
(Nov 2012)(Nov 2011)(May 2012)(May2016) (9)
2. a. Explain about classification by Backpropagation in detail. (7)
b. Discuss in detail about constrained based association mining.
(Apr 2017) (May 2012) (6)
3. a. Write and explain algorithm for mining frequent itemsets without candidate
generation. (May 2014) (7)
5 a, c
6 b, c
7 a, c
8 a, b, c, e
9 a, b, c
4. Find all frequent item sets for the given training set using apriori and FP growth
respectively. Compare the efficiency of the two mining processes. (Nov 2016) (13)
TID Items_bought
T100 {M,O,N,K,E,Y}
T200 {D,O,N,K,E,Y}
T300 {M,A,K,E}
T400 {M,U,C,K,Y}
T500 {C,O,O,K,I,E}
5. Apply the Apriori algorithm for discovering frequent itemsets to the following dataset.
(Nov 2011, May 2013, May2015, Nov/Dec15) (13)
6. Discuss the Apriori algorithm for mining frequent itemset with an example in detail.
(Nov 2014, May2011, May 2010, May 2012) (May 2016) (13)
Or
11. a. What is classification? With an example explain how support vector machines can be
used for classification. (Nov 2011) (9)
1. Give a short example to show that items in a strong association rule may actually be
negatively correlated. (15)
2. Sequential patterns can be mined in methods similar to the mining of association rules.
Design an efficient algorithm to mine multilevel sequential patterns from a transaction
database. An example of such a pattern is the following: “A customer who buys a PC will
buy Microsoft software within three months", on which one may drill down to find a
more refined version of the pattern, such as “A customer who buys a Pentium PC will
buy Microsoft Office within three months". (15)
3. In many applications, new data sets are incrementally added to the existing large data
sets. Thus an important consideration for computing descriptive data summary is whether
a measure can be computed efficiently in incremental manner. Use count, standard
deviation, and median as examples to show that a distributive or algebraic measure
facilitates efficient incremental computation, whereas a holistic measure does not. (15)
4. Suppose that you are in the market to purchase a data mining system.
(a) Regarding the coupling of a data mining system with a database and/or data
warehouse system, what are the differences between no coupling, loose coupling,
semi-tight coupling, and tight coupling?
(b) What is the difference between row scalability and column scalability?
(c) Which feature(s) from those listed above would you look for when selecting a
data mining system? (15)
5. Write pseudocode for the automatic generation of a concept hierarchy for numeric data
based on the equal frequency partitioning rule. (15)
OUTLIER ANALYSIS
76. What is an outlier? Give an example for the library management system.
Nov‟15, May‟13, Nov‟12, May‟11, Nov‟11, Nov‟16
77. How outliers may be detected by clustering? May/Jun‟15
78. Classify Outlier detection approaches. Mention the applications of outlier detections.
Nov/Dec‟11
DATA MINING APPLICATIONS
79. List the some applications of Data Mining. May/Jun‟11 „17, Nov/Dec‟16
80. Point out the role of Data mining in financial data analysis?
81. How Data mining is useful for the retail industry?
82. Explain the application of Data mining in telecommunication industry.
83. Why do we need Data mining in biological data analysis?
PART – B(13 Marks)
1. Outliers are often discarded as noise. However, one person’s garbage could be another’s
treasure. For example, exceptions in credit card transactions can help us detect the
fraudulent use of credit cards. Taking fraudulence detection as an example, propose two
methods that can be used to detect outliers and discuss which one is more reliable. (15)
2. What are the differences between visual data mining and data visualization? Data
visualization may suffer from the data abundance problem. For example, it is not easy to
visually discover interesting properties of network connections if a social network is
huge, with complex and dense connections. Propose a data mining method that may help
people see through the network topology to the interesting features of the social network.
(15)
3. What is a collaborative recommender system? In what ways does it differ from a
customer- or product- based clustering system? How does it differ from a typical
classification or predictive modeling system? Outline one method of collaborative
filtering. Discuss why it works and what its limitations are in practice. (15)
4. What are the major challenges faced in bringing data mining research to market?
Illustrate one data mining research issue that, in your view, may have a strong impact on
the market and on society. Discuss how to approach such a research issue. (15)
5. Give an example of how specific clustering methods may be integrated, for example,
where one clustering algorithm is used as a preprocessing step for another. In addition,
provide reasoning on why the integration of two methods may sometimes lead to
improved clustering quality and efficiency. (15)
R1: “Data Warehousing, Data Mining, & OLAP”, Alex Berson, Tata McGraw-Hill
edition.
R2:“DATA MINING: CONCEPTS AND TECHNIQUES”, HAN & KAMBER, 3rd EDITION
R3: “DATA MINING SOLUTIONS”, RAJENDRA AKERKAR.
R4: “DATA MINING: TUTORIAL EXERCISES - CLUSTERING – K-MEANS, NEAREST NEIGHBOR
AND HIERARCHICAL”, HAN & KAMBER