Sunteți pe pagina 1din 9

Department of Computer Sc. & Engg.

B.I.T., Mesra
Sub.: CS 8031

BE_VII>DMDW.M04

Tutorial Sheet-I

Data Mining & Data Warehousing


Class BE(CS)-VIII Sem.

Module - 1
1.
What is data mining? In your answer, address the following
a)
Is it another type?
b)
Is it a simple transformation of technology developed from database,
statistics, and machine learning ?
c)
Explain how the evolution of database technology led to data mining.
d)
Describe the steps involved in data mining when viewed as a process
of knowledge discovery.
2.

Present as example where data mining is crucial to the success of a


business. What data mining functions does this business need? Can them
be performed alternatively by data query processing or
simple
statistical analysis?

3.

How is a data warehouse different from a database?


similar?

4.

Briefly describe the following advanced database systems


and
applications:
object-oriented database, spatial databases,
databases, multimedia databases, the World Wide Web.

How are they

text

5.

Define
each
of the following
data
mining
functionalities:
characterization,
discrimination,
association,
classification,
prediction, clustering, and evolution analysis. Give examples of each
data mining functionality, using a real-life database that you are
familiar with.

6.

What is the difference between discrimination and classification?


Between characterization and clustering? Between classification and
prediction? For each of these pairs of tasks, how are they similar?

7.

Based on your observation, describe another possible kind of knowledge


that needs to be discovered by data mining methods but has not been
listed in this chapter. Does it require a mining methodology that is
quite different from those outlined in this chapter?

8.

Describe three challenges to data mining


methodology and user interaction issues.

9.

Describe two challenges to data mining regarding performance issues.

regarding

data

mining

Module - 2
10. State why, for the integration of multiple heterogeneous information
sources, many companies in industry prefer the update-driven approach
(which constructs and uses data warehouses), rather than the querydriven (which applies wrappers and integrators). Describe situations
where the query-driven approach is preferable over the update-driven
approach.
11. Briefly compare the following concepts. You may use an example to

303166908.doc

explain your point(s).


a)
b)
c)
12.

Snowflake schema, fact constellation, starnet query model


Data cleaning, data transformation, refresh
Discovery-driven cube, multi feature cube, virtual warehouse

Suppose that a data warehouse consists of the three dimensions time,


doctor, and patient, and the two measures count and charge, where
charge is the fee that a doctor charges a patient for a visit.
a)
b)
c)
d)

Enumerate three classes of schemas that are popularly used for


modeling data warehouses.
Draw a schema diagram for the above data warehouse using one of the
schema classes listed in (a).
Starting with the base cuboid [day, doctor, patient], what specific
OLAP operating should be performed in order to list the total
fee collected by each doctor in 2000?
To obtain the same list, write an SQL query assuming the data is
stored in a relational database with the schema fee (day, month,
year, doctor, hospital, patient, count, charge).

13.

Suppose that a data warehouse for Big-University consists of the


following four dimensions : student, course, semester, and instructor,
and two measures count and avg_grade. When at the lowest conceptual
level (e.g. for a given student, course, semester, and instructor
combination), the avg_grade measure stores the actual course grade of
the student. At higher conceptual levels, avg_grade stores the average
grade for the given conbination.
a)
Draw a snowflake schema diagram for the data warehouse.
b)
Starting with the base cuboid [student, course, semester,
instructor], what specific OLAP operations (e.g. roll-up from
semester to year) should one perform in order to list the average
grade of CS courses for each Big-University student.
c)
If each dimension has five levels (including all), such as student <
major < status < university < all, how many cuboids will this
cube contain (including the base and apex cuboids)?

14.

Regarding the computation of measures in a data cube:


a)
Enumerate three categories of measures, based on the kind of
aggregate functions used i computing a data cube.
b)
For a data cube with the three dimensions time, location, and
product, which category does the function variance belong to?
Describe how to compute it if the cube is partitioned into many
chunks.
c)
Suppose the function is "top 10 sales". Discuss how to efficiently
compute this measure in a data cube.

15.

In data warehouse technology, a multiple dimensional view can be


implemented by a relational database technique (ROLAP), or by a
multidimensional database technique (MOLAP), or by a hybrid database
technique (HOLAP).
a)
Briefly describe each implementation technique.
b)
For each technique, explain how each of the following functions may
be implemented:
i)
The generation of a data warehouse (including aggregation)
ii) Roll-up
iii) Drill-down
iv) Incremental updating
which implementation techniques do you prefer, and why?

303166908.doc

16.

Suppose that a data warehouse contains 20 dimension, each with about


five levels of granularity.
a)
Users are mainly interested in four particular dimensions, each
having three frequently accessed levels for rolling for rolling
up and drilling down. How would you design a data cube structure to
support this preference efficiently?
b)
At times, a user may want to drill through the cube, down to the raw
data for one or two particular dimensions. How would you support
this feature?

17.

Consider the following multi feature cube query: Grouping by all


subsets of [item, region, month], find the minimum shelf life in 2000
for each group, and the fraction of the total sales due to tuples
whose price is less than $100, and whose shelf life is within 25% of
the minimum shelf life, and within 50% of the minimum shelf life.
a)
b)
c)

18.

Draw the multi feature cube graph for the query.


Express the query in extended SQL.
Is this a distributive multi feature cube? Why or why not?

What are the differences between the three main types of data
warehouse usage: information processing, analytical processing, and
data mining? Discuss the motivation behind OLAP mining (OLAM).

Module 3
19.

Data quality can be assessed om terms of accuracy, completeness,


consistency. Propose two other dimensions of data quality.

20.

In real-world data, tuples with missing values for some attributes are
a common occurrence. Describe various methods for handling this
problem.

21.

Suppose that the data for analysis include the attribute age. The
values for the data tuples are (in increasing order): 13, 15, 16,
20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
40, 45, 46, 53, 70.
a)
b)
c)

and

age
19,
36,

Use smoothing by bin means to smooth the above data ,using a bin
depth of 3. Illustrate your steps. Comment on the effect of this
technique for the given data.
How might you determine outliers in the data?
What other methods are there for data smoothing?

22.

Discuss issues to consider during data integration.

23.

Propose an algorithm, in pseudocode or in your favorite programming


language, for the following :
a)
The automatic generation of a concept hierarchy for datagorical data
based on the number of distinct values of attributes in the
given schema.
b)
The automatic generation of a concept hierarchy for numeric data
based on the equiwidth partitioning rule

303166908.doc

c)

The automatic generation of a concept hierarchy for numeric data


based on the equidepth partitioning rule.

24. List and describe the 5 primitives for specifying a data mining task.
25. Describe why concept hierarchies are useful in data mining?
Module - 4
26.

The 4 major types of concept hirerachies are : schema hierarchies, setgrouping hierarchies, operation-derived hierachies and rule-based
hirearchies.
a)
b)

27.

Briefly define each type of hierarchy.


For each hierarchy type, provide an example that was not
in this chapter.

presented

Suppose that the University course DB for Big-University includes the


following attributes describing students :
name, address, status (e.g. undergraduate or graduate), major,
and
GPA(cumulative grade point average).
a)
Propose a concept hierarchy for the attributes address, status,
major, and GPA.
b)
For each concept hierarchy that you have proposed above, what type
of concept hierarchy is it?
c)
Define each hierarchy using DMQL syntax.
d)
Write a DMQL query to find the characteristics of students who have
an excellent GPA.
e)
Write a DMQL query to compare students
majoring in science
with students majoring in arts.
f)
Write a DMQL query to find associations involving course
instructors, student grades, and some other attribute of your
choice. Use a metarule to specify the format of associations you
would like to find. Specify minimum thresholds for the confidence
and support of the association rules reported
g)
Write a DMQL query to predict student grades in "Computing Science
101" based on student GPA and course instructor.

303166908.doc

TUTORIAL SHEET II
MODULE - 4
28.Discuss the importance of establishing standardization date mining query language. What are
some of the potential benefit and challenges? Inralveel in such took? List a few of the recent
proposal in this area?
29. Describe the differences between the following crehiteetane for the integration of the data.
Mining system with database or data wore home system: on coupling, loose coupling semi tight
coupling and tight coupling. Stall which crepitate you think is most popular and why.
MODULE-1
30. (a) what is relation database.
(b) What is transactional database?
(c) What is online-Analytical processing?
MODULE-2
31.A popular data ware heaves implementation is constrict a multidimensional database, known as
a data cube. Unfortunately this may often generate a huge yet very sparse multidimensional matrix.
(a) Present an example illustrating such a huge and sparse data cube.
(b) Design an implementation method that can be elegantly overcome this sparse matrix
Problem note that yet reel to explain your data structure in detail and discuss the sparse needle
or will as how retrieve data from your structures.
(c) Modify your design in (b) to handle inerenental data updates. Gives the easeneing behind
your new design.
MODULE-3
32. Use the flow chart summaries the following procedures for attribute subset selection
(a) Step wise foreword selection.
(b) Step wise Back word selection
(c) A combination of back word elimination and foreword selection.
Module-5
(33) For Class Character section, what are the major differences between a data enbe bored
implementation and relational implementation such as attribute oriented in diction?
Discuss which method is most efficient and under what condition then is so.
(34) Suppers that the following table is derived by attribute-oriented
Induction.
Class
birth free
count
Candor
180
Programmer

303166908.doc

Others

120

DBA

Canada
Others

20
80

(a) Transform the table in to Eros stab. Showing the associated t-weight and
d-weight.
(b) Map the class programmer in to a quantitative. Abstractive rule for example,
X, programmer (X)=) (birth fleece (X)= Canada -----) [t: x%, d: y%]V (..)
[t: w%, d : z]
(35) Suppose that the data for analyses include the attribute age. The age. Value
For the data tepees are
13,15,16,16,19,20,20,21,22,22,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,
70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the datas modality?
(c) What is the midrange of the data?
(d) Can you find the first quartile (Q1) and third quartile (Q2) of the data?
(e) Give the five numbers summery of the data.
(f) Shows a box flot of the data.
(g) How is a quartile flot?
Module 6
(36) The apriority algorithm makes use of prior knowledge of subset support properties.
(a) Prove that all non-empty subset of a frequent item set must also be frequent.
(b) Prove that the support of any non-empty subsets of item sets must be as great as the
support of S.
(c) Given frequent item set L and subset S of L, prove that the confidence of the rule
S (1-s) cannot be more than the confidence of S (1-s), where S is a
subset of S.
(d) A partitioning rauiator of Aprieri subdivides the transactions of a database D in to
N nonolpping partitioning. Prove that any item set that is frequent in D must be
frequent in at lust one petition of D.
(37) A database has four transactions. Let min-sub=60%.and min-conf =80%
TID
date
item-bought
T 100
10/15/99
{K,A,D,B}
T 200
10/15/99
{D,A,C,E,B}
T 300
10/15/99
{C,A,B,E}
T 400
10/15/99
{B,A,D}
(a) Find all frequent item set using Aprion and FP-growth, uspeetingely . Compare the
efficiency of the two mining processes.
(b) List all of the strong association rules, matching the following meta rule, where X is
a variable. Reprinting customers, and item denote valuables repenting items(eg. A,
B, etc):
X trasaefion, busy(X, item1) busy(X, item2 )=) busy (X,item3) [S,C]

303166908.doc

(38) Suppose that frequent item set are saved for a large transaction database, DB. Discuss
how to efficiently mine the (goral) association rules under the same minimum support
turnsole, it a set of new transactions, doffed as ADB, is (incrementally) cereal in ?
(39) Proposal and outline a level-shoved mining approach to mining multilevel association
rule in which and item is enfold by its level position, and as initial scan of the olatatabase
collects the count for each item of each concept level, identifying frequent and sub
frequent items. Comment on the professing cost of mining multi level association with
this me toed in comprising to mining single level association.
(40) When mining cross level Association rules, suppose it is tound that the item set{IBM
desktop computer, printer} dose not satiety minimum support can this information be
need to prune the mining of a descenelent itemset seethes {IBM desktop computer/w
printer} beige a general rule enplaning how this information may be used for purring
the search Space.
(41) Prove that each entry in the following table correctly characterizes its corresponding rule
constraint for frequent item set mining.
Rule Constraint
Ant monotone
Monotone
securest
(a) V S
no
yes
yes
(b) S V
yes
no
yes
(c) min (s) V no
yes
yes
(d) range (s) V yes
no
no
(e) Varian (s) V Convertible
Convertible no
Module-7
(42) Briefly outline the major steps of decision tree classification .
(43) Why is tree preening useful in decision tree inculcation ? What is draw back of using a
separate set of samples to evaluate purring.
(44) The following table shows the midterm and find exam grades obtained for student in a
database course.
X
Midterm exam
72
50.
81
74
94
86
59
83
65
33
88
81

303166908.doc

Y
Final exam
84
(a) Plot the data. Do X and Y seem to have a linear relationship.
63
77
(b) Use the method of least squares to find an equation for the
78
fraction of the students final exam. Grade based on the
90
students midterm grade in the course.
75
(c) Predict the final exam grade of a student who received an 86 on
49
midterm exam.
79
77
52
74
90

(45) What is boosting? State whey it may improve the accuracy of decision tree induction.
(46) Show that accuracy is a fun of sensitivity and specificity, that it --
Pos
Pos
accuracy = sensitivity { Pos+neg } + specificity
(pos+neg).
Module-8
(47) Briefly outline how to compute the dissimilarity between objects described by the
Following type of variable.
(a) Asymmetric be nay variables
(b) Nominal variables.
(c) Ratio-scaled variable.
(d) Numerical variables.
(48) Given two objects rap rental by the topples (22,1,42,10) and (20,0,36,8):
(a) Computer the Euclidean distended between the two objects.
(b) Computer the mandate is thence between the two objects.
(c) Computer the minnows distended between the two objects, wring q=3.
(49) Suppose the data mining took to leister the following eight points (with (x,y) reporting logon)
into 3 clusters.
A1(2,10), A2(2,5),A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2),C2(4,9).
The distance fun is Euclidean distance. Suppose we initially assign A1,B1 and C1 as the
Centers of each cluster, respectively. Use the K-means algorithms to show only.
(a) The three cluster enter softer the first round exam. and.
(b) The final three clusters.
(50) Data cubes and multidimensional database contain categorical, ordinal, and numerical data
in heretical or agree grate forms. Basal on what you have leaned at the clustering method , design
a clustering method that find clusters in large data cube effectively and efficiently.
Module - 9
(51) Suppose that a chain restatement would like to mine customers Consumption behavior
related to major sport events, such as Every time there is a Canucks hockey game on TV, the
sales
of ken turkey Fried chicken will go up 20% one her before the match.
(a) Describe a method to find such pattern efficiently
(b) Most time related association mining algorithms. Use Apron- like algorithms to mine such
Patterns. An alternative database projection bared frequent pattern (FP) growth method,
Is efficient a mining frequent item sets. Can you external the FP growth method to find such
time related patters efficiently.
(52) Suppose that a power station stores data about power consumption levels by time and by
region,
and power usage in formation per customer in each region. Disuses how to ashes the fallowing
problems in such a time series database.
(a) Final similar power consumption curve fragments for a given region on Fridays.
(b) Every time a power consumption curve rises sharply what may happen within 20 minutes?
(c) How can we find the most influential features that distinguish a stable power consumption

303166908.doc

Region from an unstable one?


(53) Each scientific or engineering discipline has its own subject index classification stem eared
that is often used for class tying documents in its disciftime.
(a) Design a web document classification method that can taken such a subject index to classify
A set of web document automatically.
(b) Discuss how to use web lockage information to improve the quality of such classification.
(c) Discuss how to use web usage information to improve the quality of such qualification.
--------- x --------

303166908.doc

S-ar putea să vă placă și