Sunteți pe pagina 1din 6

DBMiner: A System for Mining Knowledge in Large Relational

Databases 

Jiawei Han Yongjian Fu Wei Wang Jenny Chiang Wan Gong Krzysztof Koperski Deyi Li
Yijun Lu Amynmohamed Rajan Nebojsa Stefanovic Betty Xia Osmar R. Zaiane
Data Mining Research Group, Database Systems Research Laboratory
School of Computing Science, Simon Fraser University, British Columbia, Canada V5A 1S6
E-mail: fhan, yongjian, weiw, ychiang, wgong, koperski, dli, yijunl, arajan, nstefano, bxia, zaianeg@cs.sfu.ca
URL: http://db.cs.sfu.ca/ (for research group) http://db.cs.sfu.ca/DBMiner (for system)

Abstract 2. It performs interactive data mining at multiple con-


cept levels on any user-speci ed set of data in a
A data mining system, DBMiner, has been developed database using an SQL-like Data Mining Query
for interactive mining of multiple-level knowledge in Language, DMQL, or a graphical user interface.
large relational databases. The system implements Users may interactively set and adjust various
a wide spectrum of data mining functions, including thresholds, control a data mining process, perform
generalization, characterization, association, classi -
cation, and prediction. By incorporating several in- roll-up or drill-down at multiple concept levels, and
teresting data mining techniques, including attribute- generate di erent forms of outputs, including gen-
oriented induction, statistical analysis, progressive eralized relations, generalized feature tables, multi-
deepening for mining multiple-level knowledge, and ple forms of generalized rules, visual presentation of
meta-rule guided mining, the system provides a user- rules, charts, curves, etc.
friendly, interactive data mining environment with 3. Ecient implementation techniques have been ex-
good performance. plored using di erent data structures, including
generalized relations and multiple-dimensional data
Introduction cubes. The implementations have been integrated
smoothly with relational database systems.
With the upsurge of research and development activ- 4. The data mining process may utilize user- or expert-
ities on knowledge discovery in databases (Piatetsky- de ned set-grouping or schema-level concept hierar-
Shapiro & Frawley 1991; Fayyad et al. 1996), a data chies which can be speci ed exibly, adjusted dy-
mining system, DBMiner, has been developed based on namically based on data distribution, and generated
our studies of data mining techniques, and our experi- automatically for numerical attributes. Concept hi-
ence in the development of an early system prototype, erarchies are being taken as an integrated compo-
DBLearn. The system integrates data mining tech- nent of the system and are stored as a relation in
niques with database technologies, and discovers vari- the database.
ous kinds of knowledge at multiple concept levels from 5. Both UNIX and PC (Windows/NT) versions of the
large relational databases eciently and e ectively. system adopt a client/server architecture. The latter
The system has the following distinct features: may communicate with various commercial database
systems for data mining using the ODBC technology.
1. It incorporates several interesting data mining tech- The system has been tested on several large rela-
niques, including attribute-oriented induction (Han, tional databases, including NSERC (Natural Science
Cai, & Cercone 1993; Han & Fu 1996), statistical and Engineering Research Council of Canada) research
analysis, progressive deepening for mining multiple- grant information system, with satisfactory perfor-
level rules (Han & Fu 1995; 1996), and meta-rule mance. Additional data mining functionalities are be-
guided knowledge mining (Fu & Han 1995). It also ing designed and will be added incrementally to the
implements a wide spectrum of data mining func- system along with the progress of our research.
tions including generalization, characterization, as-
sociation, classi cation, and prediction.
Architecture and Functionalities
Research was supported in part by the grant NSERC- The general architecture of DBMiner, shown in Fig-
OPG003723 from the Natural Sciences and Engineering Re-
search Council of Canada, the grant NCE:IRIS/Precarn- ure 1, tightly integrates a relational database system,
HMI-5 from the Networks of Centres of Excellence of such as a Sybase SQL server, with a concept hierar-
Canada, and grants from B.C. Advanced Systems Institute, chy module, and a set of knowledge discovery mod-
MPR Teltech Ltd., and Hughes Research Laboratories. ules. The discovery modules of DBMiner, shown in Fig-
each class in the database. For example, one may
Graphical User Interface classify diseases and provide the symptoms which
describe each class or subclass.
SQL Server Discovery Modules  An association rule nder discovers a set of asso-
ciation rules (in the form of \A1 ^  ^ Ai !
B1 ^ ^ B ") at multiple concept levels from the
j
Data Concept Hierarchy relevant set(s) of data in a database. For example,
one may discover a set of symptoms often occurring
together with certain kinds of diseases and further
Figure 1: General architecture of DBMiner study the reasons behind them.
 A meta-rule guided miner is a data mining mechanism
DBMiner: Discovery Modules which takes a user-speci ed meta-rule form, such as
\P(x; y) Q(y; z) R(x; z)" as a pattern to con-
^ !

Characterizer Discriminator Classifier ne the search for desired rules. For example, one
may specify the discovered rules to be in the form
of \major(s : student; x) P(s; y) gpa(s; z)" in
^ !

Association Meta-rule order to nd the relationships between a student's


Rule Finder Guided Miner
Predictor
major and his/her gpa in a university database.
 A predictor predicts the possible values of some miss-
Evolution Deviation Future
ing data or the value distribution of certain at-
Evaluator Evaluator Modules
tributes in a set of objects. This involves nding the
set of attributes relevant to the attribute of inter-
est (by some statistical analysis) and predicting the
Figure 2: Knowledge discovery modules of DBMiner value distribution based on the set of data similar to
the selected object(s). For example, an employee's
potential salary can be predicted based on the salary
ure 2, include characterizer, discriminator, classi er, distribution of similar employees in the company.
association rule nder, meta-rule guided miner, pre-
dictor, evolution evaluator, deviation evaluator, and  A data evolution evaluator evaluates the data evolu-
some planned future modules. tion regularities for certain objects whose behavior
The functionalities of the knowledge discovery mod- changes over time. This may include characteriza-
ules are brie y described as follows: tion, classi cation, association, or clustering of time-
related data. For example, one may nd the general
 The characterizer generalizes a set of task-relevant characteristics of the companies whose stock price
data into a generalized relation which can then be has gone up over 20% last year or evaluate the trend
used for extraction of di erent kinds of rules or be or particular growth patterns of certain stocks.
viewed at multiple concept levels from di erent an-
gles. In particular, it derives a set of characteristic  A deviation evaluator evaluates the deviation pat-
rules which summarizes the general characteristics of terns for a set of task-relevant data in the database.
a set of user-speci ed data (called the target class). For example, one may discover and evaluate a set of
For example, the symptoms of a speci c disease can stocks whose behavior deviates from the trend of the
be summarized by a characteristic rule. majority of stocks during a certain period of time.
 A discriminator discovers a set of discriminant rules Another important function module of DBMiner is
which summarize the features that distinguish the concept hierarchy which provides essential background
class being examined (the target class) from other knowledge for data generalization and multiple-level
classes (called contrasting classes). For example, data mining. Concept hierarchies can be speci ed
to distinguish one disease from others, a discrimi- based on the relationships among database attributes
nant rule summarizes the symptoms that discrimi- (called schema-level hierarchy) or by set groupings
nate this disease from others. (called set-grouping hierarchy) and be stored in the
 A classi er analyzes a set of training data (i.e., a set form of relations in the same database. Moreover,
of objects whose class label is known) and constructs they can be adjusted dynamically based on the dis-
a model for each class based on the features in the tribution of the set of data relevant to the data mining
data. A set of classi cation rules is generated by such task. Also, hierarchies for numerical attributes can be
a classi cation process, which can be used to classify constructed automatically based on data distribution
future data and develop a better understanding of analysis (Han & Fu 1994).
DMQL and Interactive Data Mining AI
Discipline_code

DBMiner o ers both an SQL-like data mining query Comp_method


language, DMQL, and a graphical user interface for
interactive mining of multiple-level knowledge.
Databases

Hardware
Example 1 . To characterize CS grants in the
NSERC96 database related to discipline code and
Software Eng.

amount category in terms of count% and amount%, Theory


B.C. Prairies Ontario Quebec Maritime
the query is expressed in DMQL as follows, 0-20k Province
use NSERC96 20-40k

nd characteristic rules for \CS Discipline Grants" 40-60k

from award A, grant type G


60k-

related to disc code, amount, count(*)%, amount(*)%


Amount

where A.grant code = G.grant code Figure 3: A multi-dimensional data cube


and A.disc code = \Computer Science"
The query is processed as follows: The system
collects the relevant set of data by processing a tuple is the result of generalization of a set of tuples
transformed relational query, generalizes the data by in the original data relation. For example, a general-
attribute-oriented induction, and then presents the out- ized relation award may store a set of tuples, such as
puts in di erent forms, including generalized relations, \award(AI; 20 40k; 37; 835900)", which represents the
generalized feature tables, multiple (including visual) generalized data for discipline code is \AI", the amount
forms of generalized rules, pie/bar charts, curves, etc. category is \20 40k", and such kind of data takes 37
A user may interactively set and adjust various kinds in count and $835,900 in (total) amount.
of thresholds to control the data mining process. For A multi-dimensional data cube is a multi-dimensional
example, one may adjust the generalization threshold array structure, as shown in Figure 3, in which each
for an attribute to allow more or less distinct values in dimension represents a generalized attribute and each
this attribute. A user may also roll-up or drill-down cell stores the value of some aggregate attribute, such
the generalized data at multiple concept levels. 2 as count, sum, etc. For example, a multi-dimensional
A data mining query language such as DMQL fa- data cube award may have two dimensions: \disci-
cilitates the standardization of data mining functions, pline code" and \amount category". The value \AI"
in the \discipline code" dimension and \20-40k" in the
systematic development of data mining systems, and \amount category" dimension locate the corresponding
integration with standard relational database systems. values in the two aggregate attributes, count and sum,
Various kinds of graphical user interfaces can be de- in the cube. Then the values, count% and amount%,
veloped based on such a data mining query language. can be derived easily.
Such interfaces have been implemented in DBMiner on
three platforms: Windows/NT, UNIX, and Netscape. In comparison with the generalized relation struc-
A graphical user interface facilitates interactive speci- ture, a multi-dimensional data cube structure has the
cation and modi cation of data mining queries, con- following advantages: First, it may often save stor-
cept hierarchies, and various kinds of thresholds, selec- age space since only the measurement attribute values
tion and change of output forms, roll-up or drill-down, need to be stored in the cube and the generalized (di-
and dynamic control of a data mining process. mensional) attribute values will serve only as dimen-
sional indices to the cube; second, it leads to fast access
to particular cells (or slices) of the cube using index-
Implementation of DBMiner ing structures; third, it usually costs less to produce a
Data structures: Generalized relation vs. cube than a generalized relation in the process of gen-
multi-dimensional data cube eralization since the right cell in the cube can be lo-
cated easily. However, if a multi-dimensional data cube
Data generalization is a core function of DBMiner. structure is quite sparse, the storage space of a cube is
Two data structures, generalized relation, and multi- largely wasted, and the generalized relation structure
dimensional data cube, can be considered in the imple- should be adopted to save the overall storage space.
mentation of data generalization. Both data structures have been explored in the DB-
A generalized relation is a relation which consists Miner implementations: the generalized relation struc-
of a set of (generalized) attributes (storing generalized ture is adopted in version 1.0, and a multi-dimensional
values of the corresponding attributes in the original data cube structure in version 2.0. A more exible
relation) and a set of \aggregate" (measure) attributes implementation is to consider both structures, adopt
(storing the values resulted from executing aggregate the multi-dimensional data cube structure when the
functions, such as count, sum, etc.), and in which each size of the data cube is reasonable, and switch to the
generalized relation structure (by dynamic allocation) tion/cube to appropriate level(s).
otherwise (this can be estimated based on the num-
ber of dimensions being considered, and the attribute
threshold of each dimension). Such an alternative will
Discovery of discriminant rules
be considered in our future implementation. The discriminator of DBMiner nds a set of discrimi-
Besides designing good data structures, ecient im- nant rules which distinguishes the general features of a
plementation of each discovery module has been ex- target class from that of contrasting class(es) speci ed
plored, as discussed below. by a user. It is implemented as follows.
First, the set of relevant data in the database has
Multiple-level characterization been collected by query processing and is partitioned
respectively into a target class and one or a set of con-
Data characterization summarizes and characterizes a trasting class(es). Second, attribute-oriented induction
set of task-relevant data, usually based on generaliza- is performed on the target class to extract a prime tar-
tion. For mining multiple-level knowledge, progressive get relation/cube, where a prime target relation is a
deepening (drill-down) and progressive generalization generalized relation in which each attribute contains
(roll-up) techniques can be applied. no more than but close to the threshold value of the
Progressive generalization starts with a conservative corresponding attribute. Then the concepts in the con-
generalization process which rst generalizes the data trasting class(es) are generalized to the same level as
to slightly higher concept levels than the primitive data those in the prime target relation/cube, forming the
in the relation. Further generalizations can be per- prime contrasting relation/cube. Finally, the informa-
formed on it progressively by selecting appropriate at- tion in these two classes is used to generate qualitative
tributes for step-by-step generalization. Strong charac- or quantitative discriminant rules.
teristic rules can be discovered at multiple abstraction Moreover, interactive drill-down and roll-up can be
levels by ltering (based on the corresponding thresh- performed synchronously in both target class and con-
olds at di erent levels of generalization) generalized trasting class(es) in a similar way as that explained in
tuples with weak support or weak con dence in the the last subsection (characterization). These functions
rule generation process. have been implemented in the discriminator.
Progressive deepening starts with a relatively high- Multiple-level association
level generalized relation, selectively and progressively
specializes some of the generalized tuples or attributes Based on many studies on ecient mining of associa-
to lower abstraction levels. tion rules (Agrawal & Srikant 1994; Srikant & Agrawal
Conceptually, a top-down, progressive deepening 1995; Han & Fu 1995), a multiple-level association rule
process is preferable since it is natural to rst nd gen- nder has been implemented in DBMiner.
eral data characteristics at a high concept level and Di erent from mining association rules in transac-
then follow certain interesting paths to step down to tion databases, a relational association rule miner may
specialized cases. However, from the implementation nd two kinds of associations: nested association and
point of view, it is easier to perform generalization than at association, as illustrated in the following example.
specialization because generalization replaces low level
tuples by high ones through ascension of a concept hi- Example 2 . Suppose the \course taken" relation in
erarchy. Since generalized tuples do not register the a university database has the following schema:
detailed original information, it is dicult to get such course taken = (student id; course; semester;grade):
information back when specialization is required later.
Nested association is the association between a data
Our technique which facilitates specializations on object and a set of attributes in a relation by view-
generalized relations is to save a \minimally general- ing data in this set of attributes as a nested rela-
ized relation/cube" in the early stage of generalization. tion. For example, one may nd the associations be-
That is, each attribute in the relevant set of data is tween students and their course performance by view-
generalized to minimally generalized concepts (which ing \(course; semester; grade)" as a nested relation as-
can be done in one scan of the data relation) and then sociated with student id.
identical tuples in such a generalized relation/cube are
merged together, which derives the minimally gener- Flat association is the association among di er-
alized relation. After that, both progressive deepening ent attributes in a relation without viewing any at-
and interactive up-and-down can be performed with tribute(s) as a nested relation. For example, one may
reasonable eciency: If the data at the current ab- nd the relationships between course and grade in the
straction level is to be further generalized, generaliza- course taken relation such as \the courses in comput-
tion can be performed directly on it; on the other hand, ing science tend to have good grades", etc.
if it is to be specialized, the desired result can be de- Two associations require di erent data mining tech-
rived by generalizing the minimally generalized rela- niques.
For mining nested associations, a data relation can related to major, gpa, status, birth place, address
be transformed into a nested relation in which the from student
tuples which share the same values in the unnested where birth place = \Canada"
attributes are merged into one. For example, the
course taken relation can be folded into a nested re- Multi-level association rules can be discovered in
lation with the schema, such a database, as illustrated below:
course taken = (student id; course history) major(s; \Science") ^ gpa(s; \Excellent") !
course history = (course;semester;grade). status(s; \Graduate") (60%)
By such transformation, it is easy to derive associ- major(s; \Physics") ^ status(s; \M:Sc") !
ation rules like \90% senior CS students tend to take gpa(s; \3:8 4:0") (76%)
at least three CS courses at 300-level or up in each The mining of such multi-level rules can be imple-
semester". Since the nested tuples (or values) can mented in a similar way as mining multiple-level asso-
be viewed as data items in the same transaction, the ciation rules in a multi-dimensional data cube. 2
methods for mining association rules in transaction
databases, such as (Han & Fu 1995), can be applied
to such transformed relations in relational databases. Classi cation
Multi-dimensional data cube structure facilitates ef- Data classi cation is to develop a description or model
cient mining of multi-level at association rules. A for each class in a database, based on the features
count cell of a cube stores the number of occurrences present in a set of class-labeled training data.
of the corresponding multi-dimensional data values, There have been many data classi cation methods
whereas a dimension count cell stores the sum of counts studied, including decision-tree methods, such as ID-3
in the whole dimension. With this structure, it is and C4.5 (Quinlan 1993), statistical methods, neural
straightforward to calculate the measurements such as networks, rough sets, etc. Recently, some database-
support and con dence of association rules. A set of oriented classi cation methods have also been investi-
such cubes, ranging from the least generalized cube to gated (Mehta, Agrawal, & Rissanen 1996).
rather high level cubes, facilitate mining of association
rules at multiple concept levels. 2 Our classi er adopts a generalization-based decision-
tree induction method which integrates attribute-
Meta-rule guided mining oriented induction with a decision-tree induction tech-
nique, by rst performing attribute-oriented induction
Since there are many ways to derive association rules on the set of training data to generalize attribute val-
in relational databases, it is preferable to have users ues in the training set, and then performing decision
to specify some interesting constraints to guide a data tree induction on the generalized data.
mining process. Such constraints can be speci ed in a Since a generalized tuple comes from the generaliza-
meta-rule (or meta-pattern) form (Shen et al. 1996), tion of a number of original tuples, the count informa-
which con nes the search to speci c forms of rules. For tion is associated with each generalized tuple and plays
example, a meta-rule \P(x; y) Q(x; y; z)", where P
!
an important role in classi cation. To handle noise
and Q are predicate variables matching di erent prop- and exceptional data and facilitate statistical analysis,
erties in a database, can be used as a rule-form con- two thresholds, classi cation threshold and exception
straint in the search. threshold, are introduced. The former helps justi ca-
In principle, a meta-rule can be used to guide the tion of the classi cation at a node when a signi cant
mining of many kinds of rules. Since the association set of the examples belong to the same class; whereas
rules are in the form similar to logic rules, we have rst the latter helps ignore a node in classi cation if it con-
studied meta-rule guided mining of association rules in tains only a negligible number of examples.
relational databases (Fu & Han 1995). Di erent from There are several alternatives for doing generaliza-
the study by (Shen et al. 1996) where a meta-predicate tion before classi cation: A data set can be generalized
may match any relation predicates, deductive predi- to either a minimally generalized concept level, an in-
cates, attributes, etc., we con ne the search to those termediate concept level, or a rather high concept level.
predicates corresponding to the attributes in one rela- Too low a concept level may result in scattered classes,
tion. One such example is illustrated as follows. bushy classi cation trees, and diculty at concise se-
Example 3. A meta-rule guided data mining query mantic interpretation; whereas too high a level may
can be speci ed in DMQL as follows for mining a spe- result in the loss of classi cation accuracy.
ci c form of rules related to a set of attributes: \major, Currently, we are testing several alternatives at
gpa, status, birth place, address " in relation student for integration of generalization and classi cation in
those born in Canada in a university database. databases, such as (1) generalize data to some medium
nd association rules in the form of concept levels; (2) generalize data to intermediate con-
major(s : student; x) ^ Q(s; y) ! R(s; z ) cept level(s), and then perform node merge and split
for better class representation and classi cation accu-  Integration, maintenance and application of discov-
racy; and (3) perform multi-level classi cation and se- ered knowledge, including incremental update of
lect a desired level by a comparison of the classi cation discovered rules, removal of redundant or less in-
quality at di erent levels. Since all three classi ca- teresting rules, merging of discovered rules into a
tion processes are performed in relatively small, com- knowledge-base, intelligent query answering using
pressed, generalized relations, it is expected to result discovered knowledge, and the construction of mul-
in ecient classi cation algorithms in large databases. tiple layered databases.
 Extension of data mining technique towards ad-
Prediction vanced and/or special purpose database systems,
including extended-relational, object-oriented, text,
A predictor predicts data values or value distributions spatial, temporal, and heterogeneous databases.
on the attributes of interest based on similar groups of Currently, two such data mining systems, GeoMiner
data in the database. For example, one may predict and WebMiner, for mining knowledge in spatial
the amount of research grants that an applicant may databases and the Internet information-base respec-
receive based on the data about the similar groups of tively, are being under design and construction.
researchers.
The power of data prediction should be con ned to References
the ranges of numerical data or the nominal data gen- Agrawal, R., and Srikant, R. 1994. Fast algorithms for
eralizable to only a small number of categories. It is mining association rules. In Proc. 1994 Int. Conf. Very
unlikely to give reasonable prediction on one's name or Large Data Bases, 487{499.
social insurance number based on other persons' data. Fayyad, U. M.; Piatetsky-Shapiro, G.; Smyth, P.; and
For successful prediction, the factors (or attributes) Uthurusamy, R. 1996. Advances in Knowledge Discovery
which strongly in uence the values of the attributes of and Data Mining. AAAI/MIT Press.
interest should be identi ed rst. This can be done by Fu, Y., and Han, J. 1995. Meta-rule-guided mining of
the analysis of data relevance or correlations by statis- association rules in relational databases. In Proc. 1st Int'l
tical methods, decision-tree classi cation techniques, Workshop on Integration of Knowledge Discovery with De-
or simply be based on expert judgement. To analyze ductive and Object-Oriented Databases, 39{46.
attribute correlation, our predictor constructs a contin- Han, J., and Fu, Y. 1994. Dynamic generation and re-
gency table followed by association coecient calcula- nement of concept hierarchies for knowledge discovery
tion based on 2 -test by the analysis of minimally gen- in databases. In Proc. AAAI'94 Workshop on Knowledge
eralized data in databases. The attribute correlation Discovery in Databases (KDD'94), 157{168.
associated with each attribute of interest is precom- Han, J., and Fu, Y. 1995. Discovery of multiple-level
puted and stored in a special relation in the database. association rules from large databases. In Proc. 1995 Int.
Conf. Very Large Data Bases, 420{431.
When a prediction query is submitted, the set of Han, J., and Fu, Y. 1996. Exploration of the power of
data relevant to the requested prediction is collected, attribute-oriented induction in data mining. In Fayyad,
where the relevance is based on the attribute correla- U.; Piatetsky-Shapiro, G.; Smyth, P.; and Uthurusamy,
tions derived by the query-independent analysis. The R., eds., Advances in Knowledge Discovery and Data Min-
set of data which matches or is close to the query con- ing. AAAI/MIT Press. 399{421.
dition can be viewed as similar group(s) of data. If Han, J.; Cai, Y.; and Cercone, N. 1993. Data-driven dis-
this set is big enough (i.e., sucient evidence exists), covery of quantitative rules in relational databases. IEEE
its value distribution on the attribute of interest can be Trans. Knowledge and Data Engineering 5:29{40.
taken as predicted value distribution. Otherwise, the Mehta, M.; Agrawal, R.; and Rissanen, J. 1996. SLIQ:
set should be appropriately enlarged by generalization A fast scalable classi er for data mining. In Proc.
on less relevant attributes to certain high concept level 1996 Int. Conference on Extending Database Technology
to collect enough evidence for trustable prediction. (EDBT'96).
Piatetsky-Shapiro, G., and Frawley, W. J. 1991. Knowl-
edge Discovery in Databases. AAAI/MIT Press.
Further Development of DBMiner Quinlan, J. R. 1993. C4.5: Programs for Machine Learn-
The DBMiner system is currently being extended in ing. Morgan Kaufmann.
several directions, as illustrated below. Shen, W.; Ong, K.; Mitbander, B.; and Zaniolo, C.
1996. Metaqueries for data mining. In Fayyad, U.;
Further enhancement of the power and eciency of Piatetsky-Shapiro, G.; Smyth, P.; and Uthurusamy, R.,

data mining in relational database systems, includ- eds., Advances in Knowledge Discovery and Data Mining.
ing the improvement of system performance and rule AAAI/MIT Press. 375{398.
discovery quality for the existing functional modules, Srikant, R., and Agrawal, R. 1995. Mining generalized
and the development of techniques for mining new association rules. In Proc. 1995 Int. Conf. Very Large
Data Bases, 407{419.
kinds of rules, especially on time-related data.

S-ar putea să vă placă și