Documente Academic
Documente Profesional
Documente Cultură
1 Introdu
tion
Clustering is an important des
riptive data-mining task. Its goal is to divide
a given set of examples into several subsets (
lusters) su
h that examples of
the same
luster are similar, while examples of dierent
lusters are not. For
on
eptual
lustering, it is also required that a (symboli
) des
ription is given for
ea
h
luster. Several good
on
eptual
lustering algorithms exist in the attribute-
value representation (see e.g. Cobweb [11℄, AutoClass [4℄).
However, these algorithms assume the attributes of the data to be indepen-
dent within ea
h
luster. This assumption does not hold in all
ases. There
an
be redundant information, known by users and therefore uninteresting to them.
Also, the redundan
y
an be
aused by the language in whi
h the information
is represented.
In this paper we propose a method to deal with these problems by allowing
attributes to be dependent on ea
h other and by allowing the user to spe
ify his
prior knowledge.
Re
ent ma
hine learning algorithms (su
h as Tilde [2℄, I
l [8℄, Warmr
[9℄) use a more expressive rst order (or multiple relation) representation. Some
systems that do
on
eptual
lustering on examples represented in rst order
logi
exist, but often still need some information in propositional form. E.g. the
system TIC [3℄ builds
lustering trees but uses a propositional distan
e measure
between examples and/or
lusters.
The sear
h for a good
lustering is typi
ally guided by one of two heuristi
s:
an evaluation measure (also
alled obje
tive fun
tion) on
lusters or a distan
e
fun
tion. Both heuristi
s are in the propositional
ase a fun
tion of values of
attributes. However, in the rst order
ase, there is no xed number of attributes,
and so one of these two heuristi
s would have to be upgraded. While distan
e
measures between rst order obje
ts exist [19℄, [10℄, they are often quite ad ho
,
representation-dependent and
omputationally expensive.
In this paper, we propose an alternative approa
h. We use rst order queries
as attributes. Be
ause of the generality relations between these queries, the ob-
tained attributes are highly dependent on ea
h other. Therefore, we use our
method to deal with dependent attributes. This has the advantage that no dis-
tan
e is needed, only the obje
tive fun
tion has to be upgraded in a representation-
independent way.
The paper is stru
tured as follows: Se
tion 2 des
ribes a method to deal
with non-independent attributes. Se
tion 3 des
ribes the extension to multiple
relations. Next, se
tion 4 des
ribes an implementation where we in
orporate
these methods in the Cobweb algorithm. Se
tion 5 shows some experimental
validation of our method. Se
tion 6 dis
usses some related work. Finally, se
tion
7 gives some
on
lusions and ideas for further work.
Client
ustomer 1 2 34 567 89
in
ome HHLHLLHLH
so
ial
lass HHLHLLL LH
has a
ar y y ny nnn ny
edu
ation level H H L H L L H L H
buys pizza n y ny nyn yn
buys beer n y ny nyn yn
Table 1. Supermarket example
Most
lustering algorithms will produ
e two
lusters: the
luster f1; 2, 4, 7,
9g, whi
h
ould be labeled as the set of the 'ri
h' people and the
luster f3; 5,
6, 8g of 'poor' people. Perhaps an institution doing demographi
resear
h would
be happy with this. However, the supermarket already knows this and is more
interested in less trivial data, e.g. the fa
t that pizza and beer is often bought
together.
j
P (H D) =
j
P (D H ):PH (H )
(1)
PD (D)
j
P (Ei H ) = P (Ei C
i ) j (3)
where C
i is the
luster Ei is assigned to. Let Ei = (ei1 ; : : : ; eia ) with a the
number of attributes. If the attributes are assumed to be independent within
ea
h
luster, this probability
an be written as
Ya
P (Ei jCj ) = P (Al = eil jCj ) (4)
l=1
where P (Al = eil jCj ) is the probability that the l-th attribute has the value eil
in the
luster Cj . Combining (1), (2), (3) and (4) and taking the logarithm gives
Xn X a
log P (H jD) = log(PH (H )=PD (D)) + log P (Al = eil jC
i )
i=1 l=1
This
an be rewritten as
k X
X vl
a X
Pprior +
ount(Al = Vls jCj )
log P (H jD) = j =1 l=1 s=1
log P (Al = Vls jCj )
where Vls , s = 1::vl are the possible values of the l-th attribute and
ount(Al =
j
Vls Cj ) the number of examples with l-th attribute Vls in Cj and Pprior =
log(PH (H )=PD (D)). This
an be rewritten as
j
log P (H D)
k
X a X
X vl
= Pprior + n: P (Cj ) P (Al = Vls Cj ) j
j =1 l=1 s=1
log P (Al = Vls jCj ) :
It
an be
on
luded that P (H jD) is a linear fun
tion of the obje
tive fun
tion
presented in [5℄:
k
X vl h
a X
X
1
k P (Cj )
j=1 l=1 s=1 (5)
P (Al = Vls jCj ) log P (Al = Vls i j Cj )
P (Al = Vls ) log P (Al = Vls )
ex
ept for the fa
tor 1=k , whi
h is introdu
ed be
ause we prefer simple hypothe-
ses with few
lusters. This
an be seen as an information-theoreti
analog to the
well-known Partition Utility [12℄ whi
h is
k
1X X vl h
a X i
P (Cj ) P (Al = Vls jCj )2 P (Al = Vls )2 (6)
k
j=1 l=1 s=1
In this paper however, we will use equation (5) as it is more straightforward to
extend to represent belief network probabilities.
So ial lass
where P (Al = eil jan
(Al )) is the probability of the l-th attribute having the
value eil given the values of the attributes in an
(Al ). Given a set of examples
D a belief network G(V; R) on D, and k belief networks Gj (V; R) on k
lusters
Cj of D, equation (4) be
omes
Ya
P (Ei jCj ) = P (Al = eil jCj ; an
j (Al )) (7)
l=1
where an
j (Al ) is the set of dire
t an
estors of the l-th attribute in the belief
network Gj (V; R). The expression (5) be
omes
k
X vl h
a X
X
1
k P (Cj )
j=1 l=1 s=1 (8)
P (Al = Vls jCj ; an
j (Al )) log P (Al = Vls jCj ; an
ij (Al ))
P (Al = Vls jan
(Al )) log P (Al = Vls jan
(Al ))
It is hard to automati
ally infer the optimal belief network stru
ture for a
set of examples. Therefore, we use the same stru
ture for the networks Gj (V; R)
as the stru
ture of G(V; R) and adapt the probabilities. This is suÆ
ient for a
good obje
tive fun
tion in our appli
ations. An algorithm that
ould improve the
stru
ture of the belief network of a
luster su
h that the estimated probability
of the examples in
reases, would provide additional indi
ations for similarity be-
tween the examples in that
luster, and would probably give even better results.
Table 2 illustrates the ee
t on our small supermarket example. Two possible
partitions of the database are
onsidered. The left
olumn gives the values for
equation (5), assuming independent attributes. The right
olumn gives the values
for equation (8), taking into a
ount the prior knowledge of the supermarket,
represented in gure 1. When the prior knowledge is taken into a
ount, other
lusters will be formed. These new
lusters
an give more insight in the things
the supermarket really wants to know.
i.e., \there exists a
ustomer X who is parent of a
oke buyer Y ". If indeed
su
h X; Y; Z are found in database D, query Q1 su
eeds w.r.t. D, otherwise, the
query fails w.r.t. D. Noti
e query Q1 su
eeds with answerset fX = a, Y = bg.
Query Q1
an be interpreted as a boolean attribute QA 1 of
ustomers: ea
h
ustomer either is or is not \parent of a
oke buyer". To nd Q1 for a parti
ular
ustomer, we substitute the variable X in Q1 with the
ustomer's identier, and
evaluate against the database. For instan
e, query
ustomer (a) ^ parent of (a,Y ) ^ buys (Y,
oke)
i.e., \
ustomer a is parent of a
oke buyer Y ", su
eeds with Y = b. Hen
e
attribute QA 1 has value 1 for
ustomer a. Observe substitutions of X with the
three remaining
ustomer identiers b;
; d all result in failing queries, su
h that
for these
ustomers attribute QA1 has value 0.
In general, we
an view
onjun
tive queries as attributes of examples (here
ustomers) if they
ontain a set of variables (here singleton fXg) su
h that
ea
h substitution of those variables
orresponds to an identier of the example.
The most straigthforward way to a
hieve this is to (1) have a separate relation
in the database with example identiers (here Customer), and (2) make the
orresponding atom (here
ustomer (X)) obligatory in ea
h
onjun
tive query.
In that
ase, we
an roughly dene the frequen
y of a
onjun
tive query as the
number of example identiers for whi
h the query su
eeds.
Given the above interpretation of
onjun
tive queries as boolean attributes,
and given three more
onjun
tive queries with the obligatory
ustomer (X) atom:
Q2 :
ustomer (X) ^ parent of (X,Y ) ^ buys (Y,wine)
Q3 :
ustomer (X) ^ parent of (X,Y )
Q4 :
ustomer (X) ^ buys (X,Y)
we
an generate the attribute-value des
ription of our relational database D
shown in table 4. Obviously, table Customer avl is not equivalent to database
D, but for some purposes, for instan
e
lustering
ustomers, we may judge it
omes suÆ
iently
lose. We
ome ba
k to this issue in paragraph 3.3.
A
Q3
> A
Q4
0.50
0.75
QA3 QA4
QA1 QA2
Q3
A A
Q1 Q3
A Q2
A
T 0.50 T 0.50
F 0.00 F 0.00
To make the link between
onjun
tive queries (latti
es) and attributes (belief
networks) operational we still have to solve the problem of how to sele
t queries
su
h as Q1 -Q4 from an innite spa
e of possible queries on database D. Part of
the solution is oered by a so-
alled de
larative language bias formalism, well
known in relational learning. Su
h a formalism allows the user to
onstrain the
spa
e of queries to sensible ones, e.g., by imposing type and mode
onstraints
on variables. We further assume that the language bias denitions provided
by the user indire
tly { via the query latti
e { determine the stru
ture of the
belief network used throughout the
lustering pro
ess. As explained immediately
below, extra
luster-spe
i
onstraints will be used to \suppress" some of the
nodes in this possibly giganti
(innite) network.
Within parti
ular
lusters, we use an additional
onstraint based on the ob-
servation that the in
uen
e of rarely su
eeding, hen
e
alled infrequent, queries
on
lustering is negligible. Observe in that respe
t that the in
uen
e of a
on-
jun
tive query is at best proportional to r:log (r), with r the relative frequen
y
of the query. Therefore, when setting the parameters of the belief network for
a parti
ular
luster, we
an safely ignore all nodes whose relative frequen
y is
below a pre-dened threshold t, i.e., assume they have frequen
y 0. Sin
e we only
set and use the parameters asso
iated with queries whose frequen
y ex
eeds t,
we need an algorithm that sele
ts these frequent queries from the user dened
language of queries. Warmr [9℄ is su
h an algorithm.
Warmr is an instan
e of the family of levelwise frequent pattern dis
overy
algorithms [16℄ that look at a level of the latti
e at a time, starting from the most
general pattern. Warmr iterates between
andidate generation and
andidate
evaluation phases: in
andidate generation, the latti
e stru
ture is used for prun-
ing non-frequent queries from the next level; in the
andidate evaluation phase,
frequen
ies of
andidates are
omputed w.r.t. the database. Pruning is based on
monotoni
ity of the generality-under--subsumption relation w.r.t. frequen
y: if
a query is not frequent then none of its spe
ializations is frequent. So, while gen-
erating
andidates for the next level { this is essentially done by adding atoms
to frequent queries of the last level { all the queries that are spe
ializations of
infrequent patterns
an be pruned.
4 The Remind system
We implemented the ideas in the previous se
tions in a system
alled RElational
lustering MINus Dependen
ies. On the highest level, Fisher's Cobweb algo-
rithm was used. Essentially, it in
rementally
reates a hierar
hi
al
lustering,
starting with an empty tree and updating it ea
h time an example is pro
essed.
The algorithm sorts example down the tree, at ea
h step
hoosing the best of
four possible operations using an obje
tive fun
tion.
5 Experiments
In this se
tion we report on some experimental results with our Remind system.
Evaluation of
lustering systems is diÆ
ult. One of the most frequently used
methods is to use
lustering to predi
t unknown features of examples, and use
the a
ura
y as performan
e measure. As we explain in se
tion 6.2 we do not
expe
t a large performan
e gain for normal propositional problems. Therefore,
in this se
tion we fo
us on showing that our te
hnique
an be used in situations
where the normal
lustering algorithms fail.
We used some syntheti
datasets for our experiments. These
an be obtained
from http://www.
s.kuleuven.a
.be/~ml/.
5.1 Ben
hmarks
We do not use the standard Cobweb obje
tive fun
tion partition utility, but
the information-theoreti
equivalent. [12℄ proposes this as a good alternative, but
does not report experimental results. Be
ause of this, we rst ran our algorithm
on some standard UCI ben
hmarks (soybeans, voting, breast), for both obje
tive
fun
tions. We observed that Cobweb indu
es similar trees for both obje
tive
fun
tions. Also, the trees gave similar
lass predi
tion a
ura
ies in the leave-
one-out
rossvalidations we did.
drink
Fig. 3. A taxonomy
5.3 Bongard
We also did some experiments on bongard datasets. A bongard dataset
ontains
s
enes of gures (triangle,
ir
les, ...), in
ertain relations to ea
h another. e.g.
the query
Q : triangle(X );
ir
le(Y );
ir
le(Z ); not(Y = Z ); in(X; Y )
su
eeds for a s
ene if there are two
ir
les and a triangle that is in one of the
ir
les. We generated our database by randomly generating positive examples
(where there is a
ir
le in a triangle) and negative examples (where a triangle is
in a
ir
le).
We ran both standard Cobweb and Remind. It
ould be observed that
few
ir
les
many triangles
positive
many
ir
les
few triangles
negative few
ir
les
a. Remind b. standard Cobweb
Fig. 4. Bongard
lusterings
{ Cobweb generated
lusters with little gures and
lusters with many gures
(see e.g. gure 4b) with positive and negative examples mixed.
{ Remind generated
lusters with positive examples and
lusters with negative
examples (see gure 4a).
When we used the
lusterings generated by both systems to predi
t the
lass
value, the Cobweb
lustering s
ored 59% while the Remind
lustering s
ored
100%.
This result
an be intuitively explained as follows. The query Qt1 : triangle(X )
is more general than Qt2 : triangle(X ); triangle(Y ); not(X = Y ). Also, Qt2 is
more general than the query Qt3 : triangle(X ); triangle(Y ); not(X = Y ); triangle(Z ); not(X =
Z ); not(Y = Z ). For standard Cobweb, the queries Qti saying that there
should be i dierent triangles, are all dierent attributes. Sin
e there are a
wide range of possible numbers of triangles, the number of triangles be
omes
very important for Cobweb. On the other hand, sin
e these queries are depen-
dent on ea
h another, their total weight will be mu
h smaller for Remind. In
fa
t, Remind approximately a
ts as if there was one (multi-valued) attribute
number of triangles(N ).
5.4 Mutagenesis
We also did experiments on the mutagenesis database [20℄. In this dataset, an
example is a mole
ule that is either mutageni
or non-mutageni
. There is a
relation des
ribing ea
h atom, a relation des
ribing ea
h bond, and a relation
des
ribing some global
hemi
al properties of the mole
ule. The
onjun
tive
queries that
an be generated from these relations are massively dependent. We
generated all queries with minimal relative frequen
y of 5% and of maximal
length 4.
We ran both standard Cobweb and Remind on a dataset with the truth
value of these queries as attributes. We
ould observe that be
ause of the massive
dependen
ies of attributes, Cobweb produ
ed a very unbalan
ed tree of 83 levels
deep. Remind
reated more balan
ed trees.
However, it turned out to be diÆ
ult to use these
lusterings to predi
t the
mutageni
ity of the mole
ules. This
an be
aused by the fa
t that similarity
of stru
ture does not ne
essary mean that mole
ules are similar in mutageni
-
ity. Another reason is that the simple bayesian network from the query latti
e
does not
over all dependen
ies between queries. We did preliminary experi-
ments with more rened versions of the belief network and these give more
promising results (67% a
ura
y to predi
t mutageni
ity). For instan
e, given
queries Q1 : atom(X ); bond(X; Y ), Q2 : atom(X ); bond(X; Y ); element(Y;
) and
Q3 : atom(X ); bond(X; Y ); element(Y; o) and the fa
t that Q1 su
eeds. The fa
t
that Q2 does not su
eed now makes the su
ess of Q3 more probable (atom Y
is of exa
tly one element type (
arbon, hydrogen, oxygen,...)). A more thorough
study of this is part of further work.
6 Related work
6.1 Promotion of interesting features
Other work already exists that uses dependen
ies between attributes to improve
lustering. It is interesting to look at the dieren
es. One possibility is des
ribed
in [21℄, where the attributes are sele
ted that are most predi
tive for the other
attributes. This means that dependent features are given a larger weight, whi
h
is the
onverse as what we do. This method is used in the hope that irrelevant
features will disturb the
lustering pro
ess less when given a small weight. To
understand the dieren
e between this approa
h and ours, it is ne
essary to
make a distin
tion between two kinds of dependen
ies:
{ Dependen
ies
aused by e.g. the language bias, whi
h should not be dis
ov-
ered by the
lustering system. Our approa
h is to try to neutralize them.
{ Dependen
ies in the data, unknown to the user, whi
h are hoped to be
dis
overed. It
an be interesting to try to promote them.
In fa
t, it
ould be useful to try to
ombine both methods.
6.4 Multinets
The probability tables of the basi
bayesian networks we use for ilp problems
are sparse, In fa
t, other representations su
h as bayesian multinets [14℄
ould
be used to represent the dependen
ies in our networks. These representations
allow one to represent probability tables with many zeros in a more elegant
way. For simpli
ity, in this paper only the better known standard bayesian net
representation was used.
6.5 Distan
es
The method des
ribed in this paper
ould also be used to make a distan
e
measure that adapts itself to the appli
ation domain (by adapting the network
stru
ture and weights). Indeed, the utility of taking two examples together in
a
luster is a measure of their similarity. This measure is less representation-
dependent than other proposed measures (e.g. [19℄) and is in many
ases faster
to
ompute. We hope that this measure
an also perform well in distan
e-based
appli
ations, e.g. for instan
e based learning. Experiments on this is part of
further work.
Referen
es
1. R. Agrawal and R. Srikant. Mining generalised asso
iation rules. In Pro
eedings
of the 21th VLDB Conferen
e, 1995.
2. H. Blo
keel and L. De Raedt. Top-down indu
tion of rst order logi
al de
ision
trees. Arti
ial Intelligen
e, 101(1-2):285{297, June 1998.
3. H. Blo
keel, L. De Raedt, and J. Ramon. Top-down indu
tion of
lustering trees.
In Pro
eedings of the 15th International Conferen
e on Ma
hine Learning, pages
55{63, 1998.
4. P. Cheeseman and J. Stutz. Bayesian
lassi
ation (auto
lass): Theory and results.
In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhrai
Smyth, and Ramasamy
Uthurusamy, editors, Advan
es in Knowledge Dis
overy and Data Mining. AAAI
Press/MIT press, 1996.
5. J. Corter and M. Glu
k. Explaining basi
ategories: feature predi
tability and
information. psy
hologi
al bulletin, (111):291{303, 1992.
6. J. Cussens. Loglinear models for rst order probabilisti
reasoning. In Pro
. of
UAI, 1999.
7. L. De Raedt. Logi
al settings for
on
ept learning. Arti
ial Intelligen
e, 95:187{
201, 1997.
8. L. De Raedt and W. Van Laer. Indu
tive
onstraint logi
. In Klaus P. Jantke,
Takeshi Shinohara, and Thomas Zeugmann, editors, Pro
eedings of the Sixth Inter-
national Workshop on Algorithmi
Learning Theory, volume 997 of Le
ture Notes
in Arti
ial Intelligen
e, pages 80{94. Springer-Verlag, 1995.
9. L. Dehaspe and H. Toivonen. Dis
overy of frequent datalog patterns. Data Mining
and Knowledge Dis
overy, 3(1):7{36, 1999.
10. W. Emde and D. Wetts
here
k. Relational instan
e-based learning. In L. Saitta,
editor, Pro
eedings of the Thirteenth International Conferen
e on Ma
hine Learn-
ing, pages 122{130. Morgan Kaufmann, 1996.
11. D. H. Fisher. Knowledge a
quisition via in
remental
on
eptual
lustering. Ma-
hine Learning, 2:139{172, 1987.
12. D. H. Fisher. Iterative optimization and simpli
ation of hierar
hi
al
lusterings.
Journal of Arti
ial Intelligen
e Resear
h, 4:147{179, 1996.
13. P.A. Fla
h and N. La
hi
he. 1b
: a rst order bayesian
lassier. In D. Page, editor,
Pro
eedings of the Ninth International Workshop on Indu
tive Logi
Programming,
volume 1634, pages 93{103. Springer-Verlag, 1999.
14. D. Geiger and D. He
kerman. Knowledge representation and inferen
e in similarity
networks and bayesian multinets. Arti
ial Intelligen
e, 82:45{74, 1996.
15. S. Kramer, B. Pfahringer, and C. Helma. Sto
hasti
propositionalisation of non-
determinate ba
kground knowledge. In Pro
eedings of the Eighth International
Conferen
e on Indu
tive Logi
Programming, volume 1446 of Le
ture Notes in Ar-
ti
ial Intelligen
e, pages 80{94. Springer-Verlag, 1998.
16. H. Mannila and H. Toivonen. Levelwise sear
h and borders of theories in knowledge
dis
overy. Data Mining and Knowledge Dis
overy, 1(3):241 { 258, 1997.
17. S.-H. Nienhuys-Cheng and R. Wolf. Foundations of indu
tive logi
programming,
volume 1228 of Le
ture Notes in Computer S
ien
e and Le
ture Notes in Arti
ial
Intelligen
e. Springer-Verlag, New York, NY, USA, 1997.
18. G. Plotkin. A note on indu
tive generalization. In B. Meltzer and D. Mi
hie, edi-
tors, Ma
hine Intelligen
e, volume 5, pages 153{163. Edinburgh University Press,
1970.
19. J. Ramon and M. Bruynooghe. A framework for dening distan
es between rst-
order logi
obje
ts. In Pro
eedings of the Eighth International Conferen
e on In-
du
tive Logi
Programming, Le
ture Notes in Arti
ial Intelligen
e, pages 271{280.
Springer-Verlag, 1998.
20. A. Srinivasan, S.H. Muggleton, R.D. King, and M.J.E. Sternberg. Mutagenesis:
ILP experiments in a non-determinate biologi
al domain. In S. Wrobel, editor,
Pro
eedings of the Fourth International Workshop on Indu
tive Logi
Program-
ming, volume 237 of GMD-Studien, pages 217{232. Gesells
haft fur Mathematik
und Datenverarbeitung MBH, 1994.
21. L. Talavera. Feature sele
tion as a prepro
essing step for hierar
hi
al
lustering.
In Pro
eedings of the 16th International Conferen
e on Ma
hine Learning, pages
389{397. Morgan Kaufmann, 1999.