Sunteți pe pagina 1din 18

Using Belief Networks to Neutralize Known

Dependen ies in Con eptual Clustering


Jan Ramon and Lu Dehaspe
fJan.Ramon,Lu .Dehaspeg s.kuleuven.a .be
Department of Computer S ien e, Katholieke Universiteit Leuven
Celestijnenlaan 200A, B-3001 Leuven, Belgium

Abstra t. Con eptual lustering is an important des riptive data-mining


task. Several good on eptual lustering algorithms exist in the attribute-
value representation. These algorithms typi ally do not take into a ount
dependen ies known to the user. This is a problem if the user does not
want these dependen ies to in uen e the lustering pro ess. A typi al ex-
ample is where a taxonomy over attributes is known to exist and should
not be redis overed. In this paper we propose to solve this problem by
using belief networks to des ribe the lusters. This solution has an impor-
tant appli ation in relational lustering. It allows the use of massively
dependent rst order queries as attributes. We present an implemen-
tation of our approa h as an extension of the Cobweb algorithm and
evaluate our method on propositional and relational data.

1 Introdu tion
Clustering is an important des riptive data-mining task. Its goal is to divide
a given set of examples into several subsets ( lusters) su h that examples of
the same luster are similar, while examples of di erent lusters are not. For
on eptual lustering, it is also required that a (symboli ) des ription is given for
ea h luster. Several good on eptual lustering algorithms exist in the attribute-
value representation (see e.g. Cobweb [11℄, AutoClass [4℄).
However, these algorithms assume the attributes of the data to be indepen-
dent within ea h luster. This assumption does not hold in all ases. There an
be redundant information, known by users and therefore uninteresting to them.
Also, the redundan y an be aused by the language in whi h the information
is represented.
In this paper we propose a method to deal with these problems by allowing
attributes to be dependent on ea h other and by allowing the user to spe ify his
prior knowledge.
Re ent ma hine learning algorithms (su h as Tilde [2℄, I l [8℄, Warmr
[9℄) use a more expressive rst order (or multiple relation) representation. Some
systems that do on eptual lustering on examples represented in rst order
logi exist, but often still need some information in propositional form. E.g. the
system TIC [3℄ builds lustering trees but uses a propositional distan e measure
between examples and/or lusters.
The sear h for a good lustering is typi ally guided by one of two heuristi s:
an evaluation measure (also alled obje tive fun tion) on lusters or a distan e
fun tion. Both heuristi s are in the propositional ase a fun tion of values of
attributes. However, in the rst order ase, there is no xed number of attributes,
and so one of these two heuristi s would have to be upgraded. While distan e
measures between rst order obje ts exist [19℄, [10℄, they are often quite ad ho ,
representation-dependent and omputationally expensive.
In this paper, we propose an alternative approa h. We use rst order queries
as attributes. Be ause of the generality relations between these queries, the ob-
tained attributes are highly dependent on ea h other. Therefore, we use our
method to deal with dependent attributes. This has the advantage that no dis-
tan e is needed, only the obje tive fun tion has to be upgraded in a representation-
independent way.
The paper is stru tured as follows: Se tion 2 des ribes a method to deal
with non-independent attributes. Se tion 3 des ribes the extension to multiple
relations. Next, se tion 4 des ribes an implementation where we in orporate
these methods in the Cobweb algorithm. Se tion 5 shows some experimental
validation of our method. Se tion 6 dis usses some related work. Finally, se tion
7 gives some on lusions and ideas for further work.

2 Using known dependen ies


Many lustering algorithms assume the attributes of the data to be independent.
This assumption does not hold in all ases. In this se tion, we rst des ribe some
examples where this assumption does not hold. Next, we brie y review bayesian
lustering and in the last part we des ribe the extension of the luster des ription
to belief networks.

2.1 Uninteresting dependen ies known by the user


There an be dependen ies in the data, known by the users and therefore unin-
teresting to them. Suppose that a supermarket has the database shown in table
1.

Client
ustomer 1 2 34 567 89
in ome HHLHLLHLH
so ial lass HHLHLLL LH
has a ar y y ny nnn ny
edu ation level H H L H L L H L H
buys pizza n y ny nyn yn
buys beer n y ny nyn yn
Table 1. Supermarket example
Most lustering algorithms will produ e two lusters: the luster f1; 2, 4, 7,
9g, whi h ould be labeled as the set of the 'ri h' people and the luster f3; 5,
6, 8g of 'poor' people. Perhaps an institution doing demographi resear h would
be happy with this. However, the supermarket already knows this and is more
interested in less trivial data, e.g. the fa t that pizza and beer is often bought
together.

2.2 Dependen ies aused by the representation


Dependen ies an also be aused by the language in whi h the examples are
represented. One of the simplest situations where this an happen is in the
ase of attributes that are not appli able for some examples. For instan e, an
insuran e ompany an have a database of lients, with information whether they
have a life insuran e, a ar insuran e, et . If they have a life insuran e, other
information is known, e.g. the amount and the hosen type of the agreement.
For those who don't have su h an insuran e, these attributes are not appli able.
The fa t that two examples have both `not appli able' for some attribute does
not a e t their similarity.
Other examples of ases of dependen ies aused by the representation are
when a taxonomy over the attributes is available (see se tion 5.2) and when
truth values of non-independent queries are used as attributes (see se tion 3.2).

2.3 Propositional lustering obje tive fun tion


In this se tion we brie y review some elements from lustering in the proposi-
tional ase. In the Bayesian lustering approa h, a hypothesis is a des ription
of a lustering of the set of all examples D. In the Bayesian lustering setting
(see [4℄), this typi ally means that the hypothesis ontains for ea h luster a
probability distribution for the examples o urring in that luster, and also a
probability distribution giving the probabilities of an example belonging to ea h
luster. The lustering task is then to nd the posterior distribution of the hy-
potheses given the examples. Typi ally, only the most probable hypothesis is
generated and presented to the user.
Let D = fE1 ; : : : ; EN g ontain N examples. Let D be the (unknown) prob-
ability distribution from whi h the examples are drawn. Let H be the set of all
possible hypothesises. The probability that some H 2 H explains the data D is
P (H jD) and satis es the rule of Bayes:

j
P (H D) =
j
P (D H ):PH (H )
(1)
PD (D)

with PH (H ) the prior probability of H . PH (H ) an depend on the hypothesis


omplexity (a more omplex hypothesis is less likely) or on some other prior
knowledge. PD (D), the prior probability of the data an be treated as a normal-
isation onstant and doesn't need to be al ulated expli itly. All examples are
assumed to be independent, so
n
Y
j
P (D H ) = j
P (Ei H ) (2)
i=1
Let H partition D into k lusters C1 ; : : : ; Ck . Let the P (Ei 2 Cj ) be the proba-
bility that the example Ei belongs to the luster Cj a ording to the hypothesis
H . Then
Xk
P (Ei jH ) = P (Ei 2 Cj )P (Ei jCj )
j=1
where P (Ei jCj ) is the probability of drawing Ei from the distribution of the
luster Cj . Sin e we want to use our method to make an extension to Cobweb
and sin e Cobweb assigns ea h example to exa tly one lass (unlike e.g. Auto-
Class), we an simplify this formula: P (Ei 2 Cj ) equals 1 i Cj is the luster
Ei belongs to and 0 otherwise.

j
P (Ei H ) = P (Ei C i ) j (3)
where C i is the luster Ei is assigned to. Let Ei = (ei1 ; : : : ; eia ) with a the
number of attributes. If the attributes are assumed to be independent within
ea h luster, this probability an be written as
Ya
P (Ei jCj ) = P (Al = eil jCj ) (4)
l=1
where P (Al = eil jCj ) is the probability that the l-th attribute has the value eil
in the luster Cj . Combining (1), (2), (3) and (4) and taking the logarithm gives
 Xn X a 
log P (H jD) = log(PH (H )=PD (D)) + log P (Al = eil jC i )
i=1 l=1
This an be rewritten as
k X
X vl
a X
 Pprior + ount(Al = Vls jCj )
log P (H jD) = j =1 l=1 s=1
log P (Al = Vls jCj )
where Vls , s = 1::vl are the possible values of the l-th attribute and ount(Al =
j
Vls Cj ) the number of examples with l-th attribute Vls in Cj and Pprior =
log(PH (H )=PD (D)). This an be rewritten as

j
log P (H D)
k
X a X
X vl
= Pprior + n: P (Cj ) P (Al = Vls Cj ) j
j =1  l=1 s=1
log P (Al = Vls jCj ) :
It an be on luded that P (H jD) is a linear fun tion of the obje tive fun tion
presented in [5℄:
k
X vl h
a X
X
1
k P (Cj )
j=1 l=1 s=1  (5)
P (Al = Vls jCj ) log P (Al = Vls i j Cj )

P (Al = Vls ) log P (Al = Vls )

ex ept for the fa tor 1=k , whi h is introdu ed be ause we prefer simple hypothe-
ses with few lusters. This an be seen as an information-theoreti analog to the
well-known Partition Utility [12℄ whi h is
k
1X X vl h
a X i
P (Cj ) P (Al = Vls jCj )2 P (Al = Vls )2 (6)
k
j=1 l=1 s=1
In this paper however, we will use equation (5) as it is more straightforward to
extend to represent belief network probabilities.

2.4 Using belief networks as models


In the normal setting for lustering, the des ription of ea h luster onsists of
an independent probability distribution for ea h attribute. To avoid problems
that an o ur when attributes are not independent, we propose to use the
more expressive des ription language of belief networks. We rst introdu e some
de nitions.
De nition 1 (an estor,su essor). If u 2 V is a node in a graph G(V; R),
we de ne the set of all an estors an S (u) = fv 2 V n fugj(v; u) 2 Rg, the set of
dire t an estors an (u) = an  (u) n v2an  (u) an  (v ), the set of all su essors
su  (u) =Sfv 2 V j(u; v ) 2 Rg and the set of all dire t su essors su (u) =
su  (u) n v2su  (u) su  (v ).

De nition 2 (belief network). A belief network (also alled a Bayesian net-


work) is a dire ted a y li graph G(V; R) where the nodes are random variables.
Asso iated with ea h node, there is a probability table indi ating the probability
distribution of the orresponding variable given the values of its dire t an estors.
An edge (u; v ) in a belief network has the meaning \if the state of v is not
known for sure, knowledge about the state of u may in uen e the knowledge
about the state of v , no matter what else is known". Figure 1 shows an example
of a belief network, whi h ould be a representation of the prior knowledge of
the supermarket in our example from se tion 2.1.
A ording to a belief network, the probability of an example Ei is
Ya
P (Ei ) = P (Al = eil jan (Al ))
l=1
In ome So . lass=H
In ome
H 0.80
L 0.00

So ial lass

edu level has a ar

So . Class edu level In ome So . lass has a ar


H 1.00 H H 1.00
L 0.20 H L 0.00
L H 0.50
L L 0.00

Fig. 1. Known dependen ies

where P (Al = eil jan (Al )) is the probability of the l-th attribute having the
value eil given the values of the attributes in an (Al ). Given a set of examples
D a belief network G(V; R) on D, and k belief networks Gj (V; R) on k lusters
Cj of D, equation (4) be omes

Ya
P (Ei jCj ) = P (Al = eil jCj ; an j (Al )) (7)
l=1
where an j (Al ) is the set of dire t an estors of the l-th attribute in the belief
network Gj (V; R). The expression (5) be omes
k
X vl h
a X
X
1
k P (Cj )
j=1 l=1 s=1  (8)
P (Al = Vls jCj ; an j (Al )) log P (Al = Vls jCj ; an ij (Al ))

P (Al = Vls jan (Al )) log P (Al = Vls jan (Al ))

It is hard to automati ally infer the optimal belief network stru ture for a
set of examples. Therefore, we use the same stru ture for the networks Gj (V; R)
as the stru ture of G(V; R) and adapt the probabilities. This is suÆ ient for a
good obje tive fun tion in our appli ations. An algorithm that ould improve the
stru ture of the belief network of a luster su h that the estimated probability
of the examples in reases, would provide additional indi ations for similarity be-
tween the examples in that luster, and would probably give even better results.
Table 2 illustrates the e e t on our small supermarket example. Two possible
partitions of the database are onsidered. The left olumn gives the values for
equation (5), assuming independent attributes. The right olumn gives the values
for equation (8), taking into a ount the prior knowledge of the supermarket,
represented in gure 1. When the prior knowledge is taken into a ount, other
lusters will be formed. These new lusters an give more insight in the things
the supermarket really wants to know.

 indep. attr. belief net


f1; 2; 4; 7; 9g; f3; 5; 6; 8g -1.38 -1.18

f1; 3; 5; 7; 9g; f2; 4; 6; 8g -1.97 -0.64
Table 2. Value of di erent ways of lustering with and without belief network

3 Upgrading to rst order logi


In this se tion we further upgrade our approa h to rst order logi in a way
similar to [13℄. An other approa h would be to extend the models des ribing the
lusters further from bayesian nets to probabilisti logi models. However, though
several proposals for su h models were made (see e.g. [6℄, many problems are still
unanswered in this area (e.g. learning the models), whi h make that alternative
problemati .
Due to spa e limitations, we brie y illustrate rather than extensively intro-
du e the rst-order logi on epts used in the rest of the paper. These on epts
are taken mostly from the \learning from interpretations" setting of Indu tive
Logi Programming. Formal aspe ts of this indu tion paradigm are dis ussed in
[7℄.

3.1 Attributes and onjun tive queries


In this se tion, we derive attributes from a relational database. In ontrast to
[15℄, we do this exhaustively, whi h allows us in the next se tion to determine
the dependen ies between them in an easy way.
Consider a relational database D that onsists of three relations shown in
table 3 and onjun tive query1
Q1 : ustomer (X) ^ parent of (X,Y ) ^ buys (Y, oke)
1
In pra ti e we use Prolog to represent data and patterns. This means for instan e
that we an add re ursively de ned predi ates to the database and use fun tion
symbols in the queries. In this paper we restri t ourselves to relational algebra merely
to simplify the dis ussion.
Customer Buys Parent Of
Id Id Item Id Id
a a wine a b
b b oke a
b pizza d
d d wine
Table 3. A relational database D

i.e., \there exists a ustomer X who is parent of a oke buyer Y ". If indeed
su h X; Y; Z are found in database D, query Q1 su eeds w.r.t. D, otherwise, the
query fails w.r.t. D. Noti e query Q1 su eeds with answerset fX = a, Y = bg.
Query Q1 an be interpreted as a boolean attribute QA 1 of ustomers: ea h
ustomer either is or is not \parent of a oke buyer". To nd Q1 for a parti ular
ustomer, we substitute the variable X in Q1 with the ustomer's identi er, and
evaluate against the database. For instan e, query
ustomer (a) ^ parent of (a,Y ) ^ buys (Y, oke)
i.e., \ ustomer a is parent of a oke buyer Y ", su eeds with Y = b. Hen e
attribute QA 1 has value 1 for ustomer a. Observe substitutions of X with the
three remaining ustomer identi ers b; ; d all result in failing queries, su h that
for these ustomers attribute QA1 has value 0.
In general, we an view onjun tive queries as attributes of examples (here
ustomers) if they ontain a set of variables (here singleton fXg) su h that
ea h substitution of those variables orresponds to an identi er of the example.
The most straigthforward way to a hieve this is to (1) have a separate relation
in the database with example identi ers (here Customer), and (2) make the
orresponding atom (here ustomer (X)) obligatory in ea h onjun tive query.
In that ase, we an roughly de ne the frequen y of a onjun tive query as the
number of example identi ers for whi h the query su eeds.
Given the above interpretation of onjun tive queries as boolean attributes,
and given three more onjun tive queries with the obligatory ustomer (X) atom:
Q2 : ustomer (X) ^ parent of (X,Y ) ^ buys (Y,wine)
Q3 : ustomer (X) ^ parent of (X,Y )
Q4 : ustomer (X) ^ buys (X,Y)
we an generate the attribute-value des ription of our relational database D
shown in table 4. Obviously, table Customer avl is not equivalent to database
D, but for some purposes, for instan e lustering ustomers, we may judge it
omes suÆ iently lose. We ome ba k to this issue in paragraph 3.3.

3.2 Belief networks and query latti es


The ustomer data are now in a format similar to that in table 1 and we an apply
the ideas introdu ed above. If for instan e a dependen y is assumed between the
Customer avl
Id QA1 QA2 QA3 QA4
a 1 0 1 1
b 0 0 0 1
0 1 1 0
d 0 0 0 1

Table 4. Attribute-value representation of D

attributes \is parent of oke buyer" (QA A


1 ) and \is parent of wine buyer" (Q2 )
than this assumption an be expressed as before.
On top of these subje tive dependen ies however, a number of obje tive de-
penden ies between attributes QA i exist based on the logi al \more-general-than"
relation between orresponding onjun tive queries Qi . For eÆ ien y reasons, we
here on entrate on a stronger variant of logi al impli ation alled -subsumption
[18℄. The advantage of -subsumption is that it an be omputed dire tly from
the syntax of queries: we say query Qg is more-general-under--subsumption than
query Qs if and only if there exists a substitution  of the variables in Qg su h
that all atoms in Qg  (i.e., the result of applying substitution  to query Qg )
o ur in query Qs . For instan e, by this de nition, query Q3 is more general than
both queries Q1 and Q2 , in both ases with the empty substitution. No other
generality relations exist between queries Q1 -Q4 . Observe in parti ular that no
substitution an be found that makes Q4 a subset of Q1 or Q2 . This agrees with
the intuition that, e.g., \parent of a oke buyer" is not a spe ial ase of \buyer
of something".

A
Q3
> A
Q4

0.50
0.75

QA3 QA4

QA1 QA2
Q3
A A
Q1 Q3
A Q2
A
T 0.50 T 0.50
F 0.00 F 0.00

Fig. 2. Obje tively known dependen ies


The generality-under--subsumption relation indu es a latti e stru ture on a
set of onjun tive queries Qi , as explained, for instan e, in [17℄. This stru ture
trivially maps to a belief network over the orresponding boolean attributes
QA A
i , su h that ea h dire t an estor of node Qi orresponds to a maximally
spe i generalization under -subsumption of query Qi . The probability tables
asso iated with ea h node an be omputed dire tly from the data. The belief
network for attributes QA A
1 -Q4 is shown in Figure 2.

3.3 Attribute sele tion

To make the link between onjun tive queries (latti es) and attributes (belief
networks) operational we still have to solve the problem of how to sele t queries
su h as Q1 -Q4 from an in nite spa e of possible queries on database D. Part of
the solution is o ered by a so- alled de larative language bias formalism, well
known in relational learning. Su h a formalism allows the user to onstrain the
spa e of queries to sensible ones, e.g., by imposing type and mode onstraints
on variables. We further assume that the language bias de nitions provided
by the user indire tly { via the query latti e { determine the stru ture of the
belief network used throughout the lustering pro ess. As explained immediately
below, extra luster-spe i onstraints will be used to \suppress" some of the
nodes in this possibly giganti (in nite) network.
Within parti ular lusters, we use an additional onstraint based on the ob-
servation that the in uen e of rarely su eeding, hen e alled infrequent, queries
on lustering is negligible. Observe in that respe t that the in uen e of a on-
jun tive query is at best proportional to r:log (r), with r the relative frequen y
of the query. Therefore, when setting the parameters of the belief network for
a parti ular luster, we an safely ignore all nodes whose relative frequen y is
below a pre-de ned threshold t, i.e., assume they have frequen y 0. Sin e we only
set and use the parameters asso iated with queries whose frequen y ex eeds t,
we need an algorithm that sele ts these frequent queries from the user de ned
language of queries. Warmr [9℄ is su h an algorithm.
Warmr is an instan e of the family of levelwise frequent pattern dis overy
algorithms [16℄ that look at a level of the latti e at a time, starting from the most
general pattern. Warmr iterates between andidate generation and andidate
evaluation phases: in andidate generation, the latti e stru ture is used for prun-
ing non-frequent queries from the next level; in the andidate evaluation phase,
frequen ies of andidates are omputed w.r.t. the database. Pruning is based on
monotoni ity of the generality-under--subsumption relation w.r.t. frequen y: if
a query is not frequent then none of its spe ializations is frequent. So, while gen-
erating andidates for the next level { this is essentially done by adding atoms
to frequent queries of the last level { all the queries that are spe ializations of
infrequent patterns an be pruned.
4 The Remind system
We implemented the ideas in the previous se tions in a system alled RElational
lustering MINus Dependen ies. On the highest level, Fisher's Cobweb algo-
rithm was used. Essentially, it in rementally reates a hierar hi al lustering,
starting with an empty tree and updating it ea h time an example is pro essed.
The algorithm sorts example down the tree, at ea h step hoosing the best of
four possible operations using an obje tive fun tion.

4.1 The Cobweb algorithm


The four possible operations in Cobweb at ea h node are: adding the example
to a subnode, reating a new subnode with as only element the new example,
joining two subnodes adding the example to one of them, and splitting a subnode
into its subnodes adding the example to one of them. More details an be found
in [11℄.

4.2 In remental Warmr


Formula (8) was used as obje tive fun tion. To al ulate the value of this obje -
tive fun tion, the frequen y of the di erent values for all attributes is needed.
These statisti s are stored for ea h node in the tree, and in rementally updated
as new examples are added.
When attributes are independent, the best one an do is to ount the fre-
quen y of ea h value of ea h attribute. However, when there is a relation between
the attributes (e.g. a taxonomy (see below for an example), or a query latti e
in a relational setting), the data is often sparse and a more eÆ ient approa h is
possible.
We extended the Warmr algorithm to work in rementally. Given is a set
of already pro essed examples, and a set of frequent patterns on these exam-
ples with their frequen ies. When a new example has to be added, the query
frequen ies are updated in a level-wise order. If a query does not su eed for
the example, none of the frequen ies of spe ialisations of that query have to be
updated.

5 Experiments
In this se tion we report on some experimental results with our Remind system.
Evaluation of lustering systems is diÆ ult. One of the most frequently used
methods is to use lustering to predi t unknown features of examples, and use
the a ura y as performan e measure. As we explain in se tion 6.2 we do not
expe t a large performan e gain for normal propositional problems. Therefore,
in this se tion we fo us on showing that our te hnique an be used in situations
where the normal lustering algorithms fail.
We used some syntheti datasets for our experiments. These an be obtained
from http://www. s.kuleuven.a .be/~ml/.
5.1 Ben hmarks
We do not use the standard Cobweb obje tive fun tion partition utility, but
the information-theoreti equivalent. [12℄ proposes this as a good alternative, but
does not report experimental results. Be ause of this, we rst ran our algorithm
on some standard UCI ben hmarks (soybeans, voting, breast), for both obje tive
fun tions. We observed that Cobweb indu es similar trees for both obje tive
fun tions. Also, the trees gave similar lass predi tion a ura ies in the leave-
one-out rossvalidations we did.

5.2 A transa tion database


In a transa tion database, ea h example is the set of all items some ustomer
bought in one transa tion. Large transa tion databases typi ally ontain many
attributes and are very sparse. In these appli ations, to have oversight over the
data, it is important to de ne a taxonomy (see also e.g. [1℄). A taxonomy is an
is a hierar hy, an example of whi h is given in gure 3

drink

non-al oholi drink al oholi drink

water oke wine beer

Fig. 3. A taxonomy

items taxonomy nodes


1 729 270
2 144 36
3 729 30
4 2916 121
Table 5. Some parameters of the transa tion datasets

We reated 4 syntheti transa tion databases. Ea h database ontains trans-


a tions from three groups of ustomers with di erent hara teristi s. Table 5
gives for ea h database the number of items and the number of taxonomy nodes.
Then three di erent versions of Cobweb were run.
1. Standard Cobweb on the examples without the taxonomy (naive), In this
setting, ea h example has one attribute for ea h item that an be bought,
giving the number of items the ustomer bought.
2. Standard Cobweb on the database with all an estors in the taxonomy added
as attributes (taxonomy ). e.g. the attribute drink from gure 3 would be 1
i some ustomer bought either one oke, one beer, one wine or one water.
3. Remind whi h makes use of the knowledge in the taxonomy (Remind).
Ea h time, the test examples were sorted down the tree and the lass was pre-
di ted in a leave-one-out ross-validation. The results obtained are summarized
in table 6.
We observed that taxonomy performs best when a large part of the taxonomy
is over the attributes important for the lustering (the attributes we de ned dif-
ferent distributions for in the di erent lusters) su h as in set 1 (in this dataset,
189 of the 270 nodes in the taxonomy are generalisations of attributes impor-
tant for lustering). T axonomy performs worse when there is a taxonomy over
irrelevant (in this ase pure noise) attributes su h as in set 3, even if the taxon-
omy is rather small. N aive performs badly when there are too many items to
see similarities between transa tions, su h as in set 4 where we generated more
items (while ustomers remained buying on average the same amount of items).
Therefore, we expe t naive to do even worse when the databases are s aled up
further to real-world size. For all sets, Remind is about as good as the best of
the rst two. We an on lude that Remind is a more robust system.

naive taxonomy Remind


set 1 38% 43% 42%
set 2 76% 65% 74%
set 3 80% 51% 81%
set 4 44% 58% 60%
Table 6. Results on the transa tion databases

5.3 Bongard
We also did some experiments on bongard datasets. A bongard dataset ontains
s enes of gures (triangle, ir les, ...), in ertain relations to ea h another. e.g.
the query
Q : triangle(X ); ir le(Y ); ir le(Z ); not(Y = Z ); in(X; Y )
su eeds for a s ene if there are two ir les and a triangle that is in one of the
ir les. We generated our database by randomly generating positive examples
(where there is a ir le in a triangle) and negative examples (where a triangle is
in a ir le).
We ran both standard Cobweb and Remind. It ould be observed that
few ir les

many triangles
positive

many ir les
few triangles
negative few ir les
a. Remind b. standard Cobweb
Fig. 4. Bongard lusterings

{ Cobweb generated lusters with little gures and lusters with many gures
(see e.g. gure 4b) with positive and negative examples mixed.
{ Remind generated lusters with positive examples and lusters with negative
examples (see gure 4a).
When we used the lusterings generated by both systems to predi t the lass
value, the Cobweb lustering s ored 59% while the Remind lustering s ored
100%.
This result an be intuitively explained as follows. The query Qt1 : triangle(X )
is more general than Qt2 : triangle(X ); triangle(Y ); not(X = Y ). Also, Qt2 is
more general than the query Qt3 : triangle(X ); triangle(Y ); not(X = Y ); triangle(Z ); not(X =
Z ); not(Y = Z ). For standard Cobweb, the queries Qti saying that there
should be i di erent triangles, are all di erent attributes. Sin e there are a
wide range of possible numbers of triangles, the number of triangles be omes
very important for Cobweb. On the other hand, sin e these queries are depen-
dent on ea h another, their total weight will be mu h smaller for Remind. In
fa t, Remind approximately a ts as if there was one (multi-valued) attribute
number of triangles(N ).

5.4 Mutagenesis
We also did experiments on the mutagenesis database [20℄. In this dataset, an
example is a mole ule that is either mutageni or non-mutageni . There is a
relation des ribing ea h atom, a relation des ribing ea h bond, and a relation
des ribing some global hemi al properties of the mole ule. The onjun tive
queries that an be generated from these relations are massively dependent. We
generated all queries with minimal relative frequen y of 5% and of maximal
length 4.
We ran both standard Cobweb and Remind on a dataset with the truth
value of these queries as attributes. We ould observe that be ause of the massive
dependen ies of attributes, Cobweb produ ed a very unbalan ed tree of 83 levels
deep. Remind reated more balan ed trees.
However, it turned out to be diÆ ult to use these lusterings to predi t the
mutageni ity of the mole ules. This an be aused by the fa t that similarity
of stru ture does not ne essary mean that mole ules are similar in mutageni -
ity. Another reason is that the simple bayesian network from the query latti e
does not over all dependen ies between queries. We did preliminary experi-
ments with more re ned versions of the belief network and these give more
promising results (67% a ura y to predi t mutageni ity). For instan e, given
queries Q1 : atom(X ); bond(X; Y ), Q2 : atom(X ); bond(X; Y ); element(Y; ) and
Q3 : atom(X ); bond(X; Y ); element(Y; o) and the fa t that Q1 su eeds. The fa t
that Q2 does not su eed now makes the su ess of Q3 more probable (atom Y
is of exa tly one element type ( arbon, hydrogen, oxygen,...)). A more thorough
study of this is part of further work.

6 Related work
6.1 Promotion of interesting features
Other work already exists that uses dependen ies between attributes to improve
lustering. It is interesting to look at the di eren es. One possibility is des ribed
in [21℄, where the attributes are sele ted that are most predi tive for the other
attributes. This means that dependent features are given a larger weight, whi h
is the onverse as what we do. This method is used in the hope that irrelevant
features will disturb the lustering pro ess less when given a small weight. To
understand the di eren e between this approa h and ours, it is ne essary to
make a distin tion between two kinds of dependen ies:
{ Dependen ies aused by e.g. the language bias, whi h should not be dis ov-
ered by the lustering system. Our approa h is to try to neutralize them.
{ Dependen ies in the data, unknown to the user, whi h are hoped to be
dis overed. It an be interesting to try to promote them.
In fa t, it ould be useful to try to ombine both methods.

6.2 The naive bayes assumption and performan e


In the propositional setting, in many ases the naive bayes assumption holds.
This means that for lassi ation, it does not hurt mu h to assume all attributes
to be independent. To some extend, a similar assumption an be made for (naive)
bayesian lustering. Indeed it does not hurt too mu h if some attributes a i-
dently happen to be a little dependent. Dependen ies unknown by the user ould
be supposed to o ur randomly and to an el out ea h another at least partially.
As long as intra- luster orrelations are strong enough, the lustering will not
be in uen ed by mu h. Therefore, we do not expe t large performan e gains
for attribute-value problems without systemati dependen ies as des ribed in
se tion 2. Our approa h intends to be a fun tional extension, making lustering
possible in domains su h as ILP where the language bias auses dependen ies.
Also, when the user has some prior knowledge, our approa h ould shift the fo-
us away from this knowledge, in the hope that other unknown relations will be
dis overed. Sin e these relations are expe ted to be weaker, one an in that ase
even expe t a loss in performan e measured by the lassi al metri s. It would
be interesting to evaluate the usefulness of this last aspe t in a larger iterative
knowledge dis overy pro ess involving y les and user intera tion.

6.3 Bayesian Classi er


The method dis ussed in this paper ould also be used to develop a \less naive"
bayesian lassi er that ompensates for the violation of the independen e as-
sumption of a naive bayesian lassi er (see e.g. [13℄) by the ILP ontext.

6.4 Multinets
The probability tables of the basi bayesian networks we use for ilp problems
are sparse, In fa t, other representations su h as bayesian multinets [14℄ ould
be used to represent the dependen ies in our networks. These representations
allow one to represent probability tables with many zeros in a more elegant
way. For simpli ity, in this paper only the better known standard bayesian net
representation was used.

6.5 Distan es
The method des ribed in this paper ould also be used to make a distan e
measure that adapts itself to the appli ation domain (by adapting the network
stru ture and weights). Indeed, the utility of taking two examples together in
a luster is a measure of their similarity. This measure is less representation-
dependent than other proposed measures (e.g. [19℄) and is in many ases faster
to ompute. We hope that this measure an also perform well in distan e-based
appli ations, e.g. for instan e based learning. Experiments on this is part of
further work.

7 Con lusions and further work


We have proposed the use of the more expressive models of belief networks to
des ribe lusters and presented the hanges needed to the obje tive fun tions
to take this into a ount. We argued that this has several advantages. First, it
allows one more expressivity in the des ription of lusters. Se ond, even if no
algorithm for the automati inferen e of the stru ture of belief networks an
be used, it allows to ompensate for dependen ies in the data aused by e.g.
language bias. Next, we have shown that this an be applied to do lustering on
examples in multiple relations using the truth values of non-independent queries.
Finally, we did some experiments to evaluate this approa h.
The limitations of our system are inherited from the Warmr and Cobweb
systems. Level-wise pattern dis overy systems su h as Warmr an only be used
if one is able to de ne a spa e of patterns and a lower frequen y bound su h that
the sear h of frequent patterns is nite. Cobweb is an in remental system, whi h
has both advantages and disadvantages. However, the same approa h of using
belief networks to des ribe lusters ould be applied to other existing lustering
algorithms with di erent properties.
There are several possible dire tions for further work. First, one ould use
an algorithm to learn the optimal stru ture of the belief networks des ribing
the lusters. This would further in rease the expressivity. Se ond, one ould
further investigate possible appli ations of this method (e.g. in databases having
taxonomies) and its relation to other lustering aspe ts su h as pruning methods.
Third, further resear h ould be done to stru ture the rst order language in su h
a way that the dependen ies between the queries an be found more easily.

A knowledgements Jan Ramon is supported by the Flemish Institute for


the Promotion of S ien e and Te hnologi al Resear h in Industry (IWT). Lu
Dehaspe is supported by K.U.Leuven resear h grant PDM/98/89.

Referen es
1. R. Agrawal and R. Srikant. Mining generalised asso iation rules. In Pro eedings
of the 21th VLDB Conferen e, 1995.
2. H. Blo keel and L. De Raedt. Top-down indu tion of rst order logi al de ision
trees. Arti ial Intelligen e, 101(1-2):285{297, June 1998.
3. H. Blo keel, L. De Raedt, and J. Ramon. Top-down indu tion of lustering trees.
In Pro eedings of the 15th International Conferen e on Ma hine Learning, pages
55{63, 1998.
4. P. Cheeseman and J. Stutz. Bayesian lassi ation (auto lass): Theory and results.
In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhrai Smyth, and Ramasamy
Uthurusamy, editors, Advan es in Knowledge Dis overy and Data Mining. AAAI
Press/MIT press, 1996.
5. J. Corter and M. Glu k. Explaining basi ategories: feature predi tability and
information. psy hologi al bulletin, (111):291{303, 1992.
6. J. Cussens. Loglinear models for rst order probabilisti reasoning. In Pro . of
UAI, 1999.
7. L. De Raedt. Logi al settings for on ept learning. Arti ial Intelligen e, 95:187{
201, 1997.
8. L. De Raedt and W. Van Laer. Indu tive onstraint logi . In Klaus P. Jantke,
Takeshi Shinohara, and Thomas Zeugmann, editors, Pro eedings of the Sixth Inter-
national Workshop on Algorithmi Learning Theory, volume 997 of Le ture Notes
in Arti ial Intelligen e, pages 80{94. Springer-Verlag, 1995.
9. L. Dehaspe and H. Toivonen. Dis overy of frequent datalog patterns. Data Mining
and Knowledge Dis overy, 3(1):7{36, 1999.
10. W. Emde and D. Wetts here k. Relational instan e-based learning. In L. Saitta,
editor, Pro eedings of the Thirteenth International Conferen e on Ma hine Learn-
ing, pages 122{130. Morgan Kaufmann, 1996.
11. D. H. Fisher. Knowledge a quisition via in remental on eptual lustering. Ma-
hine Learning, 2:139{172, 1987.
12. D. H. Fisher. Iterative optimization and simpli ation of hierar hi al lusterings.
Journal of Arti ial Intelligen e Resear h, 4:147{179, 1996.
13. P.A. Fla h and N. La hi he. 1b : a rst order bayesian lassi er. In D. Page, editor,
Pro eedings of the Ninth International Workshop on Indu tive Logi Programming,
volume 1634, pages 93{103. Springer-Verlag, 1999.
14. D. Geiger and D. He kerman. Knowledge representation and inferen e in similarity
networks and bayesian multinets. Arti ial Intelligen e, 82:45{74, 1996.
15. S. Kramer, B. Pfahringer, and C. Helma. Sto hasti propositionalisation of non-
determinate ba kground knowledge. In Pro eedings of the Eighth International
Conferen e on Indu tive Logi Programming, volume 1446 of Le ture Notes in Ar-
ti ial Intelligen e, pages 80{94. Springer-Verlag, 1998.
16. H. Mannila and H. Toivonen. Levelwise sear h and borders of theories in knowledge
dis overy. Data Mining and Knowledge Dis overy, 1(3):241 { 258, 1997.
17. S.-H. Nienhuys-Cheng and R. Wolf. Foundations of indu tive logi programming,
volume 1228 of Le ture Notes in Computer S ien e and Le ture Notes in Arti ial
Intelligen e. Springer-Verlag, New York, NY, USA, 1997.
18. G. Plotkin. A note on indu tive generalization. In B. Meltzer and D. Mi hie, edi-
tors, Ma hine Intelligen e, volume 5, pages 153{163. Edinburgh University Press,
1970.
19. J. Ramon and M. Bruynooghe. A framework for de ning distan es between rst-
order logi obje ts. In Pro eedings of the Eighth International Conferen e on In-
du tive Logi Programming, Le ture Notes in Arti ial Intelligen e, pages 271{280.
Springer-Verlag, 1998.
20. A. Srinivasan, S.H. Muggleton, R.D. King, and M.J.E. Sternberg. Mutagenesis:
ILP experiments in a non-determinate biologi al domain. In S. Wrobel, editor,
Pro eedings of the Fourth International Workshop on Indu tive Logi Program-
ming, volume 237 of GMD-Studien, pages 217{232. Gesells haft fur Mathematik
und Datenverarbeitung MBH, 1994.
21. L. Talavera. Feature sele tion as a prepro essing step for hierar hi al lustering.
In Pro eedings of the 16th International Conferen e on Ma hine Learning, pages
389{397. Morgan Kaufmann, 1999.

S-ar putea să vă placă și