Sunteți pe pagina 1din 13

Boolean Model

The Boolean model is a simple retrieval model based on set theory and Boolean algebra. Since the concept of
a set is quite intuitive, the Boolean model pro-vides a framework which is easy to grasp by a common user of
an IR system.

Furthermore, the queries are specified as Boolean expressions which have precise

semantics. Given itsinherent simplicityandneatformalism, the Boolean model

commercialbibliographicsystems.

26 MODELING

Figure2.3 The three conjunctive components for the query[q =k/\ (kb V...,kc)J.

it difficult andawkwardto expresstheir queryrequests in terms ofBooleanex-pressions.

TheBooleanexpressionsactuallyformulated by usersoftenarequite

simple (seeChapter10 for a morethorough discussionon this issue). Despite

thesedrawbacks,theBooleanmodel is stillthedominantmodelwithcommercial

the field.

Wi,jE{O,I}. Aqueryqis composedof indextermslinked by threeconnectives:

not,and,or.Thus,aqueryisessentiallyaconventionalBooleanexpressionwhich

can berepresentedas adisjunctionofconjunctivevectors(i.e., in disjunctivenor-malform- DNF).Forinstance,

thequery[q =ka /\ (kbV...,kc) ]can bewritten

in disjunctivenormalformas[Qdnf = (1,1,1) V(1,1,0) V(1,0,0)],where each of

thecomponentsis a binaryweightedvectorassociatedwiththetuple (ka, kb'kc) '

Figure2.3illustrates the three conjunctivecomponentsforthe queryq.

Definition For the Boolean model, theindex term weight variables are all

Qdnfbe thedisjunctivenormalform for the query q. Further, let lfcc be anyofthe

is defined as

CLASSICINFORMATION RETRIEVAL 27

Ifsim(dj

, q)=1then the Boolean model predicts thatthe documentdj

is relevant

to the query q (it might notbe). Otherwise, thepredictionis that the document

is not relevant.

TheBooleanmodelpredicts that each document is either relevant or non-relevant. Thereis no notionof a partial
matchto the queryconditions. For

in documents. Theseterm weightsare ultimatelyused tocomputethe degree

ofsimilaritybetweeneach documentstoredin the systemandthe user query.

Bysortingtheretrieveddocumentsin decreasingorderofthisdegreeofsimilar-ity, thevectormodeltakes

intoconsiderationdocumentswhichmatchthequery

termsonlypartially.Themainresultanteffect isthattherankeddocumentan-swer set is a lot more precise (in

thesense that it bettermatchesthe user infor-mationneed)thanthe documentanswerset
retrievedbytheBooleanmodel.

Definition For thevectormodel, the weightWi,jassociated with a pair (ki

, dj )

is positive and non-binary. Further, the index terms in the query are also

weighted. LetWi,qbe the weight associated with the pair [ki

,q], whereWi,q~O.

d~.if

Id~1 x1q1

E:-IWi,jxWi,q

28 MODELING

, q).

is the same for all documents. Thefactor Id~Iprovides anormalizationin the

space ofthe documents.

SinceWi,j~0andWi,q~0,sim(q,dj

behindthe most effective term-weighting techniques. Thisidea is relatedto the

basicprincipleswhichsupportclusteringtechniques, asfollows.

objectsareandwhich arenotin the set A.Forinstance, one might be looking

for a set Aof cars which have a pricecomparableto thatof a Lexus 400. Since it

is not clear whatthe term comparablemeans exactly,there is not a precise (and

unique)descriptionoftheset A.Moresophisticatedclusteringalgorithmsmight

diagnosed,and healthy. Again,thepossible classdescriptionsmight be imprecise

(and not unique) andthe problemis one of deciding to which of these classes

regard to a given user query).

CLASSICINFORMATION RETRIEVAL 29

IR problem can be reducedto theproblemofdeterminingwhichdocumentsare

in the set Aand which ones are not (i.e., the IR problemcan be viewedasa

whichbetterdistinguishthe objectsin the set Afrom the remaining objectsin

the collection C. The first set of features provides forquantificationofintra-cluster similarity, whilethe second
set of features provides forquantification

the raw frequency of a term k; inside a documentdj

. Suchterm frequency is

factor is usually referred to asthe inverse documentfrequency orthe idffactor.

Themotivationfor usage of an idf factor is that terms whichappearin many

non-relevantone. As with goodclusteringalgorithms,the most effectiveterm-weighting schemes for IRtry to

balancethese two effects.

appears. Letfreqi,j be the

row frequency ofterm k; in the documentdj (i.e., the numberoftimes the term

Ajofterm k; in documentdj is given by

.. - freqi,j

.) -, maxI freql,j

(2.1)

where themaximumis computed over allterms which arementionedin the text

of the documentdj. Ifthe term ki does not appear in the documentdj then

i, be given by

(2.2)

(2.3)

30 MODELING

or by avariationofthis formula. Suchterm-weighting strategies are calledtf-idf

schemes.

SeveralvariationsoftheaboveexpressionfortheweightWi,jaredescribedin an

interestingpaperbySaltonandBuckleywhichappearedin 1988 [696]. However,

in general, the above expression should providea goodweightingscheme for

manycollections.

Forthe queryterm weights,SaltonandBuckleysuggest

_ (0 0.5freqi,q) 1 N

Wi,q- .5+ x og-maxI freql,q ni

(2.4)

wherefreqi,q is the raw frequency ofthe term k; in the text ofthe information

request q.

improves retrievalperformance;(2) itspartialmatchingstrategyallowsretrieval

ofdocumentsthat approximatethe queryconditions; and (3) its cosine rank-ing formula sortsthe
documentsaccordingto their degree ofsimilarity to the

indiscriminate applicationto all the documentsin the collectionmightin fact

hurtthe overallperformance.

In this section, wedescribethe classic probabilisticmodelintroduced in 1976

byRoberstonandSparckJones [677] whichlater becameknown asthe binary

in mind,we donotdetainourselvesin subtletiesregarding the binaryindepen-denceassumptionfor the

model.Thesectiononbibliographicdiscussionpoints

Theprobabilisticmodelattemptsto capturetheIR problemwithina prob-abilisticframework. Thefundamental

idea is as follows. Given a user query,

there is a set of documentswhichcontainsexactlythe relevant documentsand

CLASSICINFORMATION RETRIEVAL 31

descriptionofthis ideal answer set, we would have no problemsin retrievingits

documents. Thus,we canthink ofthe queryingprocess as a process of specify-ing the propertiesof an ideal
answer set (which isanalogousto interpreting the

propertiesare not knownatquery time, an effort has to be made at initially

guessing what they could be. Thisinitial guess allows us togeneratea prelim-inary probabilisticdescriptionof
the ideal answer set which is used to retrieve

to be examined). Thesystemthenusesthis informationto refinethedescription

ofthe ideal answer set. By repeating this process many times, it isexpected

that such adescriptionwill evolve and become closer tothe real descriptionof

the ideal answer set. Thus,one should always have in mindtheneed to guess at

the beginning the descriptionofthe ideal answer set. Furthermore,a conscious

effort is made to modelthis descriptionin probabilisticterms.

in the collection, the probabilisticmodeltries to estimatethe probabilitythat

the user will find the documentdj interesting (i.e., relevant). Themodel as-sumesthatthisprobabilityof
relevancedependsonthequeryandthedocument

representations only. Further,the modelassumes that there is a subset of all

documentswhich the user prefers asthe answer set forthe queryq. Such an

idealanswer set is labeled Randshouldmaximizetheoverallprobabilityof rel-evance to the user. Documentsin

the set Rarepredictedto be relevant to the

, as

a measure of itssimilarityto the query, the ratioP(dj relevant-toq)jP(dj non-relevant-

toq)whichcomputestheoddsofthedocumentdj being relevant tothe

queryq.Taking the odds of relevance as therank minimizesthe probabilityof

anerroneousjudgement [282,785J.

Definition For the probabilistic model, theindex term weight variables are

32 MODELING

P(RIJ.)
sim(d q)= .3

], P(Rldj

Using Bayes' rule,

sim(d. )= P(~IR)xP(R)

]'q P(djIR)xP(R)

P(t4IR)standsforthe probabilityofrandomlyselectingthe documentdj from

thesetRofrelevantdocuments.Further,P(R)standsfortheprobabilitythata

we write,

P(d~IR) I"V

P(d~IR)

randomly selected fromthe set R. P(kiIR)standsfor the probabilitythat the

index term k;is not presentin a documentrandomly selected from the set R.

Theprobabilitiesassociatedwiththe set Rhave meanings which are analogous

to theonesjust described.

Takinglogarithms, recalling that P(kiIR)+P(kiIR)=1,and ignoring

factors which are constantfor all documentsin the contextof the same query,

are manyalternativesfor suchcomputation.We discuss a couple ofthem below.

CLASSICINFORMATION RETRIEVAL 33
In the verybeginning(i.e.,immediatelyafterthequery specification),there

P(kiIR)

P(kiIR)

= 0.5

where, as already defined,ni is the numberofdocumentswhichcontain the

index term ki and Nis the total numberofdocumentsin the collection. Given

provide aninitial probabilisticrankingfor them. After that, this initial ranking

is improved as follows.

ki

. For simplicity, we also useVand \'i to refer to the numberof elements in

these sets (it should always be clear when theused variable refers totheset or to

the following assumptions: (a) we can approximateP(kiIR)bythe distribution

ofthe index term ki among the documentsretrieved so far, and (b) we can

P(ki\R)

P(kiIR) =

N-V

conceived.

P(kiIR)

Vi+0.5

vri

P(kiIR)

ni- \'i+0.5

N-V+l

34 MODELING

Anadjustmentfactorwhich isconstantandequalto 0.5 is not alwayssatisfactory.

Vi+W

v-.i

ni-l!i+N

N-V+l

Thiscompletesourdiscussionofthe probabilisticmodel.
Themainadvantageofthe probabilisticmodel, in theory, isthat docu-mentsareranked in decreasingorderoftheir
probabilityof being relevant. The

disadvantagesinclude: (1) the need to guessthe initial separationofdocuments

into relevant and non-relevantsets; (2) the fact that the methoddoesnottake

into accountthe frequency withwhich anindex term occurs inside adocument

(i.e., allweights arebinary);and(3) the adoptionofthe independence assump-tion for index terms. However,
asdiscussedfor the vector model, it is not clear