Sunteți pe pagina 1din 13

Boolean Model

The Boolean model is a simple retrieval model based on set theory and Boolean algebra. Since the concept of
a set is quite intuitive, the Boolean model pro-vides a framework which is easy to grasp by a common user of
an IR system.

Furthermore, the queries are specified as Boolean expressions which have precise

semantics. Given itsinherent simplicityandneatformalism, the Boolean model

received greatattentionin pastyears and wasadopted by many ofthe early

commercialbibliographicsystems.

26 MODELING

Figure2.3 The three conjunctive components for the query[q =k/\ (kb V...,kc)J.

Unfortunately,the Booleanmodel suffers frommajordrawbacks. First,

its retrieval strategyis basedon abinarydecisioncriterion(Le., a documentis

predictedto be eitherrelevant ornon-relevant)withoutanynotionof agrading

scale, which preventsgoodretrieval performance. Thus,the Boolean model is

in reality much more adata(instead ofinformation) retrieval model. Second,

whileBooleanexpressionshaveprecisesemantics, frequently it is not simple to

translateaninformation needintoaBooleanexpression. In fact, most users find

it difficult andawkwardto expresstheir queryrequests in terms ofBooleanex-pressions.


TheBooleanexpressionsactuallyformulated by usersoftenarequite

simple (seeChapter10 for a morethorough discussionon this issue). Despite

thesedrawbacks,theBooleanmodel is stillthedominantmodelwithcommercial

documentdatabasesystemsandprovides a goodstartingpointforthose new to

the field.

TheBooleanmodelconsidersthat index terms are presentorabsentin a

document.As aresult, theindextermweights areassumedto be all binary, i.e.,

Wi,jE{O,I}. Aqueryqis composedof indextermslinked by threeconnectives:

not,and,or.Thus,aqueryisessentiallyaconventionalBooleanexpressionwhich

can berepresentedas adisjunctionofconjunctivevectors(i.e., in disjunctivenor-malform- DNF).Forinstance,


thequery[q =ka /\ (kbV...,kc) ]can bewritten

in disjunctivenormalformas[Qdnf = (1,1,1) V(1,1,0) V(1,0,0)],where each of


thecomponentsis a binaryweightedvectorassociatedwiththetuple (ka, kb'kc) '

Thesebinaryweightedvectorsare called the conjunctivecomponentsofQdnf'

Figure2.3illustrates the three conjunctivecomponentsforthe queryq.

Definition For the Boolean model, theindex term weight variables are all

binaryi.e., Wi,jE{O,I}. A query q is aconventionalBoolean expression. Let

Qdnfbe thedisjunctivenormalform for the query q. Further, let lfcc be anyofthe

conjunctivecomponentsofQdnf.Thesimilarityofadocumentdj to the query q

is defined as

CLASSICINFORMATION RETRIEVAL 27

Ifsim(dj

, q)=1then the Boolean model predicts thatthe documentdj

is relevant

to the query q (it might notbe). Otherwise, thepredictionis that the document

is not relevant.

TheBooleanmodelpredicts that each document is either relevant or non-relevant. Thereis no notionof a partial
matchto the queryconditions. For

instance, let dj be adocumentfor which d~ = (0,1,0). Documentdj includes

theindextermkbbutisconsiderednon-relevantto thequery[q =ka/\(kbV-,kc) ] '

ThemainadvantagesoftheBooleanmodel arethecleanformalism behind

the modelandits simplicity. Themaindisadvantagesare that exactmatching

may lead toretrievaloftoofew or toomanydocuments(seeChapter10). Today,

it is well known thatindextermweightingcanlead toasubstantialimprovement

in retrievalperformance. Indexterm weightingbringsusto the vectormodel.

2.5.3 Vector Model

Thevectormodel [697, 695]recognizes that the use ofbinaryweights istoo

limiting andproposesaframework in which partialmatchingis possible. This

is accomplishedbyassigningnon-binaryweightsto index terms in queriesand

in documents. Theseterm weightsare ultimatelyused tocomputethe degree


ofsimilaritybetweeneach documentstoredin the systemandthe user query.

Bysortingtheretrieveddocumentsin decreasingorderofthisdegreeofsimilar-ity, thevectormodeltakes


intoconsiderationdocumentswhichmatchthequery

termsonlypartially.Themainresultanteffect isthattherankeddocumentan-swer set is a lot more precise (in


thesense that it bettermatchesthe user infor-mationneed)thanthe documentanswerset
retrievedbytheBooleanmodel.

Definition For thevectormodel, the weightWi,jassociated with a pair (ki

, dj )

is positive and non-binary. Further, the index terms in the query are also

weighted. LetWi,qbe the weight associated with the pair [ki

,q], whereWi,q~O.

Then, the query vectorif is defined asif = (WI,q,W2,q,,Wt,q)where t is the

total numberofindex terms in the system. Asbefore, thevectorfor a document

dj is represented byd~ =(WI,j,W2,j,.., Wt,j).

Therefore,adocumentdj anda user queryqare represented as t-dimensional

vectorsas shown inFigure2.4.Thevectormodelproposesto evaluatethedegree

ofsimilarityofthe documentdj withregard to the queryqas the correlation

betweenthe vectorsd~ andif. Thiscorrelationcanbequantified,for instance,

bythe cosineofthe angle betweenthese two vectors. Thatis,

d~.if

Id~1 x1q1

E:-IWi,jxWi,q

28 MODELING

Figure2.4 Thecosine of () isadoptedassim(dj

, q).

whereId~IandI~arethe normsofthedocumentandquery vectors. The factor

I~ does not affectthe ranking (i.e., the orderingof thedocuments)because it

is the same for all documents. Thefactor Id~Iprovides anormalizationin the


space ofthe documents.

SinceWi,j~0andWi,q~0,sim(q,dj

) varies from 0 to+1.Thus,insteadof

attemptingto predictwhetheradocumentis relevant or not, the vector model

ranks the documentsaccordingto their degreeofsimilarityto the query. A

documentmight beretrieved even if it matchesthe query onlypartially. For

instance, one canestablishathreshold onsim(dj

, q)and retrieve thedocuments

with a degree ofsimilarityabovethat threshold. Butto computerankings we

need first to specify howindex term weights areobtained.

Indextermweights can becalculatedin many different ways. The work by

SaltonandMcGill[698] reviews various term-weightingtechniques. Here, we do

notdiscussthem in detail.Instead,weconcentrateonelucidatingthe main idea

behindthe most effective term-weighting techniques. Thisidea is relatedto the

basicprincipleswhichsupportclusteringtechniques, asfollows.

Given a collection C ofobjectsand avaguedescriptionof a setA,thegoal of

a simpleclusteringalgorithmmight be toseparatethecollection C ofobjectsinto

two sets: a first one which.iscomposedofobjectsrelatedto thesetAand a second

one which is composed ofobjectsnotrelatedto thesetA.Vaguedescriptionhere

meansthat we do not havecompleteinformation for deciding precisely which

objectsareandwhich arenotin the set A.Forinstance, one might be looking

for a set Aof cars which have a pricecomparableto thatof a Lexus 400. Since it

is not clear whatthe term comparablemeans exactly,there is not a precise (and

unique)descriptionoftheset A.Moresophisticatedclusteringalgorithmsmight

attemptto separatethe objectsof a collection into variousclusters(or classes)

according to their properties. Forinstance, patientsof adoctorspecializing

in cancercould be classified into five classes: terminal, advanced, metastasis,

diagnosed,and healthy. Again,thepossible classdescriptionsmight be imprecise


(and not unique) andthe problemis one of deciding to which of these classes

a newpatientshould be assigned. Inwhatfollows, however, we only discuss

the simplerversion ofthe clusteringproblem[i.e., theone which considers only

two classes) becauseallthat is required is a decision on which documentsare

predictedto be relevant andwhich ones arepredictedto be not relevant (with

regard to a given user query).

CLASSICINFORMATION RETRIEVAL 29

To viewthe IR problemasone ofclustering,we refer tothe earlywork of

Salton. Wethink of thedocumentsasa collection C ofobjectsandthink ofthe

user queryasa (vague)specificationof a setAofobjects. In this scenario, the

IR problem can be reducedto theproblemofdeterminingwhichdocumentsare

in the set Aand which ones are not (i.e., the IR problemcan be viewedasa

clusteringproblem).In aclusteringproblem,twomainissues have to be resolved.

First,one needs todeterminewhatare the features whichbetterdescribethe

objectsin the set A. Second, one needs todeterminewhatare the features

whichbetterdistinguishthe objectsin the set Afrom the remaining objectsin

the collection C. The first set of features provides forquantificationofintra-cluster similarity, whilethe second
set of features provides forquantification

ofinter-cluster dissimilarity. Themost successfulclusteringalgorithmstry to

balancethese two effects.

In the vector model,intra-clustering similarityis quantifiedbymeasuring

the raw frequency of a term k; inside a documentdj

. Suchterm frequency is

usually referred toasthe tffactor andprovides onemeasureof how wellthat

term describes the documentcontents (i.e., intra-document characterization).

Furthermore,inter-cluster dissimilarityis quantifiedbymeasuringthe inverse

ofthe frequency of a term k; among the documentsin the collection. This

factor is usually referred to asthe inverse documentfrequency orthe idffactor.


Themotivationfor usage of an idf factor is that terms whichappearin many

documentsare not very useful fordistinguishinga relevant documentfrom a

non-relevantone. As with goodclusteringalgorithms,the most effectiveterm-weighting schemes for IRtry to


balancethese two effects.

Definition Let N be the totalnumberofdocumentsin the systemand ni be

the number of documentsin which the index term ki

appears. Letfreqi,j be the

row frequency ofterm k; in the documentdj (i.e., the numberoftimes the term

k; is mentionedin the text ofthe documentdj). Then, thenormalizedfrequency

Ajofterm k; in documentdj is given by

.. - freqi,j

.) -, maxI freql,j

(2.1)

where themaximumis computed over allterms which arementionedin the text

of the documentdj. Ifthe term ki does not appear in the documentdj then

fi,j =O. Further, letid/i, inverse documentfrequency for k

i, be given by

The best knownterm-weighting schemesuse weights which are given by

w.=l .xlog-',) ") ni

(2.2)

(2.3)

30 MODELING

or by avariationofthis formula. Suchterm-weighting strategies are calledtf-idf

schemes.

SeveralvariationsoftheaboveexpressionfortheweightWi,jaredescribedin an

interestingpaperbySaltonandBuckleywhichappearedin 1988 [696]. However,


in general, the above expression should providea goodweightingscheme for

manycollections.

Forthe queryterm weights,SaltonandBuckleysuggest

_ (0 0.5freqi,q) 1 N

Wi,q- .5+ x og-maxI freql,q ni

(2.4)

wherefreqi,q is the raw frequency ofthe term k; in the text ofthe information

request q.

Themainadvantagesofthevectormodelare: (1)itsterm-weightingscheme

improves retrievalperformance;(2) itspartialmatchingstrategyallowsretrieval

ofdocumentsthat approximatethe queryconditions; and (3) its cosine rank-ing formula sortsthe
documentsaccordingto their degree ofsimilarity to the

query. Theoretically,thevectormodelhasthe disadvantagethatindextermsare

assumedto be mutuallyindependent (equation 2.3 does notaccountfor index

term dependencies). However, inpractice,considerationofterm dependencies

mightbe adisadvantage. Dueto the locality ofmanyterm dependencies,their

indiscriminate applicationto all the documentsin the collectionmightin fact

hurtthe overallperformance.

Despiteits simplicity,the vectormodelis a resilient rankingstrategywith

generalcollections. It yields ranked answersetswhich are difficult toimprove

uponwithoutqueryexpansionorrelevancefeedback (see Chapter5)withinthe

framework ofthe vectormodel. Alarge varietyofalternativeranking methods

have beencomparedto the vectormodelbutthe consensusseems to be that,

in general,the vectormodelis eithersuperiororalmostasgoodasthe known .

alternatives. Furthermore,it is simple andfast. For these reasons, the vector

model is apopularretrieval modelnowadays.

2.5.4 Probabilistic Model

In this section, wedescribethe classic probabilisticmodelintroduced in 1976


byRoberstonandSparckJones [677] whichlater becameknown asthe binary

independence retrieval (BIR) model. Ourdiscussionis intentionally briefand

focuses mainlyonhighlightingthekeyfeaturesofthe model.Withthis purpose

in mind,we donotdetainourselvesin subtletiesregarding the binaryindepen-denceassumptionfor the


model.Thesectiononbibliographicdiscussionpoints

to references which coverthese details.

Theprobabilisticmodelattemptsto capturetheIR problemwithina prob-abilisticframework. Thefundamental


idea is as follows. Given a user query,

there is a set of documentswhichcontainsexactlythe relevant documentsand

CLASSICINFORMATION RETRIEVAL 31

noother.Let us refer tothisset ofdocumentsasthe idealanswer set. Giventhe

descriptionofthis ideal answer set, we would have no problemsin retrievingits

documents. Thus,we canthink ofthe queryingprocess as a process of specify-ing the propertiesof an ideal
answer set (which isanalogousto interpreting the

IRproblem as a problem ofclustering). Theproblemis that we do not know

exactlywhat thesepropertiesare. All we know isthat there are index terms

whosesemantics should be used to characterizethese properties. Since these

propertiesare not knownatquery time, an effort has to be made at initially

guessing what they could be. Thisinitial guess allows us togeneratea prelim-inary probabilisticdescriptionof
the ideal answer set which is used to retrieve

a first set ofdocuments. Aninteraction withthe user isthen initiated withthe

purposeof improvingthe probabilisticdescriptionofthe ideal answer set. Such

interaction could proceed asfollows.

The user takes a lookatthe retrieved documentsand decides which ones

are relevantandwhich ones are not (intruth, onlythefirst top documentsneed

to be examined). Thesystemthenusesthis informationto refinethedescription

ofthe ideal answer set. By repeating this process many times, it isexpected

that such adescriptionwill evolve and become closer tothe real descriptionof

the ideal answer set. Thus,one should always have in mindtheneed to guess at

the beginning the descriptionofthe ideal answer set. Furthermore,a conscious


effort is made to modelthis descriptionin probabilisticterms.

Theprobabilisticmodel is based onthefollowingfundamentalassumption.

Assumption(Probabilistic Principle) Given a user queryqandadocumentdj

in the collection, the probabilisticmodeltries to estimatethe probabilitythat

the user will find the documentdj interesting (i.e., relevant). Themodel as-sumesthatthisprobabilityof
relevancedependsonthequeryandthedocument

representations only. Further,the modelassumes that there is a subset of all

documentswhich the user prefers asthe answer set forthe queryq. Such an

idealanswer set is labeled Randshouldmaximizetheoverallprobabilityof rel-evance to the user. Documentsin


the set Rarepredictedto be relevant to the

query.Documentsnot inthis set are predictedto be non-relevant.

Thisassumptionis quitetroublesome becauseit does not stateexplicitly

how tocomputetheprobabilitiesof relevance. In fact, not eventhesample space

which is to be used for defining suchprobabilitiesis give...

Given a queryq,the probabilisticmodel assigns to eachdocumentdj

, as

a measure of itssimilarityto the query, the ratioP(dj relevant-toq)jP(dj non-relevant-


toq)whichcomputestheoddsofthedocumentdj being relevant tothe

queryq.Taking the odds of relevance as therank minimizesthe probabilityof

anerroneousjudgement [282,785J.

Definition For the probabilistic model, theindex term weight variables are

all binary i.e.,Wi,jE{a,I}, Wi,qE{a,I}. A query qis a subsetofindex terms.

Let R be the setofdocumentsknown (orinitially guessed)to be relevant. Let R

be thecomplementofR[i.e., the set,ofnon-relevantdocuments). LetP(Rldj)

32 MODELING

be the probabilitythat the document dj is relevant to the query q and P(Rld~)

be the probabilitythat dj is non-relevant to q. The similarity sim(dj,q) of the

document dj to the query q is defined as the ratio

P(RIJ.)
sim(d q)= .3

], P(Rldj

Using Bayes' rule,

sim(d. )= P(~IR)xP(R)

]'q P(djIR)xP(R)

P(t4IR)standsforthe probabilityofrandomlyselectingthe documentdj from

thesetRofrelevantdocuments.Further,P(R)standsfortheprobabilitythata

documentrandomlyselected fromtheentirecollection is relevant. Themeanings

attachedto P(d~IR)and P(R)are analogous and complementary.

SinceP(R)andP(R)arethe same for all the documentsin the collection,

we write,

P(d~IR) I"V

P(d~IR)

Assumingindependence of index terms,

P(kiIR)standsfortheprobabilitythattheindex termk,ispresentin adocument

randomly selected fromthe set R. P(kiIR)standsfor the probabilitythat the

index term k;is not presentin a documentrandomly selected from the set R.

Theprobabilitiesassociatedwiththe set Rhave meanings which are analogous

to theonesjust described.

Takinglogarithms, recalling that P(kiIR)+P(kiIR)=1,and ignoring

factors which are constantfor all documentsin the contextof the same query,

we can finally write

which is a key expression forranking computationin the probabilisticmodel.

Since we do not knowtheset Ratthe beginning, it is necessary to devise

amethodfor initially computingthe probabilitiesP(kiIR)and P(kiIR).There

are manyalternativesfor suchcomputation.We discuss a couple ofthem below.

CLASSICINFORMATION RETRIEVAL 33
In the verybeginning(i.e.,immediatelyafterthequery specification),there

are noretrieveddocuments. Thus, one has to make simplifyingassumptionssuch

as: (a) assumethat P(kiIR)isconstantforallindex terms k;(typically, equal to

0.5) and (b) assumethatthedistributionof indextermsamongthenon-relevant

documentscan beapproximatedby the distributionof indexterms among all

the documentsin the collection. These two assumptionsyield

P(kiIR)

P(kiIR)

= 0.5

where, as already defined,ni is the numberofdocumentswhichcontain the

index term ki and Nis the total numberofdocumentsin the collection. Given

this initialguess, we canthenretrieve documentswhichcontainquerytermsand

provide aninitial probabilisticrankingfor them. After that, this initial ranking

is improved as follows.

LetVbe asubsetofthe documentsinitially retrieved and ranked by the

probabilisticmodel. Such asubset can be defined, forinstance, asthe top r

ranked documentswherer is a previously defined threshold. Further,let \'i be

the subset of Vcomposed ofthe documentsin Vwhichcontainthe index term

ki

. For simplicity, we also useVand \'i to refer to the numberof elements in

these sets (it should always be clear when theused variable refers totheset or to

the number of elements in it). For improving the probabilisticranking, we need

to improve our guesses forP(kiIR)andP(kiIR).This can be accomplishedwith

the following assumptions: (a) we can approximateP(kiIR)bythe distribution

ofthe index term ki among the documentsretrieved so far, and (b) we can

approximateP(kiIR)by consideringthat allthe non-retrieveddocumentsare

not relevant. Withthese assumptions,we can write,


P(ki\R)

P(kiIR) =

N-V

This process canthen be repeated recursively. By doing so, we are able to

improve on our guesses for the probabilitiesP(kiIR)andP(k;IR)withoutany

assistance from ahumansubject (contrary to the original idea). However, we

can also useassistancefromthe user for definition ofthe subsetVas originally

conceived.

The last formulas forP(kiIR)and P(kiIR)pose problems for small values

ofVand \'i which arise inpractice(such as V= 1 and\'i =0). To circumvent

these problems, an adjustmentfactor is often added in which yields

P(kiIR)

Vi+0.5

vri

P(kiIR)

ni- \'i+0.5

N-V+l

34 MODELING

Anadjustmentfactorwhich isconstantandequalto 0.5 is not alwayssatisfactory.

Analternativeis to takethefractionnilNastheadjustmentfactor which yields

Vi+W

v-.i

ni-l!i+N

N-V+l

Thiscompletesourdiscussionofthe probabilisticmodel.
Themainadvantageofthe probabilisticmodel, in theory, isthat docu-mentsareranked in decreasingorderoftheir
probabilityof being relevant. The

disadvantagesinclude: (1) the need to guessthe initial separationofdocuments

into relevant and non-relevantsets; (2) the fact that the methoddoesnottake

into accountthe frequency withwhich anindex term occurs inside adocument

(i.e., allweights arebinary);and(3) the adoptionofthe independence assump-tion for index terms. However,
asdiscussedfor the vector model, it is not clear

that independence ofindex terms is a bad assumptionin practicalsituations.

2.5.5 BriefComparisonof Classic Models

In general, theBooleanmodelis consideredto be theweakest classicmethod.Its

mainproblemis theinabilityto recognize partialmatcheswhichfrequentlyleads

to poorperformance.Thereis somecontroversyas towhetherthe probabilistic

modeloutperformsthe vectormodel. Croftperformedsome experimentsand

suggestedthat the probabilisticmodel provides abetterretrieval performance.

However,experimentsdoneafterwardsbySaltonand Buckley refutethatclaim.

Throughseveral different measures,Saltonand Buckley showedthat the vector

model isexpectedto outperformtheprobabilisticmodelwithgeneral collections.

Thisalso seems to bethe dominantthought amongresearchers, practitioners,

and theWebcommunity,wherethe popularityofthe vector model runs high.