Sunteți pe pagina 1din 16

Data Mining

Spring 2015
Lucene: An Open source Toolkit
for Information Retrieval

Lucene: an Open Source IR Toolkit


We are going to learn
How to index documents using Lucene
How to represent query using Lucene
How to process query using Lucene

Reference
Lucene in Action, 2nd Edition (by Michael
McCandless, Erik Hatcher, Otis Gospodneti)

Lucene is freely available at (


http://lucene.apache.org/core/) under Apache
Open Source License

Lucene Structure

Document
field1
field2
field3

Query

addDocument(
)

Hits
(Matching Docs)

search(
)

IndexWriter

IndexSearcher

Lucene Index

Indexing in Lucene
Lucene indexer is based on Inverted Index
Concept (remember posting lists)

Indexing in Lucene (Example)


Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
addDoc(writer, "Lucene in Action"); // first document
addDoc(writer, "Lucene for Dummies"); // second document
addDoc(writer, "Managing Gigabytes"); // third document
addDoc(writer, "The Art of Computer Science"); //fourth document
writer.close();
private static void addDoc(IndexWriter writer, String value) throws IOException
{
Document doc = new Document();
doc.add(new Field(contents", value, Field.Store.YES,Field.Index.ANALYZED));
writer.addDocument(doc);
}

Field Options
Field.Index.ANALYZED: This breaks the fields
value into a stream of separate tokens and makes
each token searchable. This option is useful for
normal text fields (body, title, abstract, etc.).
Field.Index.NOT_ANALYZED: Do index the field,
and stores the entire Fields value as a single. Useful
in case of storing URLs field.
Field.Index.No: Dont make this fields value
available for searching.

Deleting Documents
Lucene provides a variety of functions for
deleting indexed documents.
deleteDocuments(Term): delete all documents
containing provided term.
deleteDocuments(Term []): delete all documents
containing any of the terms in the provided array.
deleteDocuments(Query): delete all documents
matching the provided query.
deleteDocuments(Query []): delete all documents
matching any of the queries in the provided array.
deleteAll()

Deleting Single Document


If you want to delete a single document, then
the best solution is to store unique IDs of
documents in a separate field

IndexWriter writer = getWriter();


writer.deleteDocuments(computer);

Expressing Query in Lucene


void main(. args)
{
String querystr = hello world;
QueryParser parser = new QueryParser("contents",
analyzer);
Query query = parser.parse(querystr);
}

Query Expression
The default operator in Lucene is OR.
The Query = hello world is equivalent to
hello OR world.
The class QueryParser parses the query
expression
In Lucene we can specify the following kind
of query expressions.

Query Expression

Query Expression

Processing Query
After expressing query, we need to create a Searcher to search the index. Then we need to instantiate a TopScoreDocCollector to collect the top 10
scoring hits.
Searcher searcher = new IndexSearcher(directory);
TopDocs topDocs = searcher.search(query, 10);
System.out.println(topDocs.totalHits + " total results");
for (int i = 0; i < topDocs.totalHits; i++) {
ScoreDoc match = topDocs.scoreDocs[i];
Document doc = searcher.doc(match.doc);
System.out.println(doc.get("contents"));
}

How Lucene Scores Documents


By default Lucene (ranking strategy) is based
on vector space model

Explaining Rank Scores


With Explanation class, Lucene can explain the rank scores of documents with the help of different factors.
Searcher searcher = new IndexSearcher(directory);
TopDocs topDocs = searcher.search(query, 10);
System.out.println(topDocs.totalHits + " total results");
for (int i = 0; i < topDocs.totalHits; i++) {
ScoreDoc match = topDocs.scoreDocs[i];
Explanation explanation = searcher.explain(query,
Document doc = searcher.doc(match.doc);
System.out.println(doc.get("contents"));
System.out.println(explanation.toString());
}

match.doc);

Explaining Rank Scores

S-ar putea să vă placă și