Documente Academic
Documente Profesional
Documente Cultură
Spring 2015
Lucene: An Open source Toolkit
for Information Retrieval
Reference
Lucene in Action, 2nd Edition (by Michael
McCandless, Erik Hatcher, Otis Gospodneti)
Lucene Structure
Document
field1
field2
field3
Query
addDocument(
)
Hits
(Matching Docs)
search(
)
IndexWriter
IndexSearcher
Lucene Index
Indexing in Lucene
Lucene indexer is based on Inverted Index
Concept (remember posting lists)
Field Options
Field.Index.ANALYZED: This breaks the fields
value into a stream of separate tokens and makes
each token searchable. This option is useful for
normal text fields (body, title, abstract, etc.).
Field.Index.NOT_ANALYZED: Do index the field,
and stores the entire Fields value as a single. Useful
in case of storing URLs field.
Field.Index.No: Dont make this fields value
available for searching.
Deleting Documents
Lucene provides a variety of functions for
deleting indexed documents.
deleteDocuments(Term): delete all documents
containing provided term.
deleteDocuments(Term []): delete all documents
containing any of the terms in the provided array.
deleteDocuments(Query): delete all documents
matching the provided query.
deleteDocuments(Query []): delete all documents
matching any of the queries in the provided array.
deleteAll()
Query Expression
The default operator in Lucene is OR.
The Query = hello world is equivalent to
hello OR world.
The class QueryParser parses the query
expression
In Lucene we can specify the following kind
of query expressions.
Query Expression
Query Expression
Processing Query
After expressing query, we need to create a Searcher to search the index. Then we need to instantiate a TopScoreDocCollector to collect the top 10
scoring hits.
Searcher searcher = new IndexSearcher(directory);
TopDocs topDocs = searcher.search(query, 10);
System.out.println(topDocs.totalHits + " total results");
for (int i = 0; i < topDocs.totalHits; i++) {
ScoreDoc match = topDocs.scoreDocs[i];
Document doc = searcher.doc(match.doc);
System.out.println(doc.get("contents"));
}
match.doc);