Documente Academic
Documente Profesional
Documente Cultură
Lucene
Doug Cutting
Created in 1999 Donated to Apache in 2001
Features
Highly scalable Java (1.4) Ports to many other languages No crawler No document parsing No PageRank
Lucene
Powered by Lucene
IBM Omnifind Y! Edition Technorati Wikipedia Internet Archive LinkedIn monster.com
Indexing
Logical structure
Index is collection of documents Documents are a collection of fields Fields are the content
Stored Stored verbatim for retrival with results Indexed Tokenized and made searchable
Physical structure
Multiple documents (with all fields) stored in segments All segments together make up the index
mergeFactor
Indexing
aardvark Little Red Riding Hood hood little red riding robin 0 0 0 0 1 2 1 2 Robin Hood 0
Little Women
women zoo 2
Indexing
Analysis
Extract tokens from text (tokenizer)
Whitespace Hyphens
Indexing
LexCorp BFG-9000 WhitespaceTokenizer LexCorp BFG-9000
WordDelimiterFilter catenateWords=1 Lex Corp LexCorp LowercaseFilter lex corp lexcorp bfg 9000 BFG 9000
Searching
Query Creation
Query parser Manual query construction from terms
title:Bell author:Hemmingway^3.0
Searching
LexCorp BFG-9000 WhitespaceTokenizer LexCorp BFG-9000 Lex corp bfg9000 WhitespaceTokenizer Lex corp bfg9000
WordDelimiterFilter catenateWords=1 Lex Corp LexCorp LowercaseFilter lex corp lexcorp bfg 9000 BFG 9000
A Match!
Searching
Proximity
quick fox~4
Prefix
pla?e practic* practically) (plate or place or plane) (practice or practical or
Range
date:[05072007 TO 05232007] (inclusive) author: {king TO mason} (exclusive)
Searching
Lucene Sub-projects
Nutch
Web crawler with document parsing
Hadoop
Distributed data processor Implements MapReduce
Solr
Solr
Yonik Seeley
Developed at CNET Donated to Apache in 2006
Features
Servlet Web Administration Interface XML/HTTP, JSON Interfaces Faceting Schema to define types and fields Highlighting Caching Index Replication (Master / Slaves) Pluggable Java 5
Solr
Powered by Solr
Netflix CNET Smithsonian AOL:sports and music RightNow ?? Drupal module GameSpot
Configuration (solrconfig.xml)
<mainIndex> <useCompoundFile>false</useCompoundFile> <mergeFactor>10</mergeFactor> <maxBufferedDocs>1000</maxBufferedDocs> <maxMergeDocs>2147483647</maxMergeDocs> <maxFieldLength>10000</maxFieldLength> </mainIndex> <requestHandler name="standard" class="solr.StandardRequestHandler" /> <requestHandler name=custom" class="your.package.CustomRequestHandler" /> <autoCommit> <maxDocs>10000</maxDocs> <maxTime>1000</maxTime> </autoCommit> <queryResponseWriter name="xml" class="org.apache.solr.request.XMLResponseWriter" default="true"/>
Schema (schema.xml)
Fields
<uniqueKey>id</uniqueKey> <field name="products" type="text" indexed="true" stored=true"/> <field name="keywords" type="text_ws" indexed="true" stored=true/> <field name="keywordsSorted" type="text_sorted" indexed="true" stored="false"/> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW"/> <dynamicField name="*_i" type="integer" indexed="true" stored="true"/> <dynamicField name="desc_*" type="string" indexed="true" stored="false"/> <copyField source=keywords" dest=keywordsSorted"/>
Schema
Analyzers
<fieldtype name="nametext" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype> <fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldtype> <fieldtype name="myfieldtype" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="German" /> </analyzer> </fieldtype>
Insertion
<add>
</add>
Update / Delete
Inserting a document with already present uniqueKey will erase the original Deleting
By uniqueKey field
<delete><id>05991</id></delete>
By query
<delete><query>name:Anthony</query></delete>
<Commit/> <Optimize/>
Search
Core parameters
qt query type (request handler) wt writer type (response writer)
Common parameters
q sort start rows fq filters fl return fields
Search
Faceting
Available in StandardRequestHandler and DisMaxRequestHandler
Search
http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=1&facet.field=cat&facet.mincount=1&facet.field=inStock <response> <responseHeader> <status>0</status> <QTime>3</QTime> </responseHeader> <result numFound="4" start="0"/> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="cat"> <int name="music">1</int> <int name="connector">2</int> <int name="electronics">3</int> </lst> <lst name="inStock"> <int name="false">3</int> <int name="true">1</int> </lst> </lst> </lst> </response>
Replication
Master / Slave architecture for load balancing and backups
More-like-this Easy to add RequestHandlers and ResponseWriters Responses in many formats Hit highlighting
Sources