Searching made easy with Apache Lucene 4.3

Lucene is a Full Text Search Engine written in Java which can lend powerful search capabilities to any application. At heart of Lucene lies a file based Full Text Index. Lucene provides APIs to create this index and then add and delete contents to this index. Further it allows search and retrieval of information from this index using powerful search algorithms. The data stored can be pulled from disparate sources like a database, filesystem and as well as the websites. Before beginning let us ponder on few terms.

Inverted Index

Inverted index is a datastructure which stores a mapping of a content and the location of object that contains that content. To make it more clear here are some examples

  1. Book Index – The Index of book contains the important words and the pages that contain those words. So book index helps us in navigating to the pages that contain a particular word.
  2. Listing of wines using price ranges – The price range is content and winename is the object that has that price range
  3. Web Index – Listing of website address by keywords. For example list of all webpages containing keywords “Apache Lucene”
  4. Shopping Cart – Listing of items in shopping cart by categories. 

Faceted Search

Any object can have multiple properties, each of these properties are facet of that object. Faceted search allows us to search for collection of objects based on multiple facets. Faceted search is also known as faceted navigation or faceted browsing and it allows us to search on information that is organized according to faceted organization structure.

Consider an example of an item in shopping cart. Item can have multiple facets like category, title, price, color, weight etc. Now a facet search would allow us to search for all the items which are in garden category, has red color and is between price range of Rs.30 to Rs.40.

Lucene provides us an API

  1. To create an inverted index.
  2. Store information according to faceted classification.
  3. Retrieve information using faceted search.

All the above makes Lucene a super-fast search engine which returns super relevant search results.

Lucene Features

  1. Relevance Ranking search
  2. Phrase, proximity, wildcard search.
  3. Plug-gable analyzer.
  4. Faceted Search.
  5. Field based sorting
  6. Range queries
  7. Mutliple index searching.
  8. Fast indexing 150GB/hour.
  9. Easy Backup and restore.
  10. Small RAM requirement.
  11. Incremental addition and fast searches.

For full list visit here

http://lucene.apache.org/core/features.html

Lucene Concepts and Terminologies

  1. Indexing – Indexing involves adding a document to the Lucene index by help of a class called “IndexWriter“.
  2. Searching – Searching involves retrieval of a document from Lucene index by help of a class called “IndexSearcher
  3. Document – A Lucene Document is a single unit of search and index. For example item in a shopping cart. Lucene index can contain millions of documents.
  4. Fields – Fields are properties of any document. In other words fields are the facets of the document which is an object. For example category of an item in shopping cart. Each document can have multiple fields.
  5. Queries – Lucene has its own query language. This allows us to search for document based on mulitple fields. We can assign weight to a field and also use boolean expressions like and and or to the query. For example - Return all items in cart which belong to category garden or home and has color red and has price less than Rs.1000.
  6. Analyzers – When a field text is to be indexed then they need to be converted into its most basic form. First they are tokenized and then they are converted to lowercase, sigularized, depunctuated. These tasks are performed by Analyzers. Analyzers are complicted and we require a deep study on how to use them. Most often the built in analyzers don’t suffice for our requirement, in that case we can create a new one. For this tutorial we will be using StandardAnalyzer as they contain most of the basic features we require.

Tutorial objective

  1. Try creating a Lucene index.
  2. Insert book records in it.
  3. Performing various kinds of searches on this index.

The book item will have following Facets

  1.  Book Title(String
  2. Book Author(String)
  3. Book Catgory(String)
  4. #Pages(int)
  5. Price(float)

The code for this tutorial has been committed to SVN. It can be checked out from

https://www.assembla.com/code/weblog4j/subversion/nodes/24/SpringDemos/trunk

This is an extended project with more tutorials. The lucene classes are in com.aranin.spring.lucene package

  1. LuceneUtil – This class contains utitlity method to create index, create IndexWriter and IndexSearcher.
  2. MySearcherManager – This class uses LuceneUtil and performs searches on the index.
  3. MyWriterManager - This class uses LuceneUtil and performs writes on the index.

Step by step walk-through

1. Dependencies – The dependencies can be added via maven

<dependency>
        <artifactId>lucene-core</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-queries</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-queryparser</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-analyzers-common</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

      <dependency>
        <artifactId>lucene-facet</artifactId>
        <groupId>org.apache.lucene</groupId>
        <type>jar</type>
        <version>${lucene-version}</version>
      </dependency>

2. Creating the index – The index can be created by creating an IndexWriter in create mode.

public void createIndex() throws Exception {

    boolean create = true;
    File indexDirFile = new File(this.indexDir);
    if (indexDirFile.exists() && indexDirFile.isDirectory()) {
       create = false;
    }

    Directory dir = FSDirectory.open(indexDirFile);
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43, analyzer);

    if (create) {
       // Create a new index in the directory, removing any
       // previously indexed documents:
       iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    }

    IndexWriter writer = new IndexWriter(dir, iwc);
    writer.commit();
    writer.close(true);
 }
  • indexDir is the directory where you want to create your index.
  • Directory is a flat list of files used for storing index. It can be a RAMDirectory, FSDirectory or a DB based directory.
  • FSDirectory implements Directory and saves indexes in files in file system.
  • IndexWriterConfig.Open mode creates a writer in create or create_append or appned mode. Create mode creates a new index if it does not exist or overwrites an existing one. For purpose of creation we create an existing one.
  • Calling above method creates an empty index.

3. Writing to the index – Once the index is created we can write documents to it. That can be done via following.

public void createIndexWriter() throws Exception {

     boolean create = true;
     File indexDirFile = new File(this.indexDir);

     Directory dir = FSDirectory.open(indexDirFile);
     Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
<span style="color: #222222; font-family: 'Courier 10 Pitch', Courier, monospace; line-height: 21px;">IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43, analyzer);</span>
     iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
     this.writer = new IndexWriter(dir, iwc);

    }

Above method creates a writer in create_append mode. In this mode if index is created then it will not be overwritten. You can note that this method does not close the writer. It just creates and returns it. Creating IndexWriter is an costly operation. Thus we should not create a writer everytime we have to write a document to the index. Instead we should create a pool of IndexWriter and use a thread system to get the writer from the pool write to the index and then return the writer to the pool.

public void addBookToIndex(BookVO bookVO) throws Exception {
     Document document = new Document();
     document.add(new StringField("title", bookVO.getBook_name(), Field.Store.YES));
     document.add(new StringField("author", bookVO.getBook_author(), Field.Store.YES));
     document.add(new StringField("category", bookVO.getCategory(), Field.Store.YES));
     document.add(new IntField("numpage", bookVO.getNumpages(), Field.Store.YES));
     document.add(new FloatField("price", bookVO.getPrice(), Field.Store.YES));
     IndexWriter writer =  this.luceneUtil.getIndexWriter();
     writer.addDocument(document);
     writer.commit();
 }

 We dont create a writer in the code while inserting. Instead we have used a precreated writer which was stored as a instance variable. 

4. Searching the index – This is again a done in two steps 1. Creating IndexSearcher 2. Creating a query and doing the search.

public void createIndexSearcher(){
    IndexReader indexReader = null;
    IndexSearcher indexSearcher = null;
    try{
         File indexDirFile = new File(this.indexDir);
         Directory dir = FSDirectory.open(indexDirFile);
         indexReader  = DirectoryReader.open(dir);
         indexSearcher = new IndexSearcher(indexReader);
    }catch(IOException ioe){
        ioe.printStackTrace();
    }

    this.indexSearcher = indexSearcher;
 }

Note – The Analyzer used in searcher should be same as the one used to create the writer as analyzer is responsible for the way in which data is stored in index. Again creating IndexSearcher is a costly operation hence it makes sense to pre create a pool of IndexSearcher and use it in similar way as IndexWriter.

public List<BookVO> getBooksByField(String value, String field, IndexSearcher indexSearcher){
     List<BookVO> bookList = new ArrayList<BookVO>();
     Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
     QueryParser parser = new QueryParser(Version.LUCENE_43, field, analyzer);

     try {
         BooleanQuery query = new BooleanQuery();
         query.add(new TermQuery(new Term(field, value)), BooleanClause.Occur.MUST);

        //Query query = parser.Query(value);
        int numResults = 100;
        ScoreDoc[] hits =   indexSearcher.search(query,numResults).scoreDocs;
        for (int i = 0; i < hits.length; i++) {
             Document doc = indexSearcher.doc(hits[i].doc);
             bookList.add(getBookVO(doc));
        }

     } catch (IOException e) {
         e.printStackTrace(); 
     }

     return bookList;
}

The IndexSearcher was pre-created and passed on to the the method. The main part of searching is query formation. Lucene supports lots of different kinds of queires.

  1. TermQuery
  2. BooleanQuery
  3. WildcardQuery
  4. PhraseQuery
  5. PrefixQuery
  6. MultiPhraseQuery
  7. FuzzyQuery
  8. RegexpQuery
  9. TermRangeQuery
  10. NumericRangeQuery
  11. ConstantScoreQuery
  12. DisjunctionMaxQuery
  13. MatchAllDocsQuery

You can choose the appropriate queries for your searches. The query language syntax can be learnt from here 

http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.pdf

References

  1. http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.pdf
  2. http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/org/apache/lucene/index/IndexWriterConfig.OpenMode.html
  3. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/store/FSDirectory.html
  4. https://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
  5. http://www.lucenetutorial.com/lucene-query-syntax.html
  6. http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/Query.html

Summary

Search remains a backbone of any content driven application. The traditional DB driven searches are not very powerful and leaves a lot to be desired. So there is a need of a fast, accurate and powerful search solution which can be easily incorporated in the application code. Lucene beautifully fills in that gap, it makes the search a breeze and is backed by a powerful array of search algorithms like relevance ranking, phrase, wildcard, proximity and ranged search. It is also space and memory efficient. No wonder so many applications have been built on top of Lucene. This article intends to provide a basic tutorial on empowering dear readers with tools for getting started with Lucene.  There is lot more to be said but then don’t you want to explore some on your own :-) ?

If you find this article useful please drop a comment or two.

Warm Regards

Niraj

Print Friendly

About Niraj Singh

I am CEO and CoFounder of a startup "Aranin Software Private Limited, Bangalore. I completed my graduation in 2002 as an Aerospace Engineer from IIT Kharagpur. I love working on new ideas and projects and recently released my first open source project JaiomServer "http://jaiomserver.org". I have 9 years of experience in IT industries most of which I have spent in developing community applications for various clients using java. Some of the sites in which I have actively involved with are hgtv.com, food.com, foodnetwork.com, pickle.com, diynetwork.com etc.
This entry was posted in Apache Lucene, Search. Bookmark the permalink.

Comments are closed.