A simple Java wrapper for Apache Lucene

In this post I’m sharing an implementation of a simple Java wrapper class for the information retrieval library Apache Lucene, along with some usage examples.

About Apache Lucene

Apache Lucene is a popular and widely used open-source information retrieval library.
Lucene is considered as a powerful and flexible library and is used to build various types of search engines. It is mostly used for its full-text indexing and querying capabilities. In Lucene, each document is modeled as a set of fields containing text. This flexibility allows Lucene to be indipendent of the specific file format, so web pages, PDFs, Microsoft Word documents, etc. can all be indexed, as long as their textual content can be extracted.

Using Lucene: issues and restrictions

Correctly using the Lucene library isn’t trivial, because of the scarcity of useful documentation and also because of a number of technical issues that must be considered.

The lifecycle of the index, and of all the API components used to read and write it, must be properly managed to have a consistent, efficient and reliable index operation. Lucene uses two major components to access the index: IndexWriter and IndexSearcher. The first is used to create the index and to add and delete documents. The latter is used to perform search queries on it. One important restriction is that no more than one IndexWriter at a time can be opened on a given index directory, as pointed out in the IndexWriter documentation.  IndexWriter instances are considered thread-safe, so multiple threads can concurrently access one to modify the index.

The IndexSearcher component relies on a IndexReader instance, which offers a “frozen” view of the index. This implies that whenever the index is updated, the IndexReader must be reopened to see the changes, and a new IndexSearcher must be obtained from it. This can lead to some issues, because the old IndexReader instances must be closed to avoid memory leaks and an excessive number of open files on the file system. However, in a concurrent environment, closing an old IndexReader that is referenced by another thread may prevent it from searching the index, since the thread might occur in a AlreadyClosedException  (see also How to safely close an IndexReader).  In Lucene 3.5, a new SearcherManager component was introduced, that automatically manages the reopening and the recycling of IndexReaders in a concurrent setting.

The query construction in Lucene can be done in two different ways: by using a QueryParser object or by using Lucene’s query API. The QueryParser generates a Query object from a query string. That’s the simplest way to build a query, but it has some limitations: for instance, you cannot use it to perform a query by a numeric field.

Another tricky aspect in the Lucene API is that the documents aren’t assigned an ID. Indeed, a progressive number is assigned to a document, but it’s modified when other documents are deleted.  Therefore, it’s on the programmer to properly assign a unique ID to each document. Being able to access a document by its ID is a must in many application contexts.

IndexWrapper – a wrapper class for Lucene

This class provides methods for:

  • Adding new documents to the index, assigning each a unique ID
  • Searching the index, supporting boolean and phrase queries on string and integer fields
  • Deleting documents from the index
  • Getting the number of indexed documents

The query parameters are passed to the search method via a QueryParams object, which contains an arbitrary set of string and integer fields. The method generates a boolean OR query for the unquoted fields and a phrase query for double-quoted string fields. A NumericRangeQuery is generated for each integer field. The string fields are preprocessed through an analyzer, which usually includes a language-specific stop-word list and a stemmer. The default analyzer can be changed using the setAnalyzer(Analyzer) method.

This class is considered thread-safe, so multiple threads can use it to search the index and add/delete documents.
Moreover, this class fosters strict consistency over performance, since the index is committed every time a document is added or deleted, and the SearcherManager is refreshed before each query is executed.

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.UUID;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.it.ItalianAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.SearcherManager;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 *
 * @author Stefano Scerra (www.stefanoscerra.it)
 *
 * A wrapper class for Apache Lucene 4.9+ index management.
 * Thic class is designed to work with a file system-resident document index.
 * Provides methods to add new documents, perform searching, deletion, and more.
 * It automatically manages the index lifecycle, creating the index when
 * it doesn't exist, and efficiently reusing IndexSearcher/IndexReader instances
 * by using a SearcherManager.
 *
 */
public final class IndexWrapper
{    
    private static final Map<String, IndexWrapper> instancePool = new HashMap<String, IndexWrapper>();
    
    private final String indexPath;   
    // set the default analyzer
    private Analyzer analyzer = new ItalianAnalyzer(Version.LUCENE_4_9);    
    private IndexWriter indexWriter;
    private SearcherManager searchMgr;
    private boolean isOpen = false;
    
    /**
     *  Contains a set of string and integer fields to use in a search query.
     */
    public static class SearchParams
    {
        private Map<String, String> strings = new HashMap<String, String>();
        private Map<String, Integer> integers = new HashMap<String, Integer>();        

        public String getString(String fieldName)
        {
            return strings.get(fieldName);
        }

        public int getInt(String fieldName)
        {
            return integers.get(fieldName);
        }       
        
        public void putInt(String fieldName, int value)
        {
            integers.put(fieldName, value);
        }
        
        public void putString(String fieldName, String value)
        {
            strings.put(fieldName, value);
        }
        
        public Set<Map.Entry<String, String>> stringSet()
        {
            return strings.entrySet();
        }
        
        public Set<Map.Entry<String, Integer>> intSet()
        {
            return integers.entrySet();
        }

        public boolean isEmpty()
        {
            return strings.isEmpty() && integers.isEmpty();
        }
    }

     /**
     * Returns a unique instance of IndexWrapper for a given index.
     * @param indexPath the lucene index path on the file system
     * @return
     */
    public static synchronized IndexWrapper getInstance(String indexPath) throws IOException
    {
        IndexWrapper indexWrapper = instancePool.get(indexPath);
        if(indexWrapper != null) return indexWrapper;
        indexWrapper = new IndexWrapper(indexPath);
        instancePool.put(indexPath, indexWrapper);
        return indexWrapper;
    }
    
    private IndexWrapper(String indexPath) throws IOException
    {
        this.indexPath = indexPath;
        openIndex();
        initSearcherManager();
        isOpen = true;
    }
    
    
    public boolean isOpen()
    {
        return isOpen;
    }
    
    /**
     * Sets the analyzer to be used for indexing and querying.
     * @param analyzer
     */
    public void setAnalyzer(Analyzer analyzer)
    {
        this.analyzer = analyzer;
    }   
    
     /**
     * Opens the index on the file system.
     * If the index does not exist, it is created.
     * Must be called only once.
     * @throws IOException
     */
    private void openIndex() throws IOException
    {
        Directory dir = FSDirectory.open(new File(indexPath));
        IndexWriterConfig cfg = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
        cfg.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        IndexWriter iw = new IndexWriter(dir, cfg);
        iw.commit();
        indexWriter = iw;
    }
    
    /**
     * Initializes the index's SearcherManager component, which
     * automatically manages the IndexSearcher instances
     * @throws IOException
     */
    private void initSearcherManager() throws IOException
    {
        searchMgr = new SearcherManager(indexWriter, true, null);
    }
    
    /**
     * Finds and returns the document with the given id.
     * @param id
     * @return
     * @throws IOException
     */
    public Document getDocumentById(long id) throws IOException
    {        
        searchMgr.maybeRefreshBlocking();        
        IndexSearcher searcher = searchMgr.acquire();
        try
        {
            NumericRangeQuery q = NumericRangeQuery.newLongRange("id", id, id, true, true);
            TopDocs topDocs = searcher.search(q, 1);
            if(topDocs.scoreDocs.length == 0) return null;
            ScoreDoc scoreDoc = topDocs.scoreDocs[0];
            Document doc = searcher.doc(scoreDoc.doc);
            return doc;
        }
        finally
        {
            searchMgr.release(searcher);
        }
    }
    
    /**
     * Generates a new unique id for a document.
     * @throws IOException
     */
    private long generateDocID() throws IOException
    {
        long id;
        do
        {
            id = Math.abs(UUID.randomUUID().getLeastSignificantBits());
        }
        while (getDocumentById(id) != null); // we want a unique id
        
        return id;
    }
    
    /**
     * Adds a document to the index.
     * @param doc
     * @throws IOException
     */
    public void addDocument(Document doc) throws IOException
    {
        // generate a unique id for the document and add it to the index
        doc.add(new LongField("id", generateDocID(), Field.Store.YES));
        indexWriter.addDocument(doc);
        indexWriter.commit();
    }
    
    /**
     * Generates a boolean or a phrase query for a given field.
     * @param fieldName
     * @param value
     */
    private static Query createStringFieldQuery(String fieldName, String value)
    {
        if (value.matches("\".*\""))
        {
            // create phrase query
            String unquotedStr = value.substring(1, value.length() - 1);
            PhraseQuery phrase = new PhraseQuery();
            for (String tok : unquotedStr.split("\\s"))
            {
                phrase.add(new Term(fieldName, tok));
            }
            return phrase;
        }
        else
        {
            // create a boolean OR query
            BooleanQuery boolQuery = new BooleanQuery();
            for (String tok : value.split("\\s"))
            {
                boolQuery.add(new TermQuery(new Term(fieldName, tok)), BooleanClause.Occur.SHOULD);
            }

            return boolQuery;
        }
    }
    
    /**
     * Performs a search on the index. The query parameters are passed via a
     * SearchParams object.
     * @param params
     * @return A list of the top-10 documents matching the query
     * @throws IOException
     * @throws ParseException
     */
    public List<Document> search(SearchParams params) throws IOException, ParseException
    {        
        searchMgr.maybeRefreshBlocking();        
        IndexSearcher searcher = searchMgr.acquire();
        try
        {
            List<Document> results = new ArrayList<Document>();
            BooleanQuery query = new BooleanQuery();

            // create query for string fields
            for (Map.Entry<String, String> stringField : params.stringSet())
            {
                query.add(createStringFieldQuery(stringField.getKey(), stringField.getValue()),
                        BooleanClause.Occur.SHOULD);
            }
            // create query for int fields
            for (Map.Entry<String, Integer> intField : params.intSet())
            {
                query.add(NumericRangeQuery.newIntRange(intField.getKey(), intField.getValue(),
                        intField.getValue(), true, true), BooleanClause.Occur.SHOULD);
            }

            QueryParser qp = new QueryParser(Version.LUCENE_4_9, "", analyzer);
            Query q = qp.parse(query.toString());
            TopDocs topDocs = searcher.search(q, 10);
            
            for (ScoreDoc doc : topDocs.scoreDocs)
            {
                results.add(searcher.doc(doc.doc));
            }

            return results;
        }
        finally
        {
            searchMgr.release(searcher);
        }
    }
    
    /**
     * Deletes the document with the given id from the index.
     * @param id
     * @throws IOException
     */
    public void deleteDocument(long id) throws IOException
    {    
        // delete from index
        Query deleteQuery = NumericRangeQuery.newLongRange("id", id, id, true, true);
        indexWriter.deleteDocuments(deleteQuery);
        indexWriter.commit();
    }
    
    /**
     * Returns the number of documents currently indexed.
     * @throws IOException
     */
    public int getNumDocs() throws IOException
    {       
        searchMgr.maybeRefreshBlocking();        
        IndexSearcher searcher = searchMgr.acquire();
        try
        {
            return searcher.getIndexReader().numDocs();
        }
        finally
        {
            searchMgr.release(searcher);
        }       
    }
    
    /**
     * Closes the index along with IndexWriter and SearcherManager components.
     * Call this method after you're done working with the index.
     * @throws IOException
     */
    public void close() throws IOException
    {
        synchronized(this)
        {
            if(searchMgr != null) searchMgr.close();
            if(indexWriter != null) indexWriter.close();
            searchMgr = null;
            indexWriter = null;
            isOpen = false;
        }
    }
    
    /**
     *  Reopens the index.
     *  close() must be called first.
     * @throws IOException
     */
    public void reopen() throws IOException
    {
        synchronized(this)
        {
            openIndex();
            initSearcherManager();
            isOpen = true;
        }
    }
    
}

Examples of usage

A stateless session bean that uses the IndexWrapper class to add a PDF document to the Lucene index

@Stateless
@TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED)
public class IndexManager implements IndexManagerLocal
{
    public static final String ARCHIVE_PATH = "C:\\archive";
    public static final String INDEX_PATH = "C:\\index";
    
    private Document createDocument(String author, String title, int year, String fileName, String content)
    {
        Document doc = new Document();
        
        doc.add(new TextField("author", author, Field.Store.YES));
        doc.add(new TextField("title", title, Field.Store.YES));
        doc.add(new TextField("content", content, Field.Store.NO));
        doc.add(new IntField("year", year, Field.Store.YES));
        doc.add(new StoredField("fileName", fileName));
        
        return doc;
    }
    
    @Override
    public void addDocument(String author, String title, int year, String fileName)
    {
        try
        {
            // get a IndexWrapper instance for this index
            IndexWrapper indexWrapper = IndexWrapper.getInstance(INDEX_PATH);
            
            // get PDF text content via the PDFBox library
            File file = new File(ARCHIVE_PATH + "\\" + fileName);
            PDFTextStripper textStripper = new PDFTextStripper();
            PDDocument pdDoc = PDDocument.load(file);
            String content = textStripper.getText(pdDoc);
            pdDoc.close();

            // create the Document object and add it to the index
            Document doc = createDocument(author, title, year, fileName, content);
            indexWrapper.addDocument(doc);
        }
        catch (Exception ex)
        {
            Logger.getLogger(IndexManager.class.getName()).log(Level.SEVERE, null, ex);
        }
    }
}

Performing a search on the index

// choose some query parameters
String author, content;
int year;

[...]

IndexWrapper indexWrapper = IndexWrapper.getInstance(INDEX_PATH);
SearchParams params = new SearchParams();

params.putString("author", author);
params.putString("content", content);
params.putInt("year", year);

List<Document> docs = indexWrapper.search(params);