Wednesday, April 27, 2011

Lucene In-Memory Text Search Example

The Lucene text search engine library (from the Apache Jakarta project) provides fast and flexible search capabilities that can be easily integrated into many kinds of applications. Lucene provides a number of advanced capabilities “out of the box”, and can be extended to accomodate special needs.
For large text collections, you will almost always want to use disk-based indices that can be updated and reused across multiple executions of an application. For small collections, especially when running in an unsigned applet or WebStart application where disk access is not permitted, Lucene provides a mechanism for maintaining an in-memory index.

Prerequisites of the program:
To compile and run this class, you will need to include the lucene jar file (downloaded from http://jakarta.apache.org/lucene/) in your classpath. Figure 2 shows the output from the class.

Flow of program

The example below provides a simple illustration of this capability.  At a minimum, using Lucene typically involves the following steps:

  1. Build an index using IndexWriter
    1. For file-based indexes, a directory name can be passed to the IndexWriter constructor. In this example, however, we use the RAMDirectory class to maintain an in-memory index.
    2. Add Document objects representing each object to be searched to the IndexWriter. A Document is a collection of Field objects. Different subclasses of Field support indexed or unindexed content.
    3. Optimize and close the IndexWriter object.

  2. Update the index, by either rebuilding it from scratch or deleting (and, where appropriate, re-adding) Documents. Somewhat unintuitively, adding and deleting Documents from an index is done with an IndexReader object.
  3. Search the index using an IndexSearcher object.
    1. As with IndexWriters, IndexSearchers can be constructed with a directory name for file-based indexes. In this example, we pass in the RAMDirectory object that we created when the index was built.
    2. A Query object encapulates the search query. These can be created using the QueryParser class.
    3. The Query object is passed to the IndexSearcher’s search(…) method, which returns a Hits object that provides access to the Document objects that match the query.

Sample Code
There are ways to customize practically every aspect of Lucene. The example in Figure 1 illustrates a minimal usage of the library.
/**

 * A simple example of an in-memory search using Lucene.
 */
import java.io.IOException;
import java.io.StringReader;

import org.apache.lucene.search.Hits;
import org.apache.lucene.search.Query;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

public class InMemoryExample {

    public static void main(String[] args) {
        // Construct a RAMDirectory to hold the in-memory representation
        // of the index.
        RAMDirectory idx = new RAMDirectory();

        try {
            // Make an writer to create the index
            IndexWriter writer =
                new IndexWriter(idx, new StandardAnalyzer(), true);

            // Add some Document objects containing quotes
            writer.addDocument(createDocument("Theodore Roosevelt",
                "It behooves every man to remember that the work of the " +
                "critic, is of altogether secondary importance, and that, " +
                "in the end, progress is accomplished by the man who does " +
                "things."));
            writer.addDocument(createDocument("Friedrich Hayek",
                "The case for individual freedom rests largely on the " +
                "recognition of the inevitable and universal ignorance " +
                "of all of us concerning a great many of the factors on " +
                "which the achievements of our ends and welfare depend."));
            writer.addDocument(createDocument("Ayn Rand",
                "There is nothing to take a man's freedom away from " +
                "him, save other men. To be free, a man must be free " +
                "of his brothers."));
            writer.addDocument(createDocument("Mohandas Gandhi",
                "Freedom is not worth having if it does not connote " +
                "freedom to err."));

            // Optimize and close the writer to finish building the index
            writer.optimize();
            writer.close();

            // Build an IndexSearcher using the in-memory index
            Searcher searcher = new IndexSearcher(idx);

            // Run some queries
            search(searcher, "freedom");
            search(searcher, "free");
            search(searcher, "progress or achievements");

            searcher.close();
        }
        catch(IOException ioe) {
            // In this example we aren't really doing an I/O, so this
            // exception should never actually be thrown.
            ioe.printStackTrace();
        }
        catch(ParseException pe) {
            pe.printStackTrace();
        }
    }

    /**
     * Make a Document object with an un-indexed title field and an
     * indexed content field.
     */
    private static Document createDocument(String title, String content) {
        Document doc = new Document();

        // Add the title as an unindexed field...
        doc.add(Field.UnIndexed("title", title));

        // ...and the content as an indexed field. Note that indexed
        // Text fields are constructed using a Reader. Lucene can read
        // and index very large chunks of text, without storing the
        // entire content verbatim in the index. In this example we
        // can just wrap the content string in a StringReader.
        doc.add(Field.Text("content", new StringReader(content)));

        return doc;
    }

    /**
     * Searches for the given string in the "content" field
     */
    private static void search(Searcher searcher, String queryString)
        throws ParseException, IOException {

        // Build a Query object
        Query query = QueryParser.parse(
            queryString, "content", new StandardAnalyzer());

        // Search for the query
        Hits hits = searcher.search(query);

        // Examine the Hits object to see if there were any matches
        int hitCount = hits.length();
        if (hitCount == 0) {
            System.out.println(
                "No matches were found for \"" + queryString + "\"");
        }
        else {
            System.out.println("Hits for \"" +
                queryString + "\" were found in quotes by:");

            // Iterate over the Documents in the Hits object
            for (int i = 0; i < hitCount; i++) {
                Document doc = hits.doc(i);

                // Print the value that we stored in the "title" field. Note
                // that this Field was not indexed, but (unlike the
                // "contents" field) was stored verbatim and can be
                // retrieved.
                System.out.println("  " + (i + 1) + ". " + doc.get("title"));
            }
        }
        System.out.println();
    }
}



Output
Hits for "freedom" were found in quotes by:
  1. Mohandas Gandhi
  2. Ayn Rand
  3. Friedrich Hayek

Hits for "free" were found in quotes by:
  1. Ayn Rand

Hits for "progress or achievements" were found in quotes by:
  1. Theodore Roosevelt
  2. Friedrich Hayek

No comments:

Post a Comment

Chitika