Paul Joseph Davis

CouchDB Lucene Indexing

Notice

I rewrote couchdb-lucene pretty thoroughly last night. I decided that it was quick enough that instead of spamming my own blog I'll just spam your feed reader.

Overview

I updated couchdb-lucene today to work with trunk. I changed the behavior of the indexer to index views generated as per normal CouchDB semantics with a few minor constraints.

Indexing Strategy

The basics now revolve around a _design/lucene document in your database. Any view defined in this document will be indexed by Lucene. In order to be indexed appropriately you should make sure that all of the emit(key, value) calls specify doc._id as the key.

In the future I plan on adding more configuration options to this document so that indexing can be controlled from Futon etc. At the moment the interaction is limited to just specifying views to be indexed.

The reason for changing semantics from any specified views in any design document to all views in a single document are two fold. First it makes the index reset semantics a lot easier to think about. Second, it allows us to extend the awesomeness of Lucene querying a crap load further.

For instance, if your _design/lucene document specifies two views foo and bar you can use the standard Lucene syntax like foo:plankton AND bar:goat to get back the intersection of those two views. In the future I can see adding support for numbers to support numeric range queries. Technically, if you emit string sortable dates, you can already do date ranges.

Caveats

At the moment, Lucene indexes are reset every time the _design/lucene revision changes. Obviously this is sub-par and will change eventually, but I didn't have the brain power to consider all that was necessary to track when coding that bit.

Querying

After the rewrite, querying should be a crapload more efficient. I'm caching all of the Lucene objects in an LRU cache so things should keep pretty quick.

For Lucene People

Right now I'm not using any of the extra fancy features in Lucene and the few conversations I've had about Lucene internals make me realize that I'm probably doing some fairly nasty things. If you know Lucene and have some time please take a look at org.apache.couchdb.lucene.Index and send me any comments about things I should be doing to make stuff suck less.

Java People

My Java is probably less than awesome. Any of you out there that feels like helping out with this project, please for the love of god start sending me pull requests on github. That is all.