I really don't like Java. Not one bit. I spent entirely too long fighting with it to get a Lucene full text indexer going for CouchDB running. Somewhere during that work I stumbled across some Python bindings for Hyper Estraier that looked to be pretty awesome. In about three hours I managed to duplicate about forty hours of Java work (estimates adjusted for personal bias). Now I present you with HyperCouch.
Just to start you off with some indication of indication of what's currently available in HyperCouch:
The basic steps to get HyperCouch up and running are covered on the GitHub page. Hopefully those directions are sufficient enough for you to get the necessary bits installed. I feel a bit bad about requiring so many projects to get things working but hopefully the install process doesn't cause to many issues. Feel more than free to email me directly at paul.joseph.davis@gmail.com if you're having issues.
After it appears to be up and running the only thing you should need to do is add a ft_index function to one or many of your _design/documents in CouchDB. An ft_index function acts very much like a normal CouchDB view function in that it takes a single document as input and produces some output. Unlike view functions ft_index functions don't use an emit(key, value) function to communicate results. Instead they use index(data) and property(name, value) functions to specify data that should be indexed.
index(data) calls should specify textual data that is intended for searching via full text queries like 'foo AND bar'.
property(name, value) calls specify properties of the document that should be indexed for operations such as filtering and sorting of full text search results.
{
"_id": "_design/my_awesome_doc",
"_rev": "024244112",
"ft_index" : "function(doc) {if(doc.body) index(doc.body); if(doc.baz) property("baz_prop", doc.baz)}"
}
This stuff is pretty straight forward. You can specify an ft_index on as many _design/documents as you want. The functions are additive in terms of index(data) and property(name, value) both attach data to the same document object. There will probably be weirdness if your function doesn't compile in JavaScript right now. Its on the agenda to add proper error reporting for.
Without further ado, the list of url parameters for querying HyperCouch are as follows:
q - The full text query (Can use AND OR etc. See the Hyper Estraier Search documentation.matching - Specifies different HyperEstraier query processing types. (Default is most applicable)limit - Limit the number of returned documents.skip - Skip a number of documents in the result set.order - Specify ordering of results on arbitrary parameters. See the Hyper Estraier docs.highlight - Receive HTML highlights from the documents returned. (Currently only supports hightlight=html)I'm not sure exactly what all the applications of the different matching types are, but I've included support for it. Mostly because I was curious what they did and still can't really figure out the difference.
Supported matching types:
I ran into issues testing this bit based on ordering. There were a few oddities with results being returned in different orders etc. I know that Hyper Estraier uses a call to quicksort in the internals which isn't a stable sort so I guess technically this could be part of the issue. Let it be said, I only test limit/offset for a specified ordering.
You can specify an 'order=[STRA|STRD|NUMA|NUMD] prop_name' to receive results in 'string ascending', 'string descending', 'number ascending', or 'number descending' order. From testing, I can say that it appears that Hyper Estraier appears to be doing an internal conversion so you shouldn't have to worry about proper typing, though you might get unexpected results if you number sort something that can't be converted to a number.
If you specify 'highlight=html' in the query string each returned document will contain a highlight member that is an HTML snippet of the indexed document. It was easy to implement and isn't thoroughly tested, but it's there.
You can specify arbitrary property limiting using the operators specified in the Hyper Estraier Search docs. Each doc can have an arbitrary number of properties associated with it. You can limit and combine any number of limits to properties etc. For those of you reading ahead in the Hyper Estraier docs, the proper format for the query string is to do a property_name=operator argument. Ie, if you called property("foo", doc.foo_value) in your ft_index method, you can specify 'foo=NUMLT 3' in the URL to receive documents that only contain a foo value less than three. There are approximately fifteen or so different operators you can use for limiting both string and numeric properties.
Operator types for property matching taken from the Hyper Estraier docs:
http://127.0.0.1:5984/db_name/_fti?q=foo+bar
http://127.0.0.1:5984/db_name/_fti?q=baz&limit=2
http://127.0.0.1:5984/db_name/_fti?q=homies&offset=19&limit=1
http://127.0.0.1:5984/db_name/_fti?q=no+ide&matching=rough
http://127.0.0.1:5984/db_name/_fti?q=*.**&my_property=NUMLT+2
http://127.0.0.1:5984/db_name/_fti?q=random+doc&prop_awesome=NUMBT+50+100000
http://127.0.0.1:5984/db_name/_fti?q=witty&order=wicked_prop_name+NUMD
http://127.0.0.1:5984/db_name/_fti?q=domain+universe&highlight=html
http://127.0.0.1:5984/db_name/_fti?q=which&skip=2
The returned data should look something like this:
{
"total_rows": 2,
"rows": [
{"id": "doc_id", "prop1": "val1", "prop2": "val2"},
{"id": "doc_id", "prop3": "val_schrodinger"}
]
}
The structure is probably going to have some refinements and there are a few caveats in property names for indexing, but all in all it should be fairly easy to figure out.
Remember that total_rows is the number of documents matching the query. At the moment there is no way to get the total number of indexed documents for a given database.
Hopefully that's enough of a description to whet your appetite. I'll be adding more features and better error messages as I go along. Hopefully I can trick a few people into using it and sending me feed back to make it better. Like I said feel free to email me with questions or suggestions.