I read another blog post about CouchDB today. I'm not going to link to it. This post isn't intended to be whiny drivel. Instead I thought I'd try and describe CouchDB from the point of view of someone who's been using and developing for it for a couple months. Hopefully I can manage to paint a bit of an overall picture that tries to help new comers understand some of the more basic ideas of CouchDB.
Notice the 'Nutshell'. This description is not intended to be complete or detailed. I'm merely trying to paint the overall picture. Moving on.
CouchDB is centered around two core ideas. A document store and Map/Reduce.
The central document store component is a flat name space of documents. A document is represented in JSON. JSON looks like this:
{
"_id": "me",
"_rev": "234999001",
"name": "Paul Davis",
"online": {
"email": "paul.joseph.davis@gmail.com",
"blog": "http://www.davispj.com",
"twitter": "http://twitter.com/davisp"
}
}
That's it. CRUD operations are performed via HTTP using the GET, PUT, POST, and DELETE verbs. Its dead simple.
The second major component is the Map/Reduce framework. For those of you out there who have never heard of Google, a brief description is: Map functions take an input and give an output consisting of Key/Value pairs. A reduce function takes a set of Key/Value pairs as input and produces a single output. Yes. It's that simple. Kind of. (More later).
A map functions looks like this:
function(doc)
{
for(var i in doc.online)
{
emit(doc.name, doc.online[i]);
}
}
Given the example JSON document from above, it should be clear that this is going to result in three Key/Value pairs. One for each of my online object members. That's it.
These next two sentences are very important, pay close attention: Map output is stored in ascending order according to the Key. Keys can be arbitrary JSON objects over which there is a very specific sort order. Now read those last two sentences three or four more times. Regardless of how many times you read them and think you understand them, chances are you're going to end up in a situation and be taken aback about the crafty ways you can use those two properties.
Now that you have that idea in your head, go read Christopher Lenz's blog post on CouchDB 'Joins.
This is one of the main ideas behind using CouchDB. Understanding your data and understanding how to sort it in such away that you can get what you need by taking a slice of that sorted list. More on this later.
While ignoring the rereduce parameter for now, a reduce function looks like this:
function(keys, values, rereduce)
{
if(rereduce)
{
return sum(values);
}
else
{
return values.length;
}
}
Reductions have quite a few parameters but suffice it to say, this function will give us two pieces of useable information. The total number of online places in our entire database as well as the number of places per unique name. In my experience, Reduce is used to get summary type information which makes a certain amount of obvious sense. They can be used for getting a sum of values as shown, or getting the top N results for each Key emitted by a Map. I won't say much more on the subject because use cases are either very straightforward or entirely too detailed for this post.
So we have a Map. We have a Reduce. What can't we do with these? As it turns out there are two specific types of limits. Both types of functions must produce identical output for identical input. Reduce functions must also produce identical output regardless of the order of input as well as operate on it's own output. The second type of limitation regards calculating values that require input from multiple documents. More on this shortly.
Map/Reduce functions are calculated incrementally. If you create or update 5 records and access the Map/Reduce view, only 5 records are processed. 0 records means no processing (other than data access). This is a fairly important point people like to confuse. They think that these possibly complex Map/Reduce jobs mean hours of waiting around when in reality, construction of the Map/Reduce view is amortized over every single request made to the server. Shake your SQL stick at them apples (No, caching is not the same thing).
For anyone who is dejected after reading that last section, think of this next one as your chocolate chip cookie. Consider the semi-recent release of Google's AppEngine. There was a lot of confusion over how it worked. I mean, how could the world's most hugest computing giant not support count(*), sum(), or return more than a 1K records per query? Well it turns out that things in SQL (and other systems) actually don't scale worth shit beyond a single node. And last I heard, Google doesn't run on a single node so my guess is they're probably on to something here.
By no means is CouchDB trying to directly emulate Google. Many might point at the Map/Reduce frame work and scream the big G. To those people, I say go design a multi-node query system that is simple and robust to errors. One of three things will happen: you'll realize Map/Reduce is an excellent match, you'll write something that is hugely overly complicated that everyone hates working with, or you'll have a huge break through and define an entirely new area of distributed computation.
I am not a core contributor. I speak for no one. I guarantee nothing. But this is my list of features I want bad enough in CouchDB that I'm planning on actually implementing them.
If it's not obvious, I've drunk my fair share of the CouchDB Kool-Aid. If you're still a bit uncertain, I suggest you take a peek over at Chris Anderson's excellent blog posts on the applications of the future. In my opinion, this is going to turn out to be a Very Big Deal™.