CouchDB from Far Away
=====================
I read another blog post about CouchDB [1] today. I'm not going to link to it.
This post isn't intended to be whiny drivel. Instead I thought I'd try and
describe CouchDB from the point of view of someone who's been using and
developing for it for a couple months. Hopefully I can manage to paint a bit of
an overall picture that tries to help new comers understand some of the more
basic ideas of CouchDB.
In a Nutshell
-------------
Notice the 'Nutshell'. This description is not intended to be complete or
detailed. I'm merely trying to paint the overall picture. Moving on.
CouchDB is centered around two core ideas. A document store and Map/Reduce.
Document Store
--------------
The central document store component is a flat name space of documents. A
document is represented in JSON. JSON looks like this:
{
"_id": "me",
"_rev": "234999001",
"name": "Paul Davis",
"online": {
"email": "paul.joseph.davis@gmail.com",
"blog": "http://www.davispj.com",
"twitter": "http://twitter.com/davisp"
}
}
That's it. CRUD operations are performed via HTTP using the GET, PUT, POST, and
DELETE verbs. Its dead simple.
Map/Reduce
----------
The second major component is the Map/Reduce framework. For those of you out
there who have never heard of Google, a brief description is: Map functions take
an input and give an output consisting of Key/Value pairs. A reduce function
takes a set of Key/Value pairs as input and produces a single output. Yes. It's
that simple. Kind of. (More later).
A map functions looks like this:
function(doc)
{
for(var i in doc.online)
{
emit(doc.name, doc.online[i]);
}
}
Given the example JSON document from above, it should be clear that this is
going to result in three Key/Value pairs. One for each of my online object
members. That's it.
Using Map
---------
These next two sentences are very important, pay close attention: Map output is
stored in ascending order according to the Key. Keys can be arbitrary JSON
objects over which there is a very specific sort order [2]. Now
read those last two sentences three or four more times. Regardless of how many
times you read them and think you understand them, chances are you're going to
end up in a situation and be taken aback about the crafty ways you can use those
two properties.
Now that you have that idea in your head, go read Christopher Lenz's blog post
on CouchDB Joins [3].
This is one of the main ideas behind using CouchDB. Understanding your data and
understanding how to sort it in such away that you can get what you need by
taking a slice of that sorted list. More on this later.
Using Reduce
------------
While ignoring the `rereduce` parameter for now, a reduce function looks like this:
function(keys, values, rereduce)
{
if(rereduce)
{
return sum(values);
}
else
{
return values.length;
}
}
Reductions have quite a few parameters but suffice it to say, this function will
give us two pieces of useable information. The total number of online places in
our entire database as well as the number of places per unique name. In my
experience, Reduce is used to get summary type information which makes a certain
amount of obvious sense. They can be used for getting a sum of values as shown,
or getting the top N results for each Key emitted by a Map. I won't say much
more on the subject because use cases are either very straightforward or
entirely too detailed for this post.
Gory Details
------------
So we have a Map. We have a Reduce. What *can't* we do with these? As it turns
out there are two specific types of limits. Both types of functions must produce
identical output for identical input. Reduce functions must also produce
identical output regardless of the order of input as well as operate on it's own
output. The second type of limitation regards calculating values that require
input from multiple documents. More on this shortly.
The reason?
-----------
Map/Reduce functions are calculated incrementally. If you create or update 5
records and access the Map/Reduce view, only 5 records are processed. 0 records
means no processing (other than data access). This is a fairly important point
people like to confuse. They think that these possibly complex Map/Reduce jobs
mean hours of waiting around when in reality, construction of the Map/Reduce
view is amortized over every single request made to the server. Shake your SQL
stick at them apples (No, caching is not the same thing).
Too Restrictive You Say?
------------------------
For anyone who is dejected after reading that last section, think of this next
one as your chocolate chip cookie. Consider the semi-recent release of Google's
AppEngine. There was a *lot* of confusion over how it worked. I mean, how could
the world's most hugest computing giant not support count(*), sum(), or return
more than a 1K records per query? Well it turns out that things in SQL (and
other systems) actually don't scale worth shit beyond a single node. And last I
heard, Google doesn't run on a single node so my guess is they're probably on to
something here.
By no means is CouchDB trying to directly emulate Google. Many might point at
the Map/Reduce frame work and scream the big G. To those people, I say go design
a multi-node query system that is simple and robust to errors. One of three
things will happen: you'll realize Map/Reduce is an excellent match, you'll
write something that is hugely overly complicated that everyone hates working
with, or you'll have a huge break through and define an entirely new area of
distributed computation.
Future Feature Fantasies
------------------------
I am not a core contributor. I speak for no one. I guarantee nothing. But this
is my list of features I want bad enough in CouchDB that I'm planning on
actually implementing them.
1. Extend the Map/Reduce framework to include recently published
[Map/Reduce/Merge][4] algorithms for distributed relational data
processing. M/R/M provides for the possibility of having incrementally
computed joins.
2. Transparent physical host spanning with fault tolerance as hosts enter and
exit the system at random. This part gets a bit tricky. Off the top of my
head, it'd require Paxos, some sort of doc-to-node hashing, a minimum copy
count algorithm etc.
3. Erlang full text indexing. I've spent a lot of time trying many different
solutions for full text indexing. The more I work through the different
options, I'm becoming more and more convinced that a pure Erlang
implementation is going to be required.
4. Erlang Plugins. Goes with the full text indexing but also applies to other
types of indexing that I'll be needing. Things like nested containment
lists etc. A good system that ends up being even easier than FireFox
extension installation is going to be necessary. This is a big
infrastructure and planning feature so it'll be mostly reliant on
community agreement rather than the code.
Kool-Aid
--------
If it's not obvious, I've drunk my fair share of the CouchDB Kool-Aid. If you're
still a bit uncertain, I suggest you take a peek over at Chris Anderson's
excellent blog posts on the applications [5] of the
future [6]. In my opinion, this is going to turn out to be a Very
Big Deal ™.
References
----------
[1]: http://couchdb.apache.org/
[2]: http://wiki.apache.org/couchdb/View_collation
[3]: http://www.cmlenz.net/archives/2007/10/couchdb-joins
[4]: http://portal.acm.org/citation.cfm?doid=1247480.1247602
[5]: http://jchris.mfdz.com/posts/128
[6]: http://jchris.mfdz.com/posts/129
Copyright Notice
----------------
Copyright 2008-2010 Paul Joseph Davis
License
-------
http://creativecommons.org/licenses/by/3.0/