Paul Joseph Davis

CouchDB Joins

Overview

There has been a fairly active thread on the couchdb-user about how to join couchdb documents. The authoritative documentation on such things so far has been a blog post by cmlenz. Its a very good introduction into the power of collation. Unfortunately a couple issues keep cropping up on the list generally related to avoiding denormalization in some form or another.

After thinking about the most recent thread I think I've managed to sit down and create a couple views that would fulfill the offered requirements. For reference, a short description of the setup:

The Documents

  1. User documents: A unique non-changing document_id with a cusotmizable username.
  2. Comment documents: Contain a reference to the user document id.

Goals

  1. Get a list of users that have comments
  2. Get a list of users with 5 comments (sorted on some criteria).

General Outline

All code available here.

Briefly, the idea here is to generate a Map/Reduce view that will list the users with comments and the number of comments per user. This is pretty easy but necessary to allow us to get the list of users with their top five comments while ignoring users with no comments.

There are two Map/Reduce views we'll be using, "with-comments" and "max-comments". "with-comments" will be responsible for listing the users with the number of their comments. Users with zero comments will not appear in this view. By paging through this view will allow us to walk through the "max-comments" view getting the data we need to display a listing of users with their top N comments. (Sorry, N not configurable at query time.)

I'm using couchview from couchrest to manage my views. Hence the '-map' and '-reduce' suffices on each view name. Both Map/Reduce views are assumed to be placed in a "users" design doc.

Generating data

First things first, here's a script to load some data into a db named "test" on the local host:

#! /usr/bin/env python

import random
import uuid

import couchdb

server = couchdb.Server('http://localhost:5984')
if 'test' not in server:
    server.create('test')
db = server['test']

def new_uuid():
    return uuid.uuid4().hex.upper()

users = 1000
docs = []
for i in range(users):
    uid = new_uuid()
    docs.append({'_id': uid, 'type': 'user', 'name': 'user-%06d' % i})
    for c in range(random.randint(0,25)):
        cid = new_uuid()
        docs.append({'_id': cid, 'type': 'comment', 'user_id': uid,
                    'position': c, 'text': "Comment %d %d" % (i, c)})

db.update(docs)

So, here we load 1000 users and a random number of comments (0 to 25). Its pretty straight forward. You'll notice it requires the couchdb-python module.

Use the group_level, Luke

Remember, the default behavior of CouchDB's reduce is to produce a single value. Reduces work by creating a 'shadow' tree on top of the view b+tree. A good description (with pretty pictures) is this article by Ricky Ho from Adobe. So in order to produce a reduce result that has multiple rows we must include a group=true or group_level=N query parameter.

The group=true parameter means that for every unique key, we will get a single reduce value. If we have a view that has keys that are JSON arrays, we can use group_level=N to reduce arrays that have N identical prefix elements. Ie:

for i in range(N):
    key1[i] == key2[i]

Since each of the reduces below is going to want multiple rows from the reduce functions each will use a group_level=1 to group keys for reduction.

Users with Number of Comments

This view is straight forward. Any comments are emitted, and then reduced to a sum. Notice users with zero comments will not show up in the reduce because only rows for each comment are emit()'ed.

with-comments-map.js

function(doc)
{
    if(doc.type == "comment")
    {
        emit([doc.user_id, doc._id], 1);
    }
}

with-comments-reduce.js

function(keys, values)
{
    return sum(values) ;
}

Users with Top N Comments

This is where things get a bit complicated.

The Map portion is fairly straight forward. For each comment we emit a user id, -1 pair. The -1 is to indicate that this is a user document. Thinking about it now, this is an artifact of an earlier iteration, but any value that is outside the range of comment id's would be acceptable. (Assuming you change the reduce appropriately)

The reduce function is a bit of wild one. The basic idea is that we're going to take a set of values and condense them into an object that lists the user_id and top N comments. We have to be careful on re-reduce steps that we can merge these summary objects correctly. A more detailed discussion follows the code.

max-comments-map.js:

function(doc)
{
    if(doc.type == "user")
    {
        emit([doc._id, -1], doc.name);
    }
    else if(doc.type == "comment")
    {
        emit([doc.user_id, doc._id], [doc.position, doc.text]);
    }
}

max-comments-reduce.js:

function(keys, values, rereduce)
{
    var MAX = 5 ;
    var obj = null ;

    function sort_comments(a, b)
    {
        return a[0] - b[0] ;
    }

    function add_comment(current, to_add)
    {
        current[current.length] = to_add ;
        current.sort(sort_comments) ;
        log("Current: " + current.toSource()) ;
        return current.slice(0,MAX) ;
    }

    try
    {
        if(rereduce)
        {
            for(var i = 0 ; i < values.length ; i++)
            {
                if(typeof(values[i]) == 'object' && values[i].type == 'reduction')
                {
                    if(obj == null)
                    {
                        obj = values[i] ;
                    }
                    else
                    {
                        if(obj.user_id == null)
                        {
                            obj.user_id = values[i].user_id ;
                        }

                        for(var j = 0 ; j < values[i].comments.length ; j++)
                        {
                            obj.comments = add_comment(obj.comments, values[i].comments[j]) ;
                        }
                    }
                }
            }
        }

        if(obj == null)
        {
            obj = {"comments": [], 'type': 'reduction', 'user_id': null} ;
        }

        if(keys == null)
        {
            return obj ;
        }

        for(var i = 0 ; i < keys.length ; i++)
        {
            if(typeof(values[i]) == 'object' && values[i].type == 'reduction')
            {
                continue ;
            }

            if(keys[i][0][1] == -1)
            {
                obj.user_id = values[i] ;
            }
            else
            {
                obj.comments = add_comment(obj.comments, values[i]) ;
            }
        }
    }
    catch(e)
    {
        log("Error: " + obj + " Rereduce: " + rereduce) ;
        log(e.stack) ;
    }

    return obj ;
}

First we define a couple functions for merging lists of comments. This is a fairly constrained function. The returned comments array must not grow quickly. And quickly includes linear relative to the number of documents. Ie, if you just append every comment to this array it most likely will not work. By keeping just a maximum number of rows we're alleviating that concern.

The second major section (lines 19-45) deals with when we're running a rereduce. Rereduce is important because we may be consuming the output of this function. So here we go through and merge all reduction objects we returned earlier. The biggest gotchya in this code is that user_id only exists on one reduction object initially (because we only emit()'ed it once). So we need to be sure we move it up as things are rereduced.

In the third section (lines 57-73) we are dealing with the original Map output. Ignoring a couple checks we're basically testing if the row was emit()'ed, then depending on the result adding the user_id or merging a new comment.

Hopefully that's all clear.

Using the Views

So the general idea here is to page through the with-comments reduce view and use the first and last user_id to get the data we need to page through the max-comments view.

Paging through the with-comments view as per normal suggestion, we would start like such:

curl 'http://localhost:5984/test/_view/users/with-comments-reduce?group_level=1&count=10'

Using the first and last row from this query, we have enough information to get the appropriate range in our max-comments view using the startkey and endkey query parameters with a GET request like:

curl 'http://localhost:5984/test/_view/users/max-comments-reduce?group_level=1&startkey=`blah`&endkey=`foo`'

Keeping in mind that we would have to filter any returned results that had zero comments.

Feedback

Hopefully this works for people other than me. If you try it out and have questions or comments feel free to email me here.