Paul Joseph Davis

CouchDB JSON Parser Timings

Overview

I've put together some work on integrating eep0018 int CouchDB as well as adding support for Spidermonkey 1.8.1. This is all still very experimental. I have the test suite passing for both branches except where Spidermonkey's JSON serialization differs from the JavaScript function previously used ([undefined] is serialized as [null]).

So, after getting those branches together I spent a bit of time and ran some tests to see what kind of speed differences I could get. Turns out it's dependent on the amount of data we give it, but there is a noticeable impact.

Branches

Caveats

Notice that I'm inserting 4KiB documents. If we shrunk those down to a couple bytes then these numbers tend to even out. I know that the eep0018 code is hampered by moving across the VM boundary so when we're dealing with _bulk_docs the key would be that eep0018 allows us to post more docs in a single request.

Other things people might want to play with are the numbers in couch_utils:should_flush/0 to see if we can tune how much data gets sent to the view server in one go.

So, there's a lot of permutations for different speed tests, not to mention just making sure that these branches aren't screwed beyond recognition in terms of actually working.

If you're bored and looking for something to do instead of clicking through Twitter or the current trendy social news site on a Monday morning, I invite you to grab one or two or all of the branches and run your own tests.

Benchmark Script

I realize this isn't the most sound measurement system, but I'm tired and didn't feel like being thorough. You can grab it here.

#! /usr/bin/env python
import time
import couchdb
server = couchdb.Server("http://127.0.0.1:5984/")
if "eep0018" in server:
    del server["eep0018"]
db = server.create("eep0018")

start = time.time()
updates = []
for docid in xrange(10000):
    doc = {"_id": "%.10d" % docid, "integer": docid, "text": "a" * 4096}
    updates.append(doc)
    if len(updates) >= 1000:
        db.update(updates)
        updates = []
if len(updates): db.update(updates)
end = time.time()
print "Inserting: %f" % (end - start)

start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer % 100);}"):
    pass
end = time.time()
print "Map only: %f" % (end - start)

start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer / 100);}",
            reduce_fun="function(keys, vals) {return sum(vals);}"
        ):
    pass
end = time.time()
print "With reduce: %f" % (end - start)

start = time.time()
for row in db.query("function(doc) {emit(doc._id, doc.integer * 2);}",
            reduce_fun="_sum"
        ):
    pass
end = time.time()
print "With erlang reduce: %f" % (end - start)

Results

This is the data that I got from running that test script against each of the three branches three times. The error bars are simple min/max notations. No fancy standard deviation shit going on here.

CouchDB JSON Parsing Times

Hand Waving

The results generally make sense. We get a speed bump during insertion when we switch to the eep0018 branch. The views are faster too. When we add the Spidermonkey 1.8.1 updates we get the same insert speed (because we don't touch the view server) and faster view computation.

For the more motivated timing people out there, if someone wants to play around with data sizes and look at timings for different scenarios that'd be pretty awesome. And more fancy number math probably wouldn't hurt.