I woke up this morning to an amusing blog post by Ryan Park in my Google alerts. I tend to read alot of the "RDBMS's are awesome! No, NOSQL is moar awesome!" with great bemusement. Even though it's a year and a half old it was amusing enough to motivate me to write out some of the thoughts I had while reading it.
Ryan spends a couple paragraphs talking about how horrible it is that non-RDBMS systems don't provide data constraints. In general he's pretty spot on here. One of the first things that tends to get left out of non-RDBMS systems is constraint enforcement.
There are two and a half points I'd like to make. First, constraints are usually the first to go because they're costly. Costly to implement and costly at runtime. Especially when the system is being designed with the ability to run on multiple machines.
Secondly, there are plenty of people that don't use constraints. Ryan falls pretty squarely into the "RDBMS's work for me, so they should work for you too" camp. He appears to know his stuff but what people like Ryan forget is that there are a lot of people that don't. They use an RDBMS because that's what the internet says to do. And then they plop an ORM on top of it and never actually use any of the RDBMS features that are so lauded. As it turns out, lots of these developers are super happy using a database that doesn't provide constraint enforcement.
The last half a point I'll make later as it was suggested in a comment on Ryan's post and applies later on in the conversation.
Even reading the bullet point on this one and I knew I was in for some fun. I mean seriously, if that's not proof by assertion then I don't know what is.
Ryan is quite right that developers need to make some things appear consistent so as to not confuse users. I can't speak to the specifics of SimpleDB, but obviously someone's using it successfully so I'll assume that its possible.
The two things I'd point out though is that consistency is not limited to those crazy non-RDBMS people. Even in a traditional three-tier web architecture, there's the issue with sessions. Basically the issue is that a client needs to be repeatedly routed to the same application server handling their session.
The other thing I'll point out is an interesting Facebook blog post I read a couple months ago. Its an interesting look at how Facebook added a second datacenter on the east coast. A datacenter based on MySQL no less. I'll draw your attention to the "Cache Consistency" section. And I'll I'm going to point out is that their solution required modifying MySQL's query parser. Seriously.
For the bullet point, yes, that's more or less true. The argument in support of this is pretty much non-existent. If Ryan really wanted to make an argument about aggregates, the best thing would be to go on about how a non-RDBMS requires you to know what type of aggregates you'll want up front and then do insert time calculations for these values. While that will work just fine, it makes ad-hoc queries harder. The ad-hoc issue is the next bullet point, but for some reason the connection wasn't made.
This was one of my favorite bullet points in the whole article. And by favorite I mean that it produced the most WTF's per word.
Firstly, Ryan points out that there are three general work loads for databases. (1) General queries that are used by the application, (2) More complicated queries run by staff for reporting, (3) ad-hoc queries for debugging. I would pretty much agree with him there. But then he goes on to make the assertion that points 2 and 3 are better served by SQL.
The entire second paragraph is some sort of weird twisted logic to bolster the argument that SQL makes reporting super easy. My favorite part of the whole thing is the quote right in the middle:
In my previous jobs, our reports often required hundreds
of lines of SQL to get the right information out of the
database. This is a lot of code, but it was required to
generate the data for our customers.
As far as I can tell, the argument is that SQL makes complex reports easy even though it still might take hundreds of lines to get the data required. And the other thing that's not mentioned, these reports can still take a substantial amount of time to generate. But obviously this is still better than the non-RDBMS systems where they don't even have SQL! Because obviously its impossible that an any imperative language could be as good as SQL... because... because... well I never figured that part out either.
Yeah. This one is special.
RDBMSes are highly optimized for performing aggregate
operations across huge volumes of data. Fast algorithms
like the hash join, merge join, and indexed binary search
have been around for 20 years or more.
Ok. Breathe. I'm assuming that he forgot that joins are not aggregates. And a binary search, well, shit... I guess the non-RDBMS people really are screwed. And the second paragraph talks about how the client is going to need to scan the entire database thus incurring the huge network transfer to even try and compute an aggregate.
I'll just point out that there are non-RDBMS systems that provide aggregate functionality and anything that uses a b+tree probably uses binary search. Remember people, just because you can't think of a different solution to your problem doesn't mean it can't exist.
Now this is just FUD. Sorry, but there's no better way to say it. If you've ever had to fit some random piece of data into your existing relational schema you'll probably agree that this is crap. Munging random data is hard. And if its not random data then its not really that important. And getting data out? Perhaps Ryan was being satirical?
Jan Lehnardt has a couple thought provoking arguments and Volker Mische provides some interesting fodder as well. Basically, to say something isn't fast requires you to define what fast is and, generally, no two people will ever agree on the same definition.
That said, Ryan does make an allusion to this situation when he mentions that SimpleDB probably needs a larger DB to be measured on. And he also points out that lots of databases probably fit into RAM.
There's an interesting article by one of the 37signals guys about buying more RAM instead of sharding. While definitely a valid approach, not everyone can go out and buy a single machine with 32 GiB of RAM (though obviously that's getting closer). Though I now curiously wonder what type of disks they have to keep up with the write load they might have.
I don't have a better response than the commenter jackson on the original blog post. Once an RDBMS is scaled to multiple machines, lots of the benefits are nullified and you're dealing with the same issues that the non-RDBMS folks are.
There is definitely a lot of noise in the echo chamber about scalability. Developers like to talk about needing hundreds of nodes to support their work load because that's just cool. But in reality, the issue isn't adding the hundredth node to a system, its adding the second. Regardless of the database being used, if that second node isn't planned for it'll be painful. Non-RDBMS systems generally reduce that pain point by discouraging designs that exacerbate the problems when adding a second node.
I'll file this under the "No shit?" category. There are plenty of places that an RDBMS might be a better fit than any given non-RDBMS. And vice versa. The underlying issue that people seem to miss is being able to describe situations where one might be better than the other.
The bottom line to this whole "My database is better than your database!" argument is that "You're both right, so STFU!" Eventually people will calm down and start to realize that there are multiple solutions and the right one will depend as much on the problem domain as the developer coding the solution. A better use of time would be finding personal projects and drawing up the arguments for and against the coded solution so that others might learn from past experience.