If you’ve been following the developer hangouts in the last few months, you’ve probably heard at least a little bit about NoSQL and document databases. You may also have read how they’re the best thing since sliced bread, and that NoSQL will be your new BFF. Contrariwise, you may have read some of the FUD surrounding the subject and have cough a less rose-tinted view of the things.
Last night, I attended the Cambridge <a href=”http://www.nxtgenug.net/ title=”NxtGen User Group”>NxtGenUG</a> meeting on this very topic, and I intend to distil some of what I learnt in this post; mostly it’ll be me riffing around the stuff that Neil covered. The talk was delivered by Neil Robbins, whose delivery was always energetic, interesting and informative. The 100mph demo at the end of the talk was both simple and powerful. He’s a great speaker even if his slides were a bit wordy (with one or two being excessively so). Check him out on Twitter.
Background and Context
Neil began with a bit of background, which proved useful in providing some context to the new NoSQL movement. RDBMS such as Oracle, SQL Server, et al. are excellent at mapping relational data, but much real-world data is not relational; in fact, the class of problems that RDBMS has grown in number, and is likely to continue to do so.
The classic customers-plus-orders example that you get in many “Databases 101”-type courses is not relational, as best exemplified by the OrderLine entity that is always added when normalising data. Does an order line exist in the real world? Well, ok, sort of — there are lines on an invoice, for example — but it’s a trick that needs to be learnt: it’s not intuitive. This is also one of the reasons that denormalisation is sometimes required to juice extra performance from a database. For some applications, such as Twitter, it makes no sense to enforce the relational model on the data, because the data is network-oriented and only semi-structured.
When you come to program your system in a good OO language like C#, Java, etc., you quickly encounter the Object-Relational impedance mismatch: your objects don’t look like your entities. For example, your Order object might have some form of collection of products and quantities, and that collection will likely look nothing at all like your OrderLine entity. You can get around this in a couple of ways: code your objects to look more like your entities (and subvert good OO programming practice in the process), or use an ORM such as NHibernate.
You’ve probably heard me mention ORMs and NHibernate before, and they’re a great way to solve some problems, but they add a huge amount of complexity in the process. For example, NHibernate requires you to map your objects to your entities via an XML file, or via C# in the case of Fluent NHibernate. When I’ve worked with NHibernate, I’ve found this to be a massive overhead in the process, particularly for new projects which are still evolving.
Which brings me neatly on the next point: RDBMS do not cope well with schema evolution. When you add a new column to a table, existing rows in the table utilise either NULL or the default value for the column (if a default has been specified). This means the table then has to be updated with information for the new column; if the new column is a foreign key, for example, this can quickly get very laborious.
Scalability is an interesting problem, particularly as applied to RDBMS. Vertical scaling — buying more memory, CPU, etc. — is very easy, but quite expensive: server hardware does not come at commodity prices. Horizontal scaling — splitting the system across separate nodes, either by data or function — gives greater flexibility, but is much more complex. Constraints that you might have been able to rely on the RDBMS to enforce now must be incorporated into the application tier so they are consistent across the various nodes. The two-phase commit technique used in distributed transactions to enforce consistency reduces the overall availability of the system: the figure is the product of the availability of all the individual nodes.
Instead of aiming for high consistency as RDBMS do, document databases opt for something called eventual consistency. This concept states that, across all the nodes in the system, at some point in the future the data will be consistent. As it turns out, this actually the natural order of things in many scenarios: Neil’s example was of a system with an important paper step such as filling out a form. When the form has been completed, the system is inconsistent: the form has not yet been entered into the computerised portion of the system. It is not until the form has left the agent’s briefcase, been passed to the data entry clerk, and entered into the electronic system that the system is consistent overall.
As it turns out, NoSQL implementations differ in their aims, so it’s important to pick the correct one for your situation. For example, MongoDB’s USP centres around performance: it’s blazingly fast. There is some commonality between the implementations, however: they are all aimed at large (huge, enormous, gigantic) datasets; they all target commodity hardware; they all aim for high availability. Here are the different classes of implementation:
- Key-value stores
- Hadoop, Redis, Voldemort, Dynamo
- Google BigTable, Cassandra
- CouchDB, MongoDB, Riak, Terrastore </dl> No doubt other implementations are available. Map-Reduce is another common feature of many of these kinds of databases. This, simply put, is parallel processing of your data. Work is pushed to the nodes (rather than data being pulled to them), which then emit results. This is the Map part of the operation. The Reduce part is an aggregation, which filters the results. Check out Google's explanation.
- Facebook Seattle Engineering Road Show: Mike Shroepfer on Engineering at Scale at Facebook This is a fascinating post all-round, and describes in some detail what Facebook had to do to get sufficient performance out of MySQL (including heavy use of memcached). They're now heavily investing in their own columnar storage system, called Cassandra.
- Building Scalable Databases: Are Relational Databases Compatible with Large Scale Websites?
CouchDBCouchDB, a project of the Apache Software Foundation, is an example of a document database. A document is no more strictly-defined than as a "self-contained thing". It can be an invoice; it can be an insurance quotation or policy; it can be an event or a business card. There are no set fields, and documents can have as many or as few fields as required, and the fields can be of complex types. Document databases are schemaless, so you could store an invoice, insurance quotation, and a business card in the one store. In CouchDB, documents are represented in JSON. This seems to be well on its way to becoming a standard in this space: MongoDB uses a binary-serialised form of JSON called BSON (I wish they'd called it BiSON). As such, documents look a little like this:
_all_dbs. A JSON-serialised list of all the databases in CouchDB is returned.
?include_docs=truequery string in order to retrieve the data. In practice, however, you are highly unlikely to ever need to retrieve information on all the documents: remember, these systems are designed to work with millions or even billions of documents.
--data-binarybit is an option to curl that indicates that the body of the HTTP request needs to be base64-encoded as it is binary data; the option requires an argument, in this case @'invoice.pdf'. Likewise,
-His an option, which allows you to specify extra header fields manually. In this case, we're supplying the MIME-type to indicate that our attachment is a PDF. The attachment can be retrieved in a couple of ways. Appending the query string
?attachments=trueto a GET request for a document returns the base64-encoded representation of the attachment. To view the attachment in a more useful format — in a PDF viewer in this case — add
</code> to the GET request. It's taken me WAAAY longer to write this post than I was expecting, so I'm going to wrap up with a couple of quick pointers to other resources. You can find a .NET library for CouchDB called Divan on GitHub, and there's an excellent (and free!) online version of the O'Reilly CouchDB book available from http://books.couchdb.org/relax. Do check out the further reading list below as well; there are some great posts there.
Further ReadingDare Obasanjo, aka Carnage4Life, posts extensively on web topics:
Rob Conery, ex-Microsoft (working on ASP.NET MVC), has recently jumped on the NoSQL meme with MongoDB. His blog is a custom engine powered by Mongo. The High Scalability Blog. A sort of best practice-oriented blog aiming to share information on how large sites build for scalability successfully.