CodeBork | Tales from the Codeface

The coding blog of Alastair Smith, a software developer based in Cambridge, UK. Interested in DevOps, Azure, Kubernetes, .NET Core, and VueJS.

Project maintained by Hosted on GitHub Pages — Theme by mattgraham

If you’ve been following the developer hangouts in the last few months, you’ve probably heard at least a little bit about NoSQL and document databases. You may also have read how they’re the best thing since sliced bread, and that NoSQL will be your new BFF. Contrariwise, you may have read some of the FUD surrounding the subject and have cough a less rose-tinted view of the things.

Last night, I attended the Cambridge <a href=”http://www.nxtgenug.net/ title=”NxtGen User Group”>NxtGenUG</a> meeting on this very topic, and I intend to distil some of what I learnt in this post; mostly it’ll be me riffing around the stuff that Neil covered. The talk was delivered by Neil Robbins, whose delivery was always energetic, interesting and informative. The 100mph demo at the end of the talk was both simple and powerful. He’s a great speaker even if his slides were a bit wordy (with one or two being excessively so). Check him out on Twitter.

Background and Context

Neil began with a bit of background, which proved useful in providing some context to the new NoSQL movement. RDBMS such as Oracle, SQL Server, et al. are excellent at mapping relational data, but much real-world data is not relational; in fact, the class of problems that RDBMS has grown in number, and is likely to continue to do so.

The classic customers-plus-orders example that you get in many “Databases 101”-type courses is not relational, as best exemplified by the OrderLine entity that is always added when normalising data. Does an order line exist in the real world? Well, ok, sort of — there are lines on an invoice, for example — but it’s a trick that needs to be learnt: it’s not intuitive. This is also one of the reasons that denormalisation is sometimes required to juice extra performance from a database. For some applications, such as Twitter, it makes no sense to enforce the relational model on the data, because the data is network-oriented and only semi-structured.

When you come to program your system in a good OO language like C#, Java, etc., you quickly encounter the Object-Relational impedance mismatch: your objects don’t look like your entities. For example, your Order object might have some form of collection of products and quantities, and that collection will likely look nothing at all like your OrderLine entity. You can get around this in a couple of ways: code your objects to look more like your entities (and subvert good OO programming practice in the process), or use an ORM such as NHibernate.

You’ve probably heard me mention ORMs and NHibernate before, and they’re a great way to solve some problems, but they add a huge amount of complexity in the process. For example, NHibernate requires you to map your objects to your entities via an XML file, or via C# in the case of Fluent NHibernate. When I’ve worked with NHibernate, I’ve found this to be a massive overhead in the process, particularly for new projects which are still evolving.

Which brings me neatly on the next point: RDBMS do not cope well with schema evolution. When you add a new column to a table, existing rows in the table utilise either NULL or the default value for the column (if a default has been specified). This means the table then has to be updated with information for the new column; if the new column is a foreign key, for example, this can quickly get very laborious.

Scalability is an interesting problem, particularly as applied to RDBMS. Vertical scaling — buying more memory, CPU, etc. — is very easy, but quite expensive: server hardware does not come at commodity prices. Horizontal scaling — splitting the system across separate nodes, either by data or function — gives greater flexibility, but is much more complex. Constraints that you might have been able to rely on the RDBMS to enforce now must be incorporated into the application tier so they are consistent across the various nodes. The two-phase commit technique used in distributed transactions to enforce consistency reduces the overall availability of the system: the figure is the product of the availability of all the individual nodes.

Instead of aiming for high consistency as RDBMS do, document databases opt for something called eventual consistency. This concept states that, across all the nodes in the system, at some point in the future the data will be consistent. As it turns out, this actually the natural order of things in many scenarios: Neil’s example was of a system with an important paper step such as filling out a form. When the form has been completed, the system is inconsistent: the form has not yet been entered into the computerised portion of the system. It is not until the form has left the agent’s briefcase, been passed to the data entry clerk, and entered into the electronic system that the system is consistent overall.

As it turns out, NoSQL implementations differ in their aims, so it’s important to pick the correct one for your situation. For example, MongoDB’s USP centres around performance: it’s blazingly fast. There is some commonality between the implementations, however: they are all aimed at large (huge, enormous, gigantic) datasets; they all target commodity hardware; they all aim for high availability. Here are the different classes of implementation:

Key-value stores: Hadoop, Redis, Voldemort, Dynamo
Network-oriented: Neo4J
Object</dd>: db4o
Columnar: Google BigTable, Cassandra
Document: CouchDB, MongoDB, Riak, Terrastore

CodeBork | Tales from the Codeface

Background and Context

CouchDB

Further Reading