lrrr May 21, 2008 at 12:59 pm

DIY Map / Reduce - Apache CouchDb

I warn you - my opinion does not at all pretend to any kind of objectivity. But relational databases have never, to put it mildly, inspired me.

No, I fully understand when you really have an application focused on processing and storing large amounts of data. Well, ERP-systems, all kinds of storages, statistics there, "last month they sold one hundred thousand pencils, this is two hundred."

On the other hand, in most cases, when it comes to desktop (or web) applications, where you do not need to roll millions of primitive records, and the application works with relatively high-level, complex objects, the essence of “database design and design” is to repeat two action:

a) split these high-level objects into a bunch of simple fields - numbers, strings and complex dependencies between them, and scatter between dozens of tables. Usually this is not very difficult, but some (or many?) Data types are not so pleasantly and organically laid out in this model - for example, tags from blog posts;

b) then persistently collect these fields into objects back, using four-story JOINs, megabytes of wrapper code, curves and not very ORM layers - depending on the skills of the developer, in general, overcome the notorious O / R impedance mismatch in every possible way . At the same time, handwritten JOINs do not show miracles of productivity and flexibility, and even more so automatically generated by a smart layer of wrappers.

In principle, ORM libraries in dynamic languages (see SQLAlchemy ) are quite pleasant to use, however, they do not elegantly solve another painful issue - with an upgrade of the scheme.

In general, quite a few applications use databases to store complex data structures, and at the same time they rarely need or really need complex queries using internal dependencies on this data (except for mega-JOINs in order to just pick out these structures from the database back). It seems that the usual RDBMS is not very suitable for them - the problems mentioned above are rather painfully solved, and millions of human hours are spent by database developers on the implementation of other opportunities that are useless for them.

One solution is object-oriented storage, they really become quite popular and deserve a separate discussion. They transparently solve the problem with ORM, but if we talk about web applications (which are of great interest to us in the light of the promised new version of defun.ru :), object-oriented databases are not exactly what the doctor ordered - they do not solve horizontal problems scalability and distribution of data, and the web is primarily a lot of textual information, it would be nice to somehow take this into account.

So, CouchDb is a document- oriented database. She knows how to store documents- objects consisting of a heap of fields with an arbitrary structure. Each document has only two required service fields: name and version, names are unique and located in linear space - imagine a giant directory with document files, like this:

{ 
 "_id": "63086444D554D3094C080F96D5005B03",
 "_rev": "1837603925",
 "author": "lrrr",
 "tags": ["baz", "test", "ru"],
 "url": "http: \ / \ / incubator.apache.org/couchdb",
 "title": "couchdb home",
 "description": "boo boo ba ba",
 "type": "story",
 "comments": 1,
 "votes": 2
}

Versions are needed to organize parallel access to the database - remember how your version control system works - if we want to change the document, we just take it, change it and try to put it back - if during this time its version has not changed, everything is fine, if it changed - you can just try to make the same changes again, with a new document, or somehow make merge (depending on the application). This is called optimistic locking, the main plus is that no one locks the document while editing, and therefore you do not need to wait for unlocking. By the way, such a mechanism can be applied in some modern RDBMS, only at the row level in the table (see http://www.google.com/search?q=%22row+versioning%22 ). Interface to CouchDb - exclusively HTTP, exclusively

REST , and the response from the server comes in JSON format . At first, this is somewhat alarming - not the most effective protocol, but taking into account the fact that high-level documents are stored in it as a whole, it is not necessary to make 5-10 queries to the database for each one. And there are a lot of pluses: firstly, any language can work with HTTP and JSON (and if it can’t, it’s easy to learn), secondly it’s easy to debug, thirdly CouchDb understands HTTP Etag and If-None-Match, which means you can fasten the cache to the HTTP database without much effort.

But to scale in breadth everything should be fine - in the end, Amazon SimpleDb and Google BigTable are built around such a scheme. Amazing, by the way, a coincidence, but SimpleDb and CouchDb are written in erlang;)

What distinguishes CouchDb from Google and Amazon services is the more “advanced” functionality in the field of data requests.

Naturally, less structured data is more difficult to process, and since we care so much about scalability, these queries should also be easily distributed across the database server cluster. To do this, CouchDb uses the map / reduce pattern described in a famous article by Google engineers.

In practice, it looks like this: on the server, view-functions (actually map () and reduce ()) are stored in special documents that transform the set of documents in the right way, and they can be accessed using the same REST interface. They know how to compute gradually, while preserving intermediate results, that is, if two documents are added or changed between two calls to view, the function will be called only for them. They are written in JavaScript, but you can easily plug in python / ruby / something else instead.

As an additional bonus - support for full-text search of documents using any external library (while the authors have screwed the Apache Lucene search engine to CouchDb ).

* * *

In the end, it is usually customary to kick a little about the technology in question, but CouchDb is still a pity for me to kick - it makes too good an impression. Although, of course, this is still only an alpha version, with all the ensuing consequences (reduce, say, appeared in trunk three days ago). Yes, it’s very slow - while it can process dozens of inserts per second (if you do not use bulk update mode) and yes, it eats up a lot of disk space - since all intermediate versions of the document are saved if they are not deleted periodically by the special function Compact Database ”- however, this can be done in parallel without stopping the application. However, for alpha, the system is very stable and already has, among other things, a very nice and functional web interface for administration and development.

Original post

More links:

Official site of the project
Damien Katz - CouchDb Lead Programmer
Two more blogs: Christopher Lenz , Jan Lehnardt
Top 10 Reasons to Avoid Document Databases FUD - a good article, why not be afraid of non-relational databases
Amazon SimpleDB and CouchDB compared
Ajatus - a distributed CRM system using CouchDb

Tags:

DIY Map / Reduce - Apache CouchDb

Also popular now: