Frenzy August 8, 2010 at 17:42

CouchDB today

What is CouchDB for you? Probably anyone who is even a little interested in the NoSQL theme that is now popular knows the general details very well: this is such a nice toy with map / reduce requests that are written in JavaScript, which you can work with while driving JSON over the HTTP protocol, and it is also possible. that you heard that it’s fault-tolerant, you don’t break at all. This usually doesn’t go beyond this, as a result, CouchDB is sent to delicious together with all sorts of MongoDB, Cassandra, Hadoop, etc.

Until recently, I had approximately the same opinion, until there was an urgent need to rethink the architecture of the current project (resting its forehead on its relational database) and transfer to a document database that could map / reduce. After I took a closer look at CouchDB, I realized that it is unique in its class, it should not be put on a par with the mentioned products. The ideas that are embedded in CouchDB are so conceptual that they can fundamentally upend the idea of developing web applications.

I’ll try to tell about what impressed me so much under the cut.

I want to say right away that if you already have some experience using CouchDB, then it’s likely that you yourself are “with a mustache” and this article is not for you. But as for the rest, then after reading there may be a desire and the opportunity to lie down on such a red sofa and relax, as the developers of Couch recommend.

It should be said that the hype around CouchDB has somewhat calmed down until recently, mentions of this database rarely slip on the hub, although over the past month Twitter has literally exploded with news about the release of CouchDB 1.0 and about Cloudant leaving the closed beta testing phase (which I’ll tell you about a little more detail below). The search says that this event was not really covered on the hub, while the output of MongoDB 1.6 was noted as a separate post. This injustice must be corrected, because if you remember CouchDB as raw alpha with slow inserts and frantic consumption of memory and processor, then forget that this is all in the distant past. Today it is a production-ready system with affordable commercial support and real examples of use in production by companies such as BBC, for example.

I will not describe in detail the main features that you can read about on the project website if you wish . I will try to talk about what was not obvious to me until I decided to take a closer look at this product, as a result of which I did not pay due attention to this product, while I could save myself a lot of time and nerve cells.

map / reduce

It would seem that it is difficult to surprise anyone with something. Many NoSQL databases use this particular paradigm of access to data that does not have a strictly defined scheme. Plus, the use of JavaScript to describe map / reduce functions is also alarming. The first impression is that it should be scary slowly, because at least map()it is necessary to execute for each document in the database. Plus, the SpiderMonkey engine is far from V8 in speed. What is the catch?

The keen eye will see that, in fact, CouchDB uses not map / reduce in its pure form, but the so-called incremental map / reduce. The whole idea is that CouchDB does not calculate its views (the so-called views - the results of functions like map()andreduce()) everytime. This is done only when the new view is first accessed, after which the result is calmly indexed and the document IDs calmly fall into the B + -tree on the key we need with additional information in the nodes (as well as the results of intermediate updates). Then everything is simple: when a new document gets into the database (or the old one changes), it is called once for it, map()after which it is put into the index tree. Those. there is no need to recalculate the whole index, the tree is simply incrementally completed.

When we need results, Couch simply gives us what has already been calculated in advance, there is no need to go around the documents again, performingmap(). In contrast to indexing several separate columns to speed up the search, as most databases do, we indexed all the query results in one fell swoop. Imagine MySQL, which just doesn’t need to “execute” the query - it can immediately get all its results from one single index.

The only analogue that comes to mind is materialized views from "large" RDBMSs like Oracle, but only much more lightweight. Do not forget that only the index of results is stored and only the values you need - the redundancy is not that big compared to regular databases with a bunch of indexes, because only the map / reduce results are indexed - the data that you need in the context of this query, and not all columns. Yes, and the screws are now cheap relative to the potential that this method promises.

I need more power ©

The fact that map()in fact it is executed once only for each new / changed node is very interesting. This allows you to perform quite heavy operations on a new document, anyway it is done once. And if you don’t seem to be overclocked with the built-in JavaScript interface, then another interesting feature of CouchDB pops up - in fact, it is not tied to any specific language, but uses an abstraction in the form of a View server. Those. you simply connect the view server for your favorite programming language and write map(), reduce()for example, in Python, using its rich standard library.

By and large, no one bothers you when adding a new document directly frommap()to geocode the address of your client added to the database using an external service (Google Maps API for example) and calculate the geo-index for it as another external library. Or just grab and index your document using Sphinx, and get a superfast full-text index of your database (if the integration with Apache Lucene is not good enough). In general, the scope for creativity is limited only by imagination.

If you need more speed, then all your view functions can be rewritten in C, taking advantage of the fact that the view server interface is as simple as a shovel, and the view in CouchDB is exactly the same document, or rather a design document with a special ID. Therefore, it is not necessary for him to directly contain the function code - you can put there, including any identifier that your view server will understand. In general, it is noticeable that this ideological unity of data and instructions for processing them in the form of exactly the same data is akin to the Lisp ideology.

Let them say that the map / reduce paradigm does not provide such power as SQL queries, but in practice you just need to learn how to think of it by category. So for example, the misconception that JOIN cannot be done without SQL, because joins do not scale and so on. This is all without meaning, however, the statement is too general. Firstly, a document can contain not only key-value pairs, but also collections, other objects, and everything else that can be described using JSON, and secondly, even more complex joins can be implemented, knowing the basic patterns - implementing map / reduce in CouchDB is very powerful and at the same time understandable and logical.

The web as it is

Unlike most alternatives, CouchDB was designed primarily as a database for the needs of web applications. Hence the roots of such a non-trivial solution as access to the database via the REST interface. Yes, HTTP in its purest form has an overhead, but it is fully compensated by how elegant this solution is. Firstly, you do not need to look for a driver for your favorite programming language - any HTTP client can handle it (all examples are most often curl directly from the command line). Moreover, such a client may be a web browser. You can write a web application without using any middleware on the server — you can work with the database using JavaScript via Ajax (for example, CouchApp is a jQuery-based framework from the creators of CouchDB).

After the start, Couch behaves like a normal HTTP server, you can access it using your browser and perform GET requests simply using the address bar of your browser, receiving JSON as the result. Futon, the administrative interface of the CouchDB instance, will also be immediately available. By the way, it is implemented entirely in JavaScript without server middleware, is extensible and can do many interesting things, despite its simplicity.

It is hardly worth saying that the HTTP protocol is implemented correctly, caching is supported, and Couch knows when to return 304. Similarly, a document can contain binary attachments (attachments - the equivalent of (B) LOB), which Couch stores with native files and gives just as static content . Design-documents can also contain show()andlist(), which allow you to convert the returned results as you like, for example, into HTML pages and give them directly to the browser. And if earlier you were of the opinion that storing user avatars and pictures from the goods of your online store directly in your [relational] database is bad, then with CouchDB everything is different - sometimes even data without a rigid scheme may be more structured in certain cases and holistic.

As you know, all ingenious is simple. In CouchDB, there are a lot of such simple things that were done specifically for Web applications, and not for abstract data that needs to be processed somehow. In the end, everything looks so consistent that now it’s not even clear how it could be done differently.

Scaling

CouchDB used to be said that it does not scale. Indeed, not so long ago it was true, but only half. First things first.

One of the key and most interesting features of CouchDB is its replication. The reader may wonder how replication can be interesting in principle, perceiving them as a kind of crutch. Everything is wrong in Kauch; its replications were originally designed when creating the database. First, they can master-to-master, which allows you to make all instances equally functional, rather than dividing them into master / slave. The problem with such replication in “regular” databases is potential conflict. Therefore, Couch (with MVCC features), when a conflict occurs, saves all conflicting versions and knows how to resolve these conflicts according to the configured rules (or putting this matter into the hands of your application - how you want to use it depends only on your imagination).

So you eat in transport, write something on Twitter from your Nexus One (I didn’t mention that CouchDB is fully functional on Android phones, as well as Maemo / MeeGo?), And then you enter the tunnel - connection is lost. You can safely continue to use the application, which upon the return of the connection will be able to merge new messages and fill in what you wrote with one API call without inventing bicycles. For example, if you use Ubuntu One, then you already use CouchDB in this way.

But we will not be distracted from the topic. Such replication is certainly good (especially in light of the CouchDB HTTP nature: you can put a regular balancer in front of the Kauchy cluster and not worry), but this is not a “real” scaling. Replications and relational databases can scale (albeit not so elegantly), but what about sharding? After all, everyone wants to smudge data on the cluster and simply by adding a node to increase the amount of available disk space, maximum load, peak number of users, and so on, without losing speed. CouchDB does not know how to do this out of the box. Till. But this is only a matter of time.

And here is Cloudantknows how. Cloudant is a new service that completed a beta test just a couple of days ago and is now available to everyone. This is the hosting of your CouchDB database (on the cloud from Amazon), and even the entire application, given that CouchDB can serve not only the database, but also middleware. The guys took advantage of the potential laid down in CouchDB and develop their fork (which soon has a chance to become part of the trunk), which is unlimitedly scaled by sharding, into which you can add a new node at any time, it also provides redundancy - each document is stored in triplicate on different nodes in case one of the nodes falls off. Moreover, in addition to full support for the standard API, Cloudant allows you to implement map / reduce according to the results of another map / reduce and has many more interesting features.

All code is open, so you can pick up and use Cloudant on a private cloud. You can also register a free account (up to 250 megabytes of disk space without taking into account old revisions of documents) and try CouchDB live - all the main features are available, including Futon.

Today

CouchDB in July has grown to version 1.0, which developers emphasize its stability and readiness for production use. Cloudant is also a landmark release for the entire CouchDB community. If you have not tried Couch, spend half an hour of your time - you can save a lot more time in the end, although this, of course, will depend on the specifics of your specific project. The product does exactly what it was created for, but no more. So do not expect a miracle, but I hope that after reading, someone will relax, as the developers recommend (especially if there is a red sofa).

Tags: