AlexSecret October 6, 2012 at 22:29

Why you need to think 1000 times before using noSQL

Why am I writing this article? Firstly, I would like to contribute to people's understanding of the essence of nosql and why it is necessary to choose this type of storage consciously. Secondly, I will be glad to meet like-minded opponents and, possibly, to debate. And if you liked this article, I will be glad to hear questions that can be revealed in more detail in new articles :)

Despite the fact that nosql solutions are now dark, people are reluctant to switch to new types of repositories. Is it correct? In my opinion, yes. And I will try to say why, using the example of different nosql repositories that I met on my professional path.

The beginning of the story

Good day to all. This is my first “pen test” in a large edition, I hope it turns out interesting :) The nosql term is written in this article , and we will turn to examples from life and try to draw conclusions.

Consider the most popular DBMS: MySql, PostgreSql, Oracle. There are a lot of differences, but all three are debugged relational databases with rich capabilities. They allow you to create workflow systems, banking applications and business card sites for a small cafe. This is a common solution to almost any of your tasks.

What problems does a novice developer encounter when meeting his first SQL database?

Need to learn SQL syntax
You need to understand the very essence of the relational model
Need to master the client to the database in your favorite development language

And that's it, after that a person will not just master one database, he will master a family of databases and easily switch, for example, from Mysql to Oracle. (forget for a moment about PL / SQL and other important differences). And if you still use ORM ... beauty.

This imaginary simplicity can play a cruel joke. For example: when debugging a 5-line query in Oracle, in an attempt to make it more optimal. Here you begin to understand that free cheese only happens in a mousetrap.

And nevertheless: this is the convenience of selecting information using a huge amount of query language means - is it not happiness?

Frankly, for more than 2 years, I have not seriously considered mysql, oracle. And then I will describe what distracted me and lured me ...

Alfresco

And even though it requires an SQL solution to work, I still consider alfresco my first nosql database.

What needs to be learned by the person who first sits down to develop on the basis of this wonderful platform? Yes, actually, that's all :)

She's completely different. The data structures in it are described using xml. Connections are defined using so-called associations. For example: a post, a list of comments in it are associations. There is also model inheritance. One “table” can be inherited by another.

There is an opinion that the nosql solution is necessarily a fast repository. But alfresco is a very slow repository. Very very. Of the shortcomings, I can also name the request API. There are two ways to access the repository: to get associations and objects by id through java api, and more complex queries with selection by attributes and associations via Lucene Query Engine. Requests look scary, but I wrote a simple wrapper over the query engine, which allowed me to build queries in something like this: Query.field(title).eq("Заголовок").and(Query.field(text).like("*текст*"));life has become more beautiful and more fun. The request was written from memory, but colleagues will find out (hello! :))

And still, this is a wonderful thing, because it is very convenient to write workflow systems on it, with large and complex business processes through which documents will travel, “spending the night” with one user or another. Until they finally come to some conclusion. For example, to the resolution: done.

Then it was the beginning of version 3, in 2011 it was released 4. A lot of tasty things were added, probably the performance improved, but I was too carried away by new storages ...

Cassandra

This is my love , which I do not change until now. Colleagues did not have much enthusiasm about her, but I still think that this is all from a lack of RAM on the servers. Naturally, when it comes to 500 million lines with blobs on the server, you need to use more RAM than 8 GB ... the node sometimes hangs up.

But ... very fast recording, fast reading. Full control over data, confidence that the database will not be plugged in write or read speed. I still use it in my own projects and she has not let me down yet. A distinctive feature of this base is that it is difficult to kill. I’m never afraid that the server will be cut down and I will have to do restore, as it happens, for example, with MongoDb with default settings.

Requests to the database are made using the thrift api , which is very scary in appearance. It lacks all the necessary amenities like a connection pool. We put a set of bytes, we get, in fact, a set of bytes. I solved this problem as well as in the case of Alfresco, only on a larger scale: I had to write an ORM framework, which became an add-on for thrift, and at the same time did not impose performance restrictions. There were open source alternatives to the bicycle, but they all seemed uncomfortable in the context of the tasks being solved.

Thanks to team leader and patient colleagues who selflessly started using my product and immediately threw a ton of bug reports :))

And yet cassandra still ate memory and hung with its lack ...

Riak

My acquaintance with him was short. I read on a habr - cool. I read on the site - cool. Installed, began to test. Firstly, I was confused by the lack of the necessary functionality for queries to the database. Secondly, on a record of 20 million lines, the base behaved very strangely. She just died. The restarted base behaved even stranger: from 20 million lines on board it loaded for 10 minutes, for some reason, forcing 100% only one core out of four.
This was my personal research, so I no longer wanted to waste time on this database.

Hypertable

This database seemed to be a salvation , since it was not very memory intensive on a billion records to the server, and it was very fast in recordings. Although, of course, the write speed there depends solely on the selected timeout'a flush to disk. Thrift api after cassandra did not cause problems, it remained just to add support for hypertable in orm.

But this base was so erratic, and the logs were so uninformative that one could only wonder how the product could be called stable. Attempts to find colleagues on problems on the network yielded nothing. You could just restart and never wait for the base. And it was necessary to lift it with a tambourine: reboot 2 times, delete the logs, reboot another 2-3 times. Or 5 times. Although the problems did not appear immediately and she managed to almost leave for production. In general, not an option ...

Mysql

(just for example)

Sad faces of colleagues, sad me. Nosql did not solve our problems. Everything was in vain. Reluctantly, we tested mysql on our tasks and on 3 billion records it performed very well. This really upset me, with thoughts of “How so! After all, nosql! Big data! ”I had to use Mysql on real data. Naturally: no join'ov, complex connections. I must say that the real data changed the picture and one of the tasks with mysql did not work out. That is completely. A 4 second request is beyond. Even with a tightly optimized query, this time with connections and using SQL features. But Mysql did quite a different job. The main thing is the correct number of lines in the recording batch.

In general: we were financially limited, it was impossible to purchase many powerful servers. We used what they give. And they tried to save as soon as possible.

Mongodb

In parallel with the listed databases, I used / use this one as well . This is also a favorite database, I used it already in 6 projects. As amenities, there is a convenient ORM framework for java - Morphia , huge opportunities for data sampling, scalability and speed.

Of course, there are nuances here:

use highly recommended mongo version> 2
be careful with server reboots without mongo shutting down if you haven’t done a good setup
read about mongorestore and journaling :)

In my opinion, this database is wonderful as a transition - between the SQL solution and the Nosql world. What are the advantages of this database for me personally? Schema free, simplicity of requests, document orientation, scalability. I enjoy the very paradigm that this database is wearing.

And nevertheless: from 6 projects on mongo, it would be possible to write 3-4 on mysql and not to bathe. I wrote them in Mongo just because I like Mongo.

Hadoop

I started using this thing recently - about 3 months ago, with the transition to a new job. Hadoop is an ecosystem of solutions for storing and processing huge amounts of data. Understanding the essence of map-reduce and hadoop, the simplicity of the algorithms and principles at the beginning of this solution is striking. Nevertheless, this simplicity helps to process 200 gigabytes of textual information as if you were processing a small article. The thing is that a set of simple ideas gives a quick, simple solution. And if it seems to you that the data is not being processed fast enough, add a node to the cluster.

Of course, understanding the essence, researching the hadoop source code, implementing the first calculation tasks takes some time.

The main surprise of this decision for me was that the database may not be needed at all if you need to store and process really big data.

As a conclusion, I would like to express my personal opinion about all this:
There is no single solution for all tasks. Closest to that, no matter what, is the sql solution. Each nosql repository is a tool that solves only a certain range of tasks, at the same time it requires working with a file, studying the insides and carefully setting up, or even writing your client.

Addition to the conclusion:
It is necessary to think first of all because there will be no silver bullet. And no matter how clear the manual for the database is, the number of surprises from this will not decrease. Current nosql solutions are young and therefore not without flaws tools. Nevertheless, some of them are quite ready for production use, for example: mongodb, redis, hbase, cassandra.

But to come to the answer to the question "what to use" and in which case, in my opinion, you need it yourself. By testing and researching the solutions to your specific problem.

Tags: