Overview of NoSQL Systems

    Unprecedented amounts of data are driving developers and businesses to look at alternatives to relational databases that have been in use for over thirty years. Together, all of these technologies are known as NoSQL Databases .


    The main problem is that relational databases cannot cope with the actual load in our time (we are talking about  high-load projects). There are three specific areas of concern:
    • horizontal scaling for large amounts of data, for example, as in the case of Digg (3 terabytes for green icons displayed if your friend did a dugg on the article) or Facebook (50 terabytes for searching by incoming messages) or eBay (2 petabytes in total)
    • performance of each individual server
    • not flexible logical structure design.
    Many companies need to find new ways to store and scale huge data arrays. I recently wrote a translation of an article about non-relational RIAK storage. In this article, we will examine the bulk of non-relational databases and systems, by which the movement NoSQL is meant.

    The term NoSQL was coined by Eric Evan / Racker when Joan Oskarsson of Last.fm wanted to host an event to discuss open source distributed databases.

    Some people disapprove of the term NoSQL because it sounds like based on what we don’t want to do, and not on who we are. The NoSQL movement is not a movement against relational databases. NoSQL is "Not only SQL ”(Not Only SQL), not“ No SQL ”(No SQL at all).

    The term NoSQL hides a large number of products with completely different designs, and sometimes when discussing, the conversation can go about different systems. So I suggest using three axes to compare these systems: scalability, data and query model, data storage system.

    I have selected 10 NoSQL databases for examples. This is not an entire list, but they are sufficient for evaluation.

    Scalability


    By scalability, some may mean replication, so when we talk about scalability in this context, we have in mind automatic distribution of data between several servers . We call such systems distributed databases. These include Cassandra, HBase, Riak, Scalaris and Voldemort . This is your only choice if you use a volume of data that cannot be processed on one machine or if you do not want to manage the distribution manually.

    There are two things that you need to look at in a distributed database: support for multiple data centers and the ability to add new machines to a working cluster transparently for your applications .


    Non-distributed databases include CouchDB, MongoDB, Neo4j, Redis, and Tokyo Cabinet . These systems can serve as a layer for storing data for distributed systems; MongoDB provides limited sharding support, as well as a separate Lounge project for CouchDB, and Tokyo Cabinet can be used as a file storage system for Voldemort.

    Data and Query Model


    There is a huge variety of data models and query APIs in NoSQL databases.


    (Relevant links Thrift , Map / Reduce , Thrift , Cursor , Graph , Collection , Nested hashes , get / put , get / put , get / put ) Column family

    system (columnfamily)used in Cassandra and HBase and its idea was introduced into them from documents describing the Google Bigtable device (Cassandra really left the Bigtable ideas a bit and introduced supercolumns). In both systems, you have rows and columns, as you are used to seeing, but the number of rows is not large: each row has more or less columns, depending on the need, and the columns should not be defined in advance.

    The key / value system itself is simple and not difficult to implement, but not effective if you are only interested in querying or updating a piece of data. It is also difficult to implement complex structures on top of distributed systems.

    Document OrientedDatabases are essentially the next level of key / value systems, allowing you to associate nested data with each key. Support for such queries is more efficient than simply returning the entire BLOB every time.

    Neo4J has a truly unique data model, storing objects and relationships as nodes and edges of the graph . For queries that match this model (for example, hierarchical data), they can be a thousand times faster than alternatives.

    Scalaris is unique in using distributed transactions across multiple keys. A discussion of the trade-offs between consistency and availability is beyond the scope of this post, but this is another aspect to consider when evaluating distributed systems.

    Storage System


    By storage, I mean how data is stored inside the system.


    The storage system can tell us a lot about what kind of load the base can normally withstand.

    Databases storing data in memory are very, very fast (Redis can perform up to 100,000 operations per second), but cannot work with data exceeding the size of available RAM. Longevity (saving data in the event of a server failure or power outage) can also be a problem ( in new versions there will be support for append-only log ). The amount of data that can be expected to be written to disk is potentially large. Another system with data storage in RAM - Scalaris, solves the problem of longevity using replication, but it does not support scaling to several data centers, so data loss is likely here in the event of a power outage.

    Memtables and SSTables buffer requests for writing in memory (memtable), after writing to a commit log for data preservation (it’s difficult to explain, but you can read more in wiki Cassandra - http://wiki.apache.org/cassandra/ArchitectureOverview ). After accumulating a sufficient number of records, Memtable is sorted and written to disk, already as SSTable. This gives performance close to memory performance, at the same time, the system is devoid of actual problems when stored only in memory. (This procedure is described in more detail in sections 5.3 and 5.4, as well as merging trees based on the log - The log-structured merge-tree )

    B-treesused in databases for a very long time. They provide reliable indexing support, but the performance is very poor when used on machines with hard disks on magnetic disks (which are still the most cost-effective), as there are a large number of head positions when writing or reading data.

    An interesting option is to use B-trees in CouchDB , only with the add function ( append-only B-Trees - a binary tree that does not need to be rebuilt when adding elements), which allows you to get good performance when writing data to disk.

    Conclusion


    The NoSQL movement grew sharply in 2009 due to the enthusiasm for the number of companies involved in the use of large volumes of data. More and more systems are appearing that make it possible to organize and transparently support huge amounts of data, process and control this data. I hope thanks to this short article, you will learn about some of the strengths of NoSQL systems and possibly contribute to the development of this movement.


    Also popular now: