I am writing a search engine (virtual project). Data storage

    Storage is perhaps the thinnest place for such projects. Depending on the tasks to be solved, it should provide:
    - quick access to data;
    - fast data update;
    - sufficient functionality with expansion options.
    In queuing systems with a large flow of requests, the short processing time of an individual request is the key to the system’s performance.
    If the speed of the appearance of new data in access (news systems) is important, then the speed of updating the database comes to the fore.
    With the growth of data volumes, combining high speed access and updating becomes almost impossible.

    Using universal DBMSs in this situation is practically impossible (anyone who disagrees with this thesis - I propose to convince Yandex or Google of this). The volumes of data processed by me are much more modest (the data deployed on a separate server takes about two hundred gigabytes), but they also forced me to look for other solutions.
    Using well-known KeyValue NoSQL repositories also does not always help. I tried updating several hundred million records in Berkelley DB - the result did not suit me, especially considering the need to perform this operation daily.
    As a result, I chose the following cocktail for myself (the work methods described below are used in real problems and have proven their effectiveness):
    - an ordered linear list with hash keys;
    - an ordered linear list with an index in the form of a binary tree;
    - Berkeley DB with add-in from MemcacheDB.
    The first two options are used to store the main index, the last - when the number of records requiring daily updates / additions does not exceed several millions and it is necessary to quickly add-delete-winter several records.
    Indexes are quite large, so storing all data in RAM is not possible.
    The fastest option, of course, hash keys. By value, we immediately get the result. The only drawback of these keys is that excessive resources are required for storage. My voids in the index do not exceed 20% of the total. This is achieved due to the fact that the hash is not calculated by the desired value, but in fact is the ID assigned to this record. The IDs of records retiring over time are reused.
    Unfortunately, such a scheme is not always possible. For example, sometimes it is necessary to carry out the opposite operation - by value, get the ID. In this situation, the use of hashes may be too resource intensive. In such cases, I use a binary tree key. Its main disadvantage before the hash is that to search for a single record, log2 disk accesses are required instead of one. When an index contains more than 500 million records, at least 28 read operations. This is pretty expensive. Of course, the system tries to help me by caching the most frequently requested blocks. As a result, the result is quite acceptable. But in order not to rely on the mercy of the system, I decided to make the next storage option combined. The first half of the keys will be stored in memory in the form of a hash. Thus, with a small consumption of RAM, we get a decrease in the number of disk accesses. And if we take into account that the data on the disk is ordered, then by the hash we read the key block once and then we work with it in memory. This option has not yet been implemented, but I think it will be more nimble than a binary tree.

    Thus, I try to provide high speed data access. Now you need to solve the update problem.
    In the process of updating the data, four tables are involved daily, each ranging from half a billion to a billion records. Plus, about one and a half million new records enter the input, which in the process pass through these tables as through a sieve.
    For example, one of the operations. Given:
    - dictionary [key, ID], about 700 million entries;
    - source data [key, data], about 70 million records.
    Keys are text strings. It is necessary to replace the keys in the second table with ID, and in the first add new keys from the second table. The pecks in the second table are not unique. For a relational DBMS, one query is not enough, plus add the overhead costs of importing and indexing the second table (the data comes in a text file). How much this whole process will be carried out is hard to imagine. Work with smaller volumes gave an unsatisfactory result.
    KeyValue systems also do not save the situation. Although quite tasty benchmarks are laid out on MemcacheDB , judging by the data volumes used in the experiment, such results were achieved without going beyond the limits of RAM, which is impossible in my case (the first table is 7 gigabytes of data in the zip itself).
    To solve this problem, I use the merge of two streams with the subsequent separation of the result (the second table is added to the first while saving the replacement results). The algorithm is quite simple, it has long been described by D. Knut and not only him, and quite effective. In various variations, I use this scheme in many other treatments.
    It takes about 50 minutes to complete the above operation (2x2.6GHz Opteron-2218 server, 16 Gig OP, disks - 4x73G scsi).
    Thus, the task of daily updating is solved in an acceptable time. I can’t say that it completely suits me, but to increase the efficiency more complex developments may be required, for which there is no time. There are several paper sketches, but they require preliminary verification "in the metal", so that is what I am talking about.

    Also popular now: