cast July 11, 2011 at 16:16

A little bit about designing databases for a search engine

Without a database, even without several fundamentally different, such a project is impossible. Therefore, I will devote some time to this issue.

So, at a minimum, a DB will be needed serving the usual “flat” (“2D”) data - i.e. some identifier ID is mapped to a data field.
Why am I considering a data field one? Because:

selection is made only by the ID field - data is not searched. There are specialized indexes for this - otherwise, with such amounts of information, there will be little sense
any number of fields can be packaged into one, for this I created a set of small application libraries “on my knee”, in particular, CRC data is saved when packing, so as not to use, God forbid

If you do not set yourself the task of minimizing the number of lines of code for working with data and a little convenience, then almost any task can be reduced to the one where these points will be sufficient. And in the case of such high requirements for optimality and speed, in my opinion this is completely justified.

The main operations for such database tables are:

ID selection
read the entire database sequentially (in memory or hash)
update record by ID
insert a new one at the end, get an ID

I found it optimal to use “page type” tables, where data is stored, written, and read from disk by pages. Each page has a fixed number of entries. It is quite simple if I know in advance a fixed record size - then the table will work even faster, however, even if the record size changes - nothing changes significantly in processing. Updating, adding to the end is done within the page in memory, then the page is written to disk. In a table file, pages are stored sequentially.

The question arises: how to update records in the middle of the table when their size changes - because if the whole table is more than 10-20-200 GB, then copy half the table to a temporary file, and then it will take hours back? I put this question on the file system, breaking all the pages into blocks. One block - one file on disk, the number of files in one table is not limited. Each file stores sequentially a limited fixed number of pages. Then if you need to update the record in the middle of the table, then I need to change only 1 file of a much smaller and, often, limited volume. The responsibility for not asking the file system to do something stupid is to write first to the beginning, then to the end, then again to the beginning - I took it upon myself. Just in order not to strain the server, I always write in batches, the corresponding functionality is optimized as much as possible, everything happens in memory. Well, of course, the entire system of search engine modules is built on the basis that writing 1000 records to the end is faster than 1 to the beginning - therefore, when writing to the beginning, it is sometimes easier to make a copy of the table.

Ok, with the usual tables decided. Now the described database is very good, in particular, it processes 35 GB pieces of texts in the search process, with arbitrary selection.

But there are also limitations - to keep correspondence in such a table: for each word the list of documents in which the word was found (along with additional information) is practically impossible - for each word there will be a lot of documents, and accordingly the volume will be huge.

So what operations should be done with such a database:

sequential selection from the beginning of the list for the desired arbitrary word and until you get bored
for good, it would be necessary to change the list of documents for the word, but here you can make a feint - do just insert at the end of the database

How to update such an index? Obviously, if the index is empty and we start inserting lists of documents into it starting from the first word and ending with the last - we will write only at the end of the file. Moreover, whether or not to write physically separate blocks for each word — separately on the disk, the developer's business — in either case, you can simply remember: where the next block ended and its length, save it to the simplest list. Then the sequential reading procedure will be as follows: we move in the file to the beginning of the list for the desired word, and read sequentially until the list begins for the next word: 1 seek, and the minimum required number of reads is victory (here I do not specifically consider the operations of the file system itself - You can separately deal with their optimization)

Now, obviously, when we want to add information about newly indexed pages to the index, we can save new information separately, open the current index for reading, and create a new one for writing, and read-write sequentially merging with the information that needs to be added, starting with the first word and ending last. The question of in what order to add documents to the list I will consider a little later when I talk about the construction of the index.

The full content and list of my articles on the search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/

Tags:

A little bit about designing databases for a search engine

Also popular now: