
We are looking fast, even faster
I came across an interesting question in the QA section. The answer to it made me write this article as a more complete answer to the question “how to organize a search using many parameters, as in the Yandex market, for example”.
I know that on Habré, and indeed there are many supporters of noSQL solutions (it’s not without sin), but still I am a supporter of thinking first, and only then choosing a solution.
So, what do we have in DANO?
Hysteria came from the understanding that it was necessary to process more than 100 requests per second, and for this it would be necessary to sell a three-ruble note with a view of the Kremlin and buy more iron.
So, we are starting to think about how to save a ton of money and put a part in your pocket as a bonus.
I want to immediately notice that we do not set a goal to IMMEDIATELY get a list of the necessary lines from the database. We need to do prefiltration to speed up the search and filtering process.
To begin with, all of our checkboxes, which have only 2 values, we will convert to bits and group them into one “number”. In our case, this number will be 120 bit (128 bit for even counting). This will be our first "index". According to it, we can always filter out those products that match our conditions. In practice, on average, there are no more than 10 thousand such goods for the entire base. The remaining parameters can already be seen only in these 10 thousand products, and not "shovel" the entire base.
Any range (height, weight, waist, bust volume) can be divided into intervals and numbered. Divide into 256 parts. The figure is taken "from the ceiling." In any case, we will get 2 numbers for the range of $ 100 - $ 1200. If you take $ 1 - $ 100 with a maximum of $ 25,500, then we can write our range as 2 | 12 - in fact this is a 16 bit number. We will have such numbers by the number of possible ranges. If there are less than a few of them, then we will stop here, if there are 100-10000 of them, then everything is more complicated. We cannot adequately quickly search for a 200-20000 bit key. Yes, and we won’t, we just calculate it md5, and to speed up the process, we also convert everything to a 128 bit number (some kind of pathetic 16 bytes). In combination with many “checkboxes” we get an even more accurate selection.
Algorithm for finding these values with add. the parameter "not important" is almost similar to the option with checkboxes, except that we write only 1 / yes to the index (and 0, when choosing "not important"). In the second index we will put 0 with the option “not important” and 1 with any other.
Total we get 2 indexes:
1 index: 010100000010101010101012
2 index: 01111101110111101110111
The homework will be to build through Xor | And | Or a request in which we will reset all the insignificant bits at index2 and compare with index1.
So, we found 0-100-500 matching conditions from 12 million records. It is these records that should already be checked in their entirety (120-bit checkbox field, range, and many Yes / No / Not Important). Agree that a full check among a couple of hundred lines is much faster than a full-fledged search.
The option of converting a long key to a short one, for example, by encryption, is convenient enough for filtering data. At the same time, if we work with SQL, then the prefilter selection can be stored in MemTable for acceleration and complete filtering with this data already.
A similar algorithm is used by us in the filter of texts by shingle, when from the checksum we get 32 bits, which are faster to compare than 128 bits.
A similar technique works in one specialized document storage system, which allows microseconds to find in 2.2Gb text database links to documents where the search word or phrase is located. But, more about that in another article.
I know that on Habré, and indeed there are many supporters of noSQL solutions (it’s not without sin), but still I am a supporter of thinking first, and only then choosing a solution.
So, what do we have in DANO?
- We have 120 checkboxes - option 1/0
- We have 30 "radio" with a choice of "yes / no / no matter"
- We have 2-3 sliders to indicate the price range / size of which thread
- We have the most important thing: 12 million records in the database.
- We have Select * From tovar Where (wifi = true) and (led = false) and (type = 3) and .... other parameters ...; with runtime close to customer hysteria.
Hysteria came from the understanding that it was necessary to process more than 100 requests per second, and for this it would be necessary to sell a three-ruble note with a view of the Kremlin and buy more iron.
So, we are starting to think about how to save a ton of money and put a part in your pocket as a bonus.
I want to immediately notice that we do not set a goal to IMMEDIATELY get a list of the necessary lines from the database. We need to do prefiltration to speed up the search and filtering process.
Checkboxes
To begin with, all of our checkboxes, which have only 2 values, we will convert to bits and group them into one “number”. In our case, this number will be 120 bit (128 bit for even counting). This will be our first "index". According to it, we can always filter out those products that match our conditions. In practice, on average, there are no more than 10 thousand such goods for the entire base. The remaining parameters can already be seen only in these 10 thousand products, and not "shovel" the entire base.
Ranges
Any range (height, weight, waist, bust volume) can be divided into intervals and numbered. Divide into 256 parts. The figure is taken "from the ceiling." In any case, we will get 2 numbers for the range of $ 100 - $ 1200. If you take $ 1 - $ 100 with a maximum of $ 25,500, then we can write our range as 2 | 12 - in fact this is a 16 bit number. We will have such numbers by the number of possible ranges. If there are less than a few of them, then we will stop here, if there are 100-10000 of them, then everything is more complicated. We cannot adequately quickly search for a 200-20000 bit key. Yes, and we won’t, we just calculate it md5, and to speed up the process, we also convert everything to a 128 bit number (some kind of pathetic 16 bytes). In combination with many “checkboxes” we get an even more accurate selection.
Yes / no / not important
Algorithm for finding these values with add. the parameter "not important" is almost similar to the option with checkboxes, except that we write only 1 / yes to the index (and 0, when choosing "not important"). In the second index we will put 0 with the option “not important” and 1 with any other.
Total we get 2 indexes:
1 index: 010100000010101010101012
2 index: 01111101110111101110111
The homework will be to build through Xor | And | Or a request in which we will reset all the insignificant bits at index2 and compare with index1.
So, we found 0-100-500 matching conditions from 12 million records. It is these records that should already be checked in their entirety (120-bit checkbox field, range, and many Yes / No / Not Important). Agree that a full check among a couple of hundred lines is much faster than a full-fledged search.
...
Profit !!!
The option of converting a long key to a short one, for example, by encryption, is convenient enough for filtering data. At the same time, if we work with SQL, then the prefilter selection can be stored in MemTable for acceleration and complete filtering with this data already.
A similar algorithm is used by us in the filter of texts by shingle, when from the checksum we get 32 bits, which are faster to compare than 128 bits.
A similar technique works in one specialized document storage system, which allows microseconds to find in 2.2Gb text database links to documents where the search word or phrase is located. But, more about that in another article.