I am writing a search engine (virtual project). Part 1. First bricks

    If you are not interested in the invention of bicycles, then please do not read or spit in the back.
    Whoever has something to say on the merits of the matter is always glad.
    Now I am going to consider the main issues that I need to scale the system.

    For successful scaling, the system, including data, must be divided into elementary "bricks". So that this process is as simple as possible. The simpler, the less confusion in the future. A successful partition can be a plus in solving other problems. This has already happened in my practice, when a detailed data structure made it possible to solve new problems that were not thought at the very beginning of the project. But to be honest, this thorough study was based on elementary laziness, so it was made so that it was easy to replace one brick with another unnoticed by the entire structure.
    The lyrics are over.
    In my opinion, the main information unit in the search should be a site. Apparently, large search engines do it, but if not, I feel sorry for them. It’s just scary to imagine that when searching in the Yandex catalog section, it’s not a search for a group of sites, but a filtering of the global search result. Or when if Google sets up filtering the issuance for China, not by disabling (un) necessary sites, but by thinning the issuance. However, I won’t be surprised if he just builds a separate “Chinese” index.
    So. What gives us the storage of the index and the ability to access it on a per-site basis?
    1. The ability to provide a search service to individual sites. Large search engines have a search restriction to a separate site, but for some reason the sites themselves do not use it, preferring to set local search engines. At least this market (of local search engines) exists and this can be used to mutual benefit - platforms for testing and running in functionality.
    2. The ability to search a group of sites, like the Yandex catalog. This idea is not new, but is unlikely to ever become irrelevant.
    3. The ability to exclude unwanted sites from the search. For example, “family search,” which children can use. It is unlikely that one of the parents will want, even if by chance, they see porn sites in the issue.
    Those. the site-by-site organization of the index provides ample opportunities for inclusive and exclusive filtering (inclusion and exclusion of individual sites or entire groups).
    4. This thought is perhaps the most seditious - no backup needed! Instead of backup, you start building the index from scratch. It will take more time than restoring their backup, but it will reduce your hardware costs. After all, one does not have to actually keep a second copy of the index. While you work with a separate site, this is not very annoying. But with increasing volumes, the problem of backup storage and support will grow at a similar pace.
    I am not going to completely refuse to back up. But to do this only for critical areas - key guides and indexes. Firstly, the volume of this data is much smaller, and secondly, their loss is a real disaster.
    5. Mobility. Transferring a part of the index to another server is quick and painless, which greatly facilitates the process of updating the machine park. This is if we are going to develop the project for a long time.

    How many such index bricks to place on a separate server is decided depending on the availability of resources and this is the next topic.

    PS. I am not considering the question of what to do if the site index is too large for one server.
    First of all, there are not many such sites and you can think about it as you approach them.
    Secondly, this problem can be solved in parallel without interfering with the operation of the main system and without upsetting its alterations.

    The option when the site wants to organize not only an end-to-end search, but also the ability to limit itself to one or more sections normally fits into the proposed structure. The scheme site-group_site_group-group_group -...- everything is replaced by the scheme group-group_group_group -....- site.
    Both this and that bears the common name - hierarchical structure. The main thing - what are the inherent basic limitations? The number of nesting levels, how many child nodes can a single section have? The absence of restrictions will give flexibility, but will affect the speed of work. Rigid frames allow you to operate with fixed-length lists, which will speed up the work. The main task is to propose such restrictions that they suit in most cases. Thank you Infanty for developing the idea .

    Also popular now: