Load distribution when parsing websites and connecting additional cloud resources

In this post we will talk about the library, which registers the nodes in itself and redirects requests from the outside to a specific node.

How did you get the idea to write this project?

After there was a need to parse sites in large quantities, I tried to implement such a thing using the selenium grid, then I took selenoid. The selenoid came up, but there was a lot that I didn't need, such as browser versions and options, and also, most importantly, this lack of auto scaling (but selenoid is not for this). 90% of the time the cluster is idle, and then there is a large load that the server cannot cope with. It turns out large costs of iron, which almost all the time does not work, but still can not cope. I thought it would be great if, as the load arrived, the number of executable browsers would increase, and how the load disappears and the browsers are removed. Fortunately, this can be implemented, for example, through AWS EC2 .

Little about the structure

  • Hub.

    The hub runs where it suits you, it is needed in one copy.
    When creating a docker container with a hub, it needs to pass the token environment variable .

    After that, it starts to expect incoming connections from the nodes and from users.
    The hub remembers routes, it remembers them for exactly one minute of inactivity , then deletes this route and releases the node for another client.
  • Node

    The node can be configured as a base container for auto scaling systems, for example, with an average load on the container pool, you can add another one, or, in extreme cases, you can start a virtual server with this container at launch time, provided that you pay for the actual time to use the server.

    When creating a docker container with a node, it needs to pass the environment variable token and server . Server is the ip of our hub.

Option number 1. Request from the site

The node makes a request to the hub, with the token header set — which is the token from the environment variable. The hub checks the token from the request, and if they match, it remembers it. The hub starts to ping this node every 4 seconds. If 5 ping attempts fail, the node is deleted with a loss of connection. The node, in turn, initiates a response ping, once every 10 seconds, in case the connection with the hub was lost. This is done so that after the connection is broken, the cluster itself restores its condition.

Option number 2. Request from user

The user makes a request to the hub with the token and number headers installed . The token is needed so that only trusted nodes can exploit the cluster, and number so that we can create different sessions within the same client ip. Each session has its own unique number.

For each request, the hub checks whether there is an already created route or not, if there is - the request is simply redirected to the desired node, if there is no such route - the request from the user enters the queue to release the node. As soon as one of the nodes is released, the hub makes up the route for the user session and the freed route. Now all requests for this session will go to a specific node.

A minute later. how the user closed the connection - the node is released and transferred to another user request.

Link to the project repository


The post turned out to be more similar to the instructions for use, but nevertheless, I believe that this project can be useful.

PS Some clarifications

This is the first project that I started writing on GOLANG, in connection with which, if someone has any suggestions or comments - please write in the comments (I don’t even count on PR, but it would be super cool!)

Also popular now: