Testing in Yandex. How to make a fault tolerant grid out of thousands of browsers

    Any specialist involved in testing web applications knows that most of the routine actions on services can do the Selenium framework . In Yandex, millions of autotests are run per day, using Selenium to work with browsers, so we need thousands of different browsers available simultaneously 24/7. And here the fun begins. Selenium with a large number of browsers has many problems with scaling and fault tolerance. After several attempts, we got an elegant and easy to maintain solution, and we want to share it with you. Our gridrouter project allows you to organize a fault-tolerant Selenium grid from any number of browsers. The code is available in open-source and is available on Github



    . Under the cut, I’ll tell you what Selenium’s drawbacks we paid attention to, how we came to our solution, and explain how to configure it.

    Problem


    Selenium has changed dramatically more than once since its inception; the current architecture, called the Selenium Grid, works like this.

    The cluster consists of two applications: a hub (hub) and nodes (node). A hub is an API that accepts user requests and sends them to nodes. Noda is a query executor that launches browsers and performs test steps in them. An infinite number of nodes can theoretically be connected to one hub, each of which can launch any of the supported browsers. But what in practice?

    • There is a weak spot. The hub is the only browser access point. If for some reason the hub process stops responding, then all browsers become inaccessible. It is clear that the service also stops working if the data center where the hub is located has a network or power failure.
    • Selenium Grid does not scale well. Our long-term experience in operating Selenium on various equipment shows that under load one hub is able to work with no more than several dozen connected nodes. If you continue to add nodes, then at peak load the hub may stop responding over the network or process requests too slowly.
    • No quota. You cannot create users and specify which browser versions which user can use.

    Decision


    In order not to suffer from the fall of one hub, you can raise a few. But ordinary libraries for working with Selenium are designed to work with only one hub, so you will have to teach them how to work with a distributed system.

    Client Balancing


    Initially, we solved the problem of working with several hubs using a small library, which was used in the test code and performed balancing on the client side.

    Here's how it works:

    1. Information about hosts with hubs and browser versions available on them is stored in the configuration file.
    2. The user plugs the library into their tests and requests a browser.
    3. A host is randomly selected from the list and an attempt is made to get a browser.
    4. If the attempt is successful, then the browser is given to the user and tests begin.
    5. If the browser could not be obtained, then again the next host is randomly selected, etc. Since different hubs may have a different number of available browsers, hubs in the configuration file can be assigned different weights, and random selection is made taking into account these weights. This approach allows for a uniform load distribution.
    6. The user receives an error only if the browser could not be received on any of the hubs.

    The implementation of such an algorithm is simple, but requires integration with each library to work with Selenium. Let's say in your tests the browser turns out with this code:

    WebDriver driver = new RemoteWebDriver(SELENIUM_SERVER_URL, capabilities);

    Here RemoteWebDriver is the standard class for working with Selenium in Java. To work in our infrastructure, you have to wrap it in our own code with the choice of a hub:

    WebDriver driver = SeleniumHubFinder.find(capabilities);

    The test code no longer has a URL before Selenium, it is contained in the library configuration. This also means that the test code is now tied to SeleniumHubFinder and will not start without it. In addition, if you have tests not only in Java, but also in other languages, you will have to write a client balancer for all of them, and this can be expensive. It is much easier to get the balancing code to the server and specify its address in the test code.

    Server balancing


    When designing the server, we laid down the following natural requirements:

    1. The server must implement the Selenium API ( JsonWire protocol ) for the tests to work with it, as with a regular Selenium hub.
    2. You can arrange as many server goals in any data centers and balance them with an iron or software balancer (SLB).
    3. The server heads are completely independent of each other and do not store the shared state.
    4. The server out of the box provides quotas, that is, the independent operation of several users.



    The architecturally obtained solution looks like this:

    • A load balancer (SLB) scatters requests from users to one of N goals with a server listening on a standard port (4444).
    • Each of the goals stores in the form of configuration information about all available Selenium hubs.
    • When a request arrives at the browser, the server uses the balancing algorithm described in the previous section and receives the browser.
    • Each running browser in standard Selenium gets its own unique identifier, called the session ID . This value is transmitted by the client to the hub on any request. Upon receipt of the browser, the server replaces the current session ID with a new one, additionally containing information about the hub on which this session was received. The received session with the extended ID is given to the client.
    • At the following requests, the server retrieves the host address with the hub from the session ID and proxies requests to this host. Since all the information necessary for the server is in the request itself, it is not necessary to synchronize the state of the goals - each of them can work independently.

    Gridrouter


    We called the server gridrouter and decided to share its code with everyone. The server is written in Java using the Spring Framework . The source code of the project can be viewed here . We also prepared Debian packages that install the server.

    At the moment, gridrouter is installed as a combat server used by different Yandex teams. The total number of available browsers in this grid is more than three thousand. At peak loads, we serve approximately the same number of user sessions.

    How to configure gridrouter


    In order to configure the gridouter, you need to specify a list of users and quotas for each user. We did not set a goal to make super-secure authentication with hash functions and salt, so we use the usual basic HTTP authentication , and store logins and passwords in clear text in the text file /etc/grid-router/users.properties of the form:

    user:password, user
    user2:password2, user
    

    Each line contains a username and password through a colon, as well as the role that is currently the same, user . As for quotas, everything here is also very simple. Each quota is a separate XML file / etc / grid-router / quota /.xmlwhere - Username. Inside the file looks like this:


    It can be seen that the names and versions of available browsers are determined, which must exactly match those indicated on the hubs. For each version of the browser, one or several regions is defined, that is, different data centers, in each of which the host, port and the number of available browsers are recorded (this is the weight). The name of the region can be arbitrary. Information about the regions is needed when one of the data centers becomes unavailable. In this case, the gridrouter, after one unsuccessful attempt to get a browser in a certain region, tries to get a browser from another region. Such an algorithm significantly increases the likelihood of a quick browser.

    How to run tests


    Although we mainly write in Java, we tested our server with Selenium tests in other programming languages. Usually in hub hub tests, it’s shown something like this:

    http://example.com:4444/wd/hub
    

    Since we use basic HTTP authentication, the following links should be used when working with gridrouter:

    http://username:password@example.com:4444/wd/hub
    

    If you have any problems with the configuration, please contact us, start the issue on Github .

    Recommendations for setting up hubs and nodes


    We conducted experiments with different configurations of hubs and nodes and came to the conclusion that from the point of view of ease of operation, reliability and ease of scaling, the following approach is most practical. Usually they install one hub, to which many nodes are connected, because the entry point must be one.



    When using a gridrouter, you can put as many hubs as you like, so the easiest way is to configure one hub and several nodes connected to localhost on the same machine : 4444. This is especially convenient if everything is deployed on virtual machines. For example, we found that for a virtual machine with two VCPUs and 4 GB of memory, the combination of a hub and five nodes is optimal. We install only one browser version on one virtual machine, because in this case it is easy to measure memory consumption and translate the number of virtual machines into the number of available browsers.

    Also popular now: