Insect Life, or How We Catch Bugs in Antivirus Database Updates

    Unfortunately, everyone has errors. And Kaspersky Anti-Virus did not pass this fate. We have “bugs” in updates, some of which give users unpleasant chores. We carefully investigate all such cases, draw conclusions, and twist testing technologies.

    And how are anti-virus updates generally tested?

    For obvious reasons, in the antivirus industry, the technological details of testing are usually held up to seven seals. Try searching the Internet - there is no useful information about this.
    On the other hand, testing updates is a very interesting topic worthy of the reader’s attention. And we have something to share here.
    In the late 90s, Kaspersky Lab was one of the first in the industry to automate the process and has been constantly developing it for about 15 years.

    Testing anti-virus database updates

    First, some interesting numbers:
    • We release over 120 types of updates for over 90 different versions / assemblies of products;
    • 79 public sets: anti-virus databases for different platforms, anti-spam databases, anti-rootkit databases , anti-phishing, parental controls, BSS-bases , etc.
    • 43 special sets of bases for our technology partners .

    Public anti-virus databases are released on average every 2 hours, anti-spam databases - up to 5 minutes. Everything is tested for false positives ("falsifications"), loss detections (malware admission), "crashes", system load, working capacity in each product and a number of other criteria.

    One of the main ones is “falsifications” and loss detections. The quality of protection depends on them and we approach this process especially thoroughly. Databases are tested automatically on a large collection (software, curves, broken files, etc.), on a regular computer it would take a day. Of course, we cannot afford such a luxury, so the tests are carried out on a special computing cluster under the control of our own distributed DDPS (Distributed Data Processing System) system, which is able to scan an 80-terabyte collection in 6 hours.

    The performance of databases in a particular product under a specific OS is also very critical (because who needs good databases with a non-working product?). They are tested by a special robot on virtual machines, where all combinations of supported versions and OS are installed (Windows, different Unix-Linux, MacOS). This cluster has more than 1300 virtual machines, i.e. potentially we can check 1300 (sic!) combinations at the same time.
    The cluster, it is also a cluster in Russia

    Updatable modules are tested separately (anti-virus engine, anti-rootkit, Script Emulator, IDS, etc., these are more than 20 modules). This is the work of a dedicated team of testers located in Moscow, St. Petersburg and Beijing (total 10 people). The process is semi-automatic, i.e. both robots and specialists are working on the task.

    After testing, a “lay-out kit” is created, which is uploaded to public servers with a total of more than 60 pieces on all continents except Antarctica. The antivirus installed on your computer also receives updates from these servers. Downloading is done again using our proprietary DRS (Distributed Replication System). Thanks to DRS, the process is very fast, in many threads, with a high degree of reliability. For example, here is such an indicator: replicating updates to all these servers takes only a few minutes, per hour we do about 20 “calculations”.

    Controlled Updates

    What is important in all this mechanism of production, testing and delivery of updates?

    The update results are controlled by our other tool, namely the KSN cloud system . In the case of "bugs", KSN signals a problem one hour (or even several hours or even days, depending on the nature of the "bugs") before the first signals from users come to technical support. This technological feature allows us to respond faster and even solve some incidents BEFORE they grow.

    The total cost of our infrastructure for testing updates is about $ 3 million per year. It is not cheap, but these are very important costs - the quality of protection, the stability of the products and, most importantly, the opinion and comfort of the user depend on them. Over the past year, the testing system has identified and prevented 4 major incidents, as well as a decent number of "fakes". In general, we will continue to develop infrastructure, this is not a reason for savings.

    Recently, we have had 2 serious incidents with updates (in December and February). We carefully analyzed them, made conclusions and have already made several important changes both organizationally and technically.

    Firstly, we tightened the rules for the release of “dangerous” updates that could potentially cause problems on the client side: increased the level of control - now every manipulation of each file is registered and you can instantly understand what, how and by whom it is released, and any, even the most insignificant deviations in the infrastructure from the normal state or investigated incidents can stop the update (accordingly, prevent a new incident).

    Secondly, new tests for “volatility” have been introduced into the update verification procedure. One of them uses a special collection of files, during scanning of which the product uses the maximum amount of code in anti-virus databases, while poor-quality code has more opportunities to appear. At the same time, we finalized the system of collecting, recording, attributing and aggregating information about “falls” on the client side (dumps) - now not a single “drop” in the databases is dissolved, we’ll investigate everything. The first tests showed that the procedure is very effective. Already caught one very unpleasant bug.

    Thirdly, they formed requirements for future products to increase resistance to various problems with updates. There are no details here so as not to blur out the descriptions of future patents :-).

    Finally, we are launching a new corporate crisis management procedure that covers all departments involved and ensures maximum speed and transparency of information flow along the chain from the developer to the client.

    Important: the above is just a small part of the improvements we are working on this year.

    Despite the powerful testing system, errors happen, and no one is safe from this. We are people too, although we control robots and automatic machines. Errare humanum est. But it is important that humanum draw conclusions from each errare and constantly improve, because there is no limit to perfection, alas (or fortunately?).

    Posted by Nikolay Grebennikov, CTO Kaspersky Lab

    Also popular now: