BASS - a framework for automatic synthesis of antivirus signatures

Original author: Jonas Zaddach and Mariano Graziano

Transfer

Hello. Less than ten days are left until the start of the “Reverse Engineering” course . In this regard, we want to share another interesting translation on the topic. Go!

Short review

The picture of threats is changing rapidly - new cyber attacks are constantly appearing, and the old ones are becoming more sophisticated. In these circumstances, security professionals face increasingly complex challenges. Every day they have to process and analyze millions of samples of previously unknown and completely new malware, develop effective anti-virus signatures to describe entire families of malicious programs, and provide scalability of tools as the number of samples for analysis increases. In this case, it is necessary to take into account the limited resources for malware analysis automation tools. To help IT professionals deal with these diverse challenges, Talos offers a new open source platform called BASS.

BASS (read as “bass”) is a framework for automatically generating anti-virus signatures based on samples from previously formed clusters of malicious code. It aims to reduce resource consumption by the ClamAV core by increasing the proportion of template-based signatures relative to hash signatures, and to simplify the work of analysts developing template-based signatures. With support for Docker containers, the framework scales well.

It is worth noting that so far only the alpha version of BASS is available and much remains to be finalized. This project has open source code and we are actively working on it, so we will be glad to any feedback from the community and recommendations for improving it. The source code for BASS is available here .

The BASS project was announced in 2017 at the conferenceREcon in Montreal, Canada.

Relevance

Talos specialists receive more than 1.5 million unique samples every day. For the most part, they are known threats and are immediately eliminated by a malware scanner (ClamAV). However, after scanning there are many files that still need further analysis. We run them in the sandbox and conduct dynamic analysis, which allows us to separate them into malicious and safe ones. We process the malware samples selected at this stage in order to create ClamAV signatures based on them, which will help to further filter these threats at an earlier stage, during scanning.

For three months, from February to April 2017, 560,000 new signatures were added to the ClamAV database, that is, an increase of 9,500 signatures per day. A significant part of them we received automatically in the form of hash signatures. Such signatures have one significant drawback compared to template or bytecode signatures (these are two other types supported by the ClamAV core): one hash signature corresponds to only one file. In addition, an increase in the number of hash signatures leads to the ClamAV database taking up more memory. That is why we prefer pattern-based signatures. They are much simpler and faster to manage than bytecode, and at the same time they allow you to describe entire clusters of files.

Bass

The BASS framework is designed to facilitate the creation of ClamAV signatures based on templates. It automatically generates them, processing segments of binary executable code.

BASS takes as its basis clusters of malicious code, but does not include the means to create them. Due to this, the technology remains convenient and flexible. We intentionally made the input interface universal so that it was easy to adapt to new cluster sources. Now we use several such sources, including clusters based on indicators of compromise (IoC) from our sandbox, structural hashing (when we have a knowingly malicious executable file and we are looking for additional samples that are similar in structure to it) and malware received from spam campaigns.

At the first stage, malicious instances pass through the ClamAV kernel unpackers. It can unpack archives of various formats and compressed executable files (for example, UPX), and also extract embedded objects (such as EXE files inside Word documents). The received artifacts are carefully analyzed, information is being collected. Now for the next stage, filtering, we use their sizes and the UNIX magic string.

Then the cluster of malicious code is filtered. If the files do not meet the BASS requirements (while the platform works only with PE executable files, but it is not difficult to add support for the ELF and MACH-O binary files), they are deleted from the cluster or, if there are too few objects left, the cluster is completely rejected.

The filtered cluster proceeds to the signature generation step. First, binary files are disassembled. To do this, we use IDA Pro, but it can be easily replaced with another disassembler with similar capabilities, for example, radare2.

After disassembling, it is necessary to identify a common code in the samples in order to generate signatures on its basis. This step is important for two reasons. First, the signature generation algorithm requires significant computational resources and works better with short code segments. Secondly, it is preferable to get signatures from code samples that are similar not only syntactically, but also semantically. To compare the code, we use the BinDiff utility. Again, it is also easy to replace, and in the future we may integrate other utilities into the framework for comparison.

If the cluster is small, BinDiff compares each executable with all the others. Otherwise, the scope of the comparison is reduced, otherwise the process may be too long. Based on the results obtained, a graph is constructed where the vertices denote the functions and the edges indicate their similarity. To find a good general function, it is enough to find a connected subgraph with a high overall similarity index.

The subgraph ƒ1, ƒ2, ƒ4, ƒ6 with high vertex similarity indicators (see the figure above) is an excellent candidate for the role of a common function.

As soon as several such candidates are recruited, we compare them with the white list to avoid creating signatures based on the ordinary functions of libraries statically associated with the sample. To do this, the functions are sent to the instance Kam1n0whose database we previously filled with functions from obviously pure samples. If a clone of any function is detected, the subgraph selection procedure is repeated in order to select the most suitable of the remaining ones. If the verification does not reveal anything, the set of functions is passed to the next stage.

Then the generation of the signature begins directly. Template-based ClamAV signatures are designed to detect subsequences in binary data. Therefore, we apply to all extracted functions the search algorithm for the largest common subsequence (LCS, Longest Common Subsequence).

From a computational point of view, this algorithm is quite expensive even for two samples and is noticeably harder for several, therefore we use its heuristic variant describedChristian Blichmanom (Christian Blichmann). The result might look something like this:

Finally, you must test it before publishing the signature. We automatically verify the signature using our test suite for false positives. For greater reliability, we use Sigalyzer, a new feature of our CASC IDA Pro ClamAV plug-in for generating and analyzing signatures (it will be updated later). Sigalyzer marks sections of the binary that match the ClamAV signature that worked for it. Thus, a visual visual representation of the signature is formed.

Architecture

BASS is implemented as a cluster of Docker containers. The framework is written in Python and interacts with all the necessary tools through web services. The architecture was created by analogy with the VxClass project , which also generated ClamAV signatures using IDA Pro and BinDiff, but would later be closed and, unlike BASS, not accessible to the general public.

Limitations

BASS works exclusively with binary executables, as the signature is generated from the sample code. In addition, it only analyzes x86 and x86_64 executables. Support for other architectures may appear in the future.

So far, BASS does not cope well with file viruses, which embed small and very different code snippets into infected objects, and with backdoors, mainly consisting of harmless binary code (often stolen), which is complemented by malicious functions. We are struggling with these shortcomings by working to optimize the clustering phase.

And once again we want to remind you that BASS is at the alpha testing stage, and so far not everything is working smoothly. But we hope that we will benefit the community by developing this framework as an open source project, and we will be glad to any ideas and criticism.

application

Difference between the largest common substring and the largest common subsequence.

The following illustration shows the difference between the largest common substring and the largest common subsequence. The largest common subsequence is indicated in our publication by the English acronym LCS.

That's all. And already on June 20 it will be possible to familiarize yourself in detail with the course program at the open day , which will be held in the webinar mode.

Tags: