Viruses. Viruses? Viruses! Part 1

    Talk about computer viruses? No, not that your antivirus caught yesterday. Not that you downloaded under the guise of another Photoshop installer. Not about rootkit-e, which stands on your server, disguising itself as a system process. Not about search bars, downloaders and other malvari. Not about code that does bad things on your behalf and wants your money. No, all this is commerce, no romance ...

    We will talk about computer viruses as a code that can generate its own copies, changing from generation to generation. Which, like its biological counterparts, needs a file carrier that is workable and remains workable to give life to new generations of the virus. Which for breeding requires a fertile environment, a lot of tasty executable files, and also, a lot of stupid and active users to run them. So the name “virus” is not just a beautiful label for describing a malicious program, a computer virus, in its classical sense, is an entity very close to its biological counterpart. Humanity, as has been proved more than once, is capable of creating very sophisticated solutions, especially when it comes to creating something harmful to other people.

    So, a long time ago, after DOS came to people, and each programmer had his own small universe, where the address space was the same, and the rights to the files were always rwx, the thought arose about whether the program could copy itself. “Of course it can!” Said the programmer and wrote code that copies his own executable file. The next thought was "can two programs merge into one?" “Of course they can!” Said the programmer and wrote the first infector. “But why?” He thought, and this was the beginning of the era of computer viruses. As it turned out, spoiling on a computer and trying to avoid detection in every possible way is very fun, and creating viruses is very interesting from the point of view of a system programmer. Moreover,

    In general, the lyrics are quite enough for the article, let's get down to business. I want to talk about the classic virus, its structure, basic concepts, detection methods and algorithms that are used by both parties to win.

    Virus anatomy

    We will talk about viruses that live in executable files of the PE and ELF formats, that is, about viruses whose body is executable code for the x86 platform. In addition, let our virus not destroy the source file, fully preserving its operability and correctly infecting any suitable executable file. Yes, breaking is much easier, but we agreed to talk about the right viruses, right? To keep the material up-to-date, I will not waste time reviewing infectors of the old COM format, although it was on it that the first advanced techniques for working with executable code were run.

    The main parts of the virus code are infector and payload. Infector is a code that searches for files suitable for infection and injects a virus into them, trying to hide the fact of implementation as much as possible and at the same time not damage the file’s functionality. Payload is a code that performs the actions actually necessary for the viremaker, for example, sends spam, DoS-it to someone, or simply leaves the text file “Virya was here” on the machine. It’s completely unprincipled for us that inside payload, the main thing is that the virmaker tries in every way to hide its contents.

    Let's start with the properties of the virus code. To make it easier to implement the code, you don’t want to separate the code and the data, therefore integration of data directly into the executable code is usually used. Well, for example, like this:
        jmp message
        mov eax, 0x4
        mov ebx, 0x1
        pop ecx		; со стека будет взят адрес «Hello, World»
        mov edx, 0xF
        int 0x80
        call the_back       ; после исполнения на стеке будет лежать адрес «возврата», т.е. адрес «Hello, World\n»
        db "Hello, World!", 0Dh, 0Ah

    Or so:
    push 0x68732f2f   ; “hs//”
    push 0x6e69622f   ; “nib/”
    mov ebx, esp ; в ESP теперь адрес строки «/bin/sh»
    mov al, 11
    int 0x80

    Under certain conditions, all these code variants can simply be copied to memory and made JMP on the first instruction. Having written this code correctly, taking care of the correct offsets, system calls, the cleanliness of the stack before and after execution, etc., it can be embedded inside a buffer with someone else's code.

    Suppose a virmaker has the ability to write a virus code in this style, and now he needs to embed it in an existing executable file. He needs to take care of two things:
    • Where to put the virus? It is necessary to find enough space for the virus to fit there, write it there, if possible without breaking the file, and so that in the area in which the virus appears, code execution is allowed.
    • How to transfer control to a virus? Just putting the virus into the file is not enough, you still have to make the transition to its body, and after completing its work, return control to the victim program. Or in a different order, but, in any case, we agreed not to break anything, right?

    So, we will understand the implementation of the file. Modern executable formats for the x86 platform on Windows and Linux are PE (Portable Executable) and ELF (Executable and Linkable Format). You can easily find their specifications in the system documentation, and if you deal with issues of protecting executable code, then definitely not miss it. Executable formats and the system loader (the operating system code that runs the executable file) are one of the "elephants" on which the operating system stands. The procedure for launching an .exe file is a very complex algorithmically process with a bunch of nuances, and you can talk about this in a dozen articles that you will definitely find yourself if the topic interests you. I will confine myself to a simple consideration, sufficient for a basic understanding of the startup process. So that they don’t throw tomatoes at me,

    An executable file (PE or ELF) consists of a header and a set of sections. Sections are aligned buffers with code or data (see below). When the file is run, sections are copied to memory and memory is allocated for them, and it is not necessary that they occupy the disk. The header contains the layout of the sections, and tells the loader how the sections are located in the file when it is on the disk, and how to place them in memory before transferring control to the code inside the file. We are interested in three key parameters for each section, these are psize, vsize, and flags. Psize (physical size) is the size of the partition on the disk. Vsize (virtual size) - the size of the section in memory after loading the file. Flags - section attributes (rwx). Psize and Vsize can differ significantly, for example,

    Flags (access attributes) will be assigned to the memory pages in which the section will be displayed. For example, the section with executable code will have r_x (read, execute) attributes, and the data section will have rw_ (read, write) attributes. The processor, trying to execute code on the page without the execution flag, will throw an exception, the same applies to trying to write to the page without the w attribute, therefore, when placing the virus code, the virmaker must take into account the attributes of the memory pages in which the virus code will be located. Until recently, standard sections of uninitialized data (for example, the program stack area) had rwx (read, write, execute) attributes, which allowed copying code directly onto the stack and executing it there. Now it is considered unfashionable and unsafe, and in recent operating systems the stack area is intended only for data. Of course

    Also, in the header is Entry Point - the address of the first instruction from which the execution of the file begins.

    It is necessary to mention such important property of executable files as alignment for virmeakers. In order for the file to be optimally read from the disk and displayed in memory, the sections in the executable files are aligned at the multiple of powers of two, and the free space left from the alignment (padding) is filled with something at the discretion of the compiler. For example, it is logical to align sections to the size of the memory page - then it is convenient to completely copy it into memory and assign attributes. I won’t even remember about all these alignments, wherever there is a little bit of a standard piece of data or code, it is aligned (any programmer knows that there are exactly 1,024 meters in a kilometer). Well, the description of Portable Executable (PE) and Executable Linux Format (ELF) standards for executable code that works with security methods is desktop books.

    Since the addresses inside all these sections are connected, simply slapping a piece of code in the middle of the section, “tying it up” with JMPs will not work, the source file will break. Therefore, popular places for introducing the virus code are:
    • main code section (virus overwriting the beginning of executable code starting with Entry Point).
    • padding between the end of the header and the first section. There is nothing there and it is quite possible to fit a small virus (or its loader) there without breaking the file.
    • a new section that can be added to the header and placed in the file after all the others. In this case, no internal bias will break, and there are no problems with the place either. True, the last section in the file in which execution is allowed, of course, heuristics will attract attention.
    • padding between the end of the contents of a section and its aligned end. This is much more difficult, since first you need to find this very “end”, and not the fact that we are lucky and there will be enough space. But for some compilers, this place can be found simply by characteristic bytes.

    There are ways and tricks, some of which I will describe in the second article.

    Now about the transfer of control. For the virus to work, its code must somehow gain control. The most obvious way: first the virus gets control, and then, after it works, the host program. This is the easiest way, but they also have the right to life and options when the virus gets control, for example, after the host completes, or in the middle of execution, "replacing" the execution of some function. Here are a few techniques for transferring control (the term Entry Point or EP, used hereinafter, is the entry point, that is, the address to which the system loader will transfer control after it prepares the executable for launch):
    1. JMP replaces the first bytes located in the Entry Point file with the virus body. The virus retains the erased bytes in its body, and, at the end of its own work, restores them and transfers control to the beginning of the restored buffer.
    2. A method similar to the previous one, but instead of bytes, the virus saves several complete machine instructions in Entry Point, then it can restore nothing (following only the correct cleaning of the stack), execute them after finishing its own work and transfer control to the address of the instruction following "Stolen."
    3. As in the case of implementation, there are ways more cunning, but we will also consider them below, or postpone it to the next article.

    All of these are ways to correctly insert a buffer with code into some executable file. Moreover, p. 2 and p. 3. They mean functionality that allows you to understand which bytes are instructions, and where the boundaries between instructions are. After all, we cannot “break” the instruction in half, in this case everything will break. Thus, we smoothly move on to the consideration of disassemblers in viruses. We will need a concept of how disassemblers work to examine all the normal techniques for working with executable code, so it's okay if I describe it a little now.

    If we inject our code into the position exactly between the instructions, we can save the context (stack, flags) and, having executed the virus code, restore everything back, returning control to the host program. Of course, this can also be a problem if you use code integrity controls, anti-debugging, etc., but more on that in the second article. To search for such a position, we need this:
    • put the pointer exactly at the beginning of some instruction (it’s just that you can’t take a random place in the executable section and start disassembling from it, the same byte can be both an instruction opcode and data)
    • determine the length of the instruction (for x86 architecture, instructions have different lengths)
    • move the pointer forward to this length. We will be at the beginning of the next instruction.
    • repeat until we decide to stop

    This is the minimum functionality necessary in order not to fall into the middle of the instruction, and a function that takes a pointer to a byte string and returns the length of the instruction in response is called a length disassembler. For example, the infection algorithm may be as follows:
    1. We select a delicious executable file (thick enough to fit the body of the virus, with the desired distribution of sections, etc.).
    2. Read your code (virus body code).
    3. We take the first few instructions from the victim file.
    4. We append them to the virus code (we save the information necessary to restore working capacity).
    5. We add to the virus code the transition to the instruction that continues the execution of the victim code. Thus, after executing its own code, the virus will correctly execute the prologue of the victim code.
    6. Create a new section, write the virus code there and edit the header.
    7. In place of these first instructions, we put the transition to the virus code.

    This is an option for a completely correct virus that can infiltrate into an executable file, do not break anything, secretly execute its code and return execution to the host program. Now, let's catch him.

    Detector anatomy

    Suddenly, out of nowhere, a knight appears on a white computer, in his left hand he has a debugger, and in his right hand is a disassembler, an anti-virus company programmer. Where did he come from? You guessed it, of course. With a high degree of probability, he appeared there from the "adjacent area". The antivirus area in terms of programming is highly respected by those who are in the subject, because these guys have to tinker with very sophisticated algorithms, and in rather cramped conditions. Judge for yourself: you have a hundred thousand copies of any infection and an executable file at your input, you should work in almost real time, and the cost of the error is very high.

    For an antivirus, as well as for any state machine that takes a binary yes / no decision (infected / healthy), there are two types of errors - false positive and false negative (mistakenly recognized the file as infectious, mistakenly skipped the infected one). It is clear that the total number of errors must be reduced in any scenario, but false negative for the antivirus is much more unpleasant than false positive. “After downloading the torrent, turn off the antivirus before installing the game” - is that familiar? This is “false positive” - crack.exe, which writes something to the executable .exe file for a smart enough heuristic analyzer (see below), looks like a virus. As the saying goes: "it is better to overtake than not to finish."

    I think you do not need to describe the components of a modern antivirus, they all revolve around one functional - an antivirus detector. A monitor that scans files on the fly, scans disks, checks email attachments, quarantines and stores already scanned files - all this is the binding of the main detecting kernel. The second key component of the antivirus is the replenished feature database, without which keeping the antivirus up to date is impossible. The third component, which is quite important, but deserves a separate series of articles, is monitoring the system for suspicious activity.

    So (we consider classical viruses), at the entrance we have an executable file and one of hundreds of thousands of potential viruses in it. Let's detect. Let this be a piece of virus executable code:
    XX XX XX XX XX XX	; начало вируса длиной N байт . . . 
    68 2F 2F 73 68		push 0x68732f2f   ; “hs//”
    68 2F 62 69 6E		push 0x6e69622f   ; “nib/”
    8B DC			mov ebx, esp ; в ESP теперь адрес строки «/bin/sh»
    B0 11			mov al, 11
    CD 80			int 0x80
    XX XX XX XX		; конец вируса длиной M байт . . .  

    Just want to just take a bunch of opcodes (68 2F 2F 73 68 68 2F 62 69 6E 8B DC B0 11 CD 80) and look for this byte string in the file. If you find it, I’m caught, you bastard. But, alas, it turns out that the same packet of bytes is also found in other files (well, you never know who calls the shell), and even such strings to search for "stotych", if you search each, then no optimization will help. The only fast and correct way to check for such a line in a file is to check for its existence at a FIXED offset. Where to get it from?

    We recall the “adjacent area” - especially the places about where the virus puts itself and how it transfers control to itself:
    • the virus is embedded in the padding between the header and the beginning of the first section. In this case, you can check the existence of this byte string by offset
      "header length" + N (where N is the number of bytes from the beginning of the virus to the byte string)
    • the virus lies in a new, separate section. In this case, you can check for the existence of a byte string from the beginning of all sections with code
    • the virus infiltrated the padding between the end of the code and the end of the code section. You can use a negative offset from the end of the section, such as “end of the code section” - M (where M is the number of bytes from the end of the byte string to the end of the virus code) - “byte-string length”

    Now from there about the transfer of control:
    • the virus wrote its instructions directly on top of the instructions in Entry Point. In this case, we are looking for a byte string simply by the offset “Entry Point” + N (where N is the number of bytes from the beginning of the virus to the byte string)
    • the virus recorded in Entry Point JMP on its body. In this case, you must first calculate where this JMP is looking, and then look for the byte string at the offset "JMP transition address" + N (where N is the number of bytes from the beginning of the virus to the byte string)

    Something I'm tired of writing "byte string", it is of variable length, it is inconvenient to store it in the database, and it is completely optional, therefore, instead of the byte string, we will use its length plus CRC32 from it. Such a record is very short and the comparison is fast, since the CRC32 algorithm is not slow. It does not make sense to pursue resistance to collision of checksums, since the probability of a collision over fixed displacements is scanty. In addition, even in the event of a collision, the error will be of the type “false positive”, which is not so scary. We summarize all of the above, here is an example record structure in our anti-virus database:
    1. Virus ID
    2. flags indicating where to read the offset from the EP, from the end of the header, from the end of the first section, from the beginning of all sections, from the address of the JMP instruction to the EP, etc.)
    3. offset
    4. signature length (Lsig)
    5. CRC32 Signatures (CRCsig)

    We optimize the input (we leave only the signatures that "fit" into the given file, immediately from the header we prepare the set of necessary offsets) and then:
    { # для всех подходящих записей 
    -	на основании флагов вычисляем базовое смещение в файле (начало кодовой секции, entry point и т.п.)
    -	прибавляем к нему offset
    -	читаем Lsig байт
    -	считаем от них CRC32
    -	если совпало – мы поймали вирус

    Hooray, here is our first antivirus. It is quite cool, because with the help of a fairly complete database of signatures, normally selected flags and good optimization, this detector can very quickly catch 95% of all infections (the vast majority of modern malware are simply executable files, without any mutability). Then begins the game "who will update the signature database faster" and "to whom they will sooner send a new instance of some muck."

    The collection and cataloging of this "nastiness" is a very non-trivial task, but absolutely necessary for high-quality testing of the detector. Collecting a reference database of executable files is not an easy task: try to find all instances of infected files (for complex cases in several instances), catalog them, mix them with "clean" files and regularly run a detector on them in order to detect detection errors. Such a database has been going for years, and is a very valuable asset of antivirus companies. Perhaps I am mistaken, and really get it (all sorts of online virus scan services are quite able to provide some analogue of it), but when I dealt with this issue, it was impossible to get anything like it (at least under Linux).

    Heuristic analyzer

    What a terrible word - “heuristic analyzer”, now you will not even see it in the antivirus interfaces (it probably scares users). This is one of the most interesting parts of the antivirus, as everything is shoved into it that does not fit into any of the engines (neither signature nor emulator), and looks like a doctor who sees that the patient is coughing and sneezing, but identifying a specific disease can not. This is the code that checks the file for some characteristic signs of infection. Examples of such signs:
    • incorrect (corrupted by the virus, but functional) file header
    • JMP right at the entry point
    • "Rwx" on the code section

    Well, and so on. In addition to indicating the fact of infection, a heuristic can help you decide whether to run a more “heavy” file analysis? Each symptom has a different weight, from “suspicious of some kind” to “I don’t know what, but the file is infected precisely”. It is these symptoms that give the majority of false positive errors. Let us also not forget that it is the heuristic that can provide the antivirus company with copies of potential viruses. A heuristic worked, but nothing concrete was found? So the file is definitely a candidate for sending to an antivirus company.

    Interspecific Interaction and Evolution

    As we saw, for fast and accurate comparison, the detector itself needs the bytes of the signature and its offset. Or, in another language, the contents of the code and the address of its location in the host file. Therefore, it is clear how the ideas of hiding the executable code of viruses evolved in two ways:
    • hiding the code of the virus itself;
    • hiding his entry point.

    Concealment of the virus code as a result resulted in the appearance of polymorphic engines. That is, engines that allow the virus to change its code in each new generation. In each new infected file, the virus body mutates, trying to make detection difficult. Thus, it is difficult to create the contents of the signature.

    Concealment of an entry point (Entry Point Obscuring) as a result served as an impetus for the appearance of automatic disassemblers in virus engines to determine at least transition instructions. The virus tries to hide the place from which the transition to its code occurs, using from the file what ultimately leads to the transition: all kinds of JMP, CALL, RET, address tables, etc. Thus, the virus makes it difficult to indicate the signature bias.

    We will look in more detail at some algorithms of such engines and the detector in the second article, which I plan to write in the near future.

    In parallel with the development of virus engines and their detectors, commercial protection of executable files was actively developing. A huge number of small commercial programs appeared, and developers needed engines that allowed them to take an EXE file and “wrap” it in some “envelope”, which can securely generate a valid serial number. And who can hide the executable code and implement it in executable files without loss of functionality? That's right, those same developers from the "adjacent field." Therefore, writing a good polymorphic virus and hinged protection of the executable file is a very similar task, using the same algorithms and tools. The process of analyzing viruses and creating signatures and hacking commercial software is also similar.

    On the Internet there are several pages on the topic "classification of computer viruses." But we agreed, the virus is something that can reproduce itself in the system, and what a file carrier is necessary for. Therefore, all sorts of malware trojan-rootkits are not viruses, but the type of payload code that a virus can carry on itself. There can be only one classification of computer viruses for the technologies described in the article: polymorphic and non-polymorphic viruses. That is, changing from generation to generation or not.

    The detector considered in the article easily detects non-polymorphic (called monomorphic viruses). Well, the transition to polymorphic viruses is an excellent occasion to finally complete this article, promising to return to more interesting methods of hiding executable code in the second part .

    Also popular now: