Alternative Application Tracing Methods

    image

    Tracing is used in many types of software: in emulators, dynamic unpackers, fuzzers. Traditional tracers work on one of four principles: emulating a set of instructions (Bochs), binary translation (QEMU), patching binary files to change the control flow (Pin), or working through a debugger (PaiMei, based on IDA). But now we will talk about more interesting approaches.

    Why keep track?


    The tasks that are solved using tracing can be divided into three groups, depending on what is being monitored: program execution (control flow), data flow, or interaction with the OS. Let's talk about each more ...


    Control flow


    Tracking control flow helps you understand what the binary does at runtime. This is a good way to work with obfuscated code. Also, if you work with fuzzer, this will help with code coverage analysis. Or take, for example, anti-virus software, where the tracer will monitor the execution of the binary file, formulate a certain pattern of its behavior, and also help with the dynamic unpacking of the executable file.
    Tracing can occur at different levels: tracking of each instruction, base blocks or only certain functions. As a rule, it is carried out by pre / post-instrumentation, that is, patching the control flow in the most “interesting” places. Another method is to simply attach the debugger to the program under study and handle traps and breakpoints. However, there is another not very common way - to use the functions of the central processor. One of the interesting features of Intel processors is the MSR-BTF flag, which allows you to track program execution at the base unit level - on branches (brunches). Here's what the documentation says about this flag:
    “When the software sets the BTF flag in the MSR register MSR_DEBUGCTLA and sets the TF flag in the EFLAGS register, the processor will only generate a debug interrupt after it encounters a branch or an exception.”

    Data stream


    In this scenario, tracing is used to unpack the code, as well as to monitor the processing of valuable information - during it you can detect incorrect use of objects, overflows and other errors. In addition, it can also be used to save and restore context during the tracing process. Usually this is done as follows: the library under study is completely disassembled, after that all read / write instructions are localized in it, and then in the process of code execution they are parsed and the destination address is determined. There is another option - using the appropriate API function, virtual memory protection is set, after which all access violations to it are monitored. The method is used less often when the page table is changed in memory.

    Fig.  1. Translation of virtual addresses into physical addresses


    OS interaction


    Monitoring of interaction with the OS allows you to filter out attempts to access the registry, control file changes, track the interaction of the process with various system resources, as well as calls to certain API functions. As a rule, this is realized through interception of API functions, by inserting “springboards”, inline hooks, modifying the import table, setting breakpoints. Another option is to invoke the SYSCALL system call. After all, if you recall, then every API function that makes some changes to the OS, in fact, is nothing more than a simple wrapper for a specific system call.

    Fig.  2. Numbering of SYSCALL identifiers (ID) in Windows 8


    SYSCALL mechanism is a quick way to switch CPL (Current Privilege Level) from user mode to supervisor mode, so the user mode application can make changes to the OS (Fig. 4).

    Fig.  4. Processing of operations SYSCALL (according to the textbook Intel)


    Diving into the core


    To perform the above functions, you must go down to the kernel level (ring 0). However, in the supervisor mode, there is already access to some of the functions provided by the operating system: LoadNotify, ThreadNotify, ProcessNotify. Their use helps to collect information on loading and unloading for the target process, such as: a list of modules, address ranges of the stack of a thread, a list of child processes, and more.
    The second group of functions includes a memory dumper using MDL (memory descriptor list), a process memory monitor based on VAD (Virtual Address Descriptor), a system interaction monitor that uses nt!KiSystemCall64interception of memory and trap access via IDT (Interrupt Descriptor Table).
    The memory monitor uses a VAD tree for its work, which is an AVL tree used to store information about the address space of a process. It is also used when it is necessary to initialize the PTE (Page Table Entry) for a particular memory page.

    Fig.  3. An example of a VAD tree


    As I suggested above, memory access can be monitored through a memory protection mechanism (such a tautology), but its implementation in user mode using API functions can affect performance too much. However, if we take into account that the memory protection is based on the MMU paging mechanism, there is an easier way: to change the page table in kernel mode, after which a violation of the memory access mode will be processed through the generation of the PageFault exception by the processor, and control will be transferred to the processor IDT [PageFault]. Installing an interceptor on a PageFault handler allows you to quickly receive a signal about a request for access to selected pages.
    This is because the process can only use pages of memory marked as Valid (that is, unloaded to memory), otherwise, a PageFault exception will be thrown, which will be caught. This means that if we intentionally set the Valid flag of the selected memory page to invalid (0), then every attempt to access this page will call the PageFault handler, which makes it easy to filter and process the corresponding request (by calling a callback to the tracer and setting Valid flag for a specific PTE).

    Fig.  5. Flags PTE


    Digging deeper - go to VMM!


    In the previous section, I suggested some dirty methods for kernel mode. In general, installing hooks is the wrong way, and I don’t like it, just like the guys from Microsoft do not like it. To combat such methods, they are shallow and developed PatchGuard. Fortunately, there is another way to catch PageFaults, traps or SYSCALLs - this is a hypervisor. True, this option has both its pros and cons.
    Minuses:
    • Virtualized is not a separate application, but the entire system - at the core level of the CPU.
    • The operator switch( VMMExit )selects a bit of performance, as does the hypervisor code that runs for each switch option.

    Pros:
    • A higher level of rights than the level of the supervisor, as well as a whole set of callbacks provided by virtualization technology.

    In this case, VMM (Virtual Machine Monitor) itself can be minimalistic (microVMM) and implement only the necessary processing, while occupying the minimum amount of code ( example ).

    Fig.  6. Some callbacks provided by Intel VTx

    In addition, in this case, instead of setting hooks on the IDT, you can handle everything directly using the debug exception in VMM. The same applies to catching page errors by using a PageFault exception in VMM or through an implementation of EPT (Extended Page Table).

    Fig.  7. Enable VMX output for traps and crashes

    Pitfalls VMM


    Some key features of the described approach can be noted:
    • target file remains virtually unchanged
    • for tracking (both step-by-step and at the branch level), the TRAP flag is implemented;
    • address breakpoints through 0xCC or using DRx;
    • memory monitoring by changing the process page table;
    • no need to patch a binary file;
    • can be used as a trace module from another application;
    • You can track multiple applications at once
    • You can track multiple threads of one application;
    • implemented quick calls to switch CPL.

    Separating a tracer from the space of the target process into another process gives several advantages: you can use it as a separate module, you can make binders for Python, Ruby and other languages. However, this solution also has a drawback - a very big hit on performance (interaction between processes: reading from the memory of another process, event-driven waiting mechanism). To speed up tracing, it is necessary to transfer the logic to the address space of the target process, so that you can quickly access its resources (memory, stack, register contents), as well as optionally refuse VMM due to the negative impact of VMMExit processing on performance and return to installation hooks for traps and PageFault handlers. But on the other hand, in future processors of virtualization technology, probably will become more effective and will not have such a big impact on productivity. In addition, the virtualization capabilities for tracing can be used much more widely than we consider in the article, so the advantages can compensate for the performance loss.

    Core tracer


    As for the kernel tracer, the same principles apply here:
    • trap tracking (TRAP);
    • memory monitoring by changing the page table;
    • callbacks of the tracer are transferred to user level applications;
    • no need to patch binary files of the target application.

    The main feature of such tracers is that you do not need to patch the binary file, and that tracing (including unpacking and fuzzing) can be done from the user level (for example, from a tracer written in Python), although from the point of view of performance it is much more efficient to do this directly from kernel mode.
    On the other hand, you have to pay for all these opportunities:
    • the address space of the driver does not belong to him;
    • fuzzing in memory is not such a simple matter;
    • incorrect value of RIP, registers, memory ... manipulating them can end very badly;
    • you must clearly understand what exactly you are tracking or checking;
    • it is necessary to remember the numerous IRQLs throughout the tracing process;
    • Exception Handling.

    Separation from the target process, as well as encapsulation in the module, give us high scalability and the ability to work with other modules to create a more complex tool. Thus, if you implement the tracer, for example, in Python, you can use the IDA Python, LLVM bindings, Dbghelp for debugging symbols, disassemblers (capstone and bea engines) and much more. To show how easy and fast it is to implement a Python tracer, here are a couple of examples.
    In the first example, more than three access options (RWE) to a given memory area are controlled:

    target = tracer.GetModule("codecoverme")
    dis = CDisasm(tracer)
    for i in range(0, 3):
        print("next access")		
        tracer.SetMemoryBreakpoint(0x2340000, 0x400)
        tracer.Go(tracer.GetIp())
        inst = dis.Disasm(tracer.GetIp())
        print(hex(inst.VirtualAddr), " : ", inst.CompleteInstr)
        tracer.SingleStep(tracer.GetIp())
    


    And the next section of code demonstrates application tracing at the branch level, while skipping their processing outside the main module:

    for i in range(0, 0xffffffff):
      if (target.Begin > tracer.GetIp() or target.Begin + target.Size < tracer.GetIp()):    
        ret = tracer.ReadPrt(tracer.GetRsp())
        tracer.SetAddressBreadkpoint(ret)
        tracer.Go(tracer.GetIp())
        print("out-of-module-hook")   
      isnt = dis.Disasm(tracer.GetPrevIp())
      print(hex(inst.VirtualAddr), " : ", inst.CompleteInstr)
      tracer.BranchStep(tracer.GetIp())
    


    As you can see, the code is very concise and understandable.

    Dbifuzz framework


    All of the above tracing approaches I implemented in the DbiFuzz framework, which demonstrates how you can track the operation of an executable file using alternative methods. As we have already noted, some of the known methods use instrumentation that provides a quick solution, but at the same time involves serious interference with the target process and does not preserve the integrity of the binary file. In contrast, DbiFuzz leaves the binary almost untouched, changing only PTE, BTF and inserting the TRAP flag. The other side of this approach is that when the event of interest is triggered, the interrupt is turned on: the transition ring 3 —ring 0 - ring 3. Since DbiFuzz implies a direct intervention in the context and control flow of the target processor, it can be used to write your own tools (even on Python) to access the target binary and its resources.

    WWW

    You can learn more about the DbiFuzz framework on my website , on SlideShare and on the ZeroNights portal.
    A very interesting article by Brendan Dolan-Gavitt "The VAD tree: A process-eye view of physical memory5" is dedicated to the VAD tree.


    Show time


    For many problems solved by tracing, dynamic binary instrumentation can be useful. As for the DbiFuzz framework, it can be used in the following cases:
    • when you need to track the code on the fly;
    • when unpacking a binary file, tracing the malware packer;
    • to monitor the processing of confidential data;
    • for fuzzing in memory (easy to track and change the stream);
    • when used in different tools, not necessarily written in C.

    There is no problem launching DbiFuzz on the fly, just set a trap or an INT3 sniffer. Since we do not touch the binary code of the target file, there will be no problems with integrity checking, and the TRAP flag can be replaced with MTF. Tracking valuable data also does not present any problems, you just need to install the appropriate PTE - and your monitor is ready! Python / Ruby / ... tools? Just create the necessary bindings - and go!
    Of course, this framework also has its drawbacks, but in general it has many useful features. And you can always play with DbiFuzz, use the tools included in it for your needs and keep track of everything you want.

    To be continued


    As you can see, dynamic binary instrumentation is not the only tracing method. There are many alternatives to it, and most of them are presented in the DbiFuzz framework. Already, some features of this project can help with work in the code at the kernel level, and in the future I plan to transfer the entire tracer to this space. By the way, now you can use the source code of the framework, improve the concept and experiment with new ideas ...

    useful links

    Blogs:

    Intel:

    Regarding VAD:

    Virtualization:

    Python Modules (Disassemblers):



    First published in the Hacker magazine from 02/2014.

    Subscribe to Hacker




    Also popular now: