Everything you need to know about the garbage collector in Python

Original author: Artem Golubin
  • Transfer
As a rule, you do not need to worry about the garbage collector and working with memory when you write code in Python. Once objects are no longer needed, Python automatically frees memory from under them. Despite this, understanding how GC works will help you write better code.

Memory manager


Unlike other popular languages, Python does not free all memory back to the operating system as soon as it deletes an object. Instead, it uses an additional memory manager, designed for small objects (the size of which is less than 512 bytes). To work with such objects, it allocates large blocks of memory, in which many small objects will be stored in the future.

As soon as one of the small objects is removed - the memory does not pass from under it to the operating system, Python leaves it for new objects with the same size. If there are no objects left in one of the allocated blocks of memory, then Python can release it to the operating system. As a rule, the release of blocks happens when the script creates a lot of temporary objects.

Thus, if a long-lived Python process begins to consume more memory over time, this does not mean that your code has a memory leak problem. If you want to learn more about the memory manager in Python, you can read about it in my other article (eng.).

Garbage collection algorithms


The standard Python interpreter (CPython) uses two algorithms at once, the reference counting and the generational garbage collector (hereinafter GC), better known as the standard Python gc module .

The reference counting algorithm is very simple and efficient, but it has one major drawback. He does not know how to identify circular links. It is because of this that there is an additional collector in python, referred to as generational GC, which tracks objects with potential circular references.

In Python, the reference counting algorithm is fundamental and cannot be disabled, whereas GC is optional and can be disabled.

Link counting algorithm


The reference counting algorithm is one of the easiest techniques for garbage collection. Objects are deleted as soon as they are no longer referenced.

In Python, variables do not store values, but act as references to objects. That is, when you assign a value to a new variable, an object is first created with this value, and only then the variable begins to refer to it. Multiple variables can refer to one object.

Each object in Python contains an additional field (reference count) in which the number of references to it is stored. As soon as someone refers to an object, this field is incremented by one. If for some reason the link disappears, then this field is reduced by one.

Examples when the number of links increases:

  • assignment operator
  • passing arguments
  • insert a new object into the list (the number of links for the object increases)
  • construction of the form foo = bar (foo starts referring to the same object as bar)

As soon as the reference count for a particular object reaches zero, the interpreter starts the process of destroying the object. If the deleted object contained references to other objects, these links are also deleted. Thus, the removal of one object may entail the removal of others.

For example, if a list is deleted, the reference count in all its elements is reduced by one. If all objects inside the list are not used anywhere else, they will also be deleted.

Variables that are declared outside functions, classes, and blocks are called global. Typically, the life cycle of such variables is equal to the life of the Python process. Thus, the number of references to objects referenced by global variables never drops to zero.

Variables that are declared inside a block (function, class) have local visibility (that is, they are visible only inside the block). As soon as the python interpreter exits the block, it destroys all references created by local variables inside it.

You can always check the number of links using the function sys.getrefcount.

An example of the operation of the reference counter:

foo = []
# 2 ссылки, одна от переменной foo и одна от getrefcount
print(sys.getrefcount(foo))
defbar(a):# 4 ссылки# от foo, аргумента функции (a), getrefcount и одна от стека вызова функции
    print(sys.getrefcount(a))
bar(foo)
# 2 ссылки, локальные ссылки внутри функции уничтожены
print(sys.getrefcount(foo))

The main reason why the standard interpreter (CPython) uses a reference count is historical. Currently, there are many debates about this approach. Some people believe that garbage collection can be much more efficient without reference counting. This algorithm has many problems, such as circular references, blocking threads, as well as additional overhead for memory and cpu.

The main advantage of this algorithm is that objects are deleted as soon as they are not needed.

Optional garbage collector


Why do we need an additional algorithm when we already have a reference count?

Unfortunately, the classic reference counting algorithm has one major drawback - it does not know how to find circular references. Cyclic links occur when one or more objects link to each other.

Two examples:

image

As you can see, the object lstrefers to itself, while object1and object2refer to each other. For such objects, the reference count will always be 1.

Python demo:

import gc
# используется ctypes для доступа к объектам по адресу памятиclassPyObject(ctypes.Structure):
    _fields_ = [("refcnt", ctypes.c_long)]
gc.disable()  # выключаем циклический GC
lst = []
lst.append(lst)
# сохраняем адрес списка lst
lst_address = id(lst)
# удаляем ссылку lstdel lst
object_1 = {}
object_2 = {}
object_1['obj2'] = object_2
object_2['obj1'] = object_1
obj_address = id(object_1)
# удаляем ссылкиdel object_1, object_2
# раскомментируйте для запуска ручной сборки объектов с циклическими ссылками# gc.collect()# проверяем счетчик ссылок
print(PyObject.from_address(obj_address).refcnt)
print(PyObject.from_address(lst_address).refcnt)

In the example above, the del instruction removes references to our objects (not the objects themselves). As soon as Python executes the del instruction, these objects become inaccessible from the Python code. However, with the gc module turned off, they will still remain in memory, since they had circular references and their counter is still equal to one. You can visually explore such links using the library objgraph.

In order to fix this problem, an additional algorithm was added to Python 1.5, known as the gc module . The only task of which is the removal of cyclic objects to which there is no longer access from the code.

Cyclic links can occur only in “container” objects. Those. in objects that can store other objects, such as lists, dictionaries, classes, and tuples. GC does not keep track of simple and immutable types, with the exception of tuples. Some tuples and dictionaries are also excluded from the list of surveillance when certain conditions are met. All other objects are guaranteed to handle reference counting.

When GC is triggered


Unlike the reference counting algorithm, cyclic GC does not work in real time and runs periodically. Each launch of the collector creates micropauses in the code, so CPython (the standard interpreter) uses different heuristics to determine the frequency of launching the garbage collector.

The cyclic garbage collector divides all objects into 3 generations (generations). New objects fall into the first generation. If a new object survives the garbage collection process, then it moves to the next generation. The higher the generation, the less often it is scanned for garbage. Since new facilities often have a very small lifespan (are temporary), it makes sense to interview them more often than those that have already gone through several stages of garbage collection.

In each generation there is a special counter and a trigger threshold, upon reaching which the process of garbage collection is triggered. Each counter stores the number of allocations minus the number of allocations in this generation. As soon as any container object is created in Python, it checks these counters. If the conditions work, then the garbage collection process begins.

If several or more generations crossed the threshold at once, then the older generation is chosen. This is due to the fact that older generations also scan all previous ones. To reduce the number of garbage collection pauses for long-lived objects, the oldest generation has an additional set of conditions .

The standard thresholds for generations are set to700, 10 и 10accordingly, but you can always change them using functions gc.get_threshold и gc.set_threshold.

Cycle search algorithm


For a full description of the cycle search algorithm, a separate article is required. In short, the GC iterates each object from the selected generations and temporarily removes all references from a single object (all references to which this object refers). After a full pass, all objects whose reference count less than two is considered inaccessible from the python and can be deleted.

For a deeper understanding, I recommend reading (approx. Translator: English material) the original description of the algorithm from Neil Schemenauer and the function collectfrom the CPython source . A description from Quora and a post about a garbage collector can also be useful.

It should be noted that the problem with destructors described in the original description of the algorithm has been fixed since Python 3.4 (more details in PEP 442 ).

Optimization Tips


Loops often occur in real-world tasks, they can be found in problems with graphs, linked lists or in data structures where you need to keep track of the relationships between objects. If your program has a high load and is demanding for delays, then, if possible, cycles are best avoided.

In places where you deliberately use circular links, you can use "weak" links. Weak links are implemented in the weakref module and, unlike regular links, do not affect the reference count in any way. If an object with weak links is deleted, it is returned instead None.

In some cases, it is useful to disable the automatic assembly module gc and call it manually. To do this, it is enough to call gc.disable()and then call gc.collect()manually.

How to find and debug circular references


Debugging loops can be painful, especially if your code uses a lot of third-party modules.

The gc module provides helper functions that can aid in debugging. If the GC parameters are set on the flag DEBUG_SAVEALL, then all inaccessible objects will be added to the list gc.garbage.

import gc
gc.set_debug(gc.DEBUG_SAVEALL)
print(gc.get_count())
lst = []
lst.append(lst)
list_id = id(lst)
del lst
gc.collect()
for item in gc.garbage:
    print(item)
    assert list_id == id(item)

Once you identify the problem location, you can visualize it with an objgraph .

image

Conclusion

The main garbage collection process is performed by a reference counting algorithm, which is very simple and has no settings. An additional algorithm is used only for finding and deleting objects with circular references.

You should not deal with premature code optimization for the garbage collector, in practice, serious problems with garbage collection are quite rare.

PS: I am the author of this article, you can ask any questions.

Also popular now: