We are looking for memory leaks in Python applications

Original author: https://medium.com/@wyau
  • Transfer
¡Hola! we continue a series of publications dedicated to the launch of the course "Web-developer in Python" and right now we are sharing with you the translation of another interesting article.

At Zendesk, we use Python to create machine learning products. In machine learning applications, one of the most common problems we have encountered is memory leaks and spikes. Python code is typically executed in containers using distributed processing frameworks such as Hadoop , Spark, and AWS Batch.. Each container is allocated a fixed amount of memory. As soon as the code execution exceeds the specified memory limit, the container will stop working due to errors that occur due to lack of memory.

You can quickly fix the problem by allocating even more memory. However, this can lead to waste of resources and affect the stability of applications due to unpredictable bursts of memory. The causes of a memory leak can be as follows :

  • Long storage of large objects that are not deleted;
  • Loopback links in code;
  • Base C libraries / extensions leading to memory leak;

It is good practice to profile memory usage with applications to gain a better understanding of the efficiency of code space and packages used.

This article discusses the following aspects:

  • Profiling application memory usage over time;
  • How to check memory usage in a specific part of the program;
  • Tips for debugging errors caused by memory issues.

Profiling memory over time

You can take a look at variable memory usage during the execution of a Python program using the memory-profiler package .

# install the required packages
pip install memory_profiler
pip install matplotlib
# run the profiler to record the memory usage
# sample 0.1s by defaut
mprof run --include-children python fantastic_model_building_code.py
# plot the recorded memory usage
mprof plot --output memory-profile.png

Figure A. Memory profiling as a function of time

The include-children parameter will include memory usage by any child processes generated by parent processes. Figure A reflects the iterative learning process, which causes memory to increase in cycles at those moments when training data packets are processed. Objects are deleted during garbage collection.

If memory usage is constantly increasing, this is considered a potential threat of memory leak. Here is an example code that reflects this:

Figure B. Memory usage increasing over time You

should set breakpoints in the debugger as soon as memory usage exceeds a certain threshold. To do this, use the parameterpdb-mmem , which is useful during troubleshooting.

A memory dump at a specific point in time

It is useful to estimate the expected number of large objects in a program in advance and whether to duplicate them and / or convert them to various formats.

For further analysis of objects in memory, you can create a dump heap in certain lines of the program using muppy .

# install muppy
pip install pympler
# Add to leaky code within python_script_being_profiled.py
from pympler import muppy, summary
all_objects = muppy.get_objects()
sum1 = summary.summarize(all_objects)
# Prints out a summary of the large objects
# Get references to certain types of objects such as dataframe
dataframes = [ao for ao in all_objects if isinstance(ao, pd.DataFrame)]
for d in dataframes:
  print d.columns.values
  print len(d)

Figure C. An example of a summary of a memory dump heap

Another useful library for profiling memory is objgraph , which allows you to generate graphs to check the origin of objects.

Useful pointers

A useful approach is to create a small “test case” that runs the appropriate code that causes a memory leak. Consider using a subset of randomly selected data if fully-fledged input will take a long time to process.

Perform tasks with high memory load in a separate process

Python does not necessarily free memory immediately for the operating system. To make sure that the memory has been freed, you must start a separate process after executing a piece of code. You can learn more about the garbage collector in Python here .

The debugger can add references to objects.

If you use a breakpoint debugger such as pdb , all created objects that are manually referenced by the debugger will remain in memory. This can create a false sense of memory leak, because objects are not deleted in a timely manner.

Beware of packages that may cause memory leaks

Some libraries in Python can potentially cause leaks, for example it pandashas several known problemsmemory leaks .
Have a nice hunt for leaks!

Useful links:


Write in the comment whether this article was useful to you. And those who want to learn more about our course, we invite you to open day , which will be held on April 22.

Also popular now: