Sorting a million 32-bit ints in 2 megabytes of memory in Python

    My translation of an article by Guido van Rossum:

    I was jokingly asked: can I sort a million 32-bit ints in 2 megabytes of memory in Python. While thinking, it occurred to me to use the I / O mechanism using buffer memory.

    In general, this is just a comic question - data alone will take 4 megabytes, subject to binary representation! True, you can go for the trick - take a file containing a million 32-bit ints. How to sort them using the minimum amount of memory? It should be some kind of merge sort, in which small pieces of data are sorted and written to a temporary file, after which the temporary files are merged to obtain the final result.

    Here is my solution:

    Copy Source | Copy HTML
    1. #!/usr/bin/env python3.0
    2. import sys, array, tempfile, heapq
    3. assert array.array('i').itemsize == 4
    4.  
    5. def intsfromfile(f):
    6.  while True:
    7.     a = array.array('i')
    8.     a.fromstring(f.read(4000))
    9.     if not a:
    10.         break
    11.     for x in a:
    12.         yield x
    13.  
    14. iters = []
    15. while True:
    16.  a = array.array('i')
    17.  a.fromstring(sys.stdin.buffer.read(40000))
    18.  if not a:
    19.      break
    20.  f = tempfile.TemporaryFile()
    21.  array.array('i', sorted(a)).tofile(f)
    22.  f.seek(0)
    23.  iters.append(intsfromfile(f))
    24.  
    25. a = array.array('i')
    26. for x in heapq.merge(*iters):
    27.  a.append(x)
    28.  if len(a) >= 1000:
    29.      a.tofile(sys.stdout.buffer)
    30.      del a[:]
    31. if a:
    32.  a.tofile(sys.stdout.buffer)

    On my 3-year-old Linux-based staff, it took 5.4 seconds to read data from an input file containing exactly one million random 32-bit ints. This is not so bad considering that when reading data directly from memory, sorting takes 2 seconds:

    Copy Source | Copy HTML
    1. #!/usr/bin/env python3.0
    2. import sys, array
    3. a = array.array('i', sys.stdin.buffer.read())
    4. a = list(a)
    5. a.sort()
    6. a = array.array('i', a)
    7. a.tofile(sys.stdout.buffer)

    Let's get back to our sorting. The first 3 lines are obvious:

    Copy Source | Copy HTML
    1. #!/usr/bin/env python3.0
    2. import sys, array, tempfile, heapq
    3. assert array.array('i').itemsize == 4

    The first line indicates that we are using Python 3.0. The second line imports the necessary modules. The third line causes an interrupt on 64-bit systems, where int is not 32-bit - I was not tasked with writing a portable code.

    Then we declared a function that reads ints from a file:

    Copy Source | Copy HTML
    1. def intsfromfile(f):
    2.  while True:
    3.      a = array.array('i')
    4.      a.fromstring(f.read(4000))
    5.      if not a:
    6.          break
    7.      for x in a:
    8.          yield x

    This is the place where the optimization of the algorithm takes place: it reads up to 1000 ints at a time, issuing them one at a time. At first, I did not use buffering - the function simply read 4 bytes from a file, converted to an integer, and returned. But it worked 4 times slower! Please note that we cannot use it a.fromfile(f, 1000), since the method fromfile()will return an error if there is not enough data in the file, and I wanted the code itself to adapt to any number of ints in the file.

    The next is an input loop. It cyclically reads a piece of 10'000 ints from the input file, sorts them in memory and writes to a temporary file. Then we add the iterator for the temporary file to the list of iterators, which we will need at the final stage, using the intsfromfile () function described above.

    Copy Source | Copy HTML
    1. iters = []
    2. while True:
    3.  a = array.array('i')
    4.  a.fromstring(sys.stdin.buffer.read(40000))
    5.  if not a:
    6.      break
    7.  f = tempfile.TemporaryFile()
    8.  array.array('i', sorted(a)).tofile(f)
    9.  f.seek(0)
    10.  iters.append(intsfromfile(f))

    Accordingly, for the input million ints, 100 temporary files with 10'000 values ​​each will be created.

    Finally, we merge all these files (each of them is already sorted) together. The heapq module contains a wonderful function for solving this problem: heapq.merge(iter1, iter2, ... )that returns an iterator that skips the input parameters in such a sequence that each input parameter itself passes its value in the desired order (as in this case).

    Copy Source | Copy HTML
    1. a = array.array('i')
    2. for x in heapq.merge(*iters):
    3.  a.append(x)
    4.  if len(a) >= 1000:
    5.      a.tofile(sys.stdout.buffer)
    6.      del a[:]
    7. if a:
    8.  a.tofile(sys.stdout.buffer)

    This is another case in which buffered input / output is necessary: ​​writing each individual value to a file, as soon as it becomes available, halves the speed of the algorithm. Using buffering at the input and output, we managed to increase the execution speed by 10 times!

    Also popular now: