Dictionary implementation in Python

    Hello everyone, on April 30, the Algorithms for Developers course will start at OTUS , and this is exactly what the publication of today's material is dedicated to. Let's get started.



    In this article, you'll learn how dictionaries are implemented in Python.
    Dictionaries are indexed using keys, and they can be considered as associated arrays. Let's add 3 key / value pairs to the dictionary:

    >>> d = {'a': 1, 'b': 2}
    >>> d['c'] = 3
    >>> d
    {'a': 1, 'b': 2, 'c': 3}

    Values ​​can be accessed as follows:

    >>> d['a']
    1
    >>> d['b']
    2
    >>> d['c']
    3
    >>> d['d']
    Traceback (most recent call last):
      File "", line 1, in 
    KeyError: 'd'

    The key “d”does not exist, so a KeyError error will appear.

    Hash tables

    Dictionaries in Python are implemented using hash tables. They are arrays whose indices are calculated using hash functions. The goal of the hash function is to evenly distribute the keys in the array. A good hash function minimizes the number of collisions, i.e. the likelihood that different keys will have the same hash. There are no such hash functions in Python. Its most important hash functions (for strings and integer values) produce similar values ​​in general cases:

    >>> map(hash, (0, 1, 2, 3))
    [0, 1, 2, 3]
    >>> map(hash, ("namea", "nameb", "namec", "named"))
    [-1658398457, -1658398460, -1658398459, -1658398462]

    We will assume that until the end of this article we will use strings as keys. The hash function in Python for strings is defined as follows:

    arguments: string object
    returns: hash
    function string_hash:
        if hash cached:
            return it
        set len to string's length
        initialize var p pointing to 1st char of string object
        set x to value pointed by p left shifted by 7 bits
        while len >= 0:
            set var x to (1000003 * x) xor value pointed by p
            increment pointer p
        set x to x xor length of string object
        cache x as the hash so we don't need to calculate it again
        return x as the hash

    If executed hash(‘a’)in Python, it will work out string_hash()and return 12416037344. Here we use the 64-bit machine by default.

    If an array of size is used to store value / key pairs Х, then a mask will be used to calculate the index of the cell's cell in the array, which is calculated as Х-1. This approach makes calculating cell indices quick. The probability of finding an empty cell is quite high due to the resizing mechanism, which is described below. This means that a simple calculation makes sense in most cases. The array size is 8, the index ‘a’will be equal to: hash(‘a’) & 7 = 0. The index for ‘b’is 2, the index for ‘c’is 3, the index for ‘z’ is 3, just like for ‘b’, and it is here that we get a collision.



    As we can see, a hash function in Python does its job in a quality manner when the keys are sequential, which is good, since you often have to work with such data. However, as soon as we add the key ‘z’, a collision occurs because it is not consistent with the previous ones.

    We could use a linked list to store pairs, while having the same hash, but this would increase the search time, and it would not equal O (1) on average. The following section describes the collision resolution method used for dictionaries in Python.

    Open Addressing

    Open addressing is a collision resolution technique that uses probing. In case of‘z’, the index of cell 3 is already in use in the array, so we need to look for another index that has not yet been used. The operation of adding a key / value pair takes on average O (1), as well as the search operation.

    To search for free cells, a quadratic probing sequence is used. It is implemented as follows:

    j = (5*j) + 1 + perturb;
    perturb >>= PERTURB_SHIFT;
    use j % 2**i as the next table index;

    The recursion at (5 * j) +1 quickly increases large differences in bits that did not affect the original index. "perturb"In this case, the variable takes on the other bits of the hash code.

    Let us look out of curiosity what happens if we have a sample sequence with table size 32 and j = 3.

    3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2 ...

    You can learn more about this sample sequence by referring to the source code dictobject.c . A detailed explanation of the probing mechanism can be found at the top of the file.



    Let's look at the Python source code with this example.

    C dictionary structures

    The following C structure is used to store the entry in the dictionary: key / value pair. The hash, key and value are stored. PyObjectis the base class for objects in Python.

    typedef struct {
        Py_ssize_t me_hash;
        PyObject *me_key;
        PyObject *me_value;
    } PyDictEntry;

    The following structure is a dictionary. ma_fill- This is the total number of used and inactive cells. A cell is considered inactive when a key pair is deleted. ma_usedIs the number of used (active) cells. ma_maskequals the size of the array -1 and is used to calculate the cell index. ma_tableIs an array, and ma_smalltableis the original array of size 8.

    typedef struct _dictobject PyDictObject;
    struct _dictobject {
        PyObject_HEAD
        Py_ssize_t ma_fill;
        Py_ssize_t ma_used;
        Py_ssize_t ma_mask;
        PyDictEntry *ma_table;
        PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
        PyDictEntry ma_smalltable[PyDict_MINSIZE];
    };

    Initializing a Dictionary

    When you just create a dictionary, a function is called PyDict_New(). I deleted some lines and converted the C code to pseudo code to focus on key concepts.

    Function PyDict_New():

    • Returns a dictionary object;
    • Allocates a new dictionary object;
    • Clears the dictionary table;
    • Sets the number of used dictionary cells and unused cells ( ma_fill) to 0;
    • Sets the number of active cells ( ma_used) to 0;
    • Sets the dictionary mask ( ma_value) to a value equal to the size of the dictionary - 1 = 7;
    • Sets by dictionary search function lookdict_string;
    • Returns the allocated dictionary object.

    Adding an element

    When a new key / value pair is added, it is called PyDict_SetItem(). This function accepts a pointer to a dictionary object and a key / value pair as an input. It checks if the key is a string and evaluates the hash or reuses the cached if one exists. insertdict()It is called to add a new key / value pair and the dictionary size changes if the number of used and unused cells is more than 2/3 of the size of the array.

    Why exactly 2/3? This is necessary to ensure that the probe sequence can find free cells quickly enough. Later we will consider the function for resizing.

    arguments: dictionary, key, value
    returns: 0 if OK or -1
    function PyDict_SetItem:
        if key's hash cached:
            use hash
        else:
            calculate hash
        call insertdict with dictionary object, key, hash and value
        if key/value pair added successfully and capacity over 2/3:
            call dictresize to resize dictionary's table

    inserdict()uses the search function lookdict_string()to find a free cell. The same function is used to search for a key.

    lookdict_string()computes the cell index using hash and mask values. If she cannot find the key by the value of cell index = hash & mask (slot index = hash & mask), she starts probing using the cycle described above until she finds a free cell. At the first attempt to probe, if the key is equal null, it returns an unused cell if it was found during the first search. This ensures priority for reusing previously deleted cells.
    We would like to add the following key / value pairs: {‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}. Here's what happens:

    The dictionary structure is allocated with a table size of 8.

    • PyDict_SetItem: key = 'a', value = 1
      • hash = hash ('a') = 12416037344
      • insertdict
        • lookdict_string
          • slot index = hash & mask = 12416037344 & 7 = 0
          • slot 0 is not used, return this cell
        • initialization of entry at index 0 with key, value and hash
        • ma_used = 1, ma_fill = 1
    • PyDict_SetItem: key = 'b', value = 2
      • hash = hash ('b') = 12544037731
      • insertdict
        • lookdict_string
          • slot index = hash & mask = 12544037731 & 7 = 3
          • slot 3 is not used, return this cell
        • initialization of entry at index 3 with key, value and hash
        • ma_used = 2, ma_fill = 2
    • PyDict_SetItem: key = 'z', value = 26
      • hash = hash ('z') = 15616046971
      • insertdict
        • lookdict_string
          • slot index = hash & mask = 15616046971 & 7 = 3
          • slot 3 is used, try another cell: 5 is free

          initialization of entry at index 5 with key, value and hash
          ma_used = 3, ma_fill = 3
    • PyDict_SetItem: key = 'y', value = 25
      • hash = hash ('y') = 15488046584
      • insertdict
        • lookdict_string
          • slot index = hash & mask = 15488046584 & 7 = 0
          • slot 0 is used, try another cell: 1 is free
        • initialization of entry at index 1 with key, value and hash
        • ma_used = 4, ma_fill = 4

    PyDict_SetItem: key = 'c', value = 3
    • hash = hash ('c') = 12672038114
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12672038114 & 7 = 2
        • slot 2 is not used, return this cell
      • initialization of entry at index 2 with key, value and hash
      • ma_used = 5, ma_fill = 5

    PyDict_SetItem: key = 'x', value = 24
    • hash = hash ('x') = 15360046201
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15360046201 & 7 = 1
        • slot 1 is used, try another cell: 7 is free
      • initialization of entry at index 7 with key, value and hash
      • ma_used = 6, ma_fill = 6

    Here's what we get:



    Now 6 out of 8 cells are used, more than 2/3 of the array capacity is occupied. dictresize()called to allocate a larger array. This function also copies records from the old table to the new one.

    dictresize ()C minused= 24 is called in our case, where 4 * ma_used. 2 * is ma_usedused when the number of cells used is very large (more than 50,000). Why is 4 times more cells? This reduces the number of steps to implement resizing and increases sparseness.

    The new size of the table should be greater than 24, it is calculated by shifting the current size by 1 bit to the left until the size of the table becomes more than 24. As a result, it will be 32, for example, 8 -> 16 -> 32.

    Here's what happens to our table during resizing: a new table of size 32 is highlighted. Old table entries are inserted into the new table using a new mask value of 31. The result is as follows:



    Deleting items

    PyDict_DelItem() is called to delete records. The hash is calculated for the record key, then the search function is called to return the record. Now the cell is empty.

    We want to remove the c key from our dictionary. As a result, we get the following array:



    Note that the operation of deleting an element does not change the size of the array if the number of cells used is much less than their total number. However, when a key / value pair is added, the need to resize depends on the number of used and inactive cells, so the addition operation can also reduce the array.

    This publication has come to an end, and we traditionally wait for your comments and invite everyone to an open lesson , which will be held on April 18.

    Also popular now: