"Deleting" objects in Django



    Sooner or later, developers are faced with the task of deleting unnecessary data. And the more complex the service, the more nuances must be taken into account. In this article, I will tell you how we implemented a “delete” in a database with a hundred links.

    Prehistory


    To monitor the performance of most Mail.ru Group and VKontakte projects , a proprietary service called Monitoring is used . Starting its history from the end of 2012, for 6 years the project has grown into a huge system, which has acquired a lot of functionality. Monitoring regularly checks the availability of servers and the correctness of responses to requests, collects statistics on memory usage, processor utilization, etc. When the parameters of the monitored server are beyond the allowed values, those responsible for the server receive notifications in the system and via SMS.

    All checks and incidents are logged to track the dynamics of server characteristics, so the database has reached the order of hundreds of millions of records. Periodically, new servers appear, and the old ones are no longer used. Information about unused servers must be removed from the Monitoring system in order to: a) not overload the interface with unnecessary information, and b) release the unique identifiers .

    Deletion


    It is not for nothing that in the title of the article the word “deletion” was written in quotes. You can remove an object from the system in several ways:

    • completely removing from the database;
    • marking objects as deleted and hiding from the interface. As a marker, you can use Boolean, or DateTime for more accurate logging.

    Iteration # 1


    Initially, the first approach was used when we simply executed object.delete()and the object was deleted with all dependencies. But over time, we had to abandon this approach, since one object could have dependencies with millions of other objects, and the cascading deletion tightly blocked the tables. And since the service performs thousands of checks every second and logs them, locking the tables led to a serious slowdown in the service, which was unacceptable for us.

    Iteration # 2


    To avoid long locks, we decided to delete the data in chunks. This would allow to record actual monitoring data during the intervals between deletions of objects. The list of all objects that will be deleted in cascade can be obtained using the method that is used in the admin panel when deleting an object (when confirming the deletion):

    from django.contrib.admin.util import NestedObjects
    from django.db import DEFAULT_DB_ALIAS
    collector = NestedObjects(using=DEFAULT_DB_ALIAS)
    collector.collect([obj])
    objects_to_delete = collector.nested()
    # Recursive delete objects

    The situation has improved: the load was distributed over time, new data began to be recorded faster. But we immediately ran into the next pitfall. The fact is that the list of deleted objects is formed at the very beginning of the deletion, and if in the process of “portioned” deletion new dependent objects are added, then the parent element cannot be deleted.

    We immediately abandoned the idea in case of an error in recursive deletion again to collect data on new dependencies or prohibit adding dependent entries during deletion, because a) you can go into an infinite loop or b) you have to find all the add code of all dependent objects .

    Iteration # 3


    We thought about the second type of deletion, when the data is marked and hidden from the interface. Initially, this approach was rejected, because finding all requests and adding a filter to the absence of a remote parent element seemed to be a task for at least a week. In addition, there was a high probability of missing the necessary code, which would lead to unpredictable consequences.

    Then we decided to use decorators to override the query manager. Further, it is better to see the code than to write a hundred words.

    defexclude_objects_for_deleted_hosts(*fields):"""
        Decorator that adds .exclude({field__}is_deleted=True)
        for model_class.objects.get_queryset
        :param fields: fields for exclude condition
        """defwrapper(model_class):defapply_filters(qs):for field in filter_fields:
                    qs = qs.exclude(**{
                        '{}is_deleted'.format('{}__'.format(field) if field else''): True,
                    })
                return qs
            model_class.all_objects = copy.deepcopy(model_class.objects)
            filter_fields = set(fields)
            get_queryset = model_class.objects.get_queryset
            model_class.objects.get_queryset = lambda: apply_filters(get_queryset())
            # save info about model decorator
            setattr(model_class, DECORATOR_DEL_HOST_ATTRIBUTE, filter_fields)
            return model_class
        return wrapper
    

    The decorator exclude_objects_for_deleted_hosts(fields)for the specified model fields fieldsautomatically adds a filter for each request exclude, which just removes the records that should not be displayed in the interface.

    Now it is enough for all models, which will be affected in any way by the removal, to add a decorator:

    @exclude_objects_for_deleted_hosts('host')classAlias(models.Model):
        host = models.ForeignKey(to=Host, verbose_name='Host', related_name='alias')
    

    Now, in order to remove an object Host, just change the attribute is_deleted:

    host.is_deleted = True# after this save the host and all related objects will be inaccessible
    host.save()

    All requests will automatically exclude records referencing deleted objects:

    # model decorator @exclude_objects_for_deleted_hosts('checker__monhost', 'alias__host')
    CheckerToAlias.objects.filter(
        alias__hostname__in=['cloud.spb.s', 'cloud.msk.s']
    ).values('id')
    

    It turns out such a SQL query:

    SELECT 
        monitoring_checkertoalias.id
    FROM
        monitoring_checkertoalias
            INNERJOIN
        monitoring_checker ON 
                (`monitoring_checkertoalias`.`checker_id` = monitoring_checker.`id`)
            INNERJOINHostsON (`monitoring_checker`.`monhost_id` = Hosts.`id`)
            INNERJOIN
        dcmap_alias ON (`monitoring_checkertoalias`.`alias_id` = dcmap_alias.`id`)
            INNERJOINHosts T5 ON (`dcmap_alias`.`host_id` = T5.`id`)
    WHERE (
        NOT (`Hosts`.`is_deleted` = TRUE)  -- раз, проверка для monitoring_checkerANDNOT (T5.`is_deleted` = TRUE) -- два, проверка для dcmap_aliasAND
        dcmap_alias.name IN ('dir1.server.p', 'dir2.server.p')
    );

    As you can see, in the request additional joines were added for the fields specified in the decorator and for checking `is_deleted` = TRUE.

    Little about numbers


    It is logical that additional joines and conditions increase query execution time. The study of this question showed that the degree of "complication" depends on the structure of the database, the number of records and the presence of indices.

    Specifically, in our case, for each level of dependency, the request is penalized by about 30%. This is the maximum penalty that we get on the largest table with millions of records; on a smaller table, the penalty is reduced to a few percent. Fortunately, we have the necessary indexes set up, and for the majority of critical queries, the necessary joines were already there, so we did not feel a big difference in performance.

    Unique Identifiers


    Before you delete the data, you may need to free the identifiers that you plan to use in the future, because this may cause a non-unique value error when creating a new object. Despite the fact that in the Django-application will not be visible deleted objects, they will still be in the database. Therefore, for deleted objects we add uuid to the identifier.

    host.hostname = '{}_{}'.format(host.hostname, uuid.uuid4())
    host.is_deleted = True
    host.save()
    

    Exploitation


    For each new model or dependency, you need to update the decorator if you need one. To simplify the search for dependent models, we wrote a “smart” test:

    deftest_deleted_host_decorator_for_models(self):defrecursive_host_finder(model, cache, path, filters):# cache for skipping looked models
            cache.add(model)
            # process all related modelsfor field in (f for f in model._meta.fields if isinstance(f, ForeignKey)):
                if field.related_model == Host:
                    filters.add(path + field.name)
                elif field.related_model notin cache:
                    recursive_host_finder(field.related_model, cache.copy(),
                                          path + field.name + '__', filters)
        # check all modelsfor current_model in apps.get_models():
            model_filters = getattr(current_model, DECORATOR_DEL_HOST_ATTRIBUTE, set())
            found_filters = set()
            if current_model == Host:
                found_filters.add('')
            else:
                recursive_host_finder(current_model, set(), '', found_filters)
            if found_filters or model_filters:
                try:
                    self.assertSetEqual(model_filters, found_filters)
                except AssertionError as err:
                    err.args = (
                        '{}\n !!! Fix decorator "exclude_objects_for_deleted_hosts" ''for model {}'.format(err.args[0], current_model),
                    )
                    raise err
    

    The test recursively checks all models for dependence on the model being deleted, then looks to see if the decorator has been set to the required fields for this model. If something is missing, the test will delicately tell you where to add the decorator.

    Epilogue


    Thus, with the help of a decorator, it was possible with a little blood to realize the “deletion” of data that has a large number of dependencies. All requests automatically receive a mandatory filter exclude. The imposition of additional conditions slows down the process of obtaining data, the degree of "complication" depends on the database structure, the number of records and the presence of indices. The proposed test will tell you which models need to add decorators, and in the future will monitor their consistency.

    Also popular now: