Parameterized queries and django orm performance

Published on August 05, 2012

Parameterized queries and django orm performance

    Faced with significant performance losses when using django orm, I began to look for a way out, considering different ways to use orm. What I did - see the roll.


    How is a regular piece of code written using django orm?

    As a rule, this piece enters a certain function, well, for example, view, receives parameters and forms a result based on these parameters.

    As an example, consider the following basic situation: we want to get a list of names of groups that the current user belongs to. The simplest and most obvious way that comes to mind in the first place is to get a list of groups through a relationship and find out their names:

    def myview(request):
      u = request.user
      a = [g.name for g in u.groups.all()]
      ...
    


    We’ll check what the performance of this piece will be, bearing in mind that the user object request.user is already received at the stage of preliminary processing of the request.

    Create the thetest group and join the very first user to it:
    >>> u = User.objects.all()[0]
    >>> g = Group(name='thetest')
    >>> g.save()
    >>> u.groups.add(g)
    >>> u.groups.all()
    [<Group: thetest>]
    


    I will use this case in all future tests. Since everything is done through the shell, I also use the variable u obtained at this stage in them.

    So, test number 1, we execute the conceived piece of code. Check if it really returns the search list:

    >>> a = [g.name for g in u.groups.all()]
    >>> a
    [u'thetest']
    


    To measure performance, run it 1000 times.

    >>> def test1():
    ...  import datetime
    ...  t1 = datetime.datetime.now()
    ...  for i in xrange(1000):
    ...   a = [g.name for g in u.groups.all()]
    ...  t2 = datetime.datetime.now()
    ...  print "%s" % (t2 - t1)
    ... 
    >>> test1()
    0:00:01.437324
    

    A thousand revolutions of our cycle took about one and a half seconds, which gives 1.5 milliseconds per request.

    Experienced jungle writers have probably already gathered to poke my nose at the fact that this piece is far from optimality. Indeed, you can write at first glance a more optimal piece of code that performs the same actions without constructing a group object and retrieving from the database only the data that we really need:

    >>> a = [g['name'] for g in u.groups.values('name')]
    >>> a
    [u'thetest']
    


    Well, measure this piece.

    >>> def test2():
    ...  import datetime
    ...  t1 = datetime.datetime.now()
    ...  for i in xrange(1000):
    ...   a = [g['name'] for g in u.groups.values('name')]
    ...  t2 = datetime.datetime.now()
    ...  print "%s" % (t2 - t1)
    ... 
    >>> test2()
    0:00:01.752529
    

    It seems unnatural, but is the second version of our code less optimal than the first?

    In fact, this is the way it is. Losses on calling values ​​() and additional analysis of the request turned out to be higher than the potential savings on the construction of the Group object and obtaining the values ​​of all its fields.

    But excuse me? And why, in fact, each time we re-design and analyze the request, if in our view we always execute the same request , and only the user object on which this request is executed will differ?

    Unfortunately, django does not initially allowprepare the request in advance, referring to the prepared request as necessary. There are no corresponding calls, and the query generation syntax implies the use of only specific values ​​as query parameters.

    We'll have to climb a little bit on the source. I would like to take this opportunity to thank the developers of django_extensions and their wonderful shell_plus team, which greatly facilitates the introspection of objects.

    It turns out that the QuerySet object (this is the one that is obtained for example at the time of accessing objects.all ()) has a query property, an object of the django.db.models.sql.query.Query class. Which in turn has a sql_with_params () method

    This method returns a set of parameters that is completely ready for passing to cursor.execute () - that is, a SQL expression string and additional parameters. The great thing is that these very advanced parameters are the parameters that are passed to the QuerySet when it is formed:

    >>> u.groups.all().values('name').query.sql_with_params()
    ('SELECT `auth_group`.`name` FROM `auth_group` INNER JOIN `auth_user_groups` ON (`auth_group`.`id` = `auth_user_groups`.`group_id`) WHERE `auth_user_groups`.`user_id` = %s ', (1,))
    


    Now, if we get a prepared SQL query and substitute various parameter values ​​into it, we can execute the query without wasting resources on preparing the query.

    To do this, create a special class that hides inside all the details of the hack that we are going to commit.

    from django.db import connection
    from django.db.models.query import QuerySet,ValuesQuerySet
    import django
    from threading import local
    class PQuery(local):
        def __init__(self,query,connection=connection,**placeholders):
            self.query = query
            self.connection = connection
            self.placeholders = placeholders
            self.replaces = {}
            sql = None
            try:
                sql = self.query.query.sql_with_params() # 1.4
            except AttributeError:
                sql = self.query.query.get_compiler(connection=self.connection).as_sql() # 1.3, lower?
            self.places = list(sql[1])
            self.sql = sql[0]
            self.is_values = isinstance(query,ValuesQuerySet)
            self.cursor = None
            for i in xrange(len(self.places)):
                x = self.places[i]
                found = False
                for p in self.placeholders:
                    v = self.placeholders[p]
                    if x == v:
                        found = True
                        if not p in self.replaces:
                            self.replaces[p] = []
                        self.replaces[p].append(i)
                if not found:
                    raise AttributeError("The placeholder %(ph)s not found, please add some_name=%(ph)s to the list of constructor parameters" % {
                        'ph':repr(x)
                    })
        def execute(self,**kw):
            try:
                for k in kw:
                    for i in self.replaces[k]:
                        self.places[i] = kw[k]
            except KeyError,ex:
                raise TypeError("No such placeholder: %s" % k)
            if not self.cursor:
                self.cursor = self.connection.cursor()
            self.cursor.execute(self.sql,self.places)
            if not hasattr(self,'fldnms'):
                self.fldnms = [col[0] for col in self.cursor.description]
            if self.is_values:
                return [dict(zip(self.fldnms,row)) for row in self.cursor.fetchall()]
            return [self.query.model(**dict(zip(self.fldnms,row))) for row in self.cursor.fetchall()]
        def __call__(self,**kw):
            return self.execute(**kw)
    ParametrizedQuery = PQuery # compatibility issue
    

    UPD: 2012-08-06 19:20:00 MSK - made corrections to the code regarding compatibility with multitrading, correcting minor bugs when performing complex queries and improving usability.
    Previous code version
    from django.db import connection
    from django.db.models.query import QuerySet,ValuesQuerySet
    class ParametrizedQuery:
        def __init__(self,query,connection=connection,**placeholders):
            self.query = query
            self.connection = connection
            self.placeholders = placeholders
            self.replaces = {}
            sql = self.query.query.sql_with_params()
            self.places = list(sql[1])
            self.sql = sql[0]
            self.is_values = isinstance(query,ValuesQuerySet)
            self.cursor = None
            for p in self.placeholders:
                v = self.placeholders[p]
                self.replaces[p] = self.places.index(v)
        def execute(self,**kw):
            for k in kw:
                self.places[self.replaces[k]] = kw[k]
            if not self.cursor:
                self.cursor = self.connection.cursor()
            self.cursor.execute(self.sql,self.places)
            if not hasattr(self,'fldnms'):
                self.fldnms = [col[0] for col in self.cursor.description]
            if self.is_values:
                return [dict(zip(self.fldnms,row)) for row in self.cursor.fetchall()]
            return [self.query.model(**dict(zip(self.fldnms,row))) for row in self.cursor.fetchall()]
    


    What does this class do? He receives a request and picks out prepared SQL and parameters from it. We can form such a request in which each of the parameters that we are going to substitute has a special value known to us in advance. We will use these values ​​to search for the place where we want to substitute the values ​​passed during execution.

    A few additional implementation details will also help us save resources.
    • The fldnms property contains an array of field names obtained during the first execution of the request. Subsequent calls will use the prepared array.
    • The replaces property contains the mapping of substitution names to parameter numbers.
    • Each object of our class will keep its own cursor. The potential acceleration from this step is the result of, first, the creation of the cursor is a rather expensive operation, and secondly, of the following phrase from the pyodbc description , which can be used as a backend of the database: “It is also be more efficient if you execute the same SQL repeatedly with different parameters. The SQL will be prepared only once. (pyodbc only keeps the last statement prepared, so if you switch between statements, each will be prepared multiple times.) "
    • The is_values ​​property will help us determine that a query should not return a model object, which will save on creating such an object when returning results.


    We’ll slightly modify the original request so that it’s convenient to slip the substitutions there:

    >>> q = Group.objects.filter(user__id=12345).values('name')
    >>> q.query.sql_with_params()
    ('SELECT `auth_group`.`name` FROM `auth_group` INNER JOIN `auth_user_groups` ON (`auth_group`.`id` = `auth_user_groups`.`group_id`) WHERE `auth_user_groups`.`user_id` = %s ', (12345,))
    


    We use the value 12345 as a substitution:

    >>> p = ParametrizedQuery(q,user_id=12345)
    >>> [g['name'] for g in p.execute(user_id=u.id)]
    [u'thetest']
    


    When the p.execute () request was executed, the real value of the user identifier was substituted for the 12345 substitution place.

    Now let's try to see how the code performance will change:

    >>> def test3():
    ...  import datetime
    ...  t1 = datetime.datetime.now()
    ...  for i in xrange(1000):
    ...   a = [g['name'] for g in p.execute(user_id=u.id)]
    ...  t2 = datetime.datetime.now()
    ...  print "%s" % (t2 - t1)
    ... 
    >>> test3()
    0:00:00.217270
    

    This is the result! The query execution time was reduced by 7 times .

    How to use it in real code?

    First, you need a place in which a prepared request could be stored. Secondly, at some point, you need to fill in this variable. For example, at the time of the first execution of the function code. And thirdly, of course, we use a call to a parameterized query instead of directly executing the query.

    def myview(request):
     if not hasattr(myview,'query'):
      myview.query = ParametrizedQuery(Group.objects.filter(user__id=12345).values('name'),user_id=12345)
     a = [g['name'] for g in myview.query.execute(user_id=request.user.id)]
     ...
    


    All code was executed on:
    - django.VERSION = (1, 4, 0, 'final', 0)
    - DBMS mysql (django.db.backends.mysql)
    - table engine = MYISAM
    - connection via localhost
    - Python 2.7.2+ (default, Oct 4 2011, 20:03:08) [GCC 4.6.1] on linux2
    - Linux host seva 3.0.0-22-generic # 36-Ubuntu SMP Tue Jun 12 17:13:04 UTC 2012 i686 athlon i386 GNU / Linux

    Specialist comments are welcome.