Notes for building effective Django-ORM requests in loaded projects

    It is written since another holivar arose in the comments on SQL vs ORM in the High-Load Project (HL)

    Preamble


    In the note you can find, in some places, commonplace things. Most of them are available in the documentation, but modern people often like to grab everything superficially. Yes, and many simply did not have the opportunity to test themselves in HL projects.
    As you read this article, remember:
    • You can never implement an HL project based on just one manipulation of ORM
    • Never put complicated things on the shoulders of the database. You need it to store the information, and not to count factorials!
    • If you cannot realize the idea you are interested in by simple means of ORM, do not use ORM to directly solve the problem. And even more so do not climb into a lower level, crutches will break. Find a more elegant solution.
    • Sorry for the mockingly humorous tone of the article. Otherwise boring :)
    • All information is based on Django version 1.3.4
    • Be easier!

    And-and-and-and-yes, the article will show errors in understanding ORM that I encountered in more than three years of working with Django.


    Not understood ORM


    I'll start with a classic mistake that haunted me for quite some time. Regarding beliefs in the tribe of Uruguay monkeys. I very much believed in the omnipotence of Django ORM, namely in
    Klass.objects.all()
    

    eg:
    all_result = Klass.objects.all()
    result_one = all_result.filter(condition_field=1)
    result_two = all_result.filter(condition_field=2)


    In my dreams, thinking was as follows:
    • I chose everything that interests me with a single request on the first line.
    • In the second line, I will no longer have a request, but I will work with the result by the first condition.
    • In the third line, I also will not have a query to the database, and I will have the output of interest to me with the second condition based on the results of the first query.

    You probably already guess that there are no magic monkeys and in this case we have three queries. But, I will disappoint you. In this case, we still have two queries, and to be even more precise, there aren’t a single query based on the results of this script (but in the future, of course, we won’t do this). Why, you ask?
    I explain in order. Let us prove that in this code there are three queries:
    • The first line, when calculating, is analog
      select * from table;
      

    • The second line, when calculating, is analog
      select * from table where condition_field=1;
      

    • The third line, in calculations, is analog
      select * from table where condition_field=2;
      


    Hurrah! We have proven that we have three requests. But the main phrase is “in the calculations”. In fact, we move on to the second part - the proof that we have only two requests.
    For this problem, the following understanding of ORM (in 2 sentences) will help us:
    • So far we have not calculated anything - we only form a request using ORM tools. As soon as we started to calculate, we calculate by the received generated request.

    So, in the first line, we designated the variable all_result with the query we are interested in - select all.
    In the second and third row, we specify our request for the selection of additional. conditions. Well, and therefore received 2 requests. Which should be proved.
    Attentive readers (why did you look in the previous paragraphs again?) Should already have guessed that we didn’t make any requests. And in the second and third lines, we just as well formed a request of interest to us, but we did not contact the database with it.
    So we were doing nonsense. And the calculations will begin, for example, from the first line of the downstream code:
    for result in result_one:
      print result.id
    


    Not always necessary functions and reasonable selections

    Let's try to play around with templates, and the __unicode __ () function that some people love .
    You know - a cool feature! In any place, at any time and under any circumstances, we can get the name that interests us. Super! And super until then, until we have in the output does not appear ForeignKey . As soon as appears, consider everything is gone.
    Consider a small example. We have news in one line. There are regions to which this news is tied:
    class RegionSite(models.Model):
        name = models.CharField(verbose_name="название", max_length=200,)
        def __unicode__(self):
            return "%s" % self.name
    class News(models.Model):
        region = models.ForeignKey(RegionSite, verbose_name="регион")
        date = models.DateField(verbose_name="дата", blank=True, null=True, )
        name = models.CharField(verbose_name="название", max_length=255)
        def __unicode__(self):
            return "%s (%s)" % (self.name, self.region)
    

    We need to print the 10 latest news, with the name as defined in News .__ unicode __ ()
    Uncover the sleeves, and write:
    news = News.objects.all().order_by("-date")[:10]
    

    In the template:
    {% for n in news %}
    {{ n }}
    {% endfor %}
    

    And here we dug a hole for ourselves. If this is not news, or there are not 10 - but 10 thousand, then be prepared for the fact that you will receive 10,000 requests + 1. And all because of the Mudblood of ForeignKey .
    An example of an extra 10 thousand requests (and say thank you for having a small model - all the fields and values ​​of the model would be chosen this way, whether it be 10 or 50 fields):
    SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 1 
    SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 1 
    SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 2
    SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 1 
    -- итп
    

    Why it happens? Everything is genital to simple. Each time you get the name of the news, we have a request to RegionSite to return its __unicode __ () value, and substitute in brackets to display the name of the region of the news.
    Similarly, a bad situation begins when we, for example, in the template using ORM try to get to the value we need, for example:
    {{ subgroup.group.megagroup.name }}
    

    You won’t believe what a tough request there might be :) I’m not saying that you can have dozens of such samples in the template!
    You just won’t take us so much - we sobbed and took advantage of the next great ORM feature - .values ​​() .
    Our line of code in a magic-keyboard way turns into:
    news = News.objects.all().values("name", "region__name").order_by("-date")[:10]
    

    A template:
    {% for n in news %}
    {{ n.name }} ({{ n.region__name }})
    {% endfor %}
    

    Note the double underscore. It will be useful to us soon. (For those who are not in the know - double underlining, as if the connection between the models, to put it bluntly)
    With these simple manipulations we got rid of 10 thousand queries and left only one. By the way, yes, it will work with JOIN and with the fields we have selected!
    SELECT `news_news`.`name`, `seo_regionsite`.`name` FROM `news_news` INNER JOIN `seo_regionsite` ON (`news_news`.`region_id` = `seo_regionsite`.`id`) LIMIT 10 
    

    We are madly glad! After all, we have just become ORM optimizers :) Something there! I'll tell you :) This optimization is optimization until we have 10 thousand news. But we can do it even faster!
    To do this, take into account our prejudices in terms of the number of requests and urgently increase the number of requests by 2 times! Namely, we will prepare the data:
    regions = RegionSite.objects.all().values("id", "name")
    region_info = {}
    for region in regions:
      region_info[region["id"]] = region["name"]
    news = News.objects.all().values("name", "region_id").order_by("-date")[:10]
    for n in news:
      n["name"] = "%s (%s)" % (n["name"], region_info[n["region_id"]])
    

    And then the output in the template of our freshly set variable:
    {% for n in news %}
    {{ n.name }}
    {% endfor %}
    

    Yes, I understand ... With these lines we violated the concept of MVT. But this is just an example that can easily be converted into lines that do not violate MVT standards.
    What have we done?
    1. We have prepared data on the regions and entered information about them in the dictionary:
      SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite`
      

    2. Choose from the news everything that interests us + pay attention to a single underline.
      SELECT `news_news`.`name`, `news_news`.`region_id` FROM `news_news` LIMIT 10 
      

      It was with a single underscore that we chose the direct meaning of the link in the database.
    3. We connected two models with python.

    Believe me, on single ForeignKeys you will hardly notice an increase in speed (especially if there are few selectable fields). However, if your model communicates through forging with more than one model, this is where the celebration of this decision begins.
    Let's continue to bother with double and single underlining.
    Consider a simple example to the point of banality:
    item.group_id vs. item.group.id
    

    Not only when building queries, but also when processing the results, you can run into this feature.
    Example:
    for n in News.objects.all():
        print n.region_id 
    

    There will be only one request - when selecting news
    Example 2:
    for n in News.objects.all():
        print n.region.id
    

    There will be 10 thousand requests + 1, because in each iteration we will have our request for id. It will be similar:
    SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` = 1 
    

    Here is such a difference because of one sign.
    Many advanced dzhangovody now poke a finger at the Voodoo doll with my code. And at the same time they ask me a question - what are you doing for the blizzard with data preparation, and where are values_list ("id", flat = True) ?
    Let's look at a wonderful example showing the need for accuracy in working with value_list :
    regions_id = RegionSite.objects.filter(id__lte=10).values_list("id", flat=True)
    for n in News.objects.filter(region__id__in=regions_id):
        print n.region_id
    

    With these lines of code, we:
    1. We prepare a list of the id-shniks of the regions of interest to us by some abstract condition.
    2. The resulting result is inserted into our news query and we get:
      SELECT `news_news`.`id`, `news_news`.`region_id`, `news_news`.`date`, `news_news`.`name` FROM `news_news` WHERE `news_news`.`region_id` IN (SELECT U0.`id` FROM `seo_regionsite` U0 WHERE U0.`id` <= 10 ) 
      

    Request in request! Uuuuh, I love :) Especially choose 10 thousand news with an embedded select with IN (10 thousand IDs)
    Do you certainly understand what this threatens? :) If not, then understand - nothing, absolutely nothing good!
    The solution to this issue is also simple to genius. Recall the beginning of our article - no query appears without evaluating a variable. And make a remark, for example, on the second line of code:
    for n in News.objects.filter(region__id__in=list(regions_id)):
    

    And with this solution we get 2 simple queries. No attachments.
    You still have not taken your breath from the bastards ORM in store for us? Then we’ll drop even deeper. Consider the code:
    regions_id = list(News.objects.all().values_list("region_id", flat=True))
    print RegionSite.objects.filter(id__in=regions_id)
    

    With these two lines, we select the list of regions for which we have news. Everything in this code is great, except for one point, namely the resulting request:
    SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` IN (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9) LIMIT 21
    

    Ahaha, ORM, stop it! What are you doing!
    Not only that, from all the news (I have 256 in my example, it seems), he selected the id of the regions and simply substituted them, so he took limit 21 from somewhere. Everything is simple about the limit - this is how print of a large number of array values ​​works (I I didn’t find another excuse), but here with the values ​​there is clearly an ambush.
    The solution, as in the previous example, is simple:
    print RegionSite.objects.filter(id__in=set(regions_id)).values("id", "name")
    

    Having removed the extra elements through set (), we received a quite adequate request, as expected:
    SELECT `seo_regionsite`.`id`, `seo_regionsite`.`name` FROM `seo_regionsite` WHERE `seo_regionsite`.`id` IN (1, 2, 3, 4, 9) LIMIT 21
    

    Everyone is happy, everyone is happy.
    With a little glance over the historically written code, I’ll highlight one more pattern that you should know about. And again, an example code:
    region = RegionSite.objects.get(id=1)
    t = datetime.datetime.now()
    for i in range(1000):
        list(News.objects.filter(region__in=[region]).values("id")[:10])
        # list(News.objects.filter(region__id__in=[region.id]).values("id")[:10])
        # list(News.objects.filter(region__in=[1]).values("id")[:10])
        # list(News.objects.filter(region__id__in=[1]).values("id")[:10])
    print datetime.datetime.now() - t
    

    Each of the iteration lines was sequentially included (so that only one worked). Total we can get the following approximate numbers:
    • 1 line - 6.362800 sec
    • 2nd line - 6.073090 sec
    • 3rd line - 6.431563 sec
    • 4th line - 6.126252 sec

    The discrepancies are minimal, but visible. Preferred 2 and 4 options (I mainly use 4m). The main waste of time is how quickly we create a request. Trivial, but significant, I think. Each reader will draw conclusions independently.
    And we end the article with a scary word - transaction .
    Special case:
    • You have InnoDB
    • You need to update the data in the table, in which customers do not write, but only read (for example, a list of goods)

    One-two-update / insertion is done
    1. We prepare 2 dictionaries - to insert data and to update data
    2. We throw each of the dictionaries into its function
    3. PROFIT!

    An example of a real update function:
    @transaction.commit_manually
    def update_region_price(item_prices):
        """
        Обновляем одним коммитом базу
        """
        from idea.catalog.models import CatalogItemInfo
        try:
            for ip in item_prices:
                CatalogItemInfo.objects.filter(
                    item__id=ip["item_id"], region__id=ip["region_id"]
                ).update(
                    kost=ip["kost"],
                    price=ip["price"],
                    excharge=ip["excharge"],
                    zakup_price=ip["zakup_price"],
                    real_zakup_price=ip["real_zakup_price"],
                    vendor=ip["vendor"],
                    srok=ip["srok"],
                    bonus=ip["bonus"],
                    rate=ip["rate"],
                    liquidity_factor=ip["liquidity_factor"],
                    fixed=ip["fixed"],
                )
        except Exception, e:
            print e
            transaction.rollback()
            return False
        else:
            transaction.commit()
            return True
    


    An example of a real add function:
    @transaction.commit_manually
    def insert_region_price(item_prices):
        """
        Добавляем одним коммитом базу
        """
        from idea.catalog.models import CatalogItemInfo
        try:
            for ip in item_prices:
                CatalogItemInfo.objects.create(**ip)
        except Exception, e:
            print e
            transaction.rollback()
            return False
        else:
            transaction.commit()
            return True
    

    Knowing these points, you can build effective applications using Django ORM, and not fit into the SQL code.

    Answers on questions:

    Since such a dance has gone, write when to use ORM, and when not to. (c) lvo
    I think that OPM should be used whenever it is simple. Do not put ORM on your shoulders, and even more so base queries like:
    User.objects.values('username', 'email').annotate(cnt=Count('id')).filter(cnt__gt=1).order_by('-cnt')
    

    Especially on HL-production. Get yourself a separate system server in which you are so hungry.
    If you don’t have the opportunity to write in simple “ORM queries”, then change the algorithm for solving the problem.
    For example, a client in IM has filtering by characteristics, using regulars. Cool flexible thing, until the site visitors are very numerous. Changed the approach, instead of the standard Client-ORM-Base-ORM-Client, rewrote it to Client-MongoDB-Python-Client. Data in MongoDB is generated by means of ORM on the system server. As stated earlier, HL cannot be achieved by manipulating ORM alone.

    I wonder why Django is. What are the benefits of this framework (and its ORM) compared to other frameworks / technologies. (c) anjensan
    Historically. Python began to study with Django. And I bring the knowledge in the technology of its use to the maximum. Now in parallel study of Pyramid. I can compare so far only with PHP, and their frameworks, cms. I’ll probably say a general phrase - I wasted my time ineffectively when I wrote in PHP .
    Now I can name a couple of serious flaws in Django 1.3.4:
    1. Permanent connection / disconnection with the base (corrected in older versions)
    2. The speed of the template processor. According to tests found on the network, it is small enough. Need to change :)

    In general, there is one cool trick on how to increase the speed of the template processor generation.
    Never pass variables to the template via locals () - with bulk functions and intermediate variables - you will get a silent slowly moving dying monster :)

    What kind of programmer is it that is difficult to write a SQL query? (c) andreynikishaev
    A programmer who values ​​his time on program code, and not on the means of interaction between the Base-Code data processing. You need to know SQL - very often I work directly with the database console. But in the code - ORM. ORM is easier and faster to change, or complement. And also, if you write reasonably easy requests, it is easy to read and understand.

    Sorry, everyone! (Blah blah ... waiting for comments, suggestions, questions, suggestions)

    Also popular now: