The intricacies of improving performance in C ++, or how not to do it

    image
    Once, many years ago, a client came to me, and tearfully begged for instructions to understand one wonderful project and increase the speed of work.

    In short, the task was this - there is a certain robot in C ++ that strips HTML pages, and assembled folding in a database (MySQL). With a lot of functionality and the web on LAMP - but this has nothing to do with the story.

    The previous team managed on a 4-core Xeon in the cloud to get a fantastic collection speed of as much as 2 pages per second , with 100% CPU utilization of both the collector and the database on a separate server of the same kind.



    By the way, realizing that they can’t do it, the “team of strong professionals” from Bangalore surrendered and scattered, so besides a bunch of sources - “nothing! besides beads ”(C).

    We'll talk about the intricacies of putting things in order in PHP and in the database schema some other time, I will give only one example of the skill that has come to us.

    Getting to the autopsy


    image
    Such a serious database load interested me in the first place. I turn on detailed logging - and begin to tear my hair out in all places, that's it.

    Tasks from the interface, of course, were formed in the database, and the robot interrogated 50 times per second - didn’t a new task appear? Moreover, the data is naturally laid out in a way convenient for the interface, and not for the robot. The result is three inner join in the request.

    Immediately increase the interval by "once per second". I remove the crazy request, that is - add a new plate of three fields and write a trigger on tables from the web so that it is filled automatically, and change to simple
    select * from new_table where status = Pending
    

    New picture - the collector is still 100% busy, the database is 2% busy, now four pages per second.

    Pick up the profiler


    image
    And suddenly it turns out that 80% of the runtime is occupied by the marvelous EnterCriticalSection and LeaveCriticalSection methods . And they are called (predictably) from a standard allocator of one well-known company.

    The evening ceases to be languid, but I understand that there is a lot of work and will have to be rewritten from the heart.

    And of course, my predecessors preferred to parse HTML bydlockers with a bunch of regular expressions, because SAX is so complicated.

    It's time to get acquainted - and what has been improved to me?

    The dangers of premature optimizations with a thought ray


    image
    Seeing that the database was 100% loaded, the guys were firmly convinced that it was slow to insert new URLs for processing into the list.

    I even find it difficult to understand what guided them by optimizing this particular piece of code. But the approach itself! In theory, it’s slowing down here, let’s slow down yet.

    For this, they came up with such tricks:

    1. Asynchronous insert request queue
    2. A huge HashMap in memory, self-written, with a giant lock that remembered all the passed URLs in memory. And since it was a server, after such optimizations it was necessary to restart it regularly. They didn’t finish cleaning their cache.
    3. A lot of magic constants, for example, to process the next batch of URLs from the database, no more than 400 entries are taken. Why 400? And picked up.
    4. The number of "writers" in the database was large, and each tried to shove his part in the cycle, he was suddenly lucky.

    And of course, many other pearls were available.

    In general, observing code evolutions was very instructive. The benefit of stockiness can not be denied - everything is carefully commented out. Like this

    void GodClass::PlaceToDB(const Foo* bar, ...) {
    /* тут код с вариантом номер 1, закомментарен */
    /* тут код с вариантом номер 2 - копипаст первого и немного изменений, закомментарен  */
    /* тут код с вариантом номер 3 - еще изменили, не забыв скопировать вариант номер два, закомментарен  */
    ....
    /* тут вариант номер N-1, уже ничего общего не имеет с первым вариантом, закомментарен  */
    // а тут наконец-то вариант рабочий
    }
    


    What did i do


    image
    Of course, all the tricks were immediately thrown away, I returned synchronous insertion, and constraint was hung in the database to cut off duplicates (instead of dancing with giant lock and self-written hashmap).

    I also removed the auto-increment fields, inserted a UUID instead of them (an implicit lock table may creep in to calculate a new value). At the same time, I seriously reduced the table, and then at 20K per line - it is not surprising that the database is sinking.

    I also removed the magic constants, instead of them I made a normal thread pool with a common task queue and a separate thread for filling the queue, so that it would not be empty or overflowed.

    The result is 15 pages per second.

    However, re-profiling did not show breakthrough improvements. Of course, acceleration by 7 times due to improved architecture is also bread, but not enough. Indeed, in fact, all the original jambs remained, I removed only optimized pieces to death.

    Regular expressions for parsing megabyte structured files are bad


    image
    I continue to study what has been done before me, I enjoy the approach of authors unknown to me.

    Me-de-ka!

    With the grace of a tractor, the guys solved the problem of getting data like this (each one has its own set of regular expressions).

    • Cut all comments in HTML
    • JavaScript comments cut out
    • Cut script tags
    • Cut out style tags
    • They took out two digits from the head
    • Cut everything except body
    • Now put together all the “a href” and cut them out
    • In the body, we cut out all unnecessary div and table, as well as pictures
    • Then the table layout was removed
    • In the rest, p, strong, em, i, b, g, etc. tags were removed.
    • And finally, in the remaining plain text, we got three more digits

    Surprisingly with this approach, it chewed at least 2 pages per second.

    It is clear that I do not quote the expressions themselves after tuning - this is a huge sheet of unreadable squiggles.

    That's not all - of course, the correct boost library was used, and all operations were carried out on std :: string ( correctly - but where else to put HTML? Char * is not conceptual! Only hardcore! ). This is where the insane amount of memory allocations comes from.

    I take a char * and a simple HTML parser in SAX-style, I remember the necessary numbers, I pull out the URL in parallel. Two days of work, and behold.

    The result is 200 pages per second.

    Already acceptable, but not enough. Only 100 times.

    Another approach to the shell


    image
    I turn to the results of the new profiling. It got better, but there are still a lot of allocations, and for some reason Boost's to_lower () got out in the first place.

    The first thing that catches your eye is a powerful class of URLs, seamlessly drawn from Java. Well, right - it's C ++, it will be faster anyway, you will think that allocators are different . So a bunch of copies and substring () is our Hindu everything. And of course, apply to_lower directly to the URL :: host, no matter - it is necessary on every comparison and mention and certainly boost.

    I remove the excessive use of to_lower (), I rewrite the URL to char * without redistributions instead of std :: string. At the same time I'm optimizing a couple of cycles.

    The result is 300 pages per second.

    On this finished, acceleration was achieved 150 times, although there were still reserves for acceleration. And so he killed more than 2 weeks.

    conclusions


    image
    Conclusions as always - a classic of the genre. Use tools when evaluating performance, do not invent from the head. Use wider (or wider) ready-made libraries instead of setting the sun manually.

    And may Saint Connectius be with you a high performance!

    Also popular now: