Audit one “slow” application in one large concern

    In general, I just wanted to answer this comment and give as an example the unexpected results of one audit of a web application I made somehow, but the answer was very cumbersome.
    So this article was born.

    Introduction


    The point was that sometimes in the corporate sector, hiding behind contrived standards or imposed standards from security, there are completely unjustified, and sometimes completely wild implementations of the latter, often bordering simply on impossible working conditions. For example, the CIO and others like it (either negligent or simply lazy performers), guided by such politicians, may have overdone, and often did not find (or did not seek) a better solution.

    As a result, we have antiviruses on servers, with all the consequences, because it just has to be on every computer . Employees forced to work under the admin (aka MakeMeAdmin),because creating an admin account (tech-user) is stupid for scripts, restarting and debugging services, well, it’s not possible at all (and okay - anyway, there is an antivirus) . Policies that allow you to run the executable file from anywhere (network, temporary directory, etc.), because some update service doesn’t know how to do it there (it’s okay - again, the antivirus as an argument). Etc. and so on.

    In fact, I clearly understand where these requirements grow from. Quite often, these are really the conditions of customer customers, the requirements of the parent company or the partner company, or the de facto industry standards that you just need to meet. Well i.e. stupidly not going anywhere.
    But the fact is that technically just some things regarding security are not justified at all, worse than that they are really not “security”, moreover, they interfere with the development and productive work of the same client.
    And if, covering your fifth point (behind false “standards”), you decide to meet the safety requirement from the evil one, then do it at least by including your head so that it does not affect (or minimally affect) the company's productivity.

    Example: they somehow fought with one well-known antivirus (provably important, probably somewhere in the analytics area and in real time scanning queues) - with a very large server load (32 cores> = 50% cpu load):
    • the antivirus blocks (from a few milliseconds to 30 seconds!) access to files, after renaming them - as a result, a multi-threaded pipeline (asynchronous job queue) periodically crashes from “access denied” when accessing, for example, temporary, just created and then renamed files from another the thread reporting the end of work. While the working folder seems to be completely excluded from real time scanning.
    • or worse, it puts the child processes to sleep (suspended) and forgets to “wake them up”, and the stream in the ancestor endlessly waits for the work of the sleeping child to end.
    And after all, disabling it in production on servers is “not possible” (even temporarily as evidence, since as a rule they cover one place with such an antivirus).

    For example, it’s worth splitting the same network and sticking the same antivirus behind NAT. Or at least check for such sores and replace with another, more reliable one. Compare this cost with say man-hour * 25,000 employees spending every minute, minute after minute, stupidly waiting for the application to respond.
    As a result, the number of bicycles of the “safe_rename”, “real_delete” or “start_process_with_observe” type around the projects is growing. The same CIO would quickly reconsider its position if he (his unit) had to issue a collective bill for the total time of “downtime” (waiting) of all employees.

    Audit


    I somehow had to do an audit of one more or less large application in one very large concern. Web application on company intranet. According to rumors, it did not seem to have shone with speed before. And after some kind of release, everything became very, very slow.
    The manufacturer swore that everything was fine with them - supposedly by the logs it is not visible that the application is overloaded, however it is impossible to simulate such a stress test as in production. In general, the patience of the client is over - well, actually the audit ...

    It all started with writing analyzers of sent logs (thank God the protocols in the application, or rather in the app-server, could be quite detailed). As a result, huge logs are reduced in a more or less readable form. An example of a circumcised protocol after the analyzer (and tuned for the habraparzer), who is suddenly interested, can be seen under the spoiler below - I’ll say right away that the logs did not show anything obvious. Those. the server (application) well, yes - not fast (somewhere SQL slows down, somewhere storage or NAS), but in general they could in principle cope with a load ten times higher than the analyzed one.

    Analysis Protocol Example ...
    Analyse:
    Analyse-time9338645
    (155.644 min)
    07:45:09 - 10:20:48
    Idle (ms)6069160 (101,153 min)Idle-AVG:735.66Count:8250
    Busy (ms)3269485 (54.491 min)Busy-AVG:396.3
    Total (ms)4536133 (75,602 min) 
    AVG (ms)285.6 
    Requirements15883    
    Users106Avg (ms)Min (ms)Max (ms)
    WorkTime (ms)521217468 (8.686,958 min)4917145 (81,952 min)344 (0.006 min)9322426 (155.374 min)
    ServerTime (ms)4536133 (75,602 min)42793 (0.713 min)142 (0.002 min)298511 (4,975 min)
    Name/app/mailbox.htm/app/docmain.htm/port/result.htm/app/tree.htm/app/view.htm/app/docnavi.htm/port/docview.htm/app/result.htm/port/report2.htm/port/search.htm/port/pdfview.htm/app/action.htm.../app/empty.htm
    Time (ms)1135731 (18,929 min)770339 (12,839 min)616371 (10,273 min)606983 (10,116 min)286304 (4.772 min)255469 (4,258 min)173729 (2,895 min)135370 (2,256 min)109145 (1,819 min)72917 (1,215 min)34346 (0.572 min)32499 (0.542 min)...0 (0,000 min)
    AVG (ms)1.108.03514.591.064.54211.35239.79222.73622.68474.985.457.25197.07602.56172.87...0
    Count102514975792872119411472792852037057188...1
    User-Times:
    UIDnet \ u101165net \ u144102net \ u193619...
    Time (ms)298511 (4,975 min)238661 (3,978 min)168190 (2,803 min)...
    AVG (ms)2,444.81269.07282.2...
    Count122887596...
    Worktime131.818 min 07: 55: 26-10: 07: 16117,066 min 08: 23: 28-10: 20: 32150,534 min 07: 46: 35-10: 17: 07...
    RequirementsNameTime (ms)AVG (ms)CountNameTime (ms)AVG (ms)CountNameTime (ms)AVG (ms)Count...
    /port/result.htm280962 (4,683 min)12.771,0022/app/docmain.htm89560 (1,493 min)621.94144/app/mailbox.htm53689 (0.895 min)1,677,7832...
    /port/docview.htm11938 (0.199 min)746.1316/app/tree.htm55327 (0.922 min)278.03199/app/docmain.htm39019 (0.650 min)750.3752...
    /port/search.htm3750 (0,063 min)250fifteen/app/docnavi.htm42245 (0.704 min)291.34145/app/tree.htm22122 (0.369 min)254.2887...
    /port/tree.htm797 (0.013 min)88.569/app/view.htm22986 (0.383 min)247.1693/app/view.htm15830 (0.264 min)316.6fifty...
    /port/view.htm640 (0.011 min)4016/app/mailbox.htm16126 (0.269 min)1.466,00eleven/port/result.htm9622 (0,160 min)253.2138...

    Roughly speaking, even the total summarized idle-time of all server workers reports this (101,153 min out of 155,644 min).

    I will not load the reader, what else had to be encountered in the search for the malicious type that pulled the handbrake.
    After fortune telling on coffee grounds, everything was supposed to be from the same antivirus (where without it) and balancer problems, to a stupid swap on the client or some browser extensions or slow javascript agony in the browser, in some asynchronous calls.
    Everything went to the point that a hellish code audit was ahead. In the meantime, I decided on the spot to check how it behaves.

    They show the operation of the application - everything is really slow to the point. The browser creaks languidly, pages on click open frame by frame and slowly like a modem. There is no swap - nothing is eaten away.

    And having spent only an hour searching, he was, as a result, in quiet horror (here, in fact, is another word) .

    But in order: first he squeezed his proxy, with accesslog enabled, between the browser and the server, having suffered with MITM with the substitution of NTLM-credential (the proxy also worked under the user account), and with the substitution of absolute URLs to another port, etc. I was expecting to see in the logs that the application had nothing to do with it - either an antivirus, well, or something with a browser (scripts there, etc.).
    And after making a few clicks in the application, I unexpectedly found a good hundred mini-requests from the browser for every called request (for every click) in the log tile. All statics , i.e. icons, static scripts and styles with each click worked out again and again - i.e. we have one or two large requests 200 and many (very many) small 304 (Not Modified).

    However, the application, as expected, sent the correct headers for caching for static (Cache-Control, Expires, etc.).
    In short, it turned out - the browser cache was simply stupidly “turned off” (by politicians)!
    Throughout the concern !!!
    Well i.e. the checkmark “Check for newer version” was in “Every time I visit the webpage”.

    Add to all the NTLM on top (with a handshake back and forth and a request-response to PDC from both) and given that we have many applications in the company on web architecture, we multiply by the number of employees of the entire concern and Voila! You can go to the server room and enjoy the red glow of utterly redundant network equipment.
    Those. the browser not only sends and receives a bunch of small request-response hooks, it also waits for confirmation from the last resource in the page - I am unchanged to finish rendering and show the page. It seems to me that the antivirus, constantly bombarded by additional requests, has contributed to the overall picture (although I don’t know its reaction to the same script that returned from 304, for example).
    Well, and as a bonus from above - a stupid network - about how a huge mass of mini-packets can brighten the life of the entire network segment as a whole and a particular router in particular, network administrators can tell long stories (usually obscene).

    By the way, I was lucky to say the least with the search (and the client) - in the settings of my proxy the accesslog for statics was turned on, although it was usually turned off (I just tested 304 and others with them and didn’t correct the configuration back). Well, no one expected (the admins present there also went stained) that the staff responsible for the policies (in this case, the browser settings) are so incompetent. So much so that it simply borders, IMHO, with the grounds for dismissal from work.

    According to rumors, the cache was “turned off” because in a completely different project, some bizarre version of some kind of SAP components, somewhere could not do it differently (maybe Cache-Control, Expires, etc. d. or If-Modified-Since was not worked out correctly). The point is that the testers did not find, and the performers who rolled out the bazny release into the production, could not roll back and did not find anything better than how to stupidly "disable" the expires and cache-control checks. Once again - throughout the concern! That is, for all applications and surfing as such.
    They could for example put something proxying in front of the application in order to change the headers only for specifically these URLs, or for example crash into the application server for “rewrite” directly in it. Offhand, you can come up with a dozen solutions better.
    By the way, instead of recognizing the file - according to rumors, they with foam at the mouth proved that it is necessary and good, and that even Microsoft recommends these settings for IE. As a result, your humble servant, as an appendix to the progress report, also wrote a “dissertation” on the topic “Why is bad - it's not good.”

    And among ordinary users of that company I came across as a magician who solved an issue in an hour that local IT specialists could not find for months (or even years).
    Can you imagine, suddenly, not only that application, but everything else in the company suddenly began to work much faster! Up to the last network printer ...

    Then I take my leave, with a request not to kick with words like “Programmers make mistakes, too,” and with the hope that there may still be responsible persons (as well as performers) who will listen to us and at least sometimes turn on their heads ... It helps a lot.

    Also popular now: