The struggle for performance or who took CPU time

    It all started with Tumbler Dolls. About what it is and why it was needed is described here: http://habrahabr.ru/company/oktell/blog/108726/ . That is, we had four working call center servers, equally configured and with approximately the same configuration of external and internal lines, users and setting up the internal database. The telecom operator received a uniform ring load on each server, but the servers reacted differently to this!

    N Server Configuration and Operating System
    1 T1300 @ 1.66 1 GB RAM, Windows 2003 Standard Ed. R2 SP1 32 bit
    2 Intel Core Duo E8400 @ 3000 4 GB RAM, Windows 2003 Standard Ed. SP1 32 bit
    3 Intel Pentium 4 3GHz 2 GB RAM, Windows 2003 Standard Ed. SP2 32 bit
    4 E3400 @ 2.60 2 GB RAM, Windows 2003 Standard Ed. R2 SP2 32 bit

    Faced with complaints about the quality of communication. Moreover, they complained not about every call, but about "some." They complained about the “kwaki,” very characteristic of VoIP telephony. Quite quickly it was found out that the reason for the appearance of “Quaks” was an unpredictable increase in processor employment on one (first) of the call center servers with increasing load. And this despite the fact that other servers did not notice such a load at all, and there was no increase in CPU load with the same number of calls. Even despite the fact that the first server was significantly weaker than all the others, such a picture - growth in processor utilization to 100% - should not have been observed.

    It’s probably not worth saying that we went the standard way of “wiping the headlights”, “kicking the wheels”, etc. Finally, we came to the conclusion that neither the settings of the call center itself, nor its DBMS, affect the server behavior. The starting point for understanding the essence of the problem was the fact that in the task manager in the process list, none of the processes took up processor time, however, while monitoring the performance, the processor load history showed a continuous kernel processor load of 20%.

    image
    image

    The goal was to get an answer to the question what is the core busy with when there are no loads on all other services. Process Explorer - a standard utility from Microsoft - suggested that the main consumer of resources is “Hardware Interrupts”. For further analysis of the reasons for such consumption, another standard Microsoft utility, “Kernrate View,” was downloaded. As described in the recommendations for use, the command line executed "C: \ Program Files \ KrView \ Kernrates \ Kernrate_i386_XP.exe >> log.txt" and, after some time, pressing Ctrl-C, stopped it. We received a log.txt file containing information of the form:

    / =============================== \
    \ ======= ======================== /
    Date: 2010/12/08 Time: 1:09:14
    Machine Name: RESERVCC
    Number of Processors: 1
    PROCESSOR_ARCHITECTURE: x86
    PROCESSOR_LEVEL: 6
    PROCESSOR_REVISION: 0e08
    Physical Memory: 1015 MB
    Pagefile Total: 2450 MB
    Virtual Total: 2047 MB
    PageFile1: \ ?? \ C: \ pagefile.sys, 1524MB
    OS Version: 5.2 Build 3790 Service-Pack: 1.0
    WinDir : C: \ WINDOWS

    Kernrate User-Specified Command Line:
    Kernrate_i386_XP.exe

    Kernel Profile (PID = 0): Source = Time,
    Using Kernrate Default Rate of 25000 events / hit

    ------------ Overall Summary : --------------

    P0 K 0: 00: 36.671 (28.5%) U 0: 00: 09.671 (7.5%) I 0: 01: 22.343 (64.0%) DPC 0:00 : 29.484 (22.9%) Interrupt 0: 00: 00.281 (0.2%)
    Interrupts = 136809, Interrupt Rate = 1063 / sec.

    Total Profile Time = 128687 msec

    BytesStart BytesStop BytesDiff.
    Available Physical Memory, 443772928, 407339008, -36433920
    Available Pagefile (s), 2104172544, 2093707264, -10465280
    Available Virtual, 2132660224, 2131611648, -1048576
    Available Extended Virtual, 0, 0

    Total Avg. Rate
    Context Switches, 609407, 4736 / sec.
    System Calls, 5078088, 39461 / sec.
    Page Faults, 119817, 931 / sec.
    I / O Read Operations, 11671, 91 / sec.
    I / O Write Operations, 209479, 1628 / sec.
    I / O Other Operations, 229216, 1781 / sec.
    I / O Read Bytes, 39981700, 3426 / I / O
    I / O Write Bytes, 19240135, 92 / I / O
    I / O Other Bytes, 7130204725, 31107 / I / O

    - Results for Kernel Mode:
    - OutputResults: KernelModuleCount = 99
    Percentage in the following table is based on the Total Hits for the Kernel

    Time 44651 hits, 25000 events per hit - Module Hits msec% Total Events / Sec
    intelppm 27457 128685 61% 5334149
    hal 12284 128685 27% 2386447
    ntkrnlpa 2868 128685 6% 557174
    win32k 525 128685 1% 101993
    alder9xp 427 128685 0% 82954
    tcpip 254 128685 0% 49345
    Ntfs 251 128685 0% 48762
    afd 120 128685 0% 231128
    128115
    Rd115128 0% 21 175
    iaStor 97 128 685 0% 18 844
    NDIS 39 128 685 0% 7576
    RDPWD 31 128 685 0% 6022
    fltMgr 26 128 685 0% 5051
    amon 12 128685 0% 2331
    termdd 10 128685 0% 1942
    CLASSPNP 10 128685 0% 1942
    ftdisk 7 128685 0% 1359
    ipsec 3 128685 0% 582
    Npfs 2 128685 0% 388
    USBPORT 2 128685 0% 388
    volsnap 2 128685 0% 388
    TDTCP 1 128685 0% 194
    rdbss 1 128685 0% 194
    ws2ifsl 1 128685 0% 194
    netbt 1 128685 0% 194
    watchdog 1 128685 0% 194
    PartMgr 1 128685 0% 194

    ================ ================== END OF RUN =============================== ====

    Next, we determine which driver loads the “Hardware Interrupts” process. It will be the top in the Kernrate View log list, and its percentage of core occupancy will be shown next to the percentage. It is worth noting that the percentages do not show the total percentage of the system load, but the percentage of load on the server core by the drivers.
    We determined that this driver is Intelppm (Intel processor power manager). Next - google to help us. The Internet is large, powerful and unlimited. They quickly realized that a problem with Intelppm was occurring, not so often, however, we were not the only ones who faced such a disaster. The result was not slow to discover, there was an article that not only describes the problem itself, but also indicates the way to solve it (Permanent address of the original article here:http://www.osp.ru/text/print/302/5818429.html )
    Further, following the recommendations of Stephen Dougherty, we understand that intelppm is a processor power management driver that is not needed on a server where battery power is not used at all . Several solutions were proposed: reinstalling, updating, or stopping a faulty driver. What options you choose - see for yourself, it is quite logical here to follow the original recommendations of Doherty.
    We climbed into the registry. Intel processor driver data is located in the registry key HKEY_LOCAL_MACHINE / SYSTEM / Current Control Set / Services / intelppm. To disable the intelppm driver, the value of the Start parameter was changed from 1 to 4. Microsoft experts, of course, recommend that you make a backup copy of the registry, but we are Russian people, and change one parameter from 1 to 4.
    image
    Restart helped make sure that the download the processor is on average between 60-70% even with a full (!) load of the call center server with calls.
    At the same time, the device manager looks very interesting:
    image
    Complains, sneaks, cries: “The driver for this device has been disabled. Perhaps the necessary functions are performed by another driver. (Code 32), Click the “Diagnostics” button to start the diagnostic wizard for this device. ”But this does not affect flight safety ;-)

    Also popular now: