Forensic system administration

    Among all the duties of a system administrator, the most interesting, complex and productive, in my opinion, is detective work based on the incident. At the same time, unlike real criminology, the system administrator is both a detective and an expert in material evidence for himself.

    I now exclude from the consideration of incidents with meaningful malicious intent, this is a separate topic. This is about natural problems (the server crashed / hung, the virtual machine started to slow down and then stopped, the application lost 100,500 transactions and believes that everything is fine).

    The essence of the incident


    Sometimes it is trivial (“the server spontaneously rebooted”, or “the plane crashed”). Sometimes it is extremely difficult to explain (“customers complain that they cannot change the region”, and all employees with client accounts can change the region). Most often, the farther the source of the complaint from the system administrator, the more vague the complaint becomes: “the client says that after ordering a teddy bear in the online store, he cannot change the region to IE7 when using the LTE connection via a USB modem, and he receives a 500th error when trying to cancel the operation and pressing “back”).

    Even more complicated is the case when several problems merge together: “the server suddenly rebooted, and on the other server there was a timeout for working with the database, and clients wrote at the time that they were not loading pictures.” How many problems are there? One, two, three, or maybe more? Which of the problems should be combined silently (database and lack of images), and which ones should be considered separately? And if at this moment a complaint still arrives that the user cannot log in to the system, is this usually a “forgot password” or also a symptom? And if there are two such users? Or does someone say in passing, “something my mail doesn’t go through”?

    Unconsciously, at the time of the onset of problems, each new complaint immediately combines with existing ones (and can lead to the wrong place), plus it sharply increases stress due to the fact that you have to think not about three symptoms, but about eight, for example. And only seven are held in my head well. But at the same time, in my practice, it happened that a “new” symptom easily led to the essence of the problem and its elimination ... ... minus the fact that the serious problem (with which it all started) has nothing to do with the joyfully and quickly fixed nonsense. And time wasted.

    There is no simple advice for such a situation. In difficult situations, I usually write out everything that I hear and notice, not analyzing, but fixing the time.

    That is, the log (in sticky notes) looks like this:
    • Monitoring worked on srv1 (22:05)
    • (name) said about the problems with the mail (22:07)
    • I can not log in to srv12 (22:08) / refused - I went in 22:16, dmesg is clean, the uptime is big
    • I can not log in to srv13 (22:10) (timeout) - office wifi fell off (22:11)
    • The panel does not open (22:12)
    • Support writes that the client complains that nothing works, 22:15

    Do not get carried away (not time to type), but the symptoms are worth writing out. One is a case or several, whether these are important symptoms or not, it will become clear later. I usually start writing out after about the third distracting appeal.

    The second aspect of the problem is the proof of the existence of the problem. The most hated phrase that cannot be avoided:

    Everything works


    After the Enia Airlines complained to the manufacturer that planes sometimes crash, the developer checked that the planes take off / land and closed the ticket with 'Unable to reproduce'. Enia Airlines support staff continue to collect statistics on aircraft crashes and are trying to learn how to reproduce crashes in the laboratory.


    Finding a problem is often more difficult than fixing it. The worst scenario that you can imagine is a stream of complaints from users with completely green monitoring. Moreover, sometimes it happens that you look with your eyes / pens - everything is fine. But there are complaints. That is, “there is a problem,” “but there is no problem.” This is the most disgusting state, and if it’s not about the primitive “works” / “doesn’t work,” but more subtle things, then if things go wrong, it can last for weeks or even months. A real example from my practice: once every three five months there is a delay of 300-400s (!) with an ordinary load on the disk array. There are no other symptoms. It does not appear in the laboratory. Debug as you like.

    So it is surprising, but you feel joy learning to reliably reproduce the problem. A real example: At one time, it took me more than a year to formulate a script for hanging enclosure with SATA disks. In production, I watched such disgusting things, but I could not reproduce at the stand or the synthetic load. Finally managed , by pure chance. To the shame of LSI, there has been no reaction to the bug report so far, but at least I got a reinforced concrete argument in the discussion of the (in) applicability of enclosure under serious loads. I must say that I still do not have an equivalent scenario for SAS disks, although there is an extremely high belief that the problem is reproducible with the same success. So we can say that for 8 years I have the same bug, but with SAS, I have not learned how to reliably reproduce it.

    Another problem of the same plan is the interaction of departments. When there is no obvious reproducible evidence, it is very difficult to complain of foggy pains and glitches - after all, a person sees that “everything works for him”. On trust, a person can make some gestures to check and debug, but when he constantly sees that everything is good, then the words from the side “really everything is bad” rapidly lose credibility. This can and should be fought, but this phenomenon is and is objective. It must be suppressed by others and restrained in oneself if someone else complains about problems.

    Another serious dilemma is the choice between “repair” and “study”.

    To fix


    It does not work. A must. We need to fix it. Right now. Unfortunately, fixing often destroys all evidence and circumstances (for example, due to a server reboot, the state of the kernel, processes, and tmpfs contents are lost). So if there is a possibility, it is necessary to transfer the load to working servers and get rid of time pressure without getting rid of the evidence.

    To study


    This is my favorite mode of operation. When a problem is localized, excluded from production (from its sensitive places), but not resolved. You can carefully and gently feel in all places, feel the problem more precisely, understand its scope, study the circumstances and all the direct symptoms, write out all the characteristic lines from all consoles and describe the behavior for yourself and descendants. After the patient is wrung out and smoothed out, you can try to fix it, and you need to start with the least intrusive methods and switch to more severe, if less severe failed. On output, this will give the maximum experience / incident ratio.

    Research depth


    After the reason is found (“the plane crashed due to the engines stop working”) and eliminated (“the plane was replaced, the flight is normal after restarting, there are no complaints from new passengers, the support responds to complaints and requests for identification of corpses previous passengers in the normal mode ”), a certain worm of doubt arises: is the cause established exactly? The more complicated the installation, the more connections between the components in it, the less obvious and determinate the state of each of them, the less unambiguous the connection between the alleged consequence and the alleged cause. Question: “why did the motors stop working?” Reading the nagios logs in the cockpit, we find the answer: "because they did not receive fuel." Are the engines bad? No, it seems, they did everything right. No fuel - no traction. The incident report begins to look more meaningful: “the fuel stopped flowing into the engines, they stopped and the plane crashed. The pilot redundancy system in the cockpit did not work (the reasons are being clarified). ”

    How deeply must one go in trying to understand "why?" Each subsequent step opens another abyss (studying most of the subtle problems associated with reading the source text), it is very exhausting and takes a lot of time. There is even such a (alas, no joke) proverb: "Today I learned one more thing that I did not need to know." This usually happens after reading another masterpiece of Hindu thought.

    Living example. There is such a virtualization monster - XenServer. Powerful corporation (Citrix), enterprise, millions of sales, thousand-dollar licenses, all in a chinar.

    ... So far, the old exhausted sysadmin does not detect the following construction in the code serving the disks of virtual machines: launching the ls program in python and parsing its output to get a list of files in the directory. Another thing he wants to see.

    ... The senior pilot nervously cringed, remembering that the fuel hose is attached to the fuel tank with a perforated condom, because the tank has an outlet of 11 mm, and the hose has a diameter of 19. The main thing is not to think about it. Perhaps it will blow ... But the deadline of take-off was not disrupted.


    Each step into the past requires, firstly, a deeper understanding of how the next component works, and, secondly, an ever greater breadth of coverage. Because most often every step into the past is a search for superposition from the effects of an increasing number of components that often go beyond the scope of ordinary research, and in each of them we must dig deeply.

    Sometimes there are “turtles to the very bottom” - that is, the breadth of the search becomes too large, the volume of where to look is too vast, and the logs are becoming smaller (and they are less and less relevant). In this case, the search is shut up due to the increasing complexity of the analysis and lack of resources (time / enthusiasm).

    And it happens that the answer is found at the very bottom. If anyone has not read the previous story, a good example:Admin bikes: in pursuit of tunnel fragmentation in the overlay network .

    But there is a third option: when a “broken change” is found, but questions remain. For example: “why?” (its obscene equivalent usually sounds). While the process is beginning to resemble the existential search for the light of reason in humanity, the question is actually important. It can be stupidity, neglect, mistake. But if not?

    And then another layer of questions unfolds: why do they (the developers) want to do things that won't work for us? It is clear that they (developers) are villains and villains, and we (sysadmins) are all princes on white horses. But if not? What if we cook it wrong?

    This, in turn, can lead to a major revision of architecture. A minor rare problem, a particularity, suddenly reveals a logical inconsistency, or a violation of the promises / expectations of the components. (Note: “correct and forget” does not make it possible to see just such things). A real example: during a “live” migration of a virtual machine, sometimes, very rarely, it (migration) fell off with the error 0x4c (without further explanation). And left the virtual machine inoperative in the middle of migration. After curses on the heads of developers, drawing up regulations for the migration of machines and other nonsense, once, during one of the hangs, we got deep into the system and began to study the "non-migrating" machine. Usually the car was repaired by a restart method, and then there was a chance to carefully study the car was “own” and it was possible to leave it in this state for a long time. As a result of a close analysis of events from the logs, it was found that if the virtual machine's memory changed upward after the machine began to migrate, but until a certain point (the window is about 2s), then the hypervisor was stuck with the inability to copy the rest of the memory. Are the programmers to blame? Well, a little bit. You should have thought of something like that. Are the architects of the final system to blame? Guilty, and more than programmers. There is nothing to arrange transients in the middle of migration. A small fix that turned off memory management during the migration, and the 4c ​​error no longer exists. but until a certain point (the window is about 2s), the hypervisor was stuck on the inability to copy the rest of the memory. Are the programmers to blame? Well, a little bit. You should have thought of something like that. Are the architects of the final system to blame? Guilty, and more than programmers. There is nothing to arrange transients in the middle of migration. A small fix that turned off memory management during the migration, and the 4c ​​error no longer exists. but until a certain point (the window is about 2s), the hypervisor was stuck on the inability to copy the rest of the memory. Are the programmers to blame? Well, a little bit. You should have thought of something like that. Are the architects of the final system to blame? Guilty, and more than programmers. There is nothing to arrange transients in the middle of migration. A small fix that turned off memory management during the migration, and the 4c ​​error no longer exists.

    Place of occurrence


    We return to the detective metaphor.

    The more servers on the farm, the more difficult it becomes to define a "place." If you have a lonely server in your household, then whatever happens (except for the loss of electricity / Internet / cold), all the problems are on the Server. One and only. If you have several servers on the farm, the picture may already become more complicated, but not radical. You can examine all the servers in turn and find the "crime scene".

    But if there are several thousand servers?

    It is assumed that this moment is at least clear what is happening. Next, you need to understand the workflow of what is happening. In complex systems, it can be very confusing (coming up with: LB / CDN / cache / application server / database. A database is several servers, and you need to understand the master or some of the slaves, applications can go to other servers, the CDN, in turn, can be tied to object storage (+ several dozen servers) or storage). It happens that some API requests break through several applications and are generally terminated outside the servers (IPMI, switches, requests to aliens for randomization via HWRNG, etc.).

    So, workflow can be disassembled in the reverse order (from the result), or on a hunch, from the alleged cause.

    In the reverse order, this is a boring but surely working method. We go to the load balancer or its logs, we look at where the requests go / go, we go to the cache, we look at the cached or not, we go to the corresponding application, we look at which application, we look at the logs, we look at which database and on which server, we look at the DB logs, we find that the log is littered with errors, happy end). Long. Dozens of complex queries to the log assembly system, hundreds of kilobytes from which you need to dig out the necessary, etc.

    There is another approach - intuitive. “It reminds me of something in behavior ... It doesn’t look like a database, probably this is a new application”, and we go to read the logs (arbitrarily chosen by intuition). For all the arbitrariness of choice, experience and intuition, able to understand where the problem is by secondary, but important signs, are often right. In secondary signs, there may be completely unexpected things - for example, time (3:59? Cron.daily on the database?), Specific formatting of the error page (looks strange. Did the styles not fully load? We look at the backend for servers with static), message features ( if the strerr format breaks through in the message, then you need to dig directly to the host where it was performed, and if it resembles a python / java trace, even if fragmentary, it is better to read in the application area),

    I usually make two or three attempts to find using intuition, after which I fall down on the slow path. Sometimes, when in the process it seems that “I understood everything”, I can jump forward, but in the general case I have to spin up to the right place.

    Event time


    For all the triviality of general relativity, in the space-time continuum , as well as in the administration of servers, time is absolutely not a trivial thing to establish. First, not all events have time. For example, the application began to write frantic letters in an unexpected log file. The letters are important, but they do not have a timestamp (and the log in logstash does not go away, because no one thought that there could be such a log (in / tmp)). Or, there is time, but very large strokes. For example, atop shows that the application completed between 12:20 and 12:30. Roughly.

    Sometimes it happens that for the correct sequence, you need synchronous time from different servers with an accuracy of less than a second. Because it is important - first a request, and then a trace, or first a trace, and then no response to a request not connected with the accident.

    And subsecond accuracy is already quite severe. First ntp can walk a little. A real example from practice: in the dark times of openvswitch 1.1 he had a bad habit of adding a large (random) latency to new flows (and eating memory until stupid, but this is a separate story). And there was ntp on all hosts. And thought up ntp that jittering is big. And the time began to disperse on the hosts slowly, no less than 30–90 seconds. This is despite the fact that all were synchronized from one server in LAN with a ping of 0.0001s. The result was a separate check for the nagios that time on the servers does not diverge.

    Secondly, logging can set a timestamp post-factum (for heavy tracks it can be a lot). Even more fun when one of the victims of the problem is ntpd, or even just the time on the server. In my practice, it was somehow that the program messed up time (and stopped accepting signed messages). Forensic showed that the time it takes to write to the log file differs from the time it took to modify the file, which became the key to detecting a bug (which occurred at the junction of summer-winter time zones).

    It happens that different sources do not have ntp at all or have a very rough idea of ​​the time (support said two hours ago that five minutes ago the client said that everything had fallen ten minutes ago).

    Under such conditions, when the chronology cannot be kept in one’s head, one has to write out again, this time a chronological sequence.

    Motive


    Programs never do anything for nothing. Especially in the area of ​​all kinds of bad branches. Why did the program freeze at the exit? Why didn't the kernel give IO back? Why the log did not log?

    For each subject of action it is necessary to write out his motive. What for? Why? If there is no motive, but there is a suspicion of participation, you must at least write out what you did and what you should have done.

    In an ideal world, programs do not contain errors; in the real world, they contain and work with them for years. Moreover, many errors "lie quietly and do not touch anyone" in the rare corners of the code, until it so happens that their turn comes. It happens, the fix is ​​not to find and fix the error, but to eliminate the circumstances that cause the corresponding piece of code. At the same time, both the “error” mode and the “no error” mode are both more or less standard. The most striking example of this (not in my practice) is to reach the threshold of 512k prefixes in full view in BGP for some older routers ( more ).

    Actors


    Another important question: who is involved in creating the emergency? It often turns out that the culprit is not only the direct suspect. From practice: the request for billing could not work out. A look showed that mongodb, which stores temporary samples, is not running. Time trouble demanded “restart everything,” that is, the state of the processes was lost. Subsequent proceedings showed that mongo is constantly ending (entry in dmesg). At the same time, no OOMs. But after a few minutes of researching slurred logs, it turned out that mongodb is culturally enough complete by the lack of memory to answer a complex request.

    And atop's logs showed that the ceilometer-collector swells nearby, pressing and pulling the mongu from memory. There would be no atop, no one would have guessed anything, for it was restarted along with mongo. Although the monga is also to blame, for she did not write clearly what she did not like.

    Chain of events


    It often happens that surprise in one process is a mistake. But it is caused by another reason, often quite incoherent.

    It so happens that the assumption of stateless behavior of the system is violated. Everyone thinks that restarting on the same data gives the same results, but in reality it turns out that the first launch and the subsequent ones are two big differences. In this case, an accident occurs at the first start, and after a restart, everything passes and works as expected.

    The debugging chain becomes even more serious if the data in the database is the source of the problem. Where are they from? Data can lie for a long time, until the code trips over it. An example from fresh practice: the ceilometer began to fall on the metadata of the image with a dot in the name. The image took metadata while creating the instance snapshot. And the instance took metadata from the host where it was launched (last year). And typo metadata has long been fixed. And the problem bit in a year.

    Building such chains is the most fascinating part of the process. At this moment, all the connections in the system, under the gaze of a suspicious detective, become crystal clear, and often horrify with their uselessness (after it broke, it becomes clear how bad it was). A simple example from practice: as a young man, an overwhelming decision was made: to make an nfs ball on a group of servers of the same type with temporary files / single-line files. This is very convenient, because often a file from one server is needed on another, and it is very convenient to have them in one place. That was until there were problems on the server where nfs was. After reboot ... Yes, all of a sudden, it turned out that path includes this directory, and an attempt to call any program leads to a scan of nfs-balls. ... At the login stage. At that time broke through ssh / bin / bash -i.

    The process of investigating the causes often also allows you to suddenly and very deeply understand what is happening in the system from an architectural point of view. Because when everything works, it is perceived as given. And when not ... And then it becomes clear how it is connected with each other, how it works, and how not.

    In a way, I don’t have the feeling that I feel how the system works until I see how it breaks. Because it is at the moment of a partial breakdown that all design errors, unsuccessful decisions and the lack of safety margin become visible. Everything breaks, the question is, with what consequences. One solution in case of a fatal error will inform about it and end, and the other will find itself in the code branch, where the processing of SUCH errors is not provided, with consequences of an appropriate scale.

    Investigative report


    Its spelling is an essential part of forensic. People have poor memory, especially for illogical and poorly structured data. The conclusion is divided into four parts: characteristic features (especially quotes from the logs), a description of what happened, a method of early (or at least some) diagnosis - most often, one more check in monitoring, and recommendations for elimination. All this is written to the wiki (you have a corporate wiki, right?), And is scattered as a set of tasks to complete. There is another bad tendency to fight against: "this has never happened and will never happen again." If something strange happened, then it should not be ignored. Monitoring, monitoring, monitoring. Plus writing out all that was. It happens that a couple of extremely rare (one-time) incidents suddenly line up in a slender picture much later. ... If fragments of pictures are preserved. So do not be lazy to write out, do not be lazy to disassemble, and cover with tests.

    Also popular now: