The site crashes. How to effectively organize support for web resources with third-party services

    How can a third party determine if my sites and servers are working? Is there a chance of error? Who and when should learn about the problem in order to take action in time? I will try to answer all these questions by examining in detail the function of instant alerts about crashes of the HostTracker website monitoring service , as well as possible scenarios for escalating alerts and assigning roles.



    So, due to certain (sadly, but usually unpleasant) circumstances, you decided that it would be nice if someone other than you and your team looked at the site. But questions arise. Some will have to decide on our own: are we ready to wake up at night for this site, how violent the colleagues' enthusiasm for the offer to respond to night SMS will be, how much this site may not work if something happens, and, of course, who is to blame. With some other questions, we will try to help you.

    Is it reliable?


    Using third-party monitoring services, it is almost impossible to miss a problem. Except for cases of caching a site somewhere on the way, but in that case, customers will see it, right? Although, if you look a little at the advanced settings, then here you can find ways to reliable and unambiguous verification.

    An important parameter here is the monitoring interval. Checking the site once every half an hour, you need to be prepared for the fact that you really learn about the problem only after half an hour.
    Well, but if the opposite: there is no problem, but they will wake me up? Or disturb the dream of your favorite chef?

    I do not want to worry in vain


    Quite a logical requirement. Firstly, the verification algorithm provides for double-checking from several servers . Secondly, if short-term malfunctions do occur, which are not malfunctions in terms of their own significance, then it is possible to delay the notification until the circumstances are clarified:



    This means that after 3 minutes the site will be checked again, and if the problem does not resolve itself - then they will sound the alarm. Why can this happen? Network lag, rebooting network or server equipment, technical work on the server, peak load on the server, or even a little ping that suddenly grew up. But you never know what. SLA 100% does not guarantee any hosting yet. Thus, short-term failures are filtered out.

    What is also important and interesting - this delay can be set individually for each contact. For example, a fully working scheme:

    • The site administrator / developer is notified immediately
    • Head of department - after 30 minutes, it's time to provide help if the problem is serious
    • Project manager - after 1 or 3 hours, at this time it is time to look for excuses for clients if the problem is still not resolved

    That is, you can prudently adjust everything in such a way that motivating kicks and Valuable Instructions begin to arrive exactly at the moment when you really can’t understand without them.

    By sleeping carefully in bed, you are helping the community


    Yes, there are companies and people who value the personal time of employees. And it is very commendable. For such cases, it is possible to set a working time schedule:



    This is very convenient if the position of a “night admin” (or even not an administrator) is provided - just an IT specialist can also restart the server) or, for example, there are representative offices in different time zones and you can divide zones responsibility on time.

    Wake up at all costs


    For particularly critical systems, a re-alert function is provided. And it will be repeated until the site / server / service is down, or until someone logs into the account and changes the settings. There is also the possibility of a repeated voice call. That is, this is not SMS, which will just be picked up, but an annoying dialer until someone picks up the phone.



    But what if I miss something?


    You can always choose several notification methods . And set it up so that any sneeze comes to the post office, and when something really is important, then by more efficient methods.
    In addition, everything is available in the logs:



    Similar scenarios are widely used by our customers and are updated as desired. Therefore, as always, we welcome all comments and suggestions.

    Also popular now: