Gett support. How do we make everything work?

    Hello! My name is Vitaliy Kostousov, I work in the Global Tech Heroes team, and today I will tell you about support - one of the most important components of any service. You can make a great application with cool pictures and sometimes adequately joking chat bots. You can openly dump, at first offering customers a low-cost service. You can hire a wonderful SMM-box for whom you will not be ashamed and who will not have to be changed as often as an accountant in the 90s.

    But all this can stumble well in the absence of sane support for your service. And support in a global sense - from solving user problems to ensuring the functionality of software and hardware. Well, seriously, how long will people use the application that has been stupid for a couple of weeks, but the developers still haven’t responded normally to problems, the support service is unsubscribed with robotic answers, and can you listen to classical music for free in the call center? As we have everything arranged, what we use in our work to detect problems and solve them, how many of us and everything else are under the cut.





    Now we work in 3 countries: Russia, the UK and Israel, and we have hundreds of thousands of active users, corporate customers alone more than 20,000. There are enough daily requests to our applications. And there are drivers and requests from them. And also internal systems and monitoring. All this should work, and work well. To do this, we have a global technical support team called inside “Tech Heroes” - R&D teams, escalation operators and engineers, as well as Global Incident Manager. And this is what they face in their work.

    Team and users


    Immediately make a reservation that the end users of our team mean not only customers and drivers who are in priority (both private and corporate), but also marketing, support services and our internal departments. Of course, they write to support either using the application, or on social networks. If the problem is of a technical nature, then the task inside SalesForce immediately goes to us. They can write not only about the application and the quality of its work as a whole or some functions in particular, but also about the performance of the company's internal services. There are more than 1000 Gett employees who ask questions about working software, process organization.

    Our team consists of 8 people distributed in three countries - Israel, Great Britain and Russia. A specialist from Russia works remotely, his responsibilities include working with operational processes: monitoring and making changes to our main services. The remaining seven are engaged in operational issues, and many others: testing, bugs, specifications, quickly resolve calls that come from operational specialists and managers, and also monitor all our databases, services and microservices. This team processes all tickets, from whatever country they arrive. For the most part, you have to work with local problems, but it happens that there is some serious bug in the work of global services, then the work goes into Global mode.

    You also need to consider that we have a lot of b2b clients around the world - the system has very flexible settings and the possibility of business integration with the company's services. That is, there are much more classes of cars than private users of the service see. It is important to understand that all this affects both the operation of services and the number of transactional operations. The B2B segment can use a personal account on the company's website.

    Software


    There are several systems for working with tickets on the market that have already proved their usefulness: LiveAgent, ZenDesk, ZohoDesk and others. You can choose according to convenience, you can out of habit, you can - starting from the kind of software that your colleagues work with, so as not to block a bunch of layers and crutches (which also have to be supported and finished). Therefore, we work for SalesForce, since it is used by the main operational areas of the company (sales and support). This allows you to track the status of each case from the side of its creator. There is an automatic prioritization of cases based on the topics of appeals. SalesForce is also integrated in Jira, and if a task is created or a bug is introduced into development, its status is also displayed in the case. This is how we achieve transparent communication between SalesForce Support and Development , clickable




    A dedicated application system allows you to track the SLA for each ticket arriving to us.

    Tickets and requests


    Specifically, our team is engaged in the work of the application itself (for both drivers and passengers), microservices that operational specialists work with, as well as testing and monitoring. In addition, there are always requests for new reports and monitoring, which may be useful to colleagues from other departments. At the same time, some monitoring is strictly for our team, if they relate exclusively to the technical parameters of the services and databases. Part of the monitoring sends alerts to us, the responsible team and support. If the problem is connected, for example, with the driver’s application, support will respond much faster and notify drivers if necessary. Thus, the timing of information is reduced to a few minutes.

    Monitoring


    We have a lot of monitoring. As soon as one of them works, whether it’s newrelic (system services up), grafana (monitoring of specific scenarios), datadog (infrastructure uptime), we immediately receive a notification in Slack and we receive a call in turn (thanks to pagerduty). And for a certain period of time one person is appointed. Since this happens automatically, it is likely that this particular person is currently unavailable or simply did not answer, then the call will be forwarded further along the chain.

    When the alerts are triggered, we recheck the performance of the systems and find out the cause of the failure (or increased load, or a large number of events or calls, here that will fly). If we understand that this is a problem and needs to be resolved, we will send a letter to special distribution groups for operational specialists.

    Therefore, we are always online.

    Incident management


    If your company provides services, incident management is nowhere. We work according to this scheme:

    • Timely detection of problems.
    • Notification of the problem of responsible persons.
    • Stakeholder notification at all levels. That is, we talk about the problem for business, so that everyone there understands exactly how such problems affect companies and profits.
    • Maintaining maximum transparency of work.
    • Mandatory root cause analysis. After all, it has the origins of the problem, and the next one can be prevented. This is faster and more useful than solving it again in the second round.

    The goal is to learn about problems at zero stages. This is when it was you who discovered the problem as the employee providing the working capacity. Not when the client informed you about her. Therefore, we actively use the APM (Application Performance Monitoring) toolkit. I’ll voice them one more time.

    NewRelic

    • Monitoring all of our microservices and gateways
    • 50x, 4xx errors
    • Redis apdex
    • DBs Apdex


    NewRelic, clickable

    Grafana . Events monitoring (makes it clear what exactly stopped working or the behavior is different from normal). Grafana, clickable DataDog . Monitoring the hardware components of our system (databases, load balancers). DataDog, clickable AirBrake . Code Exceptions for apps / microservices (there is a list of exceptions, for example, when executing code or queries in the database, if something goes wrong and it is on the list - we track it). Kibana - monitoring of microservice / application logs (driver / client). And so that everything works not only for detection, but also for timely notification (immediately, the faster - the better), all this is connected with a number of notification channels, from Slack and













    PagerDuty to the good old email notifications. Therefore, the whole team will immediately learn about any anomalies. Alerts can be sent to different channels. Monitoring critical for the operation of applications always sends alerts to the technical support team and selectively to the channels of development teams responsible for a specific feature / service. All this helps to optimize the response time.

    Difficulties arose in the next step, when after finding a problem you had to quickly notify the person responsible for the service. And this is not so simple to do if there are a lot of processes and microservices, which means that there are no less responsible ones. And the alert can arrive late at night, when you want anything, but you just don’t sort out who is responsible for what.

    Therefore, we created a convenient directory listing all service owners (generally throughout the company). As practice has shown, this alone helped us reduce the time to resolve each incident by about 20%.

    The best recipe for an ongoing disaster in this case is to leave the incident without a person in charge.

    There is a special person, Global Incident Manager, who works as a hub for serious incidents. He is involved in monitoring and changing the basic systems to eliminate errors that can lead to the bones of the business, and is responsible to the top officials of the company, providing them with detailed reports on the analysis of the root causes.

    Therefore, in short, the incident management process itself looks like this:

    1. We determine the causes of the incident.
    2. We find the person in charge.
    3. We are coordinating efforts with him to fix the problem as quickly as possible.
    4. We make all the necessary decisions during the incident.
    5. We inform the business about this, bring them all the problems.
    6. When the dust is dispersed, we start the root cause analysis, RCA (Root Cause Analysis).

    We are building incident reports in Jira, there is a corresponding module, Incidents , we have added a number of additional fields there.

    There are only three stages of RCA.

    1. Initial RCA

    This is the highest level description of the cause of the problem (whether it was a problem with the database, or with the infrastructure, or with the code). This report is prepared by the support officer who managed the incident. The report must be completed within 24 hours after the completion of the incident.

    2. R&D RCA

    The most important part of the process must be completed within 48 hours after the completion of the incident. Here is already a complete technical analysis of the root cause - why it happened, why it was not found (testers overlooked or there is no appropriate monitoring), is there a chance that it will happen again, and what to do to prevent it from happening again.

    3. Actions

    Based on the second paragraph, the corresponding subtasks are formed, the incident remains open until the last of these subtasks is closed. Nobody wants this task to take a kanban for a long time, so this motivates to solve everything faster.

    That's how we at Gett work with incidents.

    Figures and Technologies


    We work, of course, 24/7 with an SLA of 99.99%. The main stack we have on GoLang / Ruby, this gives the necessary speed for processing complex algorithms. There are more than 150 microservices in total, and all of them are also on GoLang and Ruby. We use MySQL, Postgres, and Presto as the databases. We have storage on AWS.

    The most serious load on our services falls on the New Year holidays and the preceding 2 weeks. The condition of competitors also affects, for example, one of them dropped the application, which means that our machines will be called more often.

    There are also peaks in internal work that affect end users. For example, when we update the database or perform technical work on the side of suppliers and vendors, or deploy new services for production (not on Fridays, yes), or we commission features that immediately affect a large number of users or transactions.

    We are people too, and sometimes it happens that incorrect settings or manual intervention lead to operational errors, so we developed a plan for this case:



    No, not that. Here:

    • We check data in services, logs and audits.
    • We test and carry out update operations on Scrum.
    • We prepare a task for the team and monitor the execution of the task on production.

    If you are interested in any details, feel free to ask questions in the comments, we will answer either here, or in a separate detailed post.

    Also popular now: