Startup Growth Problems - Monitoring
The time factor - the timeliness of the execution of orders, works, agreements - is important in business. Clients and partners expect predictable time in collaboration. In a traditional business, this is affected by the work of employees, the actions of suppliers, the geographical location of the company, the state of the equipment and much more. To localize and control all this is a difficult task.
The situation is somewhat clearer in IT. Most of the processes can be automated and entrusted to a program or script. Even better, if the product is implemented as a web service - the main lesson comes down to maintaining the efficiency and development of the product.
One of the main products of our company is a web service. It all started with the realization of the idea of two people, then the company grew: PM (it’s also a web programmer), a database programmer, a system administrator (with little developer experience), a tester and two more people involved in sales and SEO. The system is implemented as a web interface and its associated database. Using the service involves the input and output of money that is stored in the "virtual currency". For security, the web server and database are deployed at home, on their servers.
Since the service is associated with money, the level of user confidence in the system is important. First of all, it is a question of security and protection of the system against hacking. However, it is difficult for an ordinary client (especially not understanding IT subtleties, secure protocols and encryption) to assess the level of system security. If the negative situation (hacking) did not happen, then the work is not visible.
The availability of the system is more obvious to the client: if a user transfers a certain amount to a virtual account, then he cannot go to the site several times - trust in the system is undermined, and the client is probably lost (sometimes a few immediately - the negative spreads quickly). And attracted by great difficulty, users, bumping into an error instead of the main page of the site, are unlikely to return too.
Understanding the seriousness of this did not come immediately. At first, the number of users was small, the capacity of the servers and the channel was abundant enough, and therefore the "departures" were rare. In addition, the functionality at first was often tested and refined for the sake of optimization, and the developer himself was in charge of monitoring the servers connected with the service during working hours. The number of users during off-hours was relatively small — there were practically no clients from other time zones at that time.
Gradually, the main part of the service was completed. Improvements have become small, but because the focus of the developers was switched to another project. The duty of control of the system was transferred to the system administrator, in addition to his other duties. Notifications have been set up for email and SMS to the administrator in the event of a server crash. Such measures at that time seemed sufficient.
The product began to gain momentum and the number of users increased. There were new ideas, some implemented immediately, some postponed for the future. The service was translated into other languages and gradually entered new markets, which, among other things, led to an increase in the number of users at night. The load on the server gradually grew, although it was still far from the technical limit of iron.
Once peace has come to an end. On the night from Friday to Saturday, users began to experience problems accessing the main site of the site, they were often met by error 503. The problem was simple, but, as it should, the admin was unavailable on Friday night, and therefore the SMS remained unread. And yet the problem was solved relatively painlessly. The developer also received an SMS, and was able to call and wake the administrator, and after 3 hours the problem was solved. Overall "simple" was 5 hours.
On Monday, followed by a debriefing on what happened. An analysis of the site’s attendance data showed an unpleasant picture - on “problematic” Friday, attendance fell by a third compared to last, but even more unpleasant were significant drops on Saturday and Sunday, despite the lack of technical problems these days, attendance decreased by 15%.
This reinforced the understanding of the need for round-the-clock monitoring. From the point of view of software, we chose Zabbix , which was to be installed and configured by the system administrator. It took about a week for this - the remaining tasks did not go anywhere, and everything was done in parallel. There was an organizational question - who exactly will monitor?
At first, I had to make such a decision - by shifting the working hours of existing employees (of those who understood this - that is, the system administrator and developer) so that in turn someone would control the server at night.
It was a forced decision and it did not last long. Firstly, the work of two people still does not provide round-the-clock monitoring - there are time gaps, in which a failure is also likely. Secondly, few people like working at night, and there was growing discontent, and besides, the distraction of the programmer practically stopped the development itself. Therefore, a week later, they abandoned the idea and began to think further.
Hiring additional staff for monitoring
Of course, such a solution is the most uncompromising - the constant control of the selected people gives a good result. But working in this mode would require a search for 3 more system administrators. At the same time, they should be sufficiently qualified to solve the problems that arise, but most of their time would still be wasted - the company is small, there are few servers, and there would be almost nothing to take up additionally. In addition, so many people also need to be controlled, which would be an additional headache.
Both options did not work. It was not possible to concentrate efforts and funds on them. But the need for monitoring has not gone away. This is one of the problems of growth - a need arises that cannot be realized on its own. As a solution came to outsource.
During the transition, doubts arose, the main one being the safety and confidentiality of information that becomes available to someone else and the quality of service, on the contrary, will not become worse? But it is rather a matter of finding a responsible executive and signing the NDA.
So, we switched, from the technical side it turned out to be easy. A month later we decided to check how things are going - checking the logs on the servers. We are satisfied with the results - for the month there were three serious failures that could potentially “put on” the servers again, but the partners solved the problems within half an hour. In this case, all the failures occurred in the interval from one o'clock in the morning to four in the morning - the gradual geographical growth of the product affected.
The work of our system administrator has changed and become more relaxed. Without being distracted by monitoring, he concentrated on DevOps. We focused and development accelerated. It turned out to realize that they had been postponed for a long time, thanks to our partner .