ilnuribat November 15, 2017 at 12:23

How Cloud Service Drimkas Cabinet copes with spontaneous workloads

One afternoon, a site crashed. Immediately after reboot, he fell again. We knew that this was not DDOS, but organic traffic: we received typical requests, but the servers could not cope. Increasing the power of iron did not help. It became clear that it was time to optimize our system.

Young startups may be interested in how to cope with the increased load on still fragile server software.

Drimkas office - a cloud service for the cashier’s owner

In a previous post, I talked about how Cabinet Dreamcasts uses webhooks. In this article, I will focus on how the Cabinet appeared and what problems we had to solve as the service developed.

Implement webhooks for the interaction of third-party services with a cloud service

Dreamcas produces online cash registers. In 2016, a new version of the law on cash registers was adopted. One of the main innovations is that each cash register must send sales data to the tax office in real time. For us, this meant that now all cash registers must be connected to the Internet in order to transfer checks to the Federal Tax Service.

Since our cash desks send sales data anyway, it seems logical to collect this data in the cloud so that the owner always has remote access to them. So the Cabinet of Drimkas appeared .

How the Cabinet was organized at the beginning

Started implementation. The first idea - to create one logical server, which all cash desks will directly access - the developers rejected.

The difficulty is that we have several cash desk models on Linux in the market, one on Windows 10 and fiscal registrars with no operating system at all. Other devices were also planned, but then no one knew what they would work on. This meant that it was necessary to support different versions of protocols and data formats - the development of new features was excessively complicated.

We decided to create an intermediate node - “Hub”, which will support different versions, cash desk models, communication transports and even encodings, if necessary. The site - “Drimkas Cabinet” - will receive normalized data and a generalized protocol for communicating with all devices.

^{The hub normalizes data from different cash desk models for the Cabinet. External systems communicate with the Cabinet via API}

To test the hypothesis - do users really need such a service? - launched the first version. Got feedback - we were asked for reports, uploading data to Excel, working with goods, an open API for integrators. The project turned out to be in demand, we released new features.

The Hub with the Cabinet communicated with standard HTTPS requests with a timeout of 5 seconds:

Cash desks sent information about opening / closing shifts and new checks to the Hub. Once a minute they checked to see if there were new assignments for them.
The hub received tasks from cash desks, stored them at home, and every 10 seconds sent a pack of 100 tasks to the Cabinet. I received tasks from the Cabinet and sent them to the ticket office when she knocked on him again.
The office accepted and processed checks, shifts and sent to the cashiers changes to old goods and the creation of new ones.

If the Cabinet for some reason could not accept at least one check from the packet, the Hub assumed that not a single task was accepted, and sent the entire packet again until the Cabinet reports on the successful acceptance of the entire packet.

Checks are the most difficult of all types of tasks. You need to go over all the positions, create new products in Postgres, and then put the entire check in MongoDB. Why did you decide to keep the checks in mong - a separate question, there are pros and cons. This is especially true now, when there is a lot of data and you need to make complex samples with aggregations.

So the system worked, while about two thousand cash desks were connected to the Cabinet, and the share of checks among the remaining tasks was no more than 15%.

When there were more users, we stopped dealing with loads

It is interesting how the unstable Internet of our users was reflected on us. When the Internet disappears in stores, the cashier continues to sell, collecting checks. As soon as the Internet appears, it sends all the data at once to the OFD and to the Hub. The load on our servers was jumping. With the increase in the number of cash registers, such a load schedule has become a real threat.

We began to notice jerking during the day, which intensified over time. Then the server simply crashed and could not rise, because it was immediately overwhelmed with new tasks.

The Hub had no problems - he added all the data to the database without heavy logic and sent it on. The whole trouble was that a bunch of tasks with a large share of checks can be processed longer than the request timeout for nginx. This led to a 500 timeout from the Cabinet, and the Hub immediately tried to feed the same data again, although the Cabinet was still processing the previous ones. As a result, the Hub simply started the DOS Cabinet until the latter fell.

Increase the timeout by nginx or reduce the number of tasks sent at a time - we get a short delay and in a month we will return to the same problem, but the scale of the consequences will increase. There will be more connected cash desks, we will let down more people.

To prevent crush, we organized a queue

If you look at the average statistics - there was enough power of iron. The data themselves were scarce, and during the day peak loads were followed by a decline. It was necessary to smooth the load so as not to fall during the peaks.

The solution is the task queue, where you can add tasks at any intervals and jumps in order to process them as much as possible. Since the problem was on the side of the Cabinet, we decided to screw it inside it. RabbitMQ was chosen as the queue. All tasks from the Hub got there one by one and the workers processed them.

The idea turned out to be so successful that the Hub wanted to do the same at home, because we also sometimes could send a hundred thousand tasks at once. So we decided to use rabbitMQ as a transport between the Hub and the Cabinet.

If earlier when saving goods on the site the Hub was lying, then we threw the error "Repeat later." Now the data will be saved in the database. The task is put in the queue much faster, because we do not wait for confirmation from the Hub, and he takes it himself whenever possible. If suddenly the Cabinet has not finished processing the task, and the connection with the worker has been interrupted, then this task will again appear in the queue. The Rabbit does all this out of the box; no error handling is required on our side.

Seeing how cool it works, we decided to find out the upper limit of the transport strength and conducted stress testing. The office network resisted, but still fell. But the Rabbit simply accumulated all the tasks at home and waited for the workers to process them. Hence the conclusion - the abilities of the rabbit are enough for us in abundance, the main thing is to give him a little SSD and more RAM.

Conclusion

It cannot be considered that the initial communication through batches of tasks was incorrect. In the early stages of development, the main resource is time. If we thought about a too distant future and optimized everything in advance, then we might not have reached the release.

Thus, the main thing is to understand in time that your service has outgrown old solutions, and for stable operation and subsequent growth, you need to slow down the development of new features and revise the architecture of the project.

Tags: