How to make a payment system with your own hands
Hi, Habr! We in RBKmoney wrote new payment processing. From scratch. Well, is not it a dream?
True, as always, on the way to the dream, most of the way had to be sailed along rivers with pitfalls, and a part of them was driven by hand-made bicycles. On this way we received a lot of interesting and useful knowledge that we would like to share with you.
We will tell you how the entire RBKmoney Payments processing was written, so we called it. How they made it resistant to loads and equipment failures, how they came up with the possibility of its almost linear horizontal scaling.
And, in the end, as we took off with all this, not forgetting the comfort of those who are inside - our payment system was created with the idea of being interesting primarily for developers, those who create it.
With this post we open a cycle of articles in which we will share both specific technical things, approaches and implementations, and experience in the development of large distributed systems in principle. The first article is an overview, in it we will mark the milestones that we will disclose in detail, and sometimes - in great detail.
Not less than 5 years have passed since the last publication in our blog. During this time, our development team has noticeably updated, now new people are at the helm of the company.
When you create a payment system, you need to take into account a bunch of different things and develop a lot of solutions. From processing, capable of processing thousands of simultaneous parallel requests for withdrawal of money, to user-friendly interfaces. Trite, if you do not take into account the small nuances.
The harsh reality is that payment processing companies are paying organizations, not accepting such traffic with open arms, and sometimes even asking to "send us no more than 3 requests per second." And the interfaces are watched by people who, maybe for the first time on the Internet, decided to pay for something. And any school of UX, incomprehensibility and delay is a reason to panic.
A basket in which you can put a purchase even during a tornado
Our approach to creating payment processing is to provide an opportunity to always launch a payment. No matter what happens inside of us - the server burned down, the admin got entangled in the networks, disconnected the electricity in the building / district / city, our diesel hmm ... lost. Never mind. The service will still allow you to start the payment.
The approach sounds familiar, right?
Yes, we were inspired by the concept described in Amazon Dynamo Paper . The guys from Amazon also built everything so that the user should be able to put a book in the basket, no matter what horror was happening on the other side of his monitor.
Of course, we do not violate the laws of physics and have not figured out how to refute the CAP-theorem . It’s not a fact that the payment will be made right there - there may be problems on the side of banks, but the service will create a request, and the user will see that everything worked. Yes, and we are up to a dozen more listings of backlog with technical debt, which must be confessed, we can answer 504 occasionally.
Look in the bunker, just a tornado outside the window
It was necessary to make our payment gateway always available. Whether the peak load has increased, something has fallen or was left for maintenance at the DC - the end user should not notice this at all.
This was decided by minimizing the places where the state of the system is stored — obviously, stateless applications can easily be scaled to the horizon.
The applications themselves are rotated in Docker containers, the logs of which we reliably merge into the central Elasticsearch storage; they find each other through Service Discovery, and the data is transmitted over IPv6 inside Macroservice .
All the microservices collected and working together, together with the accompanying services, are Macroservice, which ultimately provides you with a payment gateway, as you see it outside in the form of our public API.
SaltStack looks after the order, which describes the entire state of Macroservice.
We will come back with a detailed description of this entire farm.
With applications easier.
But if you store somewhere in the state, then it is necessary in a database in which the price of the failure of a part of the nodes is minimal. There is also no master data node in it. So that it can respond with predictable waiting times for requests. Is this a dream come true? Then it wasn’t necessary to maintain it, and so that erlangist developers would like it.
Yes, have we not yet said that the entire online part of our Erlang processing is written?
As many have probably already guessed, we didn’t have a choice.
All the state of the online part of our system is stored in Basho Riak . How to cook Riak and not to break your fingers (because you will surely break the brain), we will tell you more, but for now let's continue.
Where is the money, Lebowski?
If you take an infinite amount of money, you may be able to build an infinitely reliable processing. But it is not exactly. Yes, and we are not particularly allocated money. In exactly on the server level "quality, but China."
Fortunately, this has led to positive effects. When you realize that you, as a developer, it will be somewhat difficult to get 40 physical cores that address 512GB of RAM, you have to wriggle out and write small applications. But they can be deployed as many as you like - the servers are still inexpensive.
Back in our world, any servers tend not to come back to life after a reboot, or even catch a power supply failure at the most inopportune moment.
With an eye on all these horrors, we learned how to build a system with the expectation that any part is required to suddenly break. It is difficult to remember whether this approach caused any inconvenience in developing the online part of processing. Perhaps this is somehow related to the Erlangist philosophy and their famous LetItCrash concept ?
But with servers easier.
We figured out where to place the applications, there are a lot of them, they are scaled. The base is also distributed, there are no masters, it is not a pity for the burned-down nodes; we can quickly load the cart with servers, arrive at the DC and leave them with forks in the racks.
But with disk arrays do not do that! The failure of even a small disk storage is a failure of a part of the payment service, which we cannot afford. Duplicate storage? Too impractical.
We do not want to afford expensive branded disk arrays. Even out of a simple sense of beauty - they will not look next to the racks, where the rows of neonayms are packed in even rows. Yes, and unnecessarily expensive, it all costs.
As a result, we decided not to use disk arrays at all. All block devices we spin under CEPH on the same low-cost servers - we can put them in the rack in large quantities we need.
With network iron, the approach is not much different. We take the middling, we get good, suitable equipment for the task quite inexpensive. In the event of a switch failure, the second one works in parallel, and OSPF is configured on the servers, convergence is guaranteed.
Thus, we have a convenient, fault-tolerant and versatile system - a rack full of simple cheap servers, several switches. Next stand. And so on.
Simple, convenient and overall - very reliable.
Listen to the rules of conduct on board
We never wanted to come to the office, do work and get paid in money. The financial component is very important, but it does not replace the pleasure of a job well done. We have already written payment systems, including at previous places of work. And about what they do not want to do. But I didn’t want standard but proven solutions, I didn’t want a boring enterprise.
And we decided to pull the maximum fresh into the work. In the development of payment systems often limit new solutions, they say, why do you need a docker at all, let's go without him. And generally speaking. Nonsekurno. Prohibit.
We decided not to prohibit anything, but on the contrary, to encourage everything new. So, in our production, Macroservice was built from a huge heap of applications in docker containers, managed through SaltStack , Riak clusters, Consul as a Service Discovery, the original implementation of query tracing in a distributed system and many other great technologies.
And all this is so secure that you can without any shame publish the Bugbounty program on hackerone.com .
Of course, the very first steps along this road turned out to be littered with an absolutely indecent amount of rakes. As we have run over them, we will definitely tell, also we will tell, for example, why we do not have a test environment, and all the processing can be deployed on a developer’s laptop simple
As well as a bunch of interesting things.
Thank you for choosing our airline!
PS: Original content! All photos in the post are scenes from the life of our office.