
Fear and hatred in a single startup. Part 2 - Hate
As a system administrator, I advise you to take the most expensive dedicated server without support, RAID, large storage for special files, a template for a website brighter, and purchase AdWords for at least two days.
In the previous part, I described the general architecture of the application, and some features of the infrastructure. Today I would like to dwell on some points in more detail and tell what problems were created literally out of the blue. In parallel, I will tell you why some, frankly, dubious decisions were made (from conversations with the predecessor).
Lack of monitoring
The platform was not monitored from the word at all. At the same time, users constantly complained about the brakes of some parts of the site. The predecessor solved the problem by horizontal scaling - once every 2-3 months, just another server was bought up and added to the Nginx config on the balancer. Looking ahead, I’ll say that after I started taking statistics on capacity utilization, it turned out that 90% of the infrastructure was stupidly idle. Server rental money was wasted. The reason for this approach is “Well, if something doesn’t work, the customers will say why another demon is twisting it.”
Gentoo in production
Over the years of work in the industry, for me personally, all distributions merged into one. If earlier when planning infrastructure I tied to some one distribution just because I had more experience with it (or because I wanted to try a new one), now I am more often guided by considerations of the cost of supporting a particular solution in a particular situation.
In the project that I am describing now, my predecessor read somewhere that Gentoo scales very well to dozens of servers, and once the package is assembled, it can be simply poured onto other machines with rsync. The theory is beautiful (and I even saw such a working solution for admin workstations), in practice, nobody realized the synchronization of the portage tree at least once a week, which over time made the installation of packages practically unrealistic. There was no question about security updates. For a couple of weeks I brought everything into a divine form, and thought about moving to a binary distribution. I did not want to spend several days every month on updating and rebuilding the inverse dependencies (hi, ZeroMQ broker implemented on Ruby via libffi).
Broker
Since I’ve talked about a broker, I’ll tell you what problems he had with him. Condition monitoring. It was not (or rather, in the code of the broker were to stub functions ping_service()
, get_service_state()
, get_stats()
and the like). The only function implemented - ping_broker()
- worked only from one service, and it could be called from the Rails console:ServiceName.ping_broker()
. All. Services did not know when the broker is lying. Services did not know how to re-register in the event of a restart of the broker. The broker was stateless, accordingly, “forgot” about all services after the restart, you had to walk around the servers with your hands, connect to screens and restart all services and their event handlers. Well, like a cherry on the cake - the broker was responsible for assigning ports for the service. That is, in the broker's settings the min_port pool was set: max_port, the service at the start asked the broker which port to bind to, and tried to start listening on this port. If the broker runs on one server and the service starts on another, the port that the broker issued can already be taken and the service simply will not start with the “Address already in use” error. It was not possible to monitor services with such a working scheme.
Synctool
Who cares - a link to the project: http://walterdejong.github.io/synctool/ . In principle, it had the right to life. But firstly, the mountain of bash scripts + rsync is not configuration management, and secondly, I just met Ansible, which turned out to be much more flexible. There isn’t even much to say, just in a couple of days I transferred all the logic from synctool to Ansible and forgot like a nightmare. Reasons to use synctool - “Well, I looked at Puppet, it seemed to me difficult, but in synctool you can solve everything with scripts”. A man simply did not know about Absible / Chef.
Falcon
In the first part, I mentioned falcon, but forgot to give a link to it, corrected: http://www.falconpl.org/ . A mixture of procedural and functional scripting language support with multithreading support and its own virtual machine. In principle - a powerful and interesting thing with a low entry threshold, but why use it only to perform it ssh dba@db01 “echo ‘SELECT stuff FROM table’ | psql -U postgres app_db”
is beyond my comprehension. The question “Fuck is this here?” Regarding falcon was never asked by me.
Separation of production / development environments in code
Last point for today. Rails has a wonderful mechanism that covers 99% of cases where you need to configure the application for production and development in different ways. This mechanism was not used, and the host names for the services, the Redis address, the database address and port, and the application domain name were nailed in the code. Somehow I had to migrate Redis and the database to other servers - the platform lay for more than a day while I was scooping up all such places. The reasons are the development model, and the programmer’s not very high qualification. The project was written practically “on the knee” at first, new features were added and added, no one did refactoring, and at some point it turned into what is in the picture:
In the last part I will talk about how the platform looks now, what technologies are used and why, how the use of appropriate tools for the task helps to save money, and why the system administrator should not code, and the programmer should not administer.