How we moved from 30 servers to 2: Go
How we moved from 30 servers to 2: Go
When we released the first version of IronWorker about 3 years ago, it was written in Ruby, and the API was written in Rails. After some time, the load began to grow rapidly and we quickly reached the limit of our Ruby applications. In short, we switched to Go. And if you want to know the details - read on ...
First, a bit of history: we wrote the first version of IronWorker, originally called SimpleWorker (not a bad name, is it?) We were a consulting company creating applications for other companies, and at that time there were 2 popular things: Amazon Web Services and Ruby On Rails. And so we created applications using Ruby on Rails and AWS, and attracted new customers. The reason we created IronWorker was to "scratch where it itches." We had several clients using devices that constantly sent data 24/7 and we had to receive and convert this data into something useful. This problem was solved by starting heavy processes on a schedule that processed data every hour, every day and so on. We decided to create something that we can use for all our customers, without the need to constantly raise and maintain a separate infrastructure for each of them in order to process its data. Thus, we created a “handler as a service”, which we first used for our tasks, and then we decided that perhaps someone else would need a similar service and we made it public. So IronWorker was born.
Constant loading of processors on our servers was about 50-60%. When the load grew, we added more servers to keep the processor load at about 50%. This suited us while we were satisfied with the price we paid for so many servers. The bigger problem was how we dealt with load surges. Another jump in load (traffic) created a domino effect, which could damage the whole cluster. During such a jump in load, which is only 50% higher than usual, our Rails servers started using the processor 100% and stopped responding. This made the load balancer think that this server crashed and redistribute the load between the other servers. And, since, in addition to processing requests for a crashed server, the remaining servers had to handle peak load, it usually took a little time until the next server crashed, which was again excluded from the balancer pool, and so on. Pretty soon, all servers went down. This phenomenon is also known as colossal clusterf ** k (+ Blake Mizerany )
The only way to avoid this with the applications that we had at that time is to use a huge amount of additional power to reduce the load on our servers and be ready for peak loads. But that meant spending a huge amount of money. Something had to be changed.
We rewrote it
We decided to rewrite the API. It was a simple solution. Honestly, our API written in Ruby on Rails was not scalable. Based on many years of experience developing similar things in Java that could handle a lot of workload using far less resources than Ruby on Rails, I knew that we could do it much better. So the solution came down to which language to use.
I was open to new ideas, as the last thing I wanted to do was get back to Java. Java is (was?) A wonderful language in which there are many advantages, such as performance, but after writing Ruby code for several years, I was delighted with how productively I can write code. Ruby is a convenient, understandable, and simple language.
When we first decided to try Go, it was a risky decision. There was not a large community, there were not a large number of open source projects, there were no successful cases of using Go in production. Also, we were not sure that we could hire talented programmers if we chose Go. But soon we realized that we could hire the best programmers precisely because we chose Go. We were one of the first companies to publicly announce the use of Go in production, and the first company to post a vacancy announcement on the golang mailing list. After that, the best developers wanted to work for us, because they could use Go in work.
After we launched the version on Go, we reduced the number of servers to two (the second is needed more for reliability). Server load became minimal, as if nothing used resources. The processor load was less than 5%, and the application used several hundred KB of memory (at startup) compared to Rails applications, which consumed ~ 50 MB (at startup). Compare this to the JVM! A lot of time has passed. And since then we never had colossal clusterf ** k again.
We have grown a lot since then. Now we have a lot more traffic, we launched additional services (IronMQ and IronCache), and now we use hundreds of servers to cover the needs of customers. And the whole backend is implemented on Go. In retrospect, it was a great decision to choose Go, because it allowed us to build great products to grow and scale, as well as attract talented people. And, I think that the choice made will continue to help us grow in the foreseeable future.
PS The translation was made with permission from Travis, and if you have any questions, he and other Iron.io team members will be happy to answer them.