
Improving testing by using real traffic
- Transfer
TL; DR The closer your test data is to reality, the better. Try Gor - automatic redirection of production traffic to the test site in real time.
Here at Granify we process a huge amount of user-generated data, our business is built on that. We must be sure that the data is collected and processed correctly.
You can’t even imagine how strange the data coming from users can be. The source may be proxy servers, browsers you have never heard of, errors on the client side, and so on.
Moreover, we can simply break everything when updating the version, even if all the tests pass. In my practice, this happens all the time.
There is a whole class of errors that are very difficult to find by automated and manual testing: concurrency, errors in configuring servers, errors that occur when commands are called only in a certain order, and much more.
But we can do several things to simplify the search for such bugs and improve system stability:
A Staging environment is required, and it should be identical to production. Using tools such as Puppet or Chef greatly simplifies the task.
You must require that developers always manually test their code on staging. This helps to find the most obvious errors, but it is still very far from what can happen on production traffic.
There are several techniques for allowing you to test your code on real data (I recommend using both):
1. Update only 1 of the production servers, so part of your users will be processed with the new code. This technique has several disadvantages: some of your users may see errors, and you may have to use sticky sessions. This is pretty similar to A / B testing.
2. Reproduction of production traffic (log replay)
Ilya Grigorik wrote a wonderful article about load testing using the log replay technique.
All articles that I read on this topic mention log replay as a means for load testing using real data. I want to show how to use this technique for daily testing and finding bugs.
Programs like jMeter, httperf or Tsung have log replay support, but it is either in its infancy or focused on stress testing rather than emulating real users. Feel the difference? A real user is not only a set of requests, the correct order and time between requests, various HTTP headers and so on is very important. For load testing, this is sometimes not important, but it is critical for finding bugs. In addition, these tools are difficult to configure and automate.
I wrote a simple program Gor
Gor - allows you to automatically reproduce production traffic on staging in real time, 24 hours a day, with minimal cost. Thus, your staging environment always receives a portion of the real traffic.
Gor consists of 2 parts: Listener and Replay server. Listener is installed on production web servers, and duplicates all traffic to the Replay server on a separate machine, which already directs it to the desired address. The principle of operation is shown in the diagram below:

Gor supports the limitation of the number of requests. This is a very important setting, since staging properly uses fewer machines than production, and you can set the maximum number of requests per second that your staging environment can withstand.
You can find detailed documentation on the project page .
Since Gor is written in Go, for launch we can simply use the already compiled distribution distribution Downloads
In Granify , we have been using Gor in production for some time, and are very pleased with the results.
Have a nice test!
Here at Granify we process a huge amount of user-generated data, our business is built on that. We must be sure that the data is collected and processed correctly.
You can’t even imagine how strange the data coming from users can be. The source may be proxy servers, browsers you have never heard of, errors on the client side, and so on.
No matter how many tests and fixtures you have, they simply cannot cover all the cases. Traffic from production will always be different than expected.
Moreover, we can simply break everything when updating the version, even if all the tests pass. In my practice, this happens all the time.
There is a whole class of errors that are very difficult to find by automated and manual testing: concurrency, errors in configuring servers, errors that occur when commands are called only in a certain order, and much more.
But we can do several things to simplify the search for such bugs and improve system stability:
We always test on staging
A Staging environment is required, and it should be identical to production. Using tools such as Puppet or Chef greatly simplifies the task.
You must require that developers always manually test their code on staging. This helps to find the most obvious errors, but it is still very far from what can happen on production traffic.
Testing on real data
There are several techniques for allowing you to test your code on real data (I recommend using both):
1. Update only 1 of the production servers, so part of your users will be processed with the new code. This technique has several disadvantages: some of your users may see errors, and you may have to use sticky sessions. This is pretty similar to A / B testing.
2. Reproduction of production traffic (log replay)
Ilya Grigorik wrote a wonderful article about load testing using the log replay technique.
All articles that I read on this topic mention log replay as a means for load testing using real data. I want to show how to use this technique for daily testing and finding bugs.
Programs like jMeter, httperf or Tsung have log replay support, but it is either in its infancy or focused on stress testing rather than emulating real users. Feel the difference? A real user is not only a set of requests, the correct order and time between requests, various HTTP headers and so on is very important. For load testing, this is sometimes not important, but it is critical for finding bugs. In addition, these tools are difficult to configure and automate.
The developers are very lazy. If you want your developers to use some kind of program / service, it should be as automated as possible, and even better if it would work so that no one would notice anything.
Play production traffic automatically
I wrote a simple program Gor
Gor - allows you to automatically reproduce production traffic on staging in real time, 24 hours a day, with minimal cost. Thus, your staging environment always receives a portion of the real traffic.
Gor consists of 2 parts: Listener and Replay server. Listener is installed on production web servers, and duplicates all traffic to the Replay server on a separate machine, which already directs it to the desired address. The principle of operation is shown in the diagram below:
Gor supports the limitation of the number of requests. This is a very important setting, since staging properly uses fewer machines than production, and you can set the maximum number of requests per second that your staging environment can withstand.
You can find detailed documentation on the project page .
Since Gor is written in Go, for launch we can simply use the already compiled distribution distribution Downloads
In Granify , we have been using Gor in production for some time, and are very pleased with the results.
Have a nice test!