Passive fingerprinting to detect synthetic traffic

Briefly about the subject
Different implementations of the TCP / IP stack on operating systems have different default parameter values. This allows, with a good degree of certainty, to conclude which operating system generated the package.
In this context, a set of operating system-specific package parameters is called OS fingerprint. Since this method only assumes passing traffic without sending any requests, the method is called passive OS fingerprinting.
I use nginx as the front server, and there is no mod_p0f for it as for apache, so marking requests based on fingerprint in it is a difficult task, but it can be solved. Below I propose to consider the solution by which I achieved the result.
Decision
As mentioned above, an interesting group for me is niks machines that impersonate windows. You need to have inside nginx an understanding of which OS the connection is from. I decided to mark the required connections by directing them to a separate port nginx port according to the TTL criterion.
iptables -A PREROUTING -t nat -p tcp -m tcp --dport 80 -m ttl --ttl-lt 64 -j REDIRECT --to-ports 8123
In nginx then everything becomes quite simple.
Add an additional port:
listen 80;
listen 8123;
Note the variable requests that came to this dedicated port.
map $server_port $is_specialport {
default 0;
8123 1;
}
Let's mark proxies. There are many such requests because of Opera Turbo and the like.
map $http_x_forwarded_for $is_proxy {
default 0;
~^. 1;
}
A sign of a Windows user agent.
map $http_user_agent $is_windows {
default 0;
"~Windows" 1;
}
And finally, we define a flag variable for cases when the request has a Windows user agent, is not proxied, has a low TTL:
map $is_windows$is_specialport$is_proxy $is_suspected {
default "";
"110" is_suspicious;
}
We reserve the flag value for all requests:
log_format custom '$remote_addr - $remote_user [$time_local] '
'"$request" $status $bytes_sent '
'"$http_referer" "$http_user_agent" "$upstream_addr" '
'"$gzip_ratio" "[$upstream_response_time]" "$upstream_cache_status" "$request_time" "$is_suspected"';
access_log /var/log/nginx/nginx.access.log custom buffer=128k;
conclusions
Of course, I do not believe that the method gives greater accuracy, but the observation of the logs revealed:
- clients from whom requests were sent exclusively to statistics counters
- bots that were aimed at parsing VKontakte, but wandered to the site via a link from social networks
- evil spirits of a special kind, which is also not a living user
The share of hits is very good, it was really worth a closer look.
PS
Of course, I know that defaults are easy to change and, of course, TTL is not the only criterion that could work in this mechanism.
UPDATE:
As suggested in the comments, a counter was set in the body of the article. According to him:
Saving the referrer in the group in question
Total | No referrer (% of all) | Suspicious (% of all) | Suspicious without a referrer (% of all) | Suspicious without a referrer (% of requests without a referrer) |
---|---|---|---|---|
144623 | 2.12968% | 6.70156% | 0.407957% | 19.1558% |
Percentage of requests from known anonymous proxies in the group in question
According to MaxMind GeoIP2 dated 10.21.2014
Total | Requests from AP | Suspicious requests with AP | Group share among requests with AP |
---|---|---|---|
144623 | 160 | 124 | 77.5% |
Raw data: gist.github.com/Snawoot/d1f6ce46099555c668ca
The criterion identifies unhealthy traffic pretty well, and if there is a lot of it from one source of buying traffic, there is something to think about.