YourChief October 23, 2014 at 18:26

Passive fingerprinting to detect synthetic traffic

For quite a long time I had the idea to consider clients of a public web service, whose browser sends the User-Agent header like a browser on Windows, and which at the same time have all the signs of a network stack * of nix systems. Presumably, in this group there should be a large concentration of bots running on low-cost hosting sites to increase traffic or scan the site.

Briefly about the subject

Different implementations of the TCP / IP stack on operating systems have different default parameter values. This allows, with a good degree of certainty, to conclude which operating system generated the package.
In this context, a set of operating system-specific package parameters is called OS fingerprint. Since this method only assumes passing traffic without sending any requests, the method is called passive OS fingerprinting.

I use nginx as the front server, and there is no mod_p0f for it as for apache, so marking requests based on fingerprint in it is a difficult task, but it can be solved. Below I propose to consider the solution by which I achieved the result.

Decision

As mentioned above, an interesting group for me is niks machines that impersonate windows. You need to have inside nginx an understanding of which OS the connection is from. I decided to mark the required connections by directing them to a separate port nginx port according to the TTL criterion.

iptables -A PREROUTING -t nat -p tcp -m tcp --dport 80 -m ttl --ttl-lt 64 -j REDIRECT --to-ports 8123

In nginx then everything becomes quite simple.
Add an additional port:

        listen   80;
        listen   8123;

Note the variable requests that came to this dedicated port.

    map $server_port $is_specialport {
    default         0;
    8123            1;
    }

Let's mark proxies. There are many such requests because of Opera Turbo and the like.

    map $http_x_forwarded_for $is_proxy {
    default         0;
    ~^.            1;
    }

A sign of a Windows user agent.

    map $http_user_agent $is_windows {
    default         0;
    "~Windows"      1;
    }

And finally, we define a flag variable for cases when the request has a Windows user agent, is not proxied, has a low TTL:

    map $is_windows$is_specialport$is_proxy $is_suspected {
    default             "";
    "110"    is_suspicious;
    }

We reserve the flag value for all requests:

    log_format  custom  '$remote_addr - $remote_user [$time_local] '
                      '"$request" $status $bytes_sent '
                      '"$http_referer" "$http_user_agent" "$upstream_addr" '
                      '"$gzip_ratio" "[$upstream_response_time]" "$upstream_cache_status" "$request_time" "$is_suspected"';
    access_log  /var/log/nginx/nginx.access.log custom buffer=128k;

conclusions

Of course, I do not believe that the method gives greater accuracy, but the observation of the logs revealed:

clients from whom requests were sent exclusively to statistics counters
bots that were aimed at parsing VKontakte, but wandered to the site via a link from social networks
evil spirits of a special kind, which is also not a living user

The share of hits is very good, it was really worth a closer look.

PS
Of course, I know that defaults are easy to change and, of course, TTL is not the only criterion that could work in this mechanism.

UPDATE:
As suggested in the comments, a counter was set in the body of the article. According to him:

Saving the referrer in the group in question

Total	No referrer (% of all)	Suspicious (% of all)	Suspicious without a referrer (% of all)	Suspicious without a referrer (% of requests without a referrer)
144623	2.12968%	6.70156%	0.407957%	19.1558%

Percentage of requests from known anonymous proxies in the group in question
According to MaxMind GeoIP2 dated 10.21.2014

Total	Requests from AP	Suspicious requests with AP	Group share among requests with AP
144623	160	124	77.5%

Raw data: gist.github.com/Snawoot/d1f6ce46099555c668ca
The criterion identifies unhealthy traffic pretty well, and if there is a lot of it from one source of buying traffic, there is something to think about.

Tags:

Passive fingerprinting to detect synthetic traffic

Briefly about the subject

Decision

conclusions

Also popular now: