Passive fingerprinting to detect synthetic traffic

    imageimageFor quite a long time I had the idea to consider clients of a public web service, whose browser sends the User-Agent header like a browser on Windows, and which at the same time have all the signs of a network stack * of nix systems. Presumably, in this group there should be a large concentration of bots running on low-cost hosting sites to increase traffic or scan the site.

    Briefly about the subject


    Different implementations of the TCP / IP stack on operating systems have different default parameter values. This allows, with a good degree of certainty, to conclude which operating system generated the package.
    In this context, a set of operating system-specific package parameters is called OS fingerprint. Since this method only assumes passing traffic without sending any requests, the method is called passive OS fingerprinting.

    I use nginx as the front server, and there is no mod_p0f for it as for apache, so marking requests based on fingerprint in it is a difficult task, but it can be solved. Below I propose to consider the solution by which I achieved the result.

    Decision


    As mentioned above, an interesting group for me is niks machines that impersonate windows. You need to have inside nginx an understanding of which OS the connection is from. I decided to mark the required connections by directing them to a separate port nginx port according to the TTL criterion.
    iptables -A PREROUTING -t nat -p tcp -m tcp --dport 80 -m ttl --ttl-lt 64 -j REDIRECT --to-ports 8123
    

    In nginx then everything becomes quite simple.
    Add an additional port:
            listen   80;
            listen   8123;
    

    Note the variable requests that came to this dedicated port.
        map $server_port $is_specialport {
        default         0;
        8123            1;
        }
    

    Let's mark proxies. There are many such requests because of Opera Turbo and the like.
        map $http_x_forwarded_for $is_proxy {
        default         0;
        ~^.            1;
        }
    

    A sign of a Windows user agent.
        map $http_user_agent $is_windows {
        default         0;
        "~Windows"      1;
        }
    

    And finally, we define a flag variable for cases when the request has a Windows user agent, is not proxied, has a low TTL:
        map $is_windows$is_specialport$is_proxy $is_suspected {
        default             "";
        "110"    is_suspicious;
        }
    

    We reserve the flag value for all requests:
        log_format  custom  '$remote_addr - $remote_user [$time_local] '
                          '"$request" $status $bytes_sent '
                          '"$http_referer" "$http_user_agent" "$upstream_addr" '
                          '"$gzip_ratio" "[$upstream_response_time]" "$upstream_cache_status" "$request_time" "$is_suspected"';
        access_log  /var/log/nginx/nginx.access.log custom buffer=128k;
    


    conclusions


    Of course, I do not believe that the method gives greater accuracy, but the observation of the logs revealed:
    • clients from whom requests were sent exclusively to statistics counters
    • bots that were aimed at parsing VKontakte, but wandered to the site via a link from social networks
    • evil spirits of a special kind, which is also not a living user

    The share of hits is very good, it was really worth a closer look.

    PS
    Of course, I know that defaults are easy to change and, of course, TTL is not the only criterion that could work in this mechanism.

    UPDATE:
    As suggested in the comments, a counter was set in the body of the article. According to him:

    Saving the referrer in the group in question
    TotalNo referrer (% of all)Suspicious (% of all)Suspicious without a referrer (% of all)Suspicious without a referrer (% of requests without a referrer)
    1446232.12968%6.70156%0.407957%19.1558%

    Percentage of requests from known anonymous proxies in the group in question
    According to MaxMind GeoIP2 dated 10.21.2014
    TotalRequests from APSuspicious requests with APGroup share among requests with AP
    14462316012477.5%

    Raw data: gist.github.com/Snawoot/d1f6ce46099555c668ca
    The criterion identifies unhealthy traffic pretty well, and if there is a lot of it from one source of buying traffic, there is something to think about.

    Also popular now: