How Elasticsearch can help you find suspicious activity on a site

I offer readers of Habrahabr a translation of the article “Spotting bad actors: what your logs can tell you about protecting your business” from the official Elasticsearch blog . The article talks about how you can use the capabilities of Elasticsearch to analyze web server logs in order to detect suspicious activity on the site.

Let's think about what and when we do in case of attempts to hack our site? Firstly, most often we try to eliminate the threat even when the attackers found the vulnerability on the site and took advantage of it. Secondly, often the only operational tool to combat cybercriminals is blocking IP addresses, but this is not a very effective tool if we do not have detailed information about all the addresses from which the attack is conducted on the site.

But how much would the situation have changed if we could have received detailed information in advance about all IP-addresses and subnets that are suspicious and block them? Sounds great, doesn't it?

We can easily do this with Elasticsearch.

In the arsenal of this search engine there is a wonderful Netrisk plugin that takes on all the hassle of analyzing logs and, using the “Sankey” diagram (see picture), shows us the size and concentration of suspicious activity in various traffic segments.



Training


Let's start by installing Netrisk. To do this, you should run the following command while in the Elasticsearch home directory (version no lower than 1.4.0 is required):

bin/plugin -install markharwood/netrisk

Great, now the plugin is installed. But that's not all. He expects us to provide him with an index with correctly configured mapping for the field in which the IP addresses will be stored. I will explain the details later, but now you just need to create an index and fill it with a small amount of test data by running the following shell script:

$ES_HOME/plugins/netrisk/exampleData/indexAnonData.sh

Attention! This script will create the “mylogs” index, and if you already have an index with that name, it will be deleted.

If you have followed all the instructions above, you can open the plugin page: localhost : 9200 / _plugin / netrisk /

Launch


If you look at the data generated by the script, it may seem that this is not enough for serious analysis, because, in fact, from the valuable, we only have the status of the HTTP server responses. In fact, this is enough to detect suspicious behavior. Typically, the web server generates responses in the range from 200 to 300, but in case of problems, it can assign the status of the response from 400 to 500. For example, this can happen when someone tries to access a page that does not exist. It turns out that we can already get a list of all suspicious calls to the server using the request:

status:[400 TO 599]

Netrisk uses the standard Lucene request parser (the same as Kibana), so you can supplement the request that detects suspicious traffic with additional filters through the OR condition. For example, in this way, we can tell the system that accessing the site without specifying a UserAgent should also be considered suspicious.

It is important to understand that a request made at this stage will not unambiguously determine that all entries in the log that match it are bad. No, we just make an assumption that it may be suspicious so that Elasticsearch subsequently analyzes the entire log for a high concentration of suspicious calls to the server from specific IP or subnets.

If we run the plugin using the aforementioned request, Netrisk will show us a Sankey diagram of suspicious traffic leading to your site. Here is some information to read the chart:

  • The line thickness represents the number of “bad” requests, but this is not the most important!
  • The color of the line is much more important - it depends on which requests prevail: bright red means that almost all requests are bad, while green means that we can consider almost all requests as good. If you hover over a line, you can see the real numbers that underlie the definition of color.
  • The diagram shows which IP addresses are on each subnet. This information can be very valuable for the webmaster when determining where the malicious traffic is coming from: from specific IP addresses or from a subnet?
  • Clicking on a specific IP address will open the Honey Pot project website, where you can read the comments of other webmasters regarding this IP address.


You have probably noticed that the color of the line can change from red to green from left to right. This is due to the fact that on the left side of the diagram, each node usually represents a small group of IP addresses, requests from which were identified as suspicious. The following diagram nodes represent subnets that include a large number of IP addresses with different behaviors. However, some subnets will be completely red. Most likely, this means that no one except for cybercriminals is interested in your site in this region (for example, if you have a Russian-language site and red traffic from China comes to you, most likely, this means that Chinese hackers want to harm your resource).

How it works?


A bit about data mapping


The requests made by Netrisk rely on available statistics about IP addresses and subnets stored in the index. For such an analysis, we cannot simply index each IP address as a string, we must divide it into tokens that will represent the IP address itself and the subnets into which it belongs (for example, an IP address of type 186.28.25.186 will be divided into the following tokens: 186.28.25.186, 186.28.25, 186.28 and 186). This can be implemented using the following mapping rule:

Mapping rule
curl -XPOST "http://localhost:9200/mylogs" -d '{
    "settings": {
        "analysis": {
            "analyzer": {
                "ip4_analyzer": {
                    "tokenizer": "ip4_hierarchy"
                }
            },
            "tokenizer": {
                "ip4_hierarchy": {
                    "type": "PathHierarchy",
                    "delimiter": "."
                }
            }
        }
    },
    "mappings": {
        "log": {
            "properties": {
                "remote_host": {
                    "type": "string",
                    "index": "not_analyzed",
                    "fields": {
                        "subs": {
                            "type": "string",
                            "index_analyzer": "ip4_analyzer",
                            "search_analyzer": "keyword"
                        }
                    }
                }
            }
        }
    }
}'


This approach allows us to perform a quick search simultaneously at all 4 levels of the hierarchy of each IP address. (this also applies to IPv6 addresses)

What's inside?


Netrisk receives a request from you that determines what exactly should be considered “bad treatment” (or more precisely, “potentially bad treatment”). After filtering the data, Netrisk uses significant_terms aggregation to determine which IP addresses or subnets most often receive suspicious calls. The request template is as follows:

curl -XGET "http://localhost:9200/anonlogs/_search?search_type=count" -d'{
    "query": {
        "query_string": {
            "query": "status:[400 TO 599]"
        }
    },
    "aggs": {
        "sigIps": {
            "significant_terms": {
                "field": "remote_host.subs",
                "size": 50,
                "shard_size": 50000,
                "gnd": {}
            }
        }
    }
}'


This query selects the 50 most suspicious IP addresses and subnets. There are several points worth noting:

  1. To get the correct data, we need a high shard_size value. This will seriously load the memory and the network, as well as disk space for a large number of unique entries in the index. If we do not index the IP addresses completely in the remote_host.subs field, this will reduce the load, but also reduce the depth of the result.
  2. ElasticSearch uses the JLH algorithm for heuristic analysis by default, but the GND algorithm is much better for the task we are considering. Usually, rare words are of interest to us in analysis, for example, it’s important for us to determine that the words “Nosferatu” and “Helsing” are associated with a set of documents about Dracula’s film, but we are not very interested in what the common word “he” is associated with. In the task of analyzing IP addresses, we can neglect the capabilities that the JLH algorithm gives us and take advantage of the less functional but faster GND.


This single request will do a mass analysis and provide us with basic information about the intruders in our system, but it is advisable to clarify some information before deciding to block IP addresses. To do this, we can use the following query:

Inquiry
{
    "query": {
        "terms": {
            "remote_host.subs": [
                "256.52",
                "186.34.56"
            ]
        }
    },
    "aggs": {
        "ips": {
            "filters": {
                "filters": {
                    "256.52": {
                        "term": {
                            "remote_host.subs": "256.52"
                        }
                    },
                    "186.34.56": {
                        "term": {
                            "remote_host.subs": "186.34.56"
                        }
                    }
                }
            },
            "aggs": {
                "badTraffic": {
                    "filter": {
                        "query": {
                            "query_string": {
                                "query": "status:[400 TO 599]"
                            }
                        }
                    }
                },
                "uniqueIps": {
                    "cardinality": {
                        "field": "remote_host"
                    }
                }
            }
        }
    }
}



For each specified suspicious IP or subnet, we get an array with the following information:
  1. The total number of requests (good and bad);
  2. The total number of bad requests;
  3. The total number of unique IP addresses in the subnet;

Now that we have a diagram and detailed statistics about dubious IP addresses, we can make the final decision on blocking.

Conclusion


Tracking the behavior of entities such as IP addresses by analyzing the web server log is a complex computational task, but the data we obtained is just the tip of the iceberg.

Here are a few more interesting examples of behavioral analysis that you can implement yourself:

  1. How much time do visitors spend on my site?
  2. What IP addresses behave like bots (do not load CSS and JavaScript - only the page layout itself)?
  3. Which page of the site more often than others is the first / last when visiting the site?

Also popular now: