belkin-labs August 19, 2015 at 16:06

DIY IP Blocklist

Most recently, I have posted an article regarding the security of the site and, in particular, the problem of captcha and the big question - is it possible to get rid of it and how to do it.

The discussion was lively and very productive. As it often happens to me, as a result of analyzing user comments, I changed my mind on a very wide range of issues outlined in the article.

Here I would like to summarize the topic that was so dear to me earlier and to voice the next steps that I am going to take to develop it. They relate to creating your own blacklist of IP addresses. I, as always, do not affirm anything, but only offer options.

Analysis of mistakes

And there is one mistake, but quite significant. It suddenly turned out that there was not only a theoretical, but also a practical opportunity to make such a robot, which is either impossible or difficult to distinguish from a person. Such a robot can download JS client code and click on any buttons on the site. Since I have never met such animals in my practice, I’m not so sure that they do not exist at all, but, to put it mildly, I wanted to experience their effect on myself (my site). After the publication of the analyzed article, anonymous, but undoubtedly kind people provided me with such an opportunity. Only fully tested, failed.

What can I conclude after meeting these robots? I can’t say that everything is good or everything is bad. Those who said unequivocally were right that everything depends on the motivation of the spammer and the value of the site that is spammed. If it’s not a pity to spend time on the site, and in connection with this the spammer’s resources (time, money, desire) are almost endless, then it’s impossible or extremely unprofitable to deal with such spam without captcha. What is meant?

Attacks went to my site. I changed the name of the js function, which was performed by clicking on the submit button, and the attacks stopped, as cut off. Could this happen because the robot was programmed to execute a script by name? It could. Could it then be reprogrammed for something else? For example, searching for a button and parsing it? Yes, I could. Then, to fight this robot, I will have to close more and more things, hide the button, do some more laborious tricks, and still they can be overcome. This, on the one hand, is sad, but on the other hand it is much easier on my part to put a cool impenetrable captcha and close the question reliably, albeit at the expense of the user.

But, please note, the robot has not been reprogrammed! This means that spamming my “spam collector” is not profitable. This clearly confirmed the thesis that the site being attacked must have adequate value for this.

Conclusion 1 . It makes no sense to put a very complex defense on sites that are gaining popularity. Better try using the ajax button method", which I described in the article or any other similar method that was suggested in the comments. Such an approach will at least not scare away the users that are, and will not be a brake on the conversion. And only as the attacks begin, you should analyze the motivation a spammer and already in this connection look for methods of struggle, the last of which I see is complex and unkind for a blind person (like me) CAPTCHA.

Conclusion 2. My “ let me in ” method turned out to be largely useless. You can implement the same functionality much smaller and cheaper means without loss of functionality.

Conclusion 3. I understood why Yandex put captcha on every request for data in the tool "search words selection"! I have to take offense against him for this captcha and virtually apologize (since I was also offended virtually).

What else would I like to say about the described virtual browser attack on my spam collector? There is something, and it will already be from the category of " good news ". The fact is that all requests came from different ip-addresses, and all of them were bad ! What does " bad " mean ? These were either addresses that were already met on my site and were marked as suspicious or dangerous, while I selectively checked others on my favorite site www.projecthoneypot.org , and many (most) were marked as dangerous there.

Conclusion 4 . Marking IP addresses as dangerous or suspicious can help combat spam. There are services that provide this data for free or for money. Free information is likely to save no one, because it is limited in size, but paid information could bring significant benefits. Those who, for various reasons, do not want to spend money on services of this kind, could be offered to implement this service on their own. That is precisely what I would like to continue to reflect on.

For which sites can such a service come in handy?

In general, for everyone. Such a service makes the site owner keep abreast of what is happening on it. It is always useful simply to maintain order. I draw this conclusion solely on the basis of my experience. URI and IP accounting has repeatedly helped me in setting up my site.
For sites that have extremely useful online services, and whose owners are concerned, on the one hand, with the speed of user access to this service, and on the other, with a decrease in server load.
For sites that are spammed and this is starting to turn into a problem.
For those programmers and site owners who are extremely motivated to collect such statistics, but sometimes they themselves find it difficult to explain their motivation. To such programmers I can attribute myself.

What to do with labeled IP addresses?

Identified in harmful activities IP addresses can be hit in the rights. For example, they can be given captcha when filling out forms. With a high degree of probability - this is not a person. Posts from these IPs can be sent for moderation, and postpone the moderation itself “last”. When moderating, give a link to mark the client’s address as a spammer. It is possible to send a one-time link to the client’s note directly in the notification letter to the site administrator about the comment received.

Calls from bad IP addresses can not be processed, not brought to normal form and not redirected to the correct addresses, which can significantly increase the speed of the site.

I will explain the last statement with an example. Some site for many years. This is an article site. A long time ago, the owner of the site some articles were posted on certain resources specifically. A whole series of materials was dragged away and a reference to the primary location was given to them. Since then, these articles have been hanging on these resources, it is not possible to agree with the owners of these resources, and the links to these articles have changed. And at the same time there are constant visits to the site at these addresses and they have to be analyzed and redirected. But once, analyzing those who have to redirect, very reasonable suspicions arose that the site does not work completely for people. A lot of visitors are robots. And it can be such robots, which by and large do not always want to let yourself in. Here marking IP addresses can also help a lot. After all, we can mark addresses not only as “bad” and “suspicious”, but also as “desirable bot”, “unwanted bot” and so on.

The base of IP addresses can be shared for free, and you can even make a business out of this activity.

On what basis should I mark IP addresses?

Everything that I write further is based on my personal experience, and the latter, as it turns out, is not so universal. This is due to the features of the experimental site. But I hope that interested readers can use the ideas presented as a " source of inspiration ."

Request critical files of certain engines that are not used on the site

For example, I do not use the WordPress engine on my site, but I get a request to read the configuration file of this engine or the login page for the administrator.

Requests to system folders of the server like " /../../../ .. " or to files like " passwd "

Below is a real list of values by which I catch bad IPs. This is a very small list, but it works pretty well.

        $patterns = array(
            '#/wp-#',
            '#/browser#',
            '#/includ#',
            '#/engin#',
            '#admin#i',
            '#system#',
            '#/bitrix#',
            '#/cat_newuser#',  // страница регистрации, удаленная более 5 лет назад (8)
            '#/forum#',
            '#/common#',
            '#/plugins#',
            '#\.mdb/?#',
            '#\.aspx?/?#',
            '#^/BingSiteAuth#',
            '#passwd#',
            );

Requests to CGI scripts, if not used

Unfortunately, such requests are not caught using .htaccess (I don’t know how to catch them yet), and there may not be access to edit virtual hosts. But these requests fall into the error log and the IP addresses that accessed these files can be taken from there.

Requests for pages that are out of use during the reorganization of the site or for some other reason

Here is an interesting trick that I first used by chance, and then, after observing the results, I thought that it could be interesting for intentional use. Suppose there is a registration page. It is called no matter how. Let her be Uri, for example, "/ reg-new-user /". And now robots began to use this page. Either try to register if there were no protections, such as the “ajax button”, or just a page started to receive too many calls compared to real registrations. Then we change the uri of this page and do not redirect from the old to the new. And the requests to the old continue to go and go. And go for years. If you look back like this, then they’ve been going for about 8 years. Logically, all IPs that break before this address are immediately flagged as dangerous. The trap of harmful robots turned out. By the way Bing's search engine crashes at these kinds of addresses. Moreover, this is not a fake robot, but a real one. It is surprising that he scans such to receive these addresses? Maybe he crawls and indexes secret hacker forums? Good question, to which I, unfortunately, do not know the answer.

IP trap based on the result of form submite data analysis

To organize this trap, you need to have a field in the form that is populated with a client script. That is, a real user, working on a real browser, uploads a script that, by clicking on the button in the form, fills a certain field with a completely specific value. The field name should be incomprehensible to the robot, and the value should be encoded and better one-time. Such a value can be generated from a timestamp processed according to a certain algorithm. Then the robot either does not fill out the field or fills it with the wrong value that is expected. The presence of an unexpected value in this field immediately leads to marking IP dangerous. The method described in the previous article, code-named " let me in," works for me as such a trap .

Should be clarified specifically. This method will not work for virtual js browsers, which can also be used to automatically access pages and send spam.

Special traps

This refers to traps created specifically for robots. For example, at some point I made a message board. This tool did not justify itself and was disabled. But for some time he participated in the catalogs of boards and gained a broad and loyal “clientele” that has been working for many years. In order not to “offend” customers, I turned on this tool again, turning it into a kind of berth for robots. Everyone who goes there, most likely not people, and they can be easily marked as spam robots.

Unnatural Requests

We gradually approached the marking methods more theoretical than practical. Yes. Obviously, there are short series of requests from the same IP to various pages with high speed (up to several pieces per second). Such series load the server heavily. They can be produced by both harmful robots and bots, and spiders, the usefulness of which I have not yet been clarified. Such strange bots mark themselves in agents and do not really stand on ceremony on your site. Most recently, I witnessed a real situation when a robot that tagged itself with the name " ahrefs " sniffed out the product search url by filters and stopped the activity of a rather large online store for a day, because this request was not optimized from the MySQL point of view. And if the robot would not be disabled via .htaccessthen he probably would never have stopped.

But how to catch such unnatural requests at the lowest cost? I thought a lot and have not yet come up with anything other than manual tagging and blocking based on the results of viewing access logs. Since the logs of a good site can reach tens of megabytes, this question also goes into the realm of fantasy. Worst of all, urls can be requested real and work out honestly, and they don’t fall into any bad uri logs. All that remains is the IP tagging output for user agents. I will take this method to a separate paragraph.

Labeling IP by the contents of the client agent header

HTTP_USER_AGENT - a header that is difficult to navigate when catching bad clients. It’s much easier to flag good clients , for example, bots of the desired search engines, using this header . But there are cases of fakes. In any case, it’s better to mark the IP addresses of search engine robots all at once and in manual mode. You can use whois information to insert pools of such IPs into the database . In this case, all IP addresses of clients with googlebot agents , but non-Google ones, can be safely flagged as dangerous.

Cases of real, but unwanted or incomprehensible bots. That is, we, as a result of analyzing some kind of accident, or analyzing access logs, or somehow discover a bot with a name in the agent. Depending on the behavior of such a bot on your site, we mark all its IPs in automatic or manual mode and we mean it. For example, you can deny him access to some pages at all. On the search page for goods from the store’s catalog by a complex filter, not a single robot at all, in my opinion, has anything to do.

There is still an idea to mark the IP of all clients that do not have an HTTP_USER_AGENT header at all. It is quite unlikely that a simple and honest user will delete this header or manipulate it somehow. It remains, however, the option of illiterate configuration or technical failure on the client side. So I myself do not mark and do not plan to mark IP with a missing user agent. But I’m thinking about somehow paying attention to them, after all.

But have there been any cases of attempts to plant a shell through an agent in my practice? Yes! There was once an agent with a code for replanting the shell. But then again, it is highly likely that this was done as an experiment by a specific individual from a regular IP address. And there were agents with strange content. Here it is, for example: " () {:;}; echo Content-type: text / plain; echo; echo; echo M`expr 1330 + 7`H; / bin / uname -a; echo @ ". It’s hard, however, to say what it is, and how to use it.

Defining a robot by the contents of other headers

HTTP_REFERER in robots is usually synthetic. It may be absent, or it may coincide with the uri requested. It is generally difficult to analyze this heading, but there may be conditions on the site when it goes into business. For example, it looks suspicious when the first time you enter a page with a form, all arrays of parameters are empty, and HTTP_REFERER is equal to the uri of the page, as if it were reloaded.

Virtual js-browsers usually have normal headers, which you can’t find fault with. They are probably not distinguishable by headlines from people.

Call script with php extension

It seems useful to have a complete, 100% avoidance of extensions. Then requesting a page with the php extension will signal us about the arrival of the robot, and most likely an unwanted one.

Distorted or incomplete requests, including requests without a final slash, if required

If there is a missing slash, it is customary to simply add and do the 301st redirect. Is it worth tagging clients with this error? In my opinion, if it is, it’s only for individual pages at the discretion of the site administrator. For example, calling an ajax backend page without a backslash looks very suspicious.

But there are really distorted requests in which there is a double slash. With him I thought for a long time what to do. And in the end, he decided not to correct them and not to redirect them. It seems to me that this is unlikely to be a man. Alternatively, you can issue a special page 404 for such requests with a proposal to check for a double slash in the address bar and fix it.

There are queries that are unfinished, for example, consisting of half, for example, " / some-category / arti " instead of "/ some-category / article / 12345 / ". Unfortunately,“ good ”bots also violate this, so if you mark such clients as“ suspicious ”, then only in manual mode.

I, as the owner of one second-level domain and a large number of related third-level domains, I noticed an interesting feature of the behavior of robots. When I encounter a certain uri on one domain of the third level, it automatically applies to all subdomains. I think you can mark such clients right away. It is unlikely that a person will manually substitute uri to all subdomains. Yes and motivation of such The action is pretty unobvious.

Queries in which the script name is url-decoded

If the script is called fully or partially url-decoded, then this is extremely suspicious and clearly aimed not at good goals. In such cases, you can immediately mark the client as dangerous. There have been no such cases in my practice.

Access Method Mismatch

On the site, there may usually be times when the method of accessing the script must be strictly defined. For example, it cannot be a human error to call a script that processes a form using the GET method when parameters passed by the POST method are expected. Or ajax backend call using GET method.

Cases are suspicious, but I do not remember that I have ever met one in reality.

Direct backend ajax call

The bottom line is that there may be scripts on the site that other scripts call and the person never calls by entering the address of this script in the address bar of the browser. To catch this method, ajax backend scripts must be separate. It should not be an index.php file with a bunch of parameters. Unfortunately, individual ajax backend scripts are far from always and not everywhere. I have implemented this approach, and I catch the cases described, but I have not caught a single event.

Queries, similar to injections and opening files through parameters

With the transition to CNC (SEO-links), or simply url rewriting in order to hide page parameters, these attacks were made exotic. In the url " some-site.com/12345/ " it is rather unlikely to inject instead of " 12345 " and get something sane. Perhaps that is why such experiments began to occur less and less. In addition, in my practice, separate, but real individuals from their real IP addresses, for example, from a 3g modem, did such things in my practice. So the big question is whether to tag them. By the way, not so long ago, just three years ago, the IP addresses of mobile networks were very compromised (I don’t know now). It was just hard to work. Himself suffered with a network of Yota and MTS. POST

analysisI tried to implement the page parameters for suspicious words and characters and, in the end, abandoned it. Such an analysis has become too complicated and, therefore, resource-intensive. As a result, I reduced the list of suspicious words in the parameters to quotation marks alone, and this already seemed to me anecdotal. In addition, page parameters are still filtered, and it’s difficult to distinguish dangerous quotes from safe ones for analysis, which should be instant.

Performance Issues

Customers must be approached with great care. In the pursuit of query analysis, you can not overclock, but completely freeze the site. I do analysis, but I do it only in case of " bad uri ", that is, those that do not constitute a single real page and cannot be displayed. That is, client analysis is part of the output of page 404. And even so, I try not to load the server too much. When using regular queries, especially if they are collected in an array and applied in a loop, you have to give preference to the simplest and fastest. When accessing the database, try to do all the operations in one query. Well and other possible methods of code optimization. If the operation cannot be accelerated enough, then I leave it for manual viewing.

If client IP addresses are not only marked, but also blocked

Separately, I would like to note the check of the IP client for blocking. Such a check occurs on every request of any user.

When checking an IP address for a blocked one, it is probably better not to look for the address in a common table with thousands of addresses, but to have a separate label for only the blocked IPs and look for it. MySQL query search speed is pretty much dependent on the number of records. When searching a database, try to make such queries so that the search does not go according to data using the index, but only on the index.

One could try to look for blocked IPs not at all in the database. There is an idea to have a file on the disk with a serialized array of blocked IP addresses. Then each script reads a file, translates the serialized string into an array and works with it. There is a suspicion that such an access method will be faster than accessing and searching the database. But this must be checked, and for my site this is not so important due to its moderate activity.

List of bad uri accumulated and tagged IP

I have done on your site dedicated to programming service , which shows the accumulated bad Uri. This is actually a piece of admin panel, but is available for viewing. What you need to keep in mind when using it.

The table takes into account uri by domain
Every day the table is replenished with 100-300 entries. The base is growing very fast and I have to clean it. I cut some Uri similar to each other and transfer to the so-called pivot table. In this case, information about the client that this uri caused is lost. But personally, I do not need these uri at all. At first I watched and deleted them at all. Then he began to store, only to show someone who would be interested. The practical benefits of bad uri are not very clear to me.

The process of reducing uri is something like the following: For example, there is a frequent uri of the form " / some-page / 12345 / ", where " 12345 " is the variable part. Then all such URIs, there can be a lot of them, are reduced to " / some-page / ### /". And if not numbers, but letters are changed, then the shortened uri may look like" / some-page / ABCD / " or even" / some-page / _any_trifle_ / ".
Service is in the process of continuous development. A bit later, within a week, it is planned to make free distribution (download) of marked IP addresses.
Service has flaws. So, for example, the moment of the last bad event for IP is not remembered. This is a very significant drawback, which does not allow removing the IP danger mark by the prescription of the last bad event. Also, the service of custom IP manual tagging as good is not implemented.
Currently, only the bad uri log and the summary log are uploaded. I am going to improve the service.

Please do not go to my site for the bad uri indicated in the service! In this case, your IP may be marked bad. If you want to verify that the service is working, enter a neutral nonexistent uri, such as / adfsadf / . Then the IP will be listed, but will not be marked bad.

Tags: