vovagubin1987 July 31, 2013 at 11:59

To-do: Filtering everything and everything

From the sandbox

This article is more of a FAQ than a full manual. However, much has already been written on the Habré and for this there is a search by tags. There is no sense in rewriting everything anew.

Recently, our state, fortunately or not fortunately, has set about the Internet and its contents.
Many will undoubtedly say that rights, freedoms, etc. are being violated. Of course, I think few people doubt that the invented laws were made by little understanding people in the Internet business, and their main goal is not to protect us from what is there. Being a responsible person and driven by prosecutors in some institutions, the question arises of restricting incoming information. Such institutions, for example, include schools, kindergartens, universities, etc. him an institution. And business also needs to take care of information security.
And our first point on the way to the local content filter is

Analysis of what the Internet is and how it works.

It's no secret that 99 percent of the Internet is http. It is further known that each site has a name, page content, url, ip address. It is also known that several sites can sit on one ip, and vice versa. So, url addresses can be both dynamic and constant.
And what is written on the page is written. From here we conclude that sites can be monitored by:

Site name
page url
According to the content written on the page of the site
By ip address.

Further, all content on the Internet can be divided into three groups:

This is bad
This is unknown
That is good.

And from this two ideological paths follow:

Allow only what is good and prohibit the bad and the unknown. This path is called - BIG (and sometimes small) White List
We allow only that which is good and the unknown. We prohibit only the bad. This path bears the proud name of the Black List.

And of course, between these two paths there is a middle ground — we prohibit the bad, allow the good, and analyze the unknown and make an online decision — good or bad.

Means of their implementation.

There are again two ways:

We take a ready-made solution.

Such solutions also come in 3 types - paid, free, limited (until you give the money).

Paid solutions are hardware (you don’t know what the box is with, but doing their job), hardware-software (this is the same box, but with a full OS and corresponding applications) and software.
Free solutions are software only. But there are exceptions, but this is just the case confirming the rule.
Paid ones include Kaspersky antivirus of the corresponding functionality, ideco.ru, netpolice, kerio, etc. It’s easy to find them, because they are well advertised and it’s enough to enter in the search bar something like - buy a content filter.
Free solutions have one drawback - they do not know how to do everything right away. finding them more difficult. But here is a list of them: PfSense, SmoothWall (sometimes it comes in both paid and free. Free is not functional), UntangleGateway, Endian Firewall (also paid and free), IPCOP, Vyatta, ebox platform, Comixwall (Wonderful solution. You can download from my site is 93.190.205.100/main/moya-biblioteka/comixwall ). All these solutions have one drawback - limited.

We do everything with our hands.

This path is the most difficult, but also the most flexible. Allows you to create everything that the soul desires (including a loophole).
There is a great many components. But the most powerful and necessary is

Squid. Without a proxy, no where.
Dansguardian. This is the heart of the entire content filter. His only free opponent (not counting his forks) is the POESIA filter (but he is very dense).
DNS server Bind.
Clamav. Antivirus.
Squidguard, directors and similar redirectors for proxies.
Squidclamav.
Sslstrip. This utility makes encrypted http traffic from encrypted https traffic.
www.thoughtcrime.org/software/sslstrip . Analogs to it are the proxy server flipper and charly proxy. But analogues work on Windows. And the second is paid. But whoever needs it, you can deploy wine.
Black Lists. These lists can be taken from www.shallalist.de (1.7 million sites), www.urlblacklist.com (namely, the big version with more than 10 million sites), www.digincore.com (about 4 million), directories lists.
White lists. It's all very tight. The only normal (meaning large) Russian-language list can be obtained from the Safe Internet League , and then only as a proxy league for the safe Internet or the program www.ligainternet.ru/encyclopedia-of-security/parents-and-teachers/parents-and-teachers -detail.php? ID = 532 . By the way, due to digest authorization on league proxies, this proxy cannot be hooked to squid. If anyone knows how to pick up a proxy server with digest authentication as a parent proxy, please inform.
DNS lists. There are two well-known options. The first is the skydns filter www.skydns.ru .
The second is yandex dns dns.yandex.ru .
Skydns is more functional, unlike Yandex.

Where does the filtering take place.

The following options are possible:

On user computers without centralized management, as a system component or application.
The same as the first, but with centralized management (as an example KASPERSKY ADMINISTRATION KIT).
Component to the browser. There are appropriate plugins for chrome and fox
On a separate computer or cluster of computers (including the option-on the gateway).
Distribution.

1 and 2, 3 options in terms of filtering speed - the fastest when using the network massively.
In terms of labor, 1 and 3 are the most labor intensive.
In terms of reliability, not bypassing filtering by the user, then 4-first place.
5 option is a dream. But he is nowhere to be found.

Now the next question:

Reliability of a filtration.

I think it’s clear. Protection needs to be done multilevel, because what is leaked at one level of protection will be blocked by another level.

let `s talk about

On the disadvantages of protection levels.

Lists

The Internet is a constantly and most importantly, very rapidly changing environment. It is clear that our lists will not keep up with the Internet, and even more so if we keep them by hand. Therefore, participate in the list compilation communities and use not only list files, but also list services, where they will do everything for us (for example, skydns and yandex).
And the list does not guarantee that something is wrong on some page, and the site itself is completely white and fluffy.
Use multiple lists. What didn’t fall into one may fall into another. !!!
Programs that work on lists include Netpolice (http://netpolice.ru), censor (http://icensor.ru/), Traffic Inspector for schools (http://www.smart-soft.ru/ru) and etc. Usually, programs that can do lexical analysis can also work on lists.
The censor has an old base from 2008. But free in everything. Netpolice there are many versions and there is a free, but stripped down.
And do not forget, neither black nor white lists can protect you 100%. Only lexical analysis is capable of that.

Virus analysis.

Here the main problem is the anti-virus database. Again, one antivirus on the gateway, the other at the workplace.

Analysis of the content written on the page.

Here the main problem is the lexical analysis of the text. Of course, no one has money for artificial intelligence, therefore they use a database of words and expressions with a weight coefficient. The smaller the base, the less effective the filtering, but also the larger the base, the more effective, but also labor-intensive. For example, the analysis of the work of Jules Verne The mysterious island with lib.ru takes 8 seconds with my base and dansguardian (core2duo 2.66). Yes, and the base must be taken somewhere. I had to do the normal base myself, which is what I share with you and 93.190.205.100/main/dlya-dansguardian/spiski/view .

The next question is

The ability to bypass user content filtering.

This issue can be solved in two radical ways.

Deny direct access to the network, except for passing through the proxy server (the proxy also needs to be limited by the CONNECT method to the list of domains or / and ip or mac addresses.) We do this either using iptables, or just write sysctl.conf net.ipv4.ip_forward = 0. Well iptables is a question of a separate article.
To prohibit users to put something at workplaces. Clear business: there is no program, there is no workaround.

Question-performance.

Here, more or less everything is clear, more memory, more hertz, more cache. And it is very useful for those with small capacities to use CFLAGS optimization. All Linux and freakas allow this, but gentoo, calculate linux, slackware, freebsd are especially convenient.
If you have multi-core processors, then use OPEMNP (you can get dansguardian suitable for it from me 93.190.205.100/main/dlya-dansguardian . By the way, it also fixed a bug with the inability to upload data to the Internet.) CFLAGS = "- fopenmp". LDFLAGS = "- lgomp". Remember to include -O3 -mfpmath = sse + 387. About autopatching here.

Question-hierarchy of caches and proxies.

If you have many computers and you can use several as a filter, then do so. On one, put the squid proxy server and specify the parameters of the parent caches with the round-robin parameter on it (http://habrahabr.ru/post/28063/). A dansguardian with squid in a bunch acts as a parent on each particular computer (because without a higher dansguardian it can’t). Parents are located on the same computers on which dansguardians are located. For higher ones, a large cache does not make sense, and for the first, it is necessarily the largest cache. Even if you have one machine, then on it all the same make a bunch of squid1-> dansguardian-> squid2-> a provider with the same cache distribution. Do not place anything on dansguardian other than analyzing what is written on the pages, redrawing the content, headers and some urls, blocking mime types. In no case do not hang anti-virus and black sheets on it, otherwise there will be brakes.

List analysis let squid1 and squid2 do.
Virus check let squidclamav do it through c-icap on squid2. We put whitelists on squid1.
Everything in the white list should go directly to the Internet, bypassing parent proxies. !!!

We will definitely use our own DNS server, in which we use redirection to skydns or dns from yandex. If there are local provider resources, then we add the forward zone to the provider's dns. In the dns server, we also specify the local zone for the necessary intranet resources (and to be beautiful, they are needed). Specify nosslsearch google search. In squid configs we surely use our dns.
For everything we use the Webmin webcam and the command line. On windows servers, we do everything through the mouse.

LAN setup

Use authentication by ip addresses. If you are not a “serious” organization, access with mandatory logging to anything.
Use logically separated networks in one continuous physical network. Issue IP addresses to MAC addresses. Prohibit the connection to the proxy server port if the MAC address of the machine does not match the assigned MAC address, IP address.
Configure iptables so that calls to any ports (3128, 80, 80, 3130, 443) go through the proxy server port.
Configure automatic proxy configuration on the network through dns and dhcp. www.lissyara.su/articles/freebsd/trivia/proxy_auto_configuration
Do groups and filtering level by ip address.
You can configure proxies in your browser settings.

We are checking.

In this case, all the sliders to the maximum.
Additionally, we prohibit all video sites, contact, social networks, music portals, file sharing and file sharing networks.
Forbid mp3.
Put a checkmark in front of the secure search in your SKYDNS account.
Be sure to arrange the documentation !!!

Https filtering

To do this, between squid2 and the provider, wedge sslstrip. This utility makes encrypted http traffic from encrypted https traffic. www.thoughtcrime.org/software/sslstrip . It is also possible in the squid1 rules to set the correspondence on port 443 and domains for prohibition / permission.

A couple more tips.

Not all sites are filtered correctly. Therefore, use the bypass feature blocked by dansguardian. You can take the finished page from me 93.190.205.100/main/dlya-dansguardian/stranichka-blokirovka-i-razblokirovka-dlya-dansguardian/view .
Always keep logs of site visits and keep statistics for the year. There will always be smart people who want to do something illegal on the Internet. An identifier by ip is enough, due to its adequacy and in the name of enforcing the law on the protection of personal data. Make statistics open.
There are sites not subject to dansguardian. These are those who use json. This, for example, yandex.ru, video.yandex.ru. Do password authentication for them through squid1.
Not all providers comply with the law, and what is written on the federal list of extremist materials and is not blocked on zapret-info.gov.ru . Therefore, for the first, read and fill in the database of words and expressions, and for the second, use the antizapret.info upload.
Know, most prosecutors do not care who is to blame. Visible means visible. And at least do something.
Do not forget and put snort with snortsam. Security is above all, all the more if you have a white ip address on the gateway.
Many search engines have the ability to filter the results. This happens by adding a special parameter to the request, or through cookies. Recently, more and more people began to switch to cookies, therefore, an appropriate dansguardian setting is needed. Yes, and you can take it configs from me . There they are registered. In addition, be sure to do the lists in 4 encodings (1251, utf8, koi8r, utf16) and select the correct filtering method (more in the configs). For youtube we use edufilter .
A good squid configuration manual can be found here .

Tags:

content filtering