We proxy and save

  • Tutorial
1November, the world has changed and will never be the same again. Censorship has appeared on the Russian Internet - a well-known list of banned sites. For some this is an important political topic, for others it is an occasion to study encryption technologies and protection of anonymity, for others it’s just another strange law that must be executed on the fly. We will talk about the technological aspect.

In this tutorial, we will learn how to quickly and easily make a working mirror of any site, which allows you to change the IP and assign any domain name. We even try to hide the domain in the url, after which we can save a locally complete copy of the site. All exercises can be done on any virtual server - I personally use Hetzner hosting and Debian OS. And of course we will use the best web server of all time - NGINX!

To this paragraph, an inquisitive reader has already purchased and configured some sort of dedicated server, or simply launched Linux on an old computer under the table, as well as launched the latest version of Nginx with the “Save me now” page.


Before you begin, you must compile nginx with the ngx_http_substitutions_filter_module module , the former name is substitutions4nginx.

Further configuration will be shown using www.6pm.com as an example . This is the site of a popular online store selling goods at good discounts. It is distinguished by its categorical unwillingness to give access to customers from Russia. Why haven’t the censorship of capitalism been bared?

We already have a working Nginx, which is engaged in useful things - it twists the site on the Livestreet system about the advantages of foreign shopping. To raise the 6pm mirror, we register a DNS record with the name 6pm.pokupki-usa.ru which addresses to the server IP. As you understand, the choice of a name for a sub-domain is completely arbitrary. This name will be set in the HOST field every time you access our new resource, so you can start virtual hosting on Nginx.

In the root section of the nginx configuration, we write upstream - the name of the donor site, so we will call it in the future. In standard guides, a site is usually called a back-end, and a reverse-proxy is called a front-end.

http {
    ...
    upstream 6pm { server www.6pm.com; }


Next, you need to create a server section , here is what it looks like

    server {
        listen          80;
        server_name     6pm.pokupki-usa.ru;
        limit_conn  gulag 64;
        access_log   /var/log/nginx/6pm.access.log;
        error_log    /var/log/nginx/6pm.error.log;
        location / {
            root /var/www/6pm;
            try_files $uri @static;
        }
        location @static {
            include '6pm.conf';
            proxy_cookie_domain 6pm.com 6pm.pokupki-usa.ru;
            proxy_set_header Accept-Encoding "";
            proxy_set_header      Host     www.6pm.com;
            proxy_pass http://6pm;
            proxy_redirect http://www.6pm.com http://6pm.pokupki-usa.ru;
            proxy_redirect https://secure-www.6pm.com https://6pm.pokupki-usa.ru;
        }
    }


The standard listen and server directives determine the name of the virtual host upon access to which the server section will be triggered. Log files are best made separate.

We declare the root locale, specify the path to its storage - root / var / www / 6pm; then use try_files . This is a very important nginx directive, which allows you to organize local storage for downloaded files. The directive first checks if there is a file named $ uri and if it does not find it, it goes to the named locale @ static
$ uri - nginx variable that contains the path from the HTTP request

The prefix “@” specifies a named location. This location is not used in normal request processing, but is intended only to redirect requests to it. Such locations cannot be nested and cannot contain nested locations.


In our case, the design is used only to replace the robots.txt file in order to prohibit indexing the content of the site. However, mirroring and caching in nginx is done in this way.

include '6 pm.conf' - the logic of the substitutions module.

proxy_cookie_domain is a new function that appeared in nginx version 1.1.15, without this directive it was necessary to do so . You no longer need to rack your brains, write one line and cookies just start working.

proxy_set_header Accept-Encoding ""; - a very important team that forces the donor site to give you content not compressed, otherwise the substitutions module will not be able to perform replacements.

proxy_set_header Host- Another important team, which in the request to the site donor sets the correct HOST field. Without it, the name of our proxy server will be substituted and the request will be erroneous.
proxy_pass - direct addressing does not work in a named locale, which is why we registered the address of the donor site in the upstream directive.
proxy_redirect - many sites use redirects for their needs, each redirect needs to be caught and intercepted here, otherwise the request and the client will go beyond our cozy domain.

Now let's see the contents of 6 pm.conf. It was not by chance that I put the transformation logic into a separate file. Thousands of replacement rules and hundreds of kilobytes of filters can be placed in it without any loss of performance. In our case, we only want to complete the proxying process, so the file contains only 5 lines:

Change google analytics codes:
subs_filter 'UA-8814898-13' 'UA-28370154-3' gi;
subs_filter "'.6pm.com']," "'6pm.pokupki-usa.ru']," gi; 

I assure you that this is the most harmless prank of the possible. We will have statistics on visits, and these visits will disappear at the donor’s site.

We change all direct links to new ones.
subs_filter "www.6pm.com" "6pm.pokupki-usa.ru" gi;
subs_filter "6pm.com" "6pm.pokupki-usa.ru" gi;


As a rule, in normal sites, all the pictures are on CDN networks, which do not bother to check the source of requests, so replacing links from only the main domain is enough. In our case, 6pm showed off and posted some of the images on domains that refuse visitors from Russia. Fortunately, the replacement module supports regular expressions and it’s easy to write a general rule for a group of links. In our case, even without regexp, we just changed two characters in the domain. It turned out like this:

subs_filter "http://a..zassets.com" "http://l3.zassets.com" gi;


The only, but very serious limitation of the replacement module is that it works with only one line. This restriction is laid out architecturally, since the module works at the stage when the page is partially loaded (chunked transfer encoding) and there is no way to perform full-text regexp.

Everything, you can look at the result , everything works, even payment of the order takes place without difficulty.

So, we threw the site to a new IP address and a new domain. It was a simple task. Is it possible to link a site not to a new domain, but to a subdirectory of an existing one? This can be done, but there are difficulties. First, remember what html links are:
  1. Absolute links of the form " www.example.com/some/path "
  2. Links relative to the root of the site like "/ some / path"
  3. Relative links like "some / path"


Clause 1 is simple - we replace all links to a new path with a subdirectory.
Clause 3 is just as simple - we do not touch anything and everything works by itself if the base href attribute was not used . If this attribute is used, which is extremely rare in modern sites, then it is enough to replace it and everything will work.

The real difficulty arises from claim 2. due to the fact that we have to change a lot of options of the form / ... on / the subdomain / ... . If you do this head-on, the site will most likely stop working completely, because such a replacement will break a lot of constructions using a slash, which will ruin almost all javascript scripts.

Theoretically, you can write a fairly general universal regexp, which will be able to select the exclusively necessary patterns for replacement, in practice it is much easier to write a few simple regexp that will translate the necessary links in parts.

Back to our patient:

        location /6pm {
            root /var/www/6pm;
            try_files $uri @6pm-static;
            access_log   /var/log/nginx/6pm.access.log;
        }
        location @6pm-static {
            include '6pm2.conf';
            proxy_cookie_domain 6pm.com pokupki-usa.ru;
            proxy_cookie_path / /6pm/;
            rewrite ^/6pm/(.*) /$1 break;
            proxy_set_header Accept-Encoding "";
            proxy_set_header      Host     www.6pm.com;
            proxy_pass http://6pm;
            proxy_redirect http://www.6pm.com http://pokupki-usa.ru/6pm;
            proxy_redirect http://www.6pm.com/login http://pokupki-usa.ru/6pm;
            proxy_redirect https://secure-www.6pm.com https://pokupki-usa.ru/6pm;


The server configuration has undergone some changes.

Firstly, all the logic is transferred directly from the sever directive to location . It is easy to guess that we decided to create the / 6pm directory into which we will display the proxied site.

proxy_cookie_path / / 6pm / - transfer cookies from the root of the site to the subdirectory. This is not necessary, but if there are a lot of proxied sites, their cookies may cross and erase each other.

rewrite ^ / 6pm /(.*) / $ 1 break; - this magic cuts out the subdirectory that we added from the client request, as a result the proxy_pass directive sends the correct value to the donor server.

It became a little more difficult to catch redirects. Now all root links need to be redirected to / 6pm.

Let's look at the transformation logic:

subs_filter_types text/css text/javascript;
# Fix direct links
subs_filter "http://6pm.com" "http://pokupki-usa.ru/6pm" gi;
subs_filter "http://www.6pm.com" "http://pokupki-usa.ru/6pm" gi;
# Fix absolute links
subs_filter 'src="/' 'src="/6pm/' gi;
subs_filter 'href="/' 'href="/6pm/' gi;
subs_filter 'action="/' 'href="/6pm/' gi;
# Fix some js
subs_filter "\"/le.cgi" "\"/6pm/le.cgi" gi;
subs_filter "\"/track.cgi" "\"/6pm/track.cgi" gi;
subs_filter "\"/onload.cgi" "\"/6pm/onload.cgi" gi;
subs_filter "\"/karakoram" "\"/6pm/karakoram" gi;
subs_filter "/tealeaf/tealeaf.cgi" "/6pm/tealeaf/tealeaf.cgi" gi;
# Css and js path
subs_filter "script\('/" "script('/6pm/" gi;
subs_filter "url\(/" "url(/6pm/" gi;
subs_filter 'UA-8814898-13' 'UA-28370154-3' gi;
subs_filter "'.6pm.com']," "'pokupki-usa.ru/6pm']," gi;
subs_filter "http://a..zassets.com" "http://l3.zassets.com" gi;


Firstly, we turned on css and javascript file filtering (html parsing is enabled by default)
Secondly, we begin to carefully find and replace different types of links relative to the root. We came across a medium complexity site in which some of the scripts contain such paths.

As a result, it turned out like this: http://pokupki-usa.ru/6pm/

Unfortunately, I was not able to write a filter to the end for the case of a subdirectory. I did not get to the conversion of dynamic requests for shopping cart scripts, although I have no doubt that this can be solved. It’s just that my knowledge in Javascript is not enough to perform the necessary debugging, I will be glad to advice on how to launch a shopping cart, which now does not work in the mentioned example.

In any case, this is probably the first guide that describes the method of proxying to a subdirectory.

Also popular now: