Internet mapping

Sitting on the couch and once again coming up with a crazy idea, from the category of what’s global, but at the hobby level I haven’t done this yet, this idea still came to me :).

Having estimated that the one who has the information, the one and the opportunity to work with the audience, I wondered why there are so few Internet search engines. Well, Google, Yandex, rambler, well, and something else there, which is easy to count on the fingers. But they accumulate almost the vast majority of Internet users. A large number of users pass through them, and it depends on them to a certain extent where to direct the user. And companies to a certain extent are promoted with tricky ways of influencing bots of the same Google.

Is there a result? Does anyone have an idea of ​​how many Russian-language resources exist? Is it possible to see their ranked lists by frequency of use and thematic separation. They are trying to talk about the semantic Internet, but there seems to be no such even elementary order in structuring. Having told myself “who if not us”, I went to realize this idea and approaches to its solution. But the main thing I understood was the main complexity, which, like many places, simply rests on resources, in this case processor time. Those who are interested in finding a novice in the designated area, but with a fresh look, ask for cat.

IP as the basis of identification

Well, what’s so complicated I said to myself, you just need to get a list of all the sites, and then rank them, at least on the same PageRang from Google. Well, I went and composed a not tricky C # program for pinging port 80 over IP, and if successful, getting the domain name and its country (using the GeoIPService web service). I launched my simple dialer bot and came up in an hour to see how many sites I collected, I found almost 1 unique ... Everything turned around here. I decided to calculate how it is known that IP variations can be 256 * 256 * 256 * 256 = about 4 billion. Well, I didn’t think so much, but then I looked at how much one ping takes, it turned out to be about 0.1 second. By the same timeout I limited the receipt of the answer, as by default it is significantly larger. Now it was possible to complete the calculation of 4971 days. Well, I don’t have 14 years for this, I told myself. On this I could marvel at the miracle of Google technology, to understand that they are doing a great job and not to compete with them alone. But perseverance prevailed :)

Country Limit

Well, I told myself, Russian-language sites would be quite enough for me, the rest may not be analyzed. But how to understand the correspondence of IP and country? Does the Internet have any kind of structure? It was these issues that I had to attend to. because brute force did not suit me.

Having read what people are doing for this, I found that there is a whole shit for this simple task - with the beautiful name of geo-targeting, and marveled at the articles on the hub: Definition of a city by IP address , GeoIP base - countries and cities and a whole bunch of similar ones. In general, even paid databases, etc., the whole industry :)

But most importantly, it was some kind of secondary information and there was no need and desire to use it. Therefore, I wanted to understand where the wind blows from - where is the primary information? Where does the information in such geolocation databases come from?

Network Regions

I had to take care to understand who controls the Internet, reading the popular information on Wikipediaone can understand that from the moment the network was created and until his death in 1998, John Postel was in charge of the distribution of addresses in accordance with the agreement of the US Department of Defense. Now, a certain non-profit organization Internet Assigned Numbers Authority (IANA), which after the death of Postel joined the ICANN, founded by the US government, which in turn received a contract from the US Department of Commerce, manages the issuance of IP addresses. In general, the scheme is confusing, and there is still a showdown at the UN, so that the United States would give control of the Internet to the UN, which she naturally refused to do. But all this interests us exclusively only in terms of whether there is any order in the distribution of IP addresses and which subnets can not be scanned by search engines.

And here it is the most important document from the IANA organization with the hope of order:IANA IPv4 Address Space Registry .

It describes who is responsible (read "who owns") the regions of the network. Hence it is more convenient to introduce our own more convenient terminology: we will call the network regions the IP addresses defined by the first number of IP addresses, and the sector by the first two numbers.

From the document above it follows that the regions: 0, 10, 127 are reserved by IANA for themselves, and from 224 to the end 255 is reserved for the so-called. Multicast and future use. Further, the substantial part belongs to large telecommunication and information companies in the USA, the military of the USA and Great Britain - I counted 35 regions.

A total of 70 regions out of 256 are not accessible to mere mortals, and there is no need to scan them. The rest are distributed in 5 regional zones: America North, America South, Africa, Indonesia with China, and Europe with Asia. They are already distributed by other regional offices, and we are interested in the European RIPE NCC . Actually, whois services are provided by them and distributed among these regional organizations.

The European Organization has been assigned 35 regions of IP addresses for distribution, we will conditionally call them providers (although licensed according to country rules) and +4 regions, as I understand it, with special administration.

It’s only at this level that we can confidently say which territorial region IP addresses belong to. Further it depends on what public information regional organizations provide. But it’s better instead of 14 years to scan 256 regions of the Internet, we need to scan 39 regions of Europe / Asia, just a little more than 2 years of operation of one processor.

You can go below. Unfortunately, there is no further order (and there are exceptions at the regional level). Sectors may belong to different countries. But you can download the current whois database from Ripe and find country information in it. Cities there are sometimes there, but they are hardly suitable for machine processing, because as I understand it, what careless network administrators enter when they receive IP subnets for their use, and the fields are often confused there (for example, instead of the city address) or not specified at all. But the country code is stable.

Having processed their text file from 3 gigs, and choosing from there belonging to Latvia and Russia, I selected 2004 sectors for Russia and 307 for Latvia (selectivity to Latvia is determined by the homeland :)). And the math of folding is different here 2004 + 307 = 2063 unique sectors. Those. As mentioned earlier, the sectors naturally intersect, and there may be other European countries, but on the other hand, we got an indicative estimate.

Namely, from 109 minutes to ping one sector. And for 2063 sectors about 156 days (the minimum estimate, because a little time is still needed to search for a domain if the open 80 port is successful).

This is already lifting, even to me, if I cut all my 8 cores - in a month I will receive a map of the Internet of Russia.

And actually, why do you need this kind of Internet mapping?

Well, in fantasy, I don’t want to limit you. And at first, ideas were outlined for what it might be necessary for. I want to emphasize that this time is required only for the first scan to identify all computers with an open 80 port (i.e. potential candidates for the provision of web services), among which there are very few real domains (DNS names).

But we really get all the domains for 2015, which can later be analyzed and you never know what else. Those who want to help me and see the need for this, or just who do not mind the processor time, write to me and I will give you a ready-made program that will analyze the sector, you will discard the result and create a public database - an Internet map .

And maybe a new Google is born :)

PS In the case of a positive response, we will create an appropriate resource for this.

Also popular now: