Rating of hubs and companies by posts / subscribers
At the moment there are about 350 hubs on the hub . The site’s functionality allows you to sort them by name and index . And for other parameters - for example , the number of posts - no, but I would like to.
I was inspired by the article rating of the hub posts , and I decided to make a similar one, but to make up the rating of the hubs themselves .
In the first half of the article I will present you the ratings of hubs and companies , as well as a small analysis of them. And in the second - I will write in detail how I in Java using the JSoup library parsed HTML pages of the Habré, what interesting phenomena and problems I encountered. And at the end of the article I will lay out the full source code of the program.
All 4 ratings (full) as a web page
When I sorted the hubs, interesting things showed up. For example, I did not know that there were hubs with zero posts. And there were 4 of them ! Moreover, more than 500 people are subscribed to each of them .
The three hubs - Chulan , I’m PR and Web development - are leaders in terms of the number of posts and the number of readers. The closet is in 1st place because the administration removes the articles there. Next comes Information Security , which is extremely popular on the hub.
Unfortunately, I still did not understand why the Habrahabr hub is an offtopic. By the number of posts he will be in 13th place, and his subscribers > 80K . It turns out that writing on a site about the same site is a departure from the topic?
It was disappointing that the Java hub is not as high as we would like.
There are a lot of companies - 1343. Therefore, I will post only the TOP-30 and the last 10 companies. That's an interesting point - for some reason, the Habrt shows Everything (1331) , although my program counted 1343 of them - and, in fact, this is correct. If you manually count them - multiply the number of pages 67 by 20 companies and even 3 - it turns out 1343.
To begin with, I was surprised by the fact that there are 2 types of company absence - “company is deactivated” and “page not found”. Although I repeat - all companies were taken from the list. The first view I marked with the number of posts -2. There are a lot of such companies. And three companies, the name of which consists of numbers, lead to "page not found." I marked them -3. Such are the things. Also full of companies with zero posts — for example, Apple . I wonder why create an account for the company and not write from it at all?
Actually, if from those 1343 registered on the hub, we delete non-existent and companies without posts, then only 321. There will be such things.
When it comes to projects “for myself”, I choose my beloved Java. She didn’t let me down this time either - the JSoup library allowed me to get the necessary data from an HTML page in a few lines. But first, let's discuss how hubs work.
Pages with hubs are located at habrahabr.ru/hubs/pageN/where N is a number from 1 onwards. Therefore, if we want to get a complete list of all the hubs, we need to download and analyze these pages until they end. On each page there is a list of hubs. The list item format is pretty simple and easy to parse. It looks like this:
Let's write a method that returns us a list of all the hubs on the site:
We spin an infinite while loop, forming a new URL with each iteration. Then, using Jsoup.connect (urlHubs) .get () we get directly an HTML document with a list of hubs and their parameters. As you can see, the div with hub information has a hub class , and by calling doc.select (". Hub") , we get a list of these elements. If its size is zero - it means we went through the last page and already analyzed all the hubs - then we exit the loop.
Next, we go through all the hub elements and for each create an object of type Hub , passing our org.jsoup.nodes.Element to the constructor . It contains HTML code in the same format as above. Nowlet's abstract from everything. For this, OOP exists. Before us is only the piece of HTML presented above, and the class into which you need to cram it. Let's write a framework for our class:
Let's write a constructor. To get started, let's do the simplest thing - get the data from the header tag. To do this, we first extract the div itself of the form
Parsim through
Next, we want to parse the number of subscribers and posts - actually the parameters by which we will sort. But immediately we encounter the first problem - the tag contains the string "91741 subscribers" , which we cannot just take and convert to Integer - it contains letters! Here, regular expressions come to our aid . We quickly write a clever method that receives a string and cuts everything out of it except numbers, and even converts the result to int. \ D is NOT a number, but + - "occurs 1 or more times." Those. in this case we are replacing letters with void.
Now we can already get our values with a calm soul:
In principle, this could stop, but for the sake of interest I decided to extract all the possible information about the hub. Here a very interesting second problem arose, which would be the highlight of the article . How to parse the Habraindex?
To begin with, you should replace the comma with a period and remove extra spaces. But that is not enough! The parser still gives an error if you copy and paste the Habraindex into the code - Double.valueOf ("- 1.11") . And if you manually enter the same number - everything is OK. And visually in my IDEA they look absolutely identical!
It turns out that Habra designers just used dash instead of minus- with a different character code, and its parser, of course, does not eat. Take note. The essence of the problem is as follows :
Once, in my article Tricky Java Tasks, I examined a catch when L can not be distinguished from small 1. Actually, now I ran into a similar problem.
Therefore, the code for retrieving the Habraindex will be a little more complicated:
Next, we write the post comparator as a nested static class for the Hub
And sort by it somewhere in main
Everything, the task is completed! With the number of subscribers is similar. Next, I wrote code that displays two lists in the console in such a way that they could be immediately inserted into the article - and I did it first.
It takes about 10 seconds to get all the hubs. Source code can be downloaded here . We build and run like this, not forgetting to install Jsoup and replace the path with yours:
In addition, I redid the same classes to collect statistics on companies. There, it would seem, everything is similar - however, in order to find out the number of posts on the company’s blog, I had to load a page for each individually - and this took about 5 minutes. I did a multithreaded download to speed things up. Found that the habra does not allow to load more than 5-7 pages at the same time. Actually serialized ArrayList and wrote down. This 100 kilobyte file lies with the second source - you can work with it.
If you are interested in the full rating and in a more compact form - I posted it as a web page .
I was inspired by the article rating of the hub posts , and I decided to make a similar one, but to make up the rating of the hubs themselves .
In the first half of the article I will present you the ratings of hubs and companies , as well as a small analysis of them. And in the second - I will write in detail how I in Java using the JSoup library parsed HTML pages of the Habré, what interesting phenomena and problems I encountered. And at the end of the article I will lay out the full source code of the program.
All 4 ratings (full) as a web page
Rating Hubs
By the number of posts
Closet 35 971
I'm PR 5 461
Web Development 4 011
Information Security 3 385
Google 2 770
Iron 2 733
Gadgets. Devices for geeks 2 375
Programming 2 293
Linux 2 235
Android 1 965
JavaScript 1 687
Apple 1 612
Habrahabr 1 568
.NET 1 485
PHP 1 465
System administration 1 454
DIY or Do it yourself 1 442
Development 1 331
Project management 1 261
Interfaces 1 257
Microsoft 1 237
Game Development 1 218
Open source 1 110
Smartphones and communicators 1 091
JAVA 1 020
Design in IT 996
Algorithms 991
Copyright 982
Social networks and communities 949
GTD 939
Windows 919
Learning process in IT 916
Python 866
Robotics 798
Development for Android 783
Development for iOS 777
Hosting 749
C ++ 711
Legislation and IT business new677
Media 664
(...)
Tumblr 3
Cubrid 3
Industrial programming new 3
Julia new 2
Microsoft Access 2
Growth Hacking new 2
Google Checkout 0
MySpace 0
Xcode new 0
SCADA new 0
I'm PR 5 461
Web Development 4 011
Information Security 3 385
Google 2 770
Iron 2 733
Gadgets. Devices for geeks 2 375
Programming 2 293
Linux 2 235
Android 1 965
JavaScript 1 687
Apple 1 612
Habrahabr 1 568
.NET 1 485
PHP 1 465
System administration 1 454
DIY or Do it yourself 1 442
Development 1 331
Project management 1 261
Interfaces 1 257
Microsoft 1 237
Game Development 1 218
Open source 1 110
Smartphones and communicators 1 091
JAVA 1 020
Design in IT 996
Algorithms 991
Copyright 982
Social networks and communities 949
GTD 939
Windows 919
Learning process in IT 916
Python 866
Robotics 798
Development for Android 783
Development for iOS 777
Hosting 749
C ++ 711
Legislation and IT business new677
Media 664
(...)
Tumblr 3
Cubrid 3
Industrial programming new 3
Julia new 2
Microsoft Access 2
Growth Hacking new 2
Google Checkout 0
MySpace 0
Xcode new 0
SCADA new 0
By the number of subscribers
Closet 124 521
I'm PR 101 864
Web Development 96 117
Android 95 361
Gadgets. Geeks 95 020
Smartphones and communicators 94 376
Google 93 844
DIY or Do-it-yourself 92 322
Iron 91 959
Information security 91 729
Linux 91 103
Robotics 89 721
Programming 89 668
Tablets 88 757
Google Chrome 88 197
Operating systems 88 098
Interfaces 88 064
Windows 87,695
iPhone87 609
Algorithms 87 372
Web design 87 341
E-books 86 582
Design in IT 86 266
Perfect code 85 525
Browsers 85 443
iPad 85 290
Energy and batteries 84 866
Popular science 84 668
PHP 84 621
(...)
Backup new 3 503
Xcode new 2 823
Physics new 2 372
Raspberry Pi new 2 274
Industrial programming new 2 141
Development under e-commerce new 2 034
SCADA new 1 856
Laravel new 1,799
Growth Hacking new 1,063
Julia new 948
I'm PR 101 864
Web Development 96 117
Android 95 361
Gadgets. Geeks 95 020
Smartphones and communicators 94 376
Google 93 844
DIY or Do-it-yourself 92 322
Iron 91 959
Information security 91 729
Linux 91 103
Robotics 89 721
Programming 89 668
Tablets 88 757
Google Chrome 88 197
Operating systems 88 098
Interfaces 88 064
Windows 87,695
iPhone87 609
Algorithms 87 372
Web design 87 341
E-books 86 582
Design in IT 86 266
Perfect code 85 525
Browsers 85 443
iPad 85 290
Energy and batteries 84 866
Popular science 84 668
PHP 84 621
(...)
Backup new 3 503
Xcode new 2 823
Physics new 2 372
Raspberry Pi new 2 274
Industrial programming new 2 141
Development under e-commerce new 2 034
SCADA new 1 856
Laravel new 1,799
Growth Hacking new 1,063
Julia new 948
When I sorted the hubs, interesting things showed up. For example, I did not know that there were hubs with zero posts. And there were 4 of them ! Moreover, more than 500 people are subscribed to each of them .
The three hubs - Chulan , I’m PR and Web development - are leaders in terms of the number of posts and the number of readers. The closet is in 1st place because the administration removes the articles there. Next comes Information Security , which is extremely popular on the hub.
Unfortunately, I still did not understand why the Habrahabr hub is an offtopic. By the number of posts he will be in 13th place, and his subscribers > 80K . It turns out that writing on a site about the same site is a departure from the topic?
It was disappointing that the Java hub is not as high as we would like.
Rating companies
Although initially I planned to build the rating only for hubs, in the comments to the article they put forward a good idea - to do the same for companies. The code did not have to be changed much.There are a lot of companies - 1343. Therefore, I will post only the TOP-30 and the last 10 companies. That's an interesting point - for some reason, the Habrt shows Everything (1331) , although my program counted 1343 of them - and, in fact, this is correct. If you manually count them - multiply the number of pages 67 by 20 companies and even 3 - it turns out 1343.
By the number of subscribers
Yandex 11 056
Google 10 999
Microsoft 6 797
Intel 5 463
Apple 4 124
Opera Software ASA 3 873
Hacker Magazine 3 034
Zfort Group 2 969
JetBrains 2 946
Mail.Ru Group 2 921
VimpelCom (Beeline) 2 730
IBM 2 655
Artemy Lebedev Studio 2 640
Nokia 2,542
TM 2,314
Simple Science 2,222
Samsung 2,222
2GIS 1,992
Adobe 1,878
ABBYY 1,847
Box Overview 1 844
ВКонтакте 1 841
HP 1 828
Мосигра 1 772
Skype 1 718
«Лаборатория Касперского» 1 667
ASUS Russia 1 615
Sony Mobile Communications 1 572
Apps4All 1 541
LinguaLeo 1 493
(...)
Angie 5
Photoplay 5
Флорист.ру 5
PlatOn 5
Polyvizor 5
Dulton Media LLC 5
bdl premium 4
GolovachCourses 4
timera inc. 4
Slon.ru 3
Google 10 999
Microsoft 6 797
Intel 5 463
Apple 4 124
Opera Software ASA 3 873
Hacker Magazine 3 034
Zfort Group 2 969
JetBrains 2 946
Mail.Ru Group 2 921
VimpelCom (Beeline) 2 730
IBM 2 655
Artemy Lebedev Studio 2 640
Nokia 2,542
TM 2,314
Simple Science 2,222
Samsung 2,222
2GIS 1,992
Adobe 1,878
ABBYY 1,847
Box Overview 1 844
ВКонтакте 1 841
HP 1 828
Мосигра 1 772
Skype 1 718
«Лаборатория Касперского» 1 667
ASUS Russia 1 615
Sony Mobile Communications 1 572
Apps4All 1 541
LinguaLeo 1 493
(...)
Angie 5
Photoplay 5
Флорист.ру 5
PlatOn 5
Polyvizor 5
Dulton Media LLC 5
bdl premium 4
GolovachCourses 4
timera inc. 4
Slon.ru 3
По количеству постов
Yandex 1 012
Microsoft 828
Intel 491
Google 422
Mail.Ru Group 317
Apps4All 292
Opera Software ASA 234
Samsung 215
ASUS Russia 209
ESET NOD32 200
ABBYY 197
IBM 190
HP 188
Evernote 186
Webnames.ru 169
MUK 154
Nokia 142
Zfort Group 134
Positive Technologies 131
Simple Science 127
EPAM Systems 127
Sony Mobile Communications 116
КРОК 115
Turbomilk 103
Селектел 101
REG.RU 97
Box Overview 96
Ciklum 96
SmartGadget 94
JetBrains 87
(...)
HotSupport-2
Worksection-2
Далее-2
МФИ Софт-2
NVIDIA Corporation-2
DeepArtment-2
RuTube-2
Самый Нужный ТЕЛЕФОН-3
«Студия — 8812»-3
590.com.ua-3
Microsoft 828
Intel 491
Google 422
Mail.Ru Group 317
Apps4All 292
Opera Software ASA 234
Samsung 215
ASUS Russia 209
ESET NOD32 200
ABBYY 197
IBM 190
HP 188
Evernote 186
Webnames.ru 169
MUK 154
Nokia 142
Zfort Group 134
Positive Technologies 131
Simple Science 127
EPAM Systems 127
Sony Mobile Communications 116
КРОК 115
Turbomilk 103
Селектел 101
REG.RU 97
Box Overview 96
Ciklum 96
SmartGadget 94
JetBrains 87
(...)
HotSupport-2
Worksection-2
Далее-2
МФИ Софт-2
NVIDIA Corporation-2
DeepArtment-2
RuTube-2
Самый Нужный ТЕЛЕФОН-3
«Студия — 8812»-3
590.com.ua-3
To begin with, I was surprised by the fact that there are 2 types of company absence - “company is deactivated” and “page not found”. Although I repeat - all companies were taken from the list. The first view I marked with the number of posts -2. There are a lot of such companies. And three companies, the name of which consists of numbers, lead to "page not found." I marked them -3. Such are the things. Also full of companies with zero posts — for example, Apple . I wonder why create an account for the company and not write from it at all?
Actually, if from those 1343 registered on the hub, we delete non-existent and companies without posts, then only 321. There will be such things.
Development
For a very long time I tried to understand the Api Habrahabr . As it turned out, it is closed and is still under development. However, in correspondence with support@habrahabr.ru they told me that they have nothing against parsing their pages. Actually, this is exactly how the habraclients for Android work (at the moment).When it comes to projects “for myself”, I choose my beloved Java. She didn’t let me down this time either - the JSoup library allowed me to get the necessary data from an HTML page in a few lines. But first, let's discuss how hubs work.
Pages with hubs are located at habrahabr.ru/hubs/pageN/where N is a number from 1 onwards. Therefore, if we want to get a complete list of all the hubs, we need to download and analyze these pages until they end. On each page there is a list of hubs. The list item format is pretty simple and easy to parse. It looks like this:
Let's write a method that returns us a list of all the hubs on the site:
static List getAllHubs() {
ArrayList fullHubsList = new ArrayList<>();
String urlHubsIncomplete = "http://habrahabr.ru/hubs/page";
int pageNum = 1;
do {
String urlHubs = urlHubsIncomplete + pageNum;
try {
Document doc = Jsoup.connect(urlHubs).get();
Elements hubs = doc.select(".hub");
if (hubs.size() == 0) {
break;
}
for (Element hubElem : hubs) {
Hub hub = new Hub(hubElem);
fullHubsList.add(hub);
}
pageNum++;
} catch (Exception e) {
e.printStackTrace();
break;
}
} while (true);
return fullHubsList;
}
We spin an infinite while loop, forming a new URL with each iteration. Then, using Jsoup.connect (urlHubs) .get () we get directly an HTML document with a list of hubs and their parameters. As you can see, the div with hub information has a hub class , and by calling doc.select (". Hub") , we get a list of these elements. If its size is zero - it means we went through the last page and already analyzed all the hubs - then we exit the loop.
Next, we go through all the hub elements and for each create an object of type Hub , passing our org.jsoup.nodes.Element to the constructor . It contains HTML code in the same format as above. Nowlet's abstract from everything. For this, OOP exists. Before us is only the piece of HTML presented above, and the class into which you need to cram it. Let's write a framework for our class:
import org.jsoup.nodes.Element;
public class Hub {
String title;
int posts;
boolean profiled;
int membersCount;
float habraindex;
String url;
public Hub(Element hubElem) {
}
}
Let's write a constructor. To get started, let's do the simplest thing - get the data from the header tag. To do this, we first extract the div itself of the form
Parsim through
Element titleDiv = hubElem.select(".title").get(0);
Element tagA = titleDiv.getElementsByTag("a").get(0);
title = tagA.text();
url = tagA.attr("href");
profiled = (hubElem.select(".profiled_hub").size() != 0);
Next, we want to parse the number of subscribers and posts - actually the parameters by which we will sort. But immediately we encounter the first problem - the tag contains the string "91741 subscribers" , which we cannot just take and convert to Integer - it contains letters! Here, regular expressions come to our aid . We quickly write a clever method that receives a string and cuts everything out of it except numbers, and even converts the result to int. \ D is NOT a number, but + - "occurs 1 or more times." Those. in this case we are replacing letters with void.
private int getNumbers(String str) {
String numbers = str.replaceAll("\\D+", "");
return Integer.valueOf(numbers);
}
Now we can already get our values with a calm soul:
String membersCountFullStr = hubElem.select(".members_count").get(0).text();
membersCount = getNumbers(membersCountFullStr);
String statFullStr = hubElem.select(".stat").get(0).getAllElements().get(2).text();
posts = getNumbers(statFullStr);
In principle, this could stop, but for the sake of interest I decided to extract all the possible information about the hub. Here a very interesting second problem arose, which would be the highlight of the article . How to parse the Habraindex?
To begin with, you should replace the comma with a period and remove extra spaces. But that is not enough! The parser still gives an error if you copy and paste the Habraindex into the code - Double.valueOf ("- 1.11") . And if you manually enter the same number - everything is OK. And visually in my IDEA they look absolutely identical!
It turns out that Habra designers just used dash instead of minus- with a different character code, and its parser, of course, does not eat. Take note. The essence of the problem is as follows :
System.out.println((int)'-');//45
System.out.println((int)'–');//8211
Once, in my article Tricky Java Tasks, I examined a catch when L can not be distinguished from small 1. Actually, now I ran into a similar problem.
Therefore, the code for retrieving the Habraindex will be a little more complicated:
String rawHabraIndex = hubElem.select(".habraindex").get(0).text();//1 265,92
char minus = 45;//'-'
char dash = 8211;//'–'
String niceHabraIndex = rawHabraIndex.replaceAll(" ", "").replace(",", ".").replace(dash,minus);//1266.72
habraindex = Float.valueOf(niceHabraIndex);
Next, we write the post comparator as a nested static class for the Hub
public static class ComparePosts implements Comparator {
@Override
public int compare(Hub o1, Hub o2) {
return o2.posts - o1.posts;
}
}
And sort by it somewhere in main
List hubs = getAllHubs();
Collections.sort(hubs, new Hub.ComparePosts());
Everything, the task is completed! With the number of subscribers is similar. Next, I wrote code that displays two lists in the console in such a way that they could be immediately inserted into the article - and I did it first.
It takes about 10 seconds to get all the hubs. Source code can be downloaded here . We build and run like this, not forgetting to install Jsoup and replace the path with yours:
javac -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com/kciray/habrahubs/Main.java
java -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com.kciray.habrahubs.Main
In addition, I redid the same classes to collect statistics on companies. There, it would seem, everything is similar - however, in order to find out the number of posts on the company’s blog, I had to load a page for each individually - and this took about 5 minutes. I did a multithreaded download to speed things up. Found that the habra does not allow to load more than 5-7 pages at the same time. Actually serialized ArrayList
If you are interested in the full rating and in a more compact form - I posted it as a web page .