Rating of hubs and companies by posts / subscribers

    At the moment there are about 350 hubs on the hub . The site’s functionality allows you to sort them by name and index . And for other parameters - for example , the number of posts - no, but I would like to.

    I was inspired by the article rating of the hub posts , and I decided to make a similar one, but to make up the rating of the hubs themselves .

    In the first half of the article I will present you the ratings of hubs and companies , as well as a small analysis of them. And in the second - I will write in detail how I in Java using the JSoup library parsed HTML pages of the Habré, what interesting phenomena and problems I encountered. And at the end of the article I will lay out the full source code of the program.



    All 4 ratings (full) as a web page

    Rating Hubs


    When I sorted the hubs, interesting things showed up. For example, I did not know that there were hubs with zero posts. And there were 4 of them ! Moreover, more than 500 people are subscribed to each of them .

    The three hubs - Chulan , I’m PR and Web development - are leaders in terms of the number of posts and the number of readers. The closet is in 1st place because the administration removes the articles there. Next comes Information Security , which is extremely popular on the hub.

    Unfortunately, I still did not understand why the Habrahabr hub is an offtopic. By the number of posts he will be in 13th place, and his subscribers > 80K . It turns out that writing on a site about the same site is a departure from the topic?

    It was disappointing that the Java hub is not as high as we would like.

    Rating companies

    Although initially I planned to build the rating only for hubs, in the comments to the article they put forward a good idea - to do the same for companies. The code did not have to be changed much.

    There are a lot of companies - 1343. Therefore, I will post only the TOP-30 and the last 10 companies. That's an interesting point - for some reason, the Habrt shows Everything (1331) , although my program counted 1343 of them - and, in fact, this is correct. If you manually count them - multiply the number of pages 67 by 20 companies and even 3 - it turns out 1343.


    To begin with, I was surprised by the fact that there are 2 types of company absence - “company is deactivated” and “page not found”. Although I repeat - all companies were taken from the list. The first view I marked with the number of posts -2. There are a lot of such companies. And three companies, the name of which consists of numbers, lead to "page not found." I marked them -3. Such are the things. Also full of companies with zero posts — for example, Apple . I wonder why create an account for the company and not write from it at all?

    Actually, if from those 1343 registered on the hub, we delete non-existent and companies without posts, then only 321. There will be such things.

    Development

    For a very long time I tried to understand the Api Habrahabr . As it turned out, it is closed and is still under development. However, in correspondence with support@habrahabr.ru they told me that they have nothing against parsing their pages. Actually, this is exactly how the habraclients for Android work (at the moment).

    When it comes to projects “for myself”, I choose my beloved Java. She didn’t let me down this time either - the JSoup library allowed me to get the necessary data from an HTML page in a few lines. But first, let's discuss how hubs work.

    Pages with hubs are located at habrahabr.ru/hubs/pageN/where N is a number from 1 onwards. Therefore, if we want to get a complete list of all the hubs, we need to download and analyze these pages until they end. On each page there is a list of hubs. The list item format is pretty simple and easy to parse. It looks like this:


    Let's write a method that returns us a list of all the hubs on the site:
    static List getAllHubs() {
            ArrayList fullHubsList = new ArrayList<>();
            String urlHubsIncomplete = "http://habrahabr.ru/hubs/page";
            int pageNum = 1;
            do {
                String urlHubs = urlHubsIncomplete + pageNum;
                try {
                    Document doc = Jsoup.connect(urlHubs).get();
                    Elements hubs = doc.select(".hub");
                    if (hubs.size() == 0) {
                        break;
                    }
                    for (Element hubElem : hubs) {
                        Hub hub = new Hub(hubElem);
                        fullHubsList.add(hub);
                    }
                    pageNum++;
                } catch (Exception e) {
                    e.printStackTrace();
                    break;
                }
            } while (true);
            return fullHubsList;
        }

    We spin an infinite while loop, forming a new URL with each iteration. Then, using Jsoup.connect (urlHubs) .get () we get directly an HTML document with a list of hubs and their parameters. As you can see, the div with hub information has a hub class , and by calling doc.select (". Hub") , we get a list of these elements. If its size is zero - it means we went through the last page and already analyzed all the hubs - then we exit the loop.

    Next, we go through all the hub elements and for each create an object of type Hub , passing our org.jsoup.nodes.Element to the constructor . It contains HTML code in the same format as above. Nowlet's abstract from everything. For this, OOP exists. Before us is only the piece of HTML presented above, and the class into which you need to cram it. Let's write a framework for our class:
    import org.jsoup.nodes.Element;
    public class Hub {
        String title;
        int posts;
        boolean profiled;
        int membersCount;
        float habraindex;
        String url;
        public Hub(Element hubElem) {
        }
    }

    Let's write a constructor. To get started, let's do the simplest thing - get the data from the header tag. To do this, we first extract the div itself of the form

    Parsim through
    Element titleDiv = hubElem.select(".title").get(0);
    Element tagA = titleDiv.getElementsByTag("a").get(0);
    title = tagA.text();
    url = tagA.attr("href");
    profiled = (hubElem.select(".profiled_hub").size() != 0);

    Next, we want to parse the number of subscribers and posts - actually the parameters by which we will sort. But immediately we encounter the first problem - the tag contains the string "91741 subscribers" , which we cannot just take and convert to Integer - it contains letters! Here, regular expressions come to our aid . We quickly write a clever method that receives a string and cuts everything out of it except numbers, and even converts the result to int. \ D is NOT a number, but + - "occurs 1 or more times." Those. in this case we are replacing letters with void.
    private int getNumbers(String str) {             
        String numbers = str.replaceAll("\\D+", ""); 
        return Integer.valueOf(numbers);             
    }

    Now we can already get our values ​​with a calm soul:
    String membersCountFullStr = hubElem.select(".members_count").get(0).text();
    membersCount = getNumbers(membersCountFullStr);
    String statFullStr = hubElem.select(".stat").get(0).getAllElements().get(2).text();
    posts = getNumbers(statFullStr);

    In principle, this could stop, but for the sake of interest I decided to extract all the possible information about the hub. Here a very interesting second problem arose, which would be the highlight of the article . How to parse the Habraindex?

    To begin with, you should replace the comma with a period and remove extra spaces. But that is not enough! The parser still gives an error if you copy and paste the Habraindex into the code - Double.valueOf ("- 1.11") . And if you manually enter the same number - everything is OK. And visually in my IDEA they look absolutely identical!

    It turns out that Habra designers just used dash instead of minus- with a different character code, and its parser, of course, does not eat. Take note. The essence of the problem is as follows :
    System.out.println((int)'-');//45
    System.out.println((int)'–');//8211

    Once, in my article Tricky Java Tasks, I examined a catch when L can not be distinguished from small 1. Actually, now I ran into a similar problem.

    Therefore, the code for retrieving the Habraindex will be a little more complicated:
    String rawHabraIndex = hubElem.select(".habraindex").get(0).text();//1 265,92
    char minus = 45;//'-'
    char dash = 8211;//'–'
    String niceHabraIndex = rawHabraIndex.replaceAll(" ", "").replace(",", ".").replace(dash,minus);//1266.72
    habraindex = Float.valueOf(niceHabraIndex);

    Next, we write the post comparator as a nested static class for the Hub
    public static class ComparePosts implements Comparator { 
        @Override                                                 
        public int compare(Hub o1, Hub o2) {                      
            return o2.posts - o1.posts;                           
        }                                                         
    }                                      

    And sort by it somewhere in main
    List hubs = getAllHubs();                 
    Collections.sort(hubs, new Hub.ComparePosts());

    Everything, the task is completed! With the number of subscribers is similar. Next, I wrote code that displays two lists in the console in such a way that they could be immediately inserted into the article - and I did it first.

    It takes about 10 seconds to get all the hubs. Source code can be downloaded here . We build and run like this, not forgetting to install Jsoup and replace the path with yours:
    javac -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com/kciray/habrahubs/Main.java
    java -cp .;"C:\prog\lib\jsoup-1.7.3.jar" com.kciray.habrahubs.Main

    In addition, I redid the same classes to collect statistics on companies. There, it would seem, everything is similar - however, in order to find out the number of posts on the company’s blog, I had to load a page for each individually - and this took about 5 minutes. I did a multithreaded download to speed things up. Found that the habra does not allow to load more than 5-7 pages at the same time. Actually serialized ArrayListand wrote down. This 100 kilobyte file lies with the second source - you can work with it.

    If you are interested in the full rating and in a more compact form - I posted it as a web page .

    Also popular now: