How I calculated the millionth Russian Wikipedia article

    1,000,000Today, on May 11, 2013, at 01: 41: 39.8 UTC (05: 41: 39.8 Moscow time), a millionth article appeared in the Russian section of Wikipedia. By coincidence, the Russian section today marks its 11th anniversary. The Life Extension Foundation article was created by a member of UG72 . Disputes have already flared up over whether the article has the right to life, but the fact that it was she who has taken the line has been unequivocally established.

    The Wikipedia article counter shows the number of articles that have at least one link (there are two other rule settings) Thus, its creation and deletion of articles, as well as renaming and even any editing, can affect its value. We add to this that on the eve of the anniversary, participants begin to massively fill in their blanks in the hope that one of them will turn out to be a jubilee article, and that the counter, as a not very important thing, is usually updated asynchronously. As a result, it becomes very difficult to calculate the desired article. But everyone is interested!

    You can still get out.


    Due to the fact that any action can affect the counter, and everything happens very quickly, attempts to calculate the article number are doomed retrospectively. You need to look at the counter in real time. When Wikipedia was not so famous yet, it was enough to go to the list of new articles at the right timewhere the counter was located at that time, and have time to take a screenshot. But today, for example, the round value of the counter lasted less than two seconds.

    The Wikimedia Foundation has a Tulserver - a set of servers to which databases of the fund's projects are copied. Shell access to them can be requested by any technically competent participant. Having gained access, the resources of Tulserver can be used for any actions useful for the projects of the fund.

    The current value of the article counter is stored in the database; Naturally, information about new articles is also located there. Therefore, to fix the anniversary moment, it is enough to interrogate the database several times per second, monitor the counter changes and log them. The log looks something like this:

    06: 05: 25.02 999397 Cashboxes, _Friedrich
    06: 05: 25.51 999398 Cashboxes, _Friedrich
    06: 09: 02.67 999398 Krivolapov, _Grigory_Arkhipovich
    06: 09: 03.32 999399 Krivolapov, _Grigory_Arkhipovich
    06: 10: 16.17 999399 Light_industry_Russia
    06: 10: 18.39 999400 Light_industry_Russia
    

    Typically, each article appears in it twice: the first time the reading of the Last Created Article field changes, the second the counter value. Thus, for example, Kasiski's article , _Friedrich was 999398th.

    There were problems with access to Tulserver tonight in the jubilee area. The tracking script continued to work and register new articles, however the counter value was different! Finding out why this happens quickly did not work. Monitoring tools said that replication is performed correctly and without delay. The difference in the readings of the counters slowly swam around 100 articles. Therefore, the script had to be rewritten urgently so that it would take data directly from the site. The instance working with the database remains running just in case.

    MediaWiki has a great API, allowing you to pull out a lot of interesting data. You can formulate a request to the API that will simultaneously return the counter value and the last new pages:

    ru.wikipedia.org/w/api.php?format=jsonfm&meta=siteinfo&action=query&siprop=statistics&list=recentchanges&rctype=new The

    required data is in the .query.statistics.articlesand fields .query.recentchanges[0].title. With this data you need to do the same thing - constantly interrogate them and log any changes. In this case, counter asynchronism becomes noticeable in fewer cases.

    Since the HTTP request takes longer than the database request, I just ran the same script from my personal server just in case. On this I calmed down, hid and waited.

    Article created. Three logs in the region of a million look like this:

    Tool Server, data from a copy of the database
    https://toolserver.org/~kalan/ruwiki-1m.txt
    01: 36: 32.57 999878 Klavdievo
    01: 36: 32.89 999879 Klavdievo
    ...
    01: 41: 37.88 999908 Exactly
    01: 41: 38.30 999909 Exactly
    01: 41: 38.49 999909 Kruchinin, _Vladimir_Fyodorovich
    01: 41: 38.93 999909 Kalyamin, _Vyacheslav_Ivanovich
    01: 41: 39.09 999910 Kalyamin, _Vyacheslav_Ivanovich
    01: 41: 40.69 999911 Kalyamin, _Vyacheslav_Ivanovich
    01: 41: 40.75 999911 Life_Extension_Foundation
    01: 41: 40.91 999912 Life_Extension_Foundation
    01: 41: 41.95 999912 Fortygin, _Vitaliy_Sergeevich
    01: 41: 42.11 999913 Fortygin, _Vitaliy_Sergeevich
    01: 41: 43.07 999913 Emperor _-_ power
    01: 41: 43.29 999914 Emperor _-_ power
    01: 41: 43.35 999914 Chertova, _Nadezhda_Andreevna
    01: 41: 43.97 999915 Glock_21
    01: 41: 44.59 999916 Glock_21
    01: 41: 44.65 999916 Volodya_Shishkin
    01: 41: 44.69 999917 Volodya_Shishkin
    ...
    01: 43: 17.60 999935 Bobrik_ (village)
    01: 43: 17.69 999936 Bobrik_ (stanitsa)
    



    Tool Server, data from the API
    https://toolserver.org/~kalan/ruwiki-1m-2.txt
    01: 36: 32.67 999966 Klavdievo
    01: 36: 32.93 999967 Klavdievo
    ...
    01: 41: 38.01 999997 Exactly
    01: 41: 38.67 999997 Kruchinin, Vladimir Fedorovich
    01: 41: 39.12 999998 Kalyamin, Vyacheslav Ivanovich
    01: 41: 39.35 999999 Kalyamin, Vyacheslav Ivanovich
    01: 41: 39.80 1,000,000 Life Extension Foundation
    01: 41: 41.12 1000000 Fortygin, Vitaliy Sergeevich
    01: 41: 41.56 1000001 Fortygin, Vitaliy Sergeevich
    01: 41: 41.79 1,000,000 Fortygin, Vitaliy Sergeevich
    01: 41: 42.00 1000001 Fortygin, Vitaliy Sergeevich
    01: 41: 42.63 1000002 Emperor - power
    01: 41: 43.09 1000003 Chertova, Nadezhda Andreevna
    01: 41: 43.32 1000004 Glock 21
    01: 41: 44.22 1000004 Volodya Shishkin
    ...
    01: 43: 17.01 1000023 Bobrik (village)
    01: 43: 17.22 1000024 Beaver (village)
    



    My server, data from the API
    http://v.kalan.cc/ruwiki-1m-2.txt
    01: 36: 32.72 999966 Klavdievo
    01: 36: 32.96 999967 Klavdievo
    ...
    01: 41: 37.95 999996 Exactly
    01: 41: 38.19 999997 Exactly
    01: 41: 38.68 999997 Kruchinin, Vladimir Fedorovich
    01: 41: 38.92 999997 Kalyamin, Vyacheslav Ivanovich
    01: 41: 39.17 999999 Kalyamin, Vyacheslav Ivanovich
    01: 41: 39.88 1,000,000 Life Extension Foundation
    01: 41: 41.25 1000000 Fortygin, Vitaliy Sergeevich
    01: 41: 41.73 1000001 Fortygin, Vitaliy Sergeevich
    01: 41: 42.68 1000002 Emperor - power
    01: 41: 42.92 1000002 Chertova, Nadezhda Andreevna
    01: 41: 43.14 1000003 Chertova, Nadezhda Andreevna
    01: 41: 43.38 1000004 Glock 21
    01: 41: 44.32 1000004 Volodya Shishkin
    ...
    01: 43: 17.10 1000023 Bobrik (village)
    01: 43: 17.34 1000024 Bobrik (village)
    



    According to all three logs, it is clear that the Life Extension Foundation article took the line . According to the articles by Klavdievo (999967) and Bobrik (stanitsa) (1000024), we can conclude that the difference between the Tulserver counters and Wikipedia itself in the segment of interest to us was 88. Under the number 1,000,000−88 = 999912 we find, again, the Life Extension Foundation article .

    Fortunately, vandalous articles flashed past again this time.

    Also popular now: