rubin16 July 22, 2014 at 11:46

How Wikipedia works (part 2)

Hello, Habr!

At my main job, I was given a couple of days off, and in addition to important personal matters, I decided to devote them to the continuation of a series of posts about Wikipedia.

The first part of the series was received positively, so I will try to make the second even more interesting for the local audience: today it will be devoted to some technical aspects of the project.

As you know, Wikipedia is a volunteer project, and this principle is not completely repealed even in such a case as the technical support of its work. In principle, any participant with a fairly small amount of effort can choose a simple bug and send their patch without even using Gerrit.

But a much more common form of participation is the development of additional tools, settings and robots for Wikipedia.

Bots

In order to perform some routine or voluminous tasks, participants often launch bots or ask other, more technically savvy participants. MediaWiki provides access to the API , and for accounts with the bot flag this access is even wider (for example, through the API the bot can display not 500 entries, but 5000 each). In addition, the bot flag on the account allows you to hide edits from other users, which protects participants from thousands of minor edits that would clog the lists of recent edits.

Pywikibot

The most popular of the existing frameworks is pywikibot , which already includes a large set of ready-made scripts - for example, deleting pages from a list, moving articles from one category to another, and much more. About 100 people have already taken part in the development of this framework ; in terms of use, this script is also extremely simple: even an ordinary Windows user just needs to install Python, download the distribution kit, enter the username and password in the config and run one of the ready-made scripts.

Previously, a popular task for pywikibot was to arrange links between articles in different language sections: for example, someone creates an article on Rally on Russian Wikipedia , knows that there is a similar Rallying on English Wikipedia, and puts a link to it. Then comes the bot, which sees that in the English section there are links to another 20+ different language sections, and in the Russian section there are no links to them: therefore, the bot adds a link to a new Russian article in each section, and in the Russian article updates the full list like interwiki links.

As we see, the work is really not the most interesting for a person, but very voluminous, and dozens of bots were involved in it, gaining millions of edits as a result. For example, my bot now has about 990+ thousand edits, 80 percent of which are just such edits for interwiki. Not so long ago, the Wikipedia engine was redesigned, and such edits in each section are no longer needed, but the number of routine tasks is still not reduced.

But let's get back to pywikibot - the framework has 2 branches:

Core is a new branch where code was rewritten; it has become more structured and efficient.
Compat is an old line, but it has a wider range of scripts, it works better in third-party MediaWiki projects, and many are more familiar with working with it.

Bugs and wishes of new features are collected in a single bug tracker with MediaWiki , development is now going through Git / Gerrit , which has simplified attracting new developers, adding new patches and their reviews. Previously, development went through SVN, but in the end, to unify resources with MediaWiki and expand the circle of developers, it was decided to move to Git / Gerrit: there is even a topic on Habr about the advantages of Git over SVN .

I will not describe the whole set of already existing functions of the framework, those who wish can walk around the repository and see: I can only say that it is actively populating, and existing scripts require minimal setup to run in any language section.

AutoWikiBrowser

If the bot described above works in the console, then AWB (AutoWikiBrowser) is a more user-friendly tool. AWB has a full interface, automatic updates, and it works only on Windows (unofficially - and under Wine). Typical AWB functions: replacing text with regular expressions, other edits on a specific list of articles. AWB can recursively walk through Wikipedia categories , compare lists, highlighting unique elements or intersections, and even process Wikipedia dumps . At the same time, there are also restrictions for working with accounts that do not have an administrator or bot flag - compiled lists for such participants have a limit of 25,000 lines. If you have a bot flag, then the restrictions are completely removed

when loading a special plugin . Important disclaimer: potentially using AWB you can quickly make a number of non-constructive edits, including vandal ones, then its use is technically limited to users approved by administrators : if the username is not indicated on this page, then AWB will refuse to work.

In general, when saving each edit, you must manually click on the “Save” button, autosave is possible only if AWB is launched from under an account that has a bot flag. Therefore, AWB is difficult to use for truly large-scale tasks, but for small tasks it is very convenient, since it allows you to automate certain actions and quickly realize what you want without having to contact participants with more advanced bots (for example, see above). Personally, I often use AWB specifically for compiling lists, and then quickly run pywikibot with the necessary task: pywikibot also has a special page generator that can do all this, but for me it’s more visual and easier to do everything through a program with a GUI .

AWB source code open, the program is written in C # and is supported by a limited circle of developers. At startup, the program itself checks for updates and installs them, the distribution is also laid out on SourceForge . In case of critical errors during operation, AWB compiles a bug report and helps to pass it on to developers .

Other

There are bots that run on Perl , .NET , JAVA , but they are often supported by individual enthusiasts, and are not widely distributed. Personally, I once ran wikis in PHP as well, but the massive support of pywikibot, the active bug tracker and the responsiveness of a large number of developers completely led me to work with this bot, so I’m not able to tell you in detail about other frameworks :)

Toolserver

The section above was devoted to scripts and bots, which are mostly launched from the participant’s computer or from his server. But in addition to this, it is possible to run scripts from the sites of Wikimedia organizations: previously there was a Tulserver, which supported the German branch of Wikimedia, since June 30, 2014 it was disabled due to the fact that the Labs system was created, but first things first. The Tulserver story began in 2005 when Sun Microsystems donated a V40z server (2 * Opteron 848, 2.2 GHz, 8 GB RAM, 6 * 146 GB disk, 12 * 400 GB external RAID) for use at the Wikimedia Conference in Frankfurt. After the conference, he was taken home by one of the participants in the German department and made a coffee table out of him; after some time they decided to install it in Amsterdam based on Kennisnet

where fifty Wikimedia Foundation servers were installed.

After that, various scripts and tools (revision counters, various article analyzers, tools for downloading files, etc.) begin to be launched on Tulserver, and Tulserver's capacities began to increase: at the time of shutdown, 17 servers were working , more than 3 million calls to Tulserver were registered in day, traffic reached 40 Mb / s. Each of the servers had from 24 to 64 GB of RAM, mainly they worked on Solaris 10 (and gradually switched to Linux), the total disk space was 8 TB.

What were the main advantages of Tulserver as a site?

Replication with Wikimedia Foundation servers - more will be below.
Openness - with reasonable goals and skills, it was easy to get an account .
Enough closeness of its code: if the new Labs requires code under an open license, and the files themselves are for the most part open for viewing by all Labs participants (with the exception of passwords, logins, etc.), then the Tulserver was much more democratic in this regard.

There were also minuses:

the system worked in the “as is” mode, because funding was limited, and the capabilities of administrators supporting the system were limited: many system errors were not fixed for years;
in the event of a developer’s inactivity, his code and works disappeared, and Wikipedia often lost the tools that it had long been used to; on Labs, because of the openness and accessibility of the code, any project may start supporting another developer;
At some point, severe restrictions were imposed on the consumption of system resources, which led to the disconnection of some useful, but expensive tools.

But let's get back to the main plus - replication: without it, the Tulserver would not be any different from ordinary hosting, where you can start your processes. With the help of replication, there were always Wikipedia databases on ToolServer, so the tools could work directly with the database, and not make huge API requests, sometimes process irrelevant dumps, etc.

An example replication scheme is shown in the picture below:

Tampa in the diagram - this is the main database of the Fund located in the USA; clusters s1-s4 are responsible for the database: for example, s1 is the English Wikipedia, s2 are some large sections, etc. Data from Tampa is replicated to the Tulserver database in Amsterdam, and already there, they are accessed by Tulserver users and their tools. Naturally, there was always some kind of replication lag, and due to the use of different clusters there could be a situation where the lag for processing English Wikipedia data was 1 minute, and for Russian Wikipedia - 2-3 days. For example, on June 21 (shortly before the shutdown), the lag was up to 28 seconds .

This availability of relevant data was the main advantage of Tulserver: it was possible to practically analyze online which files are not used, how many edits and actions any participant made in all the projects of the Foundation, and a lot of other information that the Wikipedia engine does not provide directly.

Conclusion

Support for Tulserver was a heavy burden for the German branch, the system had certain limitations, and from July 1, Tulserver was completely replaced by the new Labs project, which is entirely supported by the Wikimedia Foundation itself. This is a new big project, I will write about it in the next post, but I can publish the June statistics of Labs for seed :)

213 projects are working
3 356 users are registered in the system
Used 1 714 312 MB RAM
19 045 GB of disk space are used

See you soon!

Tags: