Game to improve the quality of Wikipedia

    Today, a beta version of the WikiBest online game was announced, which is part of Wikipedia's research on data quality. It is noteworthy that at present the game allows you to compare the quality of data in 5 language versions of Wikipedia: Russian, Ukrainian, Belarusian, Polish, English. In the near future it is planned to expand the number of languages.


    Despite its popularity, Wikipedia is often criticized for the poor quality of information. In the scientific world, there are various approaches to the automatic assessment of the quality of articles in this free encyclopedia. However, a large number of problems are still not resolved. For example, how to automatically evaluate or compare the quality of individual facts in different language versions on the same topic?

    On Wikipedia, each article can have several language versions (even more than 200). On the one hand, this simplifies access to information for individual language communities. On the other hand, this can create difficulty in determining better information, as Each of these versions can be created and edited independently of each other. For example, readers and editors of the English version of an article about Yekaterinburg do not need to know what is written about this city in the Russian version of Wikipedia, although it can be expected that the information in the latter may be of better quality (of course, this rule does not work in all cases; )).

    The WikiBest game was created in order to build algorithms for automatically comparing data quality between separate language versions of articles based on the decisions of users (players) in the future using machine learning and artificial intelligence. This can help you choose more complete, relevant and reliable information that other language versions of Wikipedia could enrich.

    Game address

    The first short video lecture on how WikiBest works:

    Key Features

    Currently, the minimum requirements for a player are knowledge of 4 languages ​​(Russian, Ukrainian, Polish, English) at a basic level, which would make it possible to compare the contents of cards (in English “infobox”, in simplification - tables with data) of Wikipedia articles. Knowledge of Belarusian is also recommended - then there will be an opportunity to compare quality in all available 5 language versions.

    To participate in the game registration is required. After receiving the activation code in the mail - you can begin to "fight" for quality on Wikipedia!)

    Cards appear on the screen in 5 (4) language versions on the same topic - for example, it can be a city, a computer game, a university, a company or another object. For the convenience of comparing data, windows with cards can be moved. For each language version, it is possible to note four options regarding the data contained in them: the best quality, the best completeness, the best relevance, the best reliability.

    Ideally, each of the available options should be checked only once within 5 (4) languages. Those. we must determine who is the best in each of the four “nominations”. However, there are exceptional cases when two language versions can be the best at once. Then the game offers the player to add also a comment, with information on why he (she) thinks so.

    To go to the next five (four) cards, click "Next". And we repeat according to the scheme described above.

    For the work done in the game, "experience" is earned, which leads to an increase in the level.

    Due to the fact that research is carried out mainly by specialists in machine learning and data analysis, gamification of the service is not a strong point of this project;) This will still have to be learned. I will be glad to links to useful materials in this direction.

    Generally speaking, the project is non-profit. Any help is appreciated)

    Bit of theory

    What is data quality ? The question is not simple, and the scientific community does not have a single definition - it all depends on the context;) To begin with, quality assessment is a subjective concept and depends on a specific person, his knowledge and experience, as well as the demand for this information at a given time. Simply put, data quality can be defined as usability.

    In order to evaluate the quality of data, it is also necessary to take into account its various characteristics, such as, for example, completeness, relevance, reliability.

    In the WikiBest game, fullnessmeans how widely the object is described. Those. you need to see what characteristics are entered on the card - are all the main parameters for this object available to the reader. For example, if it is a city, then one of the most important parameters can be: population, area, mayor, etc.

    Relevance is associated with the difference between the entered parameters of the object and the real state of affairs. For example, a card with the value given as of 2018 will have a higher relevance of population data compared with a card where the same parameter has been relevant since 2016.

    Reliability in the context of the game, shows how much information is supported by reliable sources. Thus, the reader can verify the correctness of the entered value of a particular parameter.

    Why exactly 5 languages?

    As already mentioned above, the game is part of scientific research in which I am directly involved. I can be sure of the basic knowledge of these languages, so I can conduct research on the obtained data.

    As for the optional Belarusian - this is due to the size of the Belarusian section of Wikipedia. Currently there is approx. 150 thousand articles. For comparison, the Ukrainian Wiki already contains more than 800 thousand, the Russian - almost 1.5 million ( source ).

    The main goal of the ongoing research is to enrich the less developed language sections of Wikipedia. In this sense, the Belarusian section has great potential - data from other studied language sections can be transferred there. However, we already know that the quality of the data depends on the topic and language version, so first you need to determine the “candidate” for “copying” (in fact, you still need to translate this data - but this is not a problem when using semantics).

    Also popular now: