API for the Russian public initiative. Step 1: data collection and analysis

    As an introduction

    You all probably remember the phenomenon of the Russian Public Initiative ( www.roi.ru ) - the initiative for collecting signatures for an online petition proclaimed by the state in the person of the federal government. It is assumed that if 100 thousand votes are collected in 1 year on the petition, then the petition will be officially considered by our authorities. And even has a chance to get the status of the bill.

    At the same time, 6 petitions have already passed such a filter - https://www.roi.ru/complete/ two of them truly passed the gathering of 100 thousand votes and 4 petitions that received much less votes, but the authorities managed to react.

    And, although the petitions do not guarantee that any decision will be made at all, many create them not only in the hope of a positive solution, but also in order to put the problem on the “media agenda” in other words, so that the media can write about it and there would be a public state reaction.

    Therefore, ROI, so far, is not the last of the state projects and there is interest in it. Moreover, the ROI has a number of shortcomings and problems.

    ROI problems

    Authorization through ESIA (Government services)

    Much has been written about this already - authorization, of course, has led to hundreds of thousands of people starting to register for public services to be able to vote, but it is a barrier by any means. It is not so easy to organize and so far not all citizens have such registration. One could arrange online registration with reference to a mobile phone number, for example.
    This is a limitation that we cannot yet overcome.

    Open Data and API

    The ROI is of interest to many people not only in terms of their petitions, but also of petitions in general. Petitions are interesting material for everyone who wants to understand what worries citizens and what problems affect everyone the most.
    Open data is needed for many tasks:
    • mobile app for tracking initiatives
    • for visualization and analytics
    • to predict success / failure of an initiative
    • to create services to promote initiatives and attract attention to them

    Collecting data

    Before starting to make a full-fledged API for ROI, I started by modeling the collection of information from there and wrote here such a short document on Github - API for ROI

    Where I pre-painted the basic concepts that are in the system and which can be extracted theoretically.
    And immediately revealed the limitations:
    1. Votes for / against are available only to authorized users. Given that authorization through public services - this imposes certain restrictions. Authorization, of course, is surmountable, but so far we are collecting “in the forehead” that data in which there are no such restrictions.
    2. The data is divided between the description of the petition on the petition page and in the list of petitions. The list contains data on votes for, and on the page, as I already wrote, the data on votes is only available with authorization.

    In order to download the data, a small script was written that pumped out the data from the list of petitions and from their pages, and then reduced them to one general description. MongoDB was used as storage. Here you can download and watch it - github.com/ivbeg/apiroi/blob/master/scripts/data_extract.py
    The script is as simple as possible and of course, then it will change thoroughly to regularly update petitions and immediately compile them into a single format.

    The data was collected quite quickly - it took literally several hours. I will not go into details of how parsers write - this is a very simple case, there are no surprises.
    The data obtained is now available on Github here github.com/ivbeg/apiroi/tree/master/scripts/data/rawand on the open data hub - hubofdata.ru/dataset/roi-dump

    So, the data is collected, what's next?

    Analyzing data

    I called this post a post about the API, because the ultimate goal is to get it. However, while we are doing it, we can understand how to make the API the most convenient and way and whether it is necessary to include any data there and create additional data slices based on the data collected. An API, after all, can be not just an API for returning data; an API can perform much more tasks.

    First, let's think about what we can extract from our data that is convenient for visualization. Suppose that the consumers of the API are the media and those who want to visualize them.
    Here are some thoughts that visited me about what might be interesting:
    1. Understand the likelihood of an initiative gaining 100 thousand votes.
    2. Rate the intensity of voting for the initiative.
    3. Identify the most "voted authors"
    4. Determine the most popular topics

    Actually, in order to start determining this is all - the data_process.py script was written, it is there on github and with its help the indicators above were calculated.
    In the data folder - refined - are the results of preliminary calculations in JSON.

    How to assess the likelihood of passing the initiative? Ideally, it is desirable to have details on voting statistics for the entire time the initiative exists and by day, but we do not have an ideal situation and such detailing is available only to the authors of the initiatives.
    In the meantime, the prediction formula is very simple. You can calculate how many people can potentially vote by the formula:
    votes + (votes / (probe_date_seconds - start_date_seconds)) * (end_date_seconds - probe_date_seconds)
    • votes - the number of votes on the date of data retrieval
    • probe_date_seconds - date of data sampling in seconds
    • start_date_seconds - date of publication of the petition in seconds
    • end_date_seconds - completion date for collecting petition information in seconds

    In other words, everything is considered from the assumption that people will vote in the same way as they voted before and the distribution of votes will be approximately uniform. This, of course, is most likely not so, and much depends on the media activity of the initiators, however, an initial approximation gives.
    And so the first analysis showed a picture of what is shown in the screenshot.

    Turns out that:
    • 6 petitions gaining up to 100 thousand votes
    • 5 petitions will gain up to 50 thousand votes
    • the remaining 2492 petitions will not get this
    • and 1641 a petition, most likely will not pick up and 1 thousand votes of which

    Or the same picture.

    From which I conclude that it is useful to include many additional features in the API:
    • preservation of the entire voting history should be envisaged to adjust the chances of success / failure of the petition
    • it is necessary to make it possible to calculate success on any of the petitions
    • you need a link shortener for each petition because operating with links right now is completely inconvenient - they are not memorable
    • need RSS feed option
    • and much more

    It is unfortunate that the creators of the ROI themselves do not make any efforts to make the ROI open in terms of API and data.
    But thanks to the fact that the first step has been taken - the first data upload is and there are examples of scripts for upload, now anyone can do such an API. And in subsequent posts I will write more about this in more detail.

    Only registered users can participate in the survey. Please come in.

    Do I need an API for ROI?

    • 42.8% Yes, but official from the creators of ROI.ru 141
    • 31.3% Yes, it’s just necessary, any 103
    • 20.9% No, ROI is a meaningless initiative 69
    • 1.8% No, just unnecessary for such a service 6
    • 3% of open data is enough without API 10

    Also popular now: