# Using data sets from the open data portal of Russia data.gov.ru

The last time I analyzed the data sets: the distribution by categories and file formats, the degree of filling in the fields in the passports of data sets, etc. Now I will try to understand how often data sets are interested and how often data sets are used? What data sets are of interest to portal users?

In order to carry out an assessment, it is necessary to decide on what criteria to produce it. In descriptions of data sets there is information about the number of their views. You do not need to be a genius to understand that if someone was looking through information about a data set, then apparently he did it not entirely by accident. And, therefore, the criterion that the data set aroused interest will be the number of its views. And if the data set is not just an interest, but may be useful, it will be downloaded. Thus, the number of downloads will be a criterion of utility.

First, the simplest statistical characteristics for the number of views:

The large value of the maximum in comparison with the average and median, as well as the difference between the median and the average value clearly hints at the uneven distribution of the number of views and the “long tail”.

To see this visually, I divide the number of views into 1000 evenly distributed groups (averaging) and get a fairly smooth curve. Then I plot the dependence of the sum of all views on the average number of views and the number of data sets on the average number of views.

What does the chart show?

A large number of data sets have close to zero the number of views, but the total number of views of these sets is large. Further, approximately from 100 to 1000 recession. From 1000 to 5000 fairly uniform distribution. From 5000 growth.

Numbers selected by eye. This is how the diagram looks like.

Two thirds of the data sets were viewed less than 100 times.

A third of the data sets were viewed from 100 to 1000 times.

About one percent was viewed from 1,000 to 5,000 times.

And less than one tenth percent of the data sets were viewed more than 5,000 times.

But if you count on the amount of views, the picture is different.

Those sets that have been viewed less than 100 times, account for only 16%.

Nearly two-thirds, that is, the bulk of the views, fall into datasets that have been viewed from 100 to 1000 times.

About 14% are datasets that have been viewed from 1,000 to 5,000 times.

And almost 7% fall on the sets, which have been viewed more than 5,000 times (and there are less than one tenth of the total).

But this is not exactly what is needed to evaluate the use of data sets. The data sets were laid out at different times, so the use of absolute values, in this case the number of views, does not make much sense. For correct comparison, I will use the relative value - the number of views per month.

Statistical characteristics for the number of views of data sets per month:

In fact, the situation with the number of views per month resembles the number of views — an uneven distribution with a long tail.

Conventionally, I will divide all data sets by the average number of views as follows:

less than once a month;

Data sets that are viewed less than once a month, apparently, something completely unnecessary. Such data sets of the order of 6% and it is logical that they account for only 0.2% of the total number of views.

A third of the data sets are viewed from once a month to once a week. And they account for about 6% of the total number of views. It seems that someone sometimes looks.

Slightly more than half of the data sets were viewed from once a week to once a day. And they account for almost half of the total number of views. Not too often, but look.

The data sets that are viewed more than once a day, and only 2.5%, account for more than one third of the total number of views. That is what is of interest.

But the greatest interest is caused by those data sets that are viewed more often than once per hour. They are only 0.03 of the total, and they account for almost 4% of the total number of views.

Thus, only 3% of all data sets can really be considered interesting. A third is of no interest. And a little more than half can occasionally interest someone.

But this is only half the battle.

If the data set was downloaded, it means that someone needed it (and, possibly, even very useful). Thus, as mentioned above, I will determine the usefulness of the data set based on the number of downloads.

First, as usual, some statistics:

What does this mean? Uneven distribution? A long tail?

Not. It seems to me that when the median is equal to one, we can expect an interesting result.

It seems that no one downloads most of the data sets at all.

Conventionally, I divided the number of downloads as follows:

Let's look at the diagram.

And what do we see?

Half of the data sets never downloaded. Even to check what works, did not download. Even by chance. NEVER!

Only once downloaded 16% of the data sets. Perhaps by chance or to check that they are. They account for about 3% of the total number of downloads.

Twice downloaded 7% of the data sets and they account for about 3% of the total number of downloads. Twice too doubtful result.

Nearly 17% of the data sets were downloaded less than 10 times, and they account for 17% of the total number of downloads.

If you put it together, it turns out that 90% of the data sets are not at all interesting or practically of no interest?

From 10 to 100 times downloaded about 9% of the data sets, and their share is about 40%.

0.5% of the data sets were downloaded from 100 to 1000 times, but they account for a quarter of all downloads.

More than 1000 times downloaded only 0.02% of the total number of data sets, and they constitute about 8% of all downloads.

As a result, half of the data sets were never needed by anyone. 10% of data sets are of stable interest for use. Less than 1% of the data set provides real value.

But, as with the number of views, it is more correct to consider not absolute values, but relative ones.

By analogy, instead of the number of downloads will be the number of downloads per month.

Statistics briefly:

It is logical that again the same with the same.

It is clear that half of the data sets are never downloaded and the graph does not look very nice.

The diagram is more informative.

The same half of the sets (apparently, the rounding error led to the difference in shares) is never downloaded. This fact is already known.

Almost half of the data sets (45%) are downloaded less than once a month, and they account for 42% of the total number of downloads.

From once a month to once a week, about 4% is downloaded, but they account for almost a quarter of downloads.

About 0.8% of the data sets are downloaded from once a week to once a day, but they account for almost 23% of the total number of downloads.

And, finally, from once a week to once per hour, only 0.05% of the data sets are downloaded, but they account for almost 11% of all downloads.

If, for example, we assume that the portal is a store, the number of views is the number of visitors to the store, and the number of downloads is the number of purchases, then we can calculate the conversion:

K = N / N0 * 100%, where

K is the conversion rate;

N - the number of real buyers (customers who bought goods or used the service);

N0 - the number of visitors to the store or site.

For an open data portal, the conversion rate will be about 3%. Much or less, everyone can decide for himself.

Only about 3% of the data sets are really interesting to someone. But, at the same time, almost half is viewed from once a week to once a day.

Half of the data sets were never downloaded by anyone.

Less than 1% of the data sets are indeed of interest.

And then we will look at how to evaluate the data sets, check whether the links to the data sets work. Let's see how often the datasets are updated and the size of the dataset files. Is there a relationship between the file format of the data set and the number of downloads.

Resources are limited, so there may be errors during the download.

Write reviews in the comments.

In order to carry out an assessment, it is necessary to decide on what criteria to produce it. In descriptions of data sets there is information about the number of their views. You do not need to be a genius to understand that if someone was looking through information about a data set, then apparently he did it not entirely by accident. And, therefore, the criterion that the data set aroused interest will be the number of its views. And if the data set is not just an interest, but may be useful, it will be downloaded. Thus, the number of downloads will be a criterion of utility.

*And you can still imagine that the portal is a store. Items in the store are datasets. The cost of a commodity is the amount of effort required to download (find where the link is) and use (for example, view or use as a source of data for your own purposes) data. Accordingly, the number of views is the number of potential buyers, and the number of downloads is the number of purchases.*

Buyers go to the store, watch products, evaluate. If the buyer cannot find the goods or cannot understand whether he is suitable for him, he will leave. If the product is interested in the buyer, then he can buy it (download), if the price (the amount of effort spent for downloading and use) suits. For example, a certain set of data interested me, and I want to download it. But it turns out that it is in a format that is difficult for me to use. At the same time, on the other site there is the same data, but in a more convenient form or newer, or with a better description, respectively, the data set will not be downloaded.Buyers go to the store, watch products, evaluate. If the buyer cannot find the goods or cannot understand whether he is suitable for him, he will leave. If the product is interested in the buyer, then he can buy it (download), if the price (the amount of effort spent for downloading and use) suits. For example, a certain set of data interested me, and I want to download it. But it turns out that it is in a format that is difficult for me to use. At the same time, on the other site there is the same data, but in a more convenient form or newer, or with a better description, respectively, the data set will not be downloaded.

First, the simplest statistical characteristics for the number of views:

- total - 2.03 million;
- minimum - 2;
- average - 161;
- median - 61;
- maximum - 28.1 thousand

The large value of the maximum in comparison with the average and median, as well as the difference between the median and the average value clearly hints at the uneven distribution of the number of views and the “long tail”.

To see this visually, I divide the number of views into 1000 evenly distributed groups (averaging) and get a fairly smooth curve. Then I plot the dependence of the sum of all views on the average number of views and the number of data sets on the average number of views.

What does the chart show?

A large number of data sets have close to zero the number of views, but the total number of views of these sets is large. Further, approximately from 100 to 1000 recession. From 1000 to 5000 fairly uniform distribution. From 5000 growth.

Numbers selected by eye. This is how the diagram looks like.

Two thirds of the data sets were viewed less than 100 times.

A third of the data sets were viewed from 100 to 1000 times.

About one percent was viewed from 1,000 to 5,000 times.

And less than one tenth percent of the data sets were viewed more than 5,000 times.

But if you count on the amount of views, the picture is different.

Those sets that have been viewed less than 100 times, account for only 16%.

Nearly two-thirds, that is, the bulk of the views, fall into datasets that have been viewed from 100 to 1000 times.

About 14% are datasets that have been viewed from 1,000 to 5,000 times.

And almost 7% fall on the sets, which have been viewed more than 5,000 times (and there are less than one tenth of the total).

But this is not exactly what is needed to evaluate the use of data sets. The data sets were laid out at different times, so the use of absolute values, in this case the number of views, does not make much sense. For correct comparison, I will use the relative value - the number of views per month.

Statistical characteristics for the number of views of data sets per month:

- minimum - 0.184;
- average - 8.49;
- median - 5.33;
- maximum - 1.76 thousand

In fact, the situation with the number of views per month resembles the number of views — an uneven distribution with a long tail.

Conventionally, I will divide all data sets by the average number of views as follows:

less than once a month;

- from once a month to once a week;
- from once a week to once a day;
- from once a day to once an hour;
- more than once per hour.

Data sets that are viewed less than once a month, apparently, something completely unnecessary. Such data sets of the order of 6% and it is logical that they account for only 0.2% of the total number of views.

A third of the data sets are viewed from once a month to once a week. And they account for about 6% of the total number of views. It seems that someone sometimes looks.

Slightly more than half of the data sets were viewed from once a week to once a day. And they account for almost half of the total number of views. Not too often, but look.

The data sets that are viewed more than once a day, and only 2.5%, account for more than one third of the total number of views. That is what is of interest.

But the greatest interest is caused by those data sets that are viewed more often than once per hour. They are only 0.03 of the total, and they account for almost 4% of the total number of views.

Thus, only 3% of all data sets can really be considered interesting. A third is of no interest. And a little more than half can occasionally interest someone.

*Products in the store a lot. But more than a third of them have little interest in buyers. More than half of the goods are not particularly interested in buyers, but interest in them is stable. And 3% of products really cause interest.*

But this is only half the battle.

*Even if the buyer entered the store and was interested in the product, would he buy it?*If the data set was downloaded, it means that someone needed it (and, possibly, even very useful). Thus, as mentioned above, I will determine the usefulness of the data set based on the number of downloads.

First, as usual, some statistics:

- total - 63.2 thousand;
- minimum - 0;
- average - 5.01;
- median - 1;
- maximum - 2.33 thousand

What does this mean? Uneven distribution? A long tail?

Not. It seems to me that when the median is equal to one, we can expect an interesting result.

It seems that no one downloads most of the data sets at all.

Conventionally, I divided the number of downloads as follows:

- 0 - never;
- 1 time;
- 2 times;
- less than 10;
- from 10 to 100;
- from 100 to 1000;
- more than 1000.

Let's look at the diagram.

And what do we see?

Half of the data sets never downloaded. Even to check what works, did not download. Even by chance. NEVER!

Only once downloaded 16% of the data sets. Perhaps by chance or to check that they are. They account for about 3% of the total number of downloads.

Twice downloaded 7% of the data sets and they account for about 3% of the total number of downloads. Twice too doubtful result.

Nearly 17% of the data sets were downloaded less than 10 times, and they account for 17% of the total number of downloads.

If you put it together, it turns out that 90% of the data sets are not at all interesting or practically of no interest?

From 10 to 100 times downloaded about 9% of the data sets, and their share is about 40%.

0.5% of the data sets were downloaded from 100 to 1000 times, but they account for a quarter of all downloads.

More than 1000 times downloaded only 0.02% of the total number of data sets, and they constitute about 8% of all downloads.

As a result, half of the data sets were never needed by anyone. 10% of data sets are of stable interest for use. Less than 1% of the data set provides real value.

*Half of the products in the store do not buy in principle. A third of the goods bought very rarely. 10% of goods are in stable demand. And less than 1% of the goods are really in demand by buyers.*But, as with the number of views, it is more correct to consider not absolute values, but relative ones.

By analogy, instead of the number of downloads will be the number of downloads per month.

Statistics briefly:

- minimum - 0;
- the average is 0.276;
- median - 0.02;
- maximum - 145.

It is logical that again the same with the same.

It is clear that half of the data sets are never downloaded and the graph does not look very nice.

The diagram is more informative.

The same half of the sets (apparently, the rounding error led to the difference in shares) is never downloaded. This fact is already known.

Almost half of the data sets (45%) are downloaded less than once a month, and they account for 42% of the total number of downloads.

From once a month to once a week, about 4% is downloaded, but they account for almost a quarter of downloads.

About 0.8% of the data sets are downloaded from once a week to once a day, but they account for almost 23% of the total number of downloads.

And, finally, from once a week to once per hour, only 0.05% of the data sets are downloaded, but they account for almost 11% of all downloads.

If, for example, we assume that the portal is a store, the number of views is the number of visitors to the store, and the number of downloads is the number of purchases, then we can calculate the conversion:

**Conversion rate**

Уровень конверсии (conversion rate) — это процентное соотношение посетителей магазина, сайта, маркетингового мероприятия, которые совершили выбор, осуществили покупку, к общему числу всех посетителей.

Конверсия в продажах — отношение покупателей (магазина, фирмы) к общему числу посетителей (обратившихся клиентов).

Конверсия в рекламе – отношение количества показов рекламы к количеству обращений к рекламодателю.

Конверсия в интернет-маркетинге — отношение посетителей сайта, которые совершили «нужное» действие (кликнули по ссылке, проголосовали, купили) к общему числу посетителей сайта.

Обычно уровень конверсии рассчитывается в процентах. Уровень конверсии для посетителей интерне- магазинов (т. е. доля совершивших покупку посетителей сайта) составляет в среднем 2-5%. Например, цель сайта – продажа книг, и у вас за сутки было 500 посетителей сайта и 35 проданных книг. Тогда конверсия составит 35*100/500=7%.

Уровень конверсии показывает, насколько хорошо маркетинговые усилия по привлечению посетителей и покупателей, а также усилия по наполнению сайта информацией, магазин – товаром, выполняют главную задачу — обеспечение продаж.

Успешная конверсия по-разному трактуется продавцами, рекламодателями или поставщиками контента для сайта. Для продавца успешная конверсия будет означать операцию покупки. Для поставщика контента успешная конверсия может означать регистрацию посетителей на сайте, на форуме, на маркетинговом мероприятии, подписку на почтовую рассылку, скачивание программного обеспечения или какие-либо другие действия, ожидаемые от посетителей.

Понятие уровня конверсии применимо не только к электронным СМИ, электронной конверсии, но и в любом случае, когда привлечение клиентов не является конечной задачей, и более важным является получение выгоды от привлеченных клиентов — как конечный результат многоэтапной (привлечь-заинтересовать-продать) маркетинговой задачи по работе с клиентами.

Конверсия в продажах — отношение покупателей (магазина, фирмы) к общему числу посетителей (обратившихся клиентов).

Конверсия в рекламе – отношение количества показов рекламы к количеству обращений к рекламодателю.

Конверсия в интернет-маркетинге — отношение посетителей сайта, которые совершили «нужное» действие (кликнули по ссылке, проголосовали, купили) к общему числу посетителей сайта.

Обычно уровень конверсии рассчитывается в процентах. Уровень конверсии для посетителей интерне- магазинов (т. е. доля совершивших покупку посетителей сайта) составляет в среднем 2-5%. Например, цель сайта – продажа книг, и у вас за сутки было 500 посетителей сайта и 35 проданных книг. Тогда конверсия составит 35*100/500=7%.

Уровень конверсии показывает, насколько хорошо маркетинговые усилия по привлечению посетителей и покупателей, а также усилия по наполнению сайта информацией, магазин – товаром, выполняют главную задачу — обеспечение продаж.

Успешная конверсия по-разному трактуется продавцами, рекламодателями или поставщиками контента для сайта. Для продавца успешная конверсия будет означать операцию покупки. Для поставщика контента успешная конверсия может означать регистрацию посетителей на сайте, на форуме, на маркетинговом мероприятии, подписку на почтовую рассылку, скачивание программного обеспечения или какие-либо другие действия, ожидаемые от посетителей.

Понятие уровня конверсии применимо не только к электронным СМИ, электронной конверсии, но и в любом случае, когда привлечение клиентов не является конечной задачей, и более важным является получение выгоды от привлеченных клиентов — как конечный результат многоэтапной (привлечь-заинтересовать-продать) маркетинговой задачи по работе с клиентами.

K = N / N0 * 100%, where

K is the conversion rate;

N - the number of real buyers (customers who bought goods or used the service);

N0 - the number of visitors to the store or site.

For an open data portal, the conversion rate will be about 3%. Much or less, everyone can decide for himself.

### findings

Only about 3% of the data sets are really interesting to someone. But, at the same time, almost half is viewed from once a week to once a day.

Half of the data sets were never downloaded by anyone.

Less than 1% of the data sets are indeed of interest.

### What's next?

And then we will look at how to evaluate the data sets, check whether the links to the data sets work. Let's see how often the datasets are updated and the size of the dataset files. Is there a relationship between the file format of the data set and the number of downloads.

**PS**As an illustration, I laid out several analytic panels .Resources are limited, so there may be errors during the download.

Write reviews in the comments.

Only registered users can participate in the survey. Sign in , please.