
Building a Simple Pandas + Vincent Mapping
Good afternoon, dear readers.
In a previous article, the introduction to data visualization with Pandas and matplotlib was described . Today I would like to show another way to display the results of the analysis using Vincent , which also integrates very easily with Pandas, although it will take a little more action than in the case of matplotlib.
Vincent is a module designed to translate data from python to JavaScript libraries for visualizing D3js and Vega , which in turn provide great opportunities for interactive data visualization.
Those. in this way we can perform analysis in python, and we can build graphics for the results on js. For example, it can be convenient for visualizing any geographic data and plotting them on a map. In addition, vincent has integration with IPython Notebook, and, like matplotlib, it can display graphics directly in it.
As a demonstration of the capabilities of this module, I propose implementing 2 tasks:
As initial data, we take statistics from the Rosstat website .
To get started, let's upload the data and see if additional processing is needed.
So, to load data this time, we use the read_html () function (this function appeared in pandas since version 0.12). In our case, 3 arguments are passed as parameters:
After loading, we got a table of the following form:
As you can see, a little processing of the table will be needed, as there is one column without a name and one column with empty values. Well, let's choose the columns we need (from 2 to 13) the remaining columns and place them in a new DataFrame.
Now we have a dataset suitable for work. Of course, the names of the columns hurt the eye, but such names will not prevent us from solving the tasks at all.
So, let's get down to the first task of visualizing data from 2 districts. To get the data on the basis of which we will build the graph, we need to select the constituencies of interest to us (Moscow and Volga), and then transpose the resulting table. You can do it this way:
In the above code, we first filter our data set according to the districts we need using the isin () function , which checks the value of a column in a given list (an analogue of the IN operator in SQL). Then we use the transpose () function to transpose the resulting data set and write the result to a new DataFrame.
As you can see, the names of the indices in the table are now equal to the number of the year in numerical format. This is not very convenient, so let's change the index to the date format:
The set_index () function is used to set a new index in a DataFrame. In our case, 2 parameters are passed to it:
Now our data is completely ready for plotting. So, if you are working in IPython Notebook and want to see the result in real time, then for integration you need to call the initialize_notebook () function . It will look like this:
vincent.core.initialize_notebook ()
Now we need to create an object corresponding to the type of diagram (a full list of objects can be seen in the documentation ). In our case, it will be a line chart. The code will be as follows:
You can display the graph using the display () function:
As a result, we will see the following:

Well, what we did with the first task. Now let's move on to the second. To solve it, we need a TopoJSON file with a map of the Russian Federation, as well as a directory of regions. Details on how to get them and what it is can be read here . To get started, let's download the directory of regions using read_csv , described in a previous article:
As you can see, several additional parameters appeared here:
If we look closely at our stat dataset , we can see that some of its elements contain footnotes of type '1)' and '2)', which, when parsed with read_html (), were converted to regular characters and added at the end of the corresponding lines in the index. In addition, the letter 'g. ', but it’s not in the directory. All these little things affect the fact that when we combine the set with the stat. data and a guide to pull the codes to the regions, we will have regions without a code.
This can be fixed as follows:
The first line means that we are highlighting the index column in a separate new series. In the second line, we replace the values corresponding to the regular expression with empty ones.
Now we need to replace the index values with the values from the new set. As shown above, this can be done as follows:
Now we can combine our data set with a directory to get the region codes:
Our data after all the manipulations look like this:
So let's move on to the direct construction of the map and the application of data on it. To get started, we need to create a dictionary with a description of our map:
Now let's create our map object and bind our data to it. You can do this with the Map () function:
The function takes the following arguments as parameters:
But here we expect unexpectedness: in the author's version of vincent, the rotate parameter can only be integer. For a correct display of our map, we need the ability for this parameter to be able to take values in the form of a list. To fix this, go to the file % PYTHON_PATH% / \ lib \ site-packages \ vincent \ transforms.py replace the piece of code responsible for checking the type of variables:
on the:
Now our object will be created correctly. It remains to tune our object. To begin, let's make our boundaries between maju objects less visible. To do this, we have marks, which are the main building element. Read more about them in the documentation for Vega . In our case, the code looks like this:
Now let's set our values to be colored in different colors depending on the group. This can be done using Scales objects designed to translate data values (numeric, string, dates, etc.) into values for display (pixels, colors, sizes). Code below:
Well, what about the card is configured, now you can see what we got. As indicated above, you can use the display () function for this , but for some unknown reason it did not work for me, so I first uploaded it to the final json file using the to_json () function :
As parameters, 3 parameters are passed to it:
To view our html file you need a simple HTTP server included in Python. To run it on the command line, run the command:
As a result, our map will look like this:

Today I tried to show another way to visualize data using pandas . I would also like to note that the module under consideration is relatively young and is now actively developing. Of the shortcomings, I would note that not all objects are displayed when trying to output them directly to IPython and the inability to upload just a picture, not a json file, especially since such tools are developed for vega
In a previous article, the introduction to data visualization with Pandas and matplotlib was described . Today I would like to show another way to display the results of the analysis using Vincent , which also integrates very easily with Pandas, although it will take a little more action than in the case of matplotlib.
Introduction
Vincent is a module designed to translate data from python to JavaScript libraries for visualizing D3js and Vega , which in turn provide great opportunities for interactive data visualization.
Those. in this way we can perform analysis in python, and we can build graphics for the results on js. For example, it can be convenient for visualizing any geographic data and plotting them on a map. In addition, vincent has integration with IPython Notebook, and, like matplotlib, it can display graphics directly in it.
As a demonstration of the capabilities of this module, I propose implementing 2 tasks:
- We show the dynamics of per capita income for the Central and Volga Federal Districts
- We show on the map of the Russian Federation the distribution of per capita income for the subjects of the Russian Federation for 2010
As initial data, we take statistics from the Rosstat website .
Data analysis
To get started, let's upload the data and see if additional processing is needed.
import pandas as pd
import vincent
stat = pd.read_html('Data/AVGPeopleProfit.htm', header=0, index_col=0)[0]
So, to load data this time, we use the read_html () function (this function appeared in pandas since version 0.12). In our case, 3 arguments are passed as parameters:
- Html page address
- The row number containing the column names
- Column number to be used as index
After loading, we got a table of the following form:
1990.0 | 2000.0 | 2001.0 | 2002.0 | 2003.0 | 2004.0 | 2005.0 | 2006.0 | 2007.0 | 2008.0 | 2009.0 | 2010.0 | 2011.0 | nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Russian Federation | NaN | 2281 | 3062 | 3947 | 5167 | 6399 | 8088 | 10155 | 12540 | 14864 | 16895 | 18951 | 20755 | NaN |
Central Federal District | NaN | 3231 | 4300 | 5436 | 7189 | 8900 | 10902 | 13570 | 16631 | 18590 | 21931 | 24645 | 27091 | 1 |
Belgorod region | NaN | 1555 | 2121 | 2762 | 3357 | 4069 | 5276 | 7083 | 9399 | 12749 | 14147 | 16993 | 18800 | 24 |
Bryansk region | NaN | 1312 | 1818 | 2452 | 3136 | 3725 | 4788 | 6171 | 7626 | 10083 | 11484 | 13358 | 15348 | 52 |
Vladimir region | NaN | 1280 | 1666 | 2158 | 2837 | 3363 | 4107 | 5627 | 7015 | 9480 | 10827 | 12956 | 14312 | 64 |
As you can see, a little processing of the table will be needed, as there is one column without a name and one column with empty values. Well, let's choose the columns we need (from 2 to 13) the remaining columns and place them in a new DataFrame.
stat = stat[stat.columns[1:13]]
Now we have a dataset suitable for work. Of course, the names of the columns hurt the eye, but such names will not prevent us from solving the tasks at all.
2000.0 | 2001.0 | 2002.0 | 2003.0 | 2004.0 | 2005.0 | 2006.0 | 2007.0 | 2008.0 | 2009.0 | 2010.0 | 2011.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Russian Federation | 2281 | 3062 | 3947 | 5167 | 6399 | 8088 | 10155 | 12540 | 14864 | 16895 | 18951 | 20755 |
Central Federal District | 3231 | 4300 | 5436 | 7189 | 8900 | 10902 | 13570 | 16631 | 18590 | 21931 | 24645 | 27091 |
Belgorod region | 1555 | 2121 | 2762 | 3357 | 4069 | 5276 | 7083 | 9399 | 12749 | 14147 | 16993 | 18800 |
Bryansk region | 1312 | 1818 | 2452 | 3136 | 3725 | 4788 | 6171 | 7626 | 10083 | 11484 | 13358 | 15348 |
Vladimir region | 1280 | 1666 | 2158 | 2837 | 3363 | 4107 | 5627 | 7015 | 9480 | 10827 | 12956 | 14312 |
So, let's get down to the first task of visualizing data from 2 districts. To get the data on the basis of which we will build the graph, we need to select the constituencies of interest to us (Moscow and Volga), and then transpose the resulting table. You can do it this way:
fo = [u'Приволжский федеральный округ',u'Центральный федеральный округ']
fostat = stat[stat.index.isin(fo)].transpose()
In the above code, we first filter our data set according to the districts we need using the isin () function , which checks the value of a column in a given list (an analogue of the IN operator in SQL). Then we use the transpose () function to transpose the resulting data set and write the result to a new DataFrame.
Central Federal District | Volga Federal District | |
---|---|---|
2000 | 3231 | 1726 |
2001 | 4300 | 2319 |
2002 | 5436 | 3035 |
2003 | 7189 | 3917 |
2004 | 8900 | 4787 |
2005 | 10902 | 6229 |
2006 | 13570 | 8014 |
2007 | 16631 | 9959 |
2008 | 18590 | 12392 |
2009 | 21931 | 13962 |
2010 | 24645 | 15840 |
2011 | 27091 | 17282 |
As you can see, the names of the indices in the table are now equal to the number of the year in numerical format. This is not very convenient, so let's change the index to the date format:
fostat.set_index(pd.date_range('1999','2011', freq='AS'), inplace=True)
The set_index () function is used to set a new index in a DataFrame. In our case, 2 parameters are passed to it:
- List of new index values (may also be a column name)
- Parameter means that we replace the index in the current set, if it is False, the index will not be saved
Now our data is completely ready for plotting. So, if you are working in IPython Notebook and want to see the result in real time, then for integration you need to call the initialize_notebook () function . It will look like this:
vincent.core.initialize_notebook ()
Now we need to create an object corresponding to the type of diagram (a full list of objects can be seen in the documentation ). In our case, it will be a line chart. The code will be as follows:
line = vincent.Line(fostat) #создаем объект графика
line.axis_titles(x=u'Год', y=u'тыс. руб') #задаем названия осей
line.legend(title=u'ЦФО vs ПФО') #выводим легенду и задаем ей заголовок
You can display the graph using the display () function:
line.display()
As a result, we will see the following:

Mapping
Well, what we did with the first task. Now let's move on to the second. To solve it, we need a TopoJSON file with a map of the Russian Federation, as well as a directory of regions. Details on how to get them and what it is can be read here . To get started, let's download the directory of regions using read_csv , described in a previous article:
spr = pd.read_csv('Data/russia-region-names.tsv','\t', index_col=0, header=None, names = ['name','code'], encoding='utf-8')
As you can see, several additional parameters appeared here:
- index_col - sets the number of the column to be used as the index
- header - in our case means that we do not use lines from the file to define headers
- names - gets a list whose elements will be column names
- encoding - sets the encoding in which the file is stored
If we look closely at our stat dataset , we can see that some of its elements contain footnotes of type '1)' and '2)', which, when parsed with read_html (), were converted to regular characters and added at the end of the corresponding lines in the index. In addition, the letter 'g. ', but it’s not in the directory. All these little things affect the fact that when we combine the set with the stat. data and a guide to pull the codes to the regions, we will have regions without a code.
This can be fixed as follows:
ew_index = stat.index.to_series()
new_index = new_index.str.replace(u'(2\))|(1\))|(г. )','')
The first line means that we are highlighting the index column in a separate new series. In the second line, we replace the values corresponding to the regular expression with empty ones.
Now we need to replace the index values with the values from the new set. As shown above, this can be done as follows:
tat.set_index(new_index, inplace=True)
Now we can combine our data set with a directory to get the region codes:
RegionProfit = stat.join(spr, how='inner')
Our data after all the manipulations look like this:
2000.0 | 2001.0 | 2002.0 | 2003.0 | 2004.0 | 2005.0 | 2006.0 | 2007.0 | 2008.0 | 2009.0 | 2010.0 | 2011.0 | code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Belgorod region | 1555 | 2121 | 2762 | 3357 | 4069 | 5276 | 7083 | 9399 | 12749 | 14147 | 16993 | 18800 | RU-BEL |
Bryansk region | 1312 | 1818 | 2452 | 3136 | 3725 | 4788 | 6171 | 7626 | 10083 | 11484 | 13358 | 15348 | RU-BRY |
Vladimir region | 1280 | 1666 | 2158 | 2837 | 3363 | 4107 | 5627 | 7015 | 9480 | 10827 | 12956 | 14312 | RU-VLA |
Voronezh region | 1486 | 2040 | 2597 | 3381 | 4104 | 5398 | 6862 | 8307 | 10587 | 11999 | 13883 | 15871 | RU-VOR |
Ivanovo region | 1038 | 1298 | 1778 | 2292 | 2855 | 3480 | 4457 | 5684 | 8343 | 9351 | 11124 | 13006 | RU-IVA |
So let's move on to the direct construction of the map and the application of data on it. To get started, we need to create a dictionary with a description of our map:
geo_data = [{'name': 'rus', #имя карты
'url': 'RusMap/russia.json', #путь до TopoJSON файла с картой
'feature': 'russia'}] #имя объекта из файла карты
Now let's create our map object and bind our data to it. You can do this with the Map () function:
vis = vincent.Map(data=RegionProfit, geo_data=geo_data,scale=700, projection='conicEqualArea', rotate = [-105,0], center = [-10, 65], data_bind=2011, data_key='code', map_key={'rus': 'properties.region'})
The function takes the following arguments as parameters:
- data - data set
- geo_data - object with our map
- projection , - the projection in which our map will be displayed
- rotate, center, scale - projection parameters
- data_bind - a column with the data to be displayed
- data_key - field with the code by which the map and data will be linked
- map_key - dictionary type {'name of the object with the map': 'name of the property by which the binding is done'}
But here we expect unexpectedness: in the author's version of vincent, the rotate parameter can only be integer. For a correct display of our map, we need the ability for this parameter to be able to take values in the form of a list. To fix this, go to the file % PYTHON_PATH% / \ lib \ site-packages \ vincent \ transforms.py replace the piece of code responsible for checking the type of variables:
@grammar(int)
def rotate(value):
"""The rotation of the projection"""
if value < 0:
raise ValueError('The rotation cannot be negative.')
on the:
@grammar(list)
def rotate(value):
if len(value) != 2:
raise ValueError('len(center) must = 2')
Now our object will be created correctly. It remains to tune our object. To begin, let's make our boundaries between maju objects less visible. To do this, we have marks, which are the main building element. Read more about them in the documentation for Vega . In our case, the code looks like this:
vis.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.5)
Now let's set our values to be colored in different colors depending on the group. This can be done using Scales objects designed to translate data values (numeric, string, dates, etc.) into values for display (pixels, colors, sizes). Code below:
vis.scales['color'].type = 'threshold' #задает тип шкалы
vis.scales['color'].domain = [10000, 15000, 20000, 25000, 30000] #задаем набор значений данных для группировки
vis.legend(title=u'Доходы руб.') #вводим легенду карты
Well, what about the card is configured, now you can see what we got. As indicated above, you can use the display () function for this , but for some unknown reason it did not work for me, so I first uploaded it to the final json file using the to_json () function :
vis.to_json('example_map.json', html_out=True, html_path='example_map.html')
As parameters, 3 parameters are passed to it:
- summary file name
- html_out indicates that you also need to create an html shell file
- html_path - sets the path to the html file
To view our html file you need a simple HTTP server included in Python. To run it on the command line, run the command:
python -m SimpleHTTPServer 8000
As a result, our map will look like this:

Conclusion
Today I tried to show another way to visualize data using pandas . I would also like to note that the module under consideration is relatively young and is now actively developing. Of the shortcomings, I would note that not all objects are displayed when trying to output them directly to IPython and the inability to upload just a picture, not a json file, especially since such tools are developed for vega