kuznetsovin October 25, 2013 at 10:22

Building a Simple Pandas + Vincent Mapping

Good afternoon, dear readers.
In a previous article, the introduction to data visualization with Pandas and matplotlib was described . Today I would like to show another way to display the results of the analysis using Vincent , which also integrates very easily with Pandas, although it will take a little more action than in the case of matplotlib.

Introduction

Vincent is a module designed to translate data from python to JavaScript libraries for visualizing D3js and Vega , which in turn provide great opportunities for interactive data visualization.
Those. in this way we can perform analysis in python, and we can build graphics for the results on js. For example, it can be convenient for visualizing any geographic data and plotting them on a map. In addition, vincent has integration with IPython Notebook, and, like matplotlib, it can display graphics directly in it.
As a demonstration of the capabilities of this module, I propose implementing 2 tasks:

We show the dynamics of per capita income for the Central and Volga Federal Districts
We show on the map of the Russian Federation the distribution of per capita income for the subjects of the Russian Federation for 2010

As initial data, we take statistics from the Rosstat website .

Data analysis

To get started, let's upload the data and see if additional processing is needed.

import pandas as pd
import vincent
stat = pd.read_html('Data/AVGPeopleProfit.htm', header=0, index_col=0)[0]

So, to load data this time, we use the read_html () function (this function appeared in pandas since version 0.12). In our case, 3 arguments are passed as parameters:

Html page address
The row number containing the column names
Column number to be used as index

After loading, we got a table of the following form:

	1990.0	2000.0	2001.0	2002.0	2003.0	2004.0	2005.0	2006.0	2007.0	2008.0	2009.0	2010.0	2011.0	nan
Russian Federation	NaN	2281	3062	3947	5167	6399	8088	10155	12540	14864	16895	18951	20755	NaN
Central Federal District	NaN	3231	4300	5436	7189	8900	10902	13570	16631	18590	21931	24645	27091	1
Belgorod region	NaN	1555	2121	2762	3357	4069	5276	7083	9399	12749	14147	16993	18800	24
Bryansk region	NaN	1312	1818	2452	3136	3725	4788	6171	7626	10083	11484	13358	15348	52
Vladimir region	NaN	1280	1666	2158	2837	3363	4107	5627	7015	9480	10827	12956	14312	64

As you can see, a little processing of the table will be needed, as there is one column without a name and one column with empty values. Well, let's choose the columns we need (from 2 to 13) the remaining columns and place them in a new DataFrame.

stat = stat[stat.columns[1:13]]

Now we have a dataset suitable for work. Of course, the names of the columns hurt the eye, but such names will not prevent us from solving the tasks at all.

	2000.0	2001.0	2002.0	2003.0	2004.0	2005.0	2006.0	2007.0	2008.0	2009.0	2010.0	2011.0
Russian Federation	2281	3062	3947	5167	6399	8088	10155	12540	14864	16895	18951	20755
Central Federal District	3231	4300	5436	7189	8900	10902	13570	16631	18590	21931	24645	27091
Belgorod region	1555	2121	2762	3357	4069	5276	7083	9399	12749	14147	16993	18800
Bryansk region	1312	1818	2452	3136	3725	4788	6171	7626	10083	11484	13358	15348
Vladimir region	1280	1666	2158	2837	3363	4107	5627	7015	9480	10827	12956	14312

So, let's get down to the first task of visualizing data from 2 districts. To get the data on the basis of which we will build the graph, we need to select the constituencies of interest to us (Moscow and Volga), and then transpose the resulting table. You can do it this way:

fo = [u'Приволжский федеральный округ',u'Центральный федеральный округ']
fostat = stat[stat.index.isin(fo)].transpose()

In the above code, we first filter our data set according to the districts we need using the isin () function , which checks the value of a column in a given list (an analogue of the IN operator in SQL). Then we use the transpose () function to transpose the resulting data set and write the result to a new DataFrame.

	Central Federal District	Volga Federal District
2000	3231	1726
2001	4300	2319
2002	5436	3035
2003	7189	3917
2004	8900	4787
2005	10902	6229
2006	13570	8014
2007	16631	9959
2008	18590	12392
2009	21931	13962
2010	24645	15840
2011	27091	17282

As you can see, the names of the indices in the table are now equal to the number of the year in numerical format. This is not very convenient, so let's change the index to the date format:

fostat.set_index(pd.date_range('1999','2011', freq='AS'), inplace=True)

The set_index () function is used to set a new index in a DataFrame. In our case, 2 parameters are passed to it:

List of new index values (may also be a column name)
Parameter means that we replace the index in the current set, if it is False, the index will not be saved

Now our data is completely ready for plotting. So, if you are working in IPython Notebook and want to see the result in real time, then for integration you need to call the initialize_notebook () function . It will look like this:
vincent.core.initialize_notebook ()
Now we need to create an object corresponding to the type of diagram (a full list of objects can be seen in the documentation ). In our case, it will be a line chart. The code will be as follows:

line = vincent.Line(fostat) #создаем объект графика
line.axis_titles(x=u'Год', y=u'тыс. руб') #задаем названия осей
line.legend(title=u'ЦФО vs ПФО') #выводим легенду и задаем ей заголовок

You can display the graph using the display () function:

line.display()

As a result, we will see the following:

Mapping

Well, what we did with the first task. Now let's move on to the second. To solve it, we need a TopoJSON file with a map of the Russian Federation, as well as a directory of regions. Details on how to get them and what it is can be read here . To get started, let's download the directory of regions using read_csv , described in a previous article:

spr = pd.read_csv('Data/russia-region-names.tsv','\t', index_col=0, header=None, names = ['name','code'], encoding='utf-8')

As you can see, several additional parameters appeared here:

index_col - sets the number of the column to be used as the index
header - in our case means that we do not use lines from the file to define headers
names - gets a list whose elements will be column names
encoding - sets the encoding in which the file is stored

If we look closely at our stat dataset , we can see that some of its elements contain footnotes of type '1)' and '2)', which, when parsed with read_html (), were converted to regular characters and added at the end of the corresponding lines in the index. In addition, the letter 'g. ', but it’s not in the directory. All these little things affect the fact that when we combine the set with the stat. data and a guide to pull the codes to the regions, we will have regions without a code.
This can be fixed as follows:

ew_index = stat.index.to_series()
new_index = new_index.str.replace(u'(2\))|(1\))|(г. )','')

The first line means that we are highlighting the index column in a separate new series. In the second line, we replace the values corresponding to the regular expression with empty ones.
Now we need to replace the index values with the values from the new set. As shown above, this can be done as follows:

tat.set_index(new_index, inplace=True)

Now we can combine our data set with a directory to get the region codes:

RegionProfit = stat.join(spr, how='inner')

Our data after all the manipulations look like this:

	2000.0	2001.0	2002.0	2003.0	2004.0	2005.0	2006.0	2007.0	2008.0	2009.0	2010.0	2011.0	code
Belgorod region	1555	2121	2762	3357	4069	5276	7083	9399	12749	14147	16993	18800	RU-BEL
Bryansk region	1312	1818	2452	3136	3725	4788	6171	7626	10083	11484	13358	15348	RU-BRY
Vladimir region	1280	1666	2158	2837	3363	4107	5627	7015	9480	10827	12956	14312	RU-VLA
Voronezh region	1486	2040	2597	3381	4104	5398	6862	8307	10587	11999	13883	15871	RU-VOR
Ivanovo region	1038	1298	1778	2292	2855	3480	4457	5684	8343	9351	11124	13006	RU-IVA

So let's move on to the direct construction of the map and the application of data on it. To get started, we need to create a dictionary with a description of our map:

geo_data = [{'name': 'rus', #имя карты
             'url': 'RusMap/russia.json', #путь до TopoJSON файла с картой
             'feature': 'russia'}] #имя объекта из файла карты

Now let's create our map object and bind our data to it. You can do this with the Map () function:

vis = vincent.Map(data=RegionProfit, geo_data=geo_data,scale=700, projection='conicEqualArea', rotate = [-105,0], center = [-10, 65], data_bind=2011, data_key='code', map_key={'rus': 'properties.region'})

The function takes the following arguments as parameters:

data - data set
geo_data - object with our map
projection , - the projection in which our map will be displayed
rotate, center, scale - projection parameters
data_bind - a column with the data to be displayed
data_key - field with the code by which the map and data will be linked
map_key - dictionary type {'name of the object with the map': 'name of the property by which the binding is done'}

But here we expect unexpectedness: in the author's version of vincent, the rotate parameter can only be integer. For a correct display of our map, we need the ability for this parameter to be able to take values in the form of a list. To fix this, go to the file % PYTHON_PATH% / \ lib \ site-packages \ vincent \ transforms.py replace the piece of code responsible for checking the type of variables:

@grammar(int) 
def rotate(value): 
"""The rotation of the projection"""
     if value < 0: 
         raise ValueError('The rotation cannot be negative.')

on the:

 
@grammar(list) 
def rotate(value): 
    if len(value) != 2: 
        raise ValueError('len(center) must = 2')

Now our object will be created correctly. It remains to tune our object. To begin, let's make our boundaries between maju objects less visible. To do this, we have marks, which are the main building element. Read more about them in the documentation for Vega . In our case, the code looks like this:

vis.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.5)

Now let's set our values to be colored in different colors depending on the group. This can be done using Scales objects designed to translate data values (numeric, string, dates, etc.) into values for display (pixels, colors, sizes). Code below:

vis.scales['color'].type = 'threshold' #задает тип шкалы
vis.scales['color'].domain = [10000, 15000, 20000, 25000, 30000] #задаем набор значений данных для группировки
vis.legend(title=u'Доходы руб.') #вводим легенду карты

Well, what about the card is configured, now you can see what we got. As indicated above, you can use the display () function for this , but for some unknown reason it did not work for me, so I first uploaded it to the final json file using the to_json () function :

vis.to_json('example_map.json', html_out=True, html_path='example_map.html')

As parameters, 3 parameters are passed to it:

summary file name
html_out indicates that you also need to create an html shell file
html_path - sets the path to the html file

To view our html file you need a simple HTTP server included in Python. To run it on the command line, run the command:

python -m SimpleHTTPServer 8000

As a result, our map will look like this:

Conclusion

Today I tried to show another way to visualize data using pandas . I would also like to note that the module under consideration is relatively young and is now actively developing. Of the shortcomings, I would note that not all objects are displayed when trying to output them directly to IPython and the inability to upload just a picture, not a json file, especially since such tools are developed for vega

Tags:

Building a Simple Pandas + Vincent Mapping

Introduction

Data analysis

Mapping

Conclusion

Also popular now: