Building a Simple Pandas + Vincent Mapping

    Good afternoon, dear readers.
    In a previous article, the introduction to data visualization with Pandas and matplotlib was described . Today I would like to show another way to display the results of the analysis using Vincent , which also integrates very easily with Pandas, although it will take a little more action than in the case of matplotlib.

    Introduction


    Vincent is a module designed to translate data from python to JavaScript libraries for visualizing D3js and Vega , which in turn provide great opportunities for interactive data visualization.
    Those. in this way we can perform analysis in python, and we can build graphics for the results on js. For example, it can be convenient for visualizing any geographic data and plotting them on a map. In addition, vincent has integration with IPython Notebook, and, like matplotlib, it can display graphics directly in it.
    As a demonstration of the capabilities of this module, I propose implementing 2 tasks:
    • We show the dynamics of per capita income for the Central and Volga Federal Districts
    • We show on the map of the Russian Federation the distribution of per capita income for the subjects of the Russian Federation for 2010

    As initial data, we take statistics from the Rosstat website .

    Data analysis


    To get started, let's upload the data and see if additional processing is needed.

    import pandas as pd
    import vincent
    stat = pd.read_html('Data/AVGPeopleProfit.htm', header=0, index_col=0)[0]
    

    So, to load data this time, we use the read_html () function (this function appeared in pandas since version 0.12). In our case, 3 arguments are passed as parameters:
    1. Html page address
    2. The row number containing the column names
    3. Column number to be used as index

    After loading, we got a table of the following form:
    1990.02000.02001.02002.02003.02004.02005.02006.02007.02008.02009.02010.02011.0nan
    Russian FederationNaN 2281 3062 3947 5167 6399 8088 10155 12540 14864 16895 18951 20755NaN
    Central Federal DistrictNaN 3231 4300 5436 7189 8900 10902 13570 16631 18590 21931 24645 27091 1
    Belgorod regionNaN 1555 2121 2762 3357 4069 5276 7083 9399 12749 14147 16993 18800 24
    Bryansk regionNaN 1312 1818 2452 3136 3725 4788 6171 7626 10083 11484 13358 15348 52
    Vladimir regionNaN 1280 1666 2158 2837 3363 4107 5627 7015 9480 10827 12956 14312 64

    As you can see, a little processing of the table will be needed, as there is one column without a name and one column with empty values. Well, let's choose the columns we need (from 2 to 13) the remaining columns and place them in a new DataFrame.

    stat = stat[stat.columns[1:13]]
    

    Now we have a dataset suitable for work. Of course, the names of the columns hurt the eye, but such names will not prevent us from solving the tasks at all.
    2000.02001.02002.02003.02004.02005.02006.02007.02008.02009.02010.02011.0
    Russian Federation 2281 3062 3947 5167 6399 8088 10155 12540 14864 16895 18951 20755
    Central Federal District 3231 4300 5436 7189 8900 10902 13570 16631 18590 21931 24645 27091
    Belgorod region 1555 2121 2762 3357 4069 5276 7083 9399 12749 14147 16993 18800
    Bryansk region 1312 1818 2452 3136 3725 4788 6171 7626 10083 11484 13358 15348
    Vladimir region 1280 1666 2158 2837 3363 4107 5627 7015 9480 10827 12956 14312

    So, let's get down to the first task of visualizing data from 2 districts. To get the data on the basis of which we will build the graph, we need to select the constituencies of interest to us (Moscow and Volga), and then transpose the resulting table. You can do it this way:

    fo = [u'Приволжский федеральный округ',u'Центральный федеральный округ']
    fostat = stat[stat.index.isin(fo)].transpose()
    

    In the above code, we first filter our data set according to the districts we need using the isin () function , which checks the value of a column in a given list (an analogue of the IN operator in SQL). Then we use the transpose () function to transpose the resulting data set and write the result to a new DataFrame.
    Central Federal DistrictVolga Federal District
    2000 3231 1726
    2001 4300 2319
    2002 5436 3035
    2003 7189 3917
    2004 8900 4787
    2005 10902 6229
    2006 13570 8014
    2007 16631 9959
    2008 18590 12392
    2009 21931 13962
    2010 24645 15840
    2011 27091 17282

    As you can see, the names of the indices in the table are now equal to the number of the year in numerical format. This is not very convenient, so let's change the index to the date format:

    fostat.set_index(pd.date_range('1999','2011', freq='AS'), inplace=True)
    

    The set_index () function is used to set a new index in a DataFrame. In our case, 2 parameters are passed to it:
    1. List of new index values ​​(may also be a column name)
    2. Parameter means that we replace the index in the current set, if it is False, the index will not be saved

    Now our data is completely ready for plotting. So, if you are working in IPython Notebook and want to see the result in real time, then for integration you need to call the initialize_notebook () function . It will look like this:
    vincent.core.initialize_notebook ()
    Now we need to create an object corresponding to the type of diagram (a full list of objects can be seen in the documentation ). In our case, it will be a line chart. The code will be as follows:

    line = vincent.Line(fostat) #создаем объект графика
    line.axis_titles(x=u'Год', y=u'тыс. руб') #задаем названия осей
    line.legend(title=u'ЦФО vs ПФО') #выводим легенду и задаем ей заголовок
    

    You can display the graph using the display () function:

    line.display()

    As a result, we will see the following:


    Mapping



    Well, what we did with the first task. Now let's move on to the second. To solve it, we need a TopoJSON file with a map of the Russian Federation, as well as a directory of regions. Details on how to get them and what it is can be read here . To get started, let's download the directory of regions using read_csv , described in a previous article:

    spr = pd.read_csv('Data/russia-region-names.tsv','\t', index_col=0, header=None, names = ['name','code'], encoding='utf-8')
    

    As you can see, several additional parameters appeared here:
    • index_col - sets the number of the column to be used as the index
    • header - in our case means that we do not use lines from the file to define headers
    • names - gets a list whose elements will be column names
    • encoding - sets the encoding in which the file is stored


    If we look closely at our stat dataset , we can see that some of its elements contain footnotes of type '1)' and '2)', which, when parsed with read_html (), were converted to regular characters and added at the end of the corresponding lines in the index. In addition, the letter 'g. ', but it’s not in the directory. All these little things affect the fact that when we combine the set with the stat. data and a guide to pull the codes to the regions, we will have regions without a code.
    This can be fixed as follows:

    ew_index = stat.index.to_series()
    new_index = new_index.str.replace(u'(2\))|(1\))|(г. )','')
    

    The first line means that we are highlighting the index column in a separate new series. In the second line, we replace the values ​​corresponding to the regular expression with empty ones.
    Now we need to replace the index values ​​with the values ​​from the new set. As shown above, this can be done as follows:

    tat.set_index(new_index, inplace=True)
    

    Now we can combine our data set with a directory to get the region codes:

    RegionProfit = stat.join(spr, how='inner')
    

    Our data after all the manipulations look like this:
    2000.02001.02002.02003.02004.02005.02006.02007.02008.02009.02010.02011.0code
    Belgorod region 1555 2121 2762 3357 4069 5276 7083 9399 12749 14147 16993 18800 RU-BEL
    Bryansk region 1312 1818 2452 3136 3725 4788 6171 7626 10083 11484 13358 15348 RU-BRY
    Vladimir region 1280 1666 2158 2837 3363 4107 5627 7015 9480 10827 12956 14312 RU-VLA
    Voronezh region 1486 2040 2597 3381 4104 5398 6862 8307 10587 11999 13883 15871 RU-VOR
    Ivanovo region 1038 1298 1778 2292 2855 3480 4457 5684 8343 9351 11124 13006 RU-IVA


    So let's move on to the direct construction of the map and the application of data on it. To get started, we need to create a dictionary with a description of our map:

    geo_data = [{'name': 'rus', #имя карты
                 'url': 'RusMap/russia.json', #путь до TopoJSON файла с картой
                 'feature': 'russia'}] #имя объекта из файла карты
    

    Now let's create our map object and bind our data to it. You can do this with the Map () function:

    vis = vincent.Map(data=RegionProfit, geo_data=geo_data,scale=700, projection='conicEqualArea', rotate = [-105,0], center = [-10, 65], data_bind=2011, data_key='code', map_key={'rus': 'properties.region'})
    

    The function takes the following arguments as parameters:
    • data - data set
    • geo_data - object with our map
    • projection , - the projection in which our map will be displayed
    • rotate, center, scale - projection parameters
    • data_bind - a column with the data to be displayed
    • data_key - field with the code by which the map and data will be linked
    • map_key - dictionary type {'name of the object with the map': 'name of the property by which the binding is done'}


    But here we expect unexpectedness: in the author's version of vincent, the rotate parameter can only be integer. For a correct display of our map, we need the ability for this parameter to be able to take values ​​in the form of a list. To fix this, go to the file % PYTHON_PATH% / \ lib \ site-packages \ vincent \ transforms.py replace the piece of code responsible for checking the type of variables:

    @grammar(int) 
    def rotate(value): 
    """The rotation of the projection"""
         if value < 0: 
             raise ValueError('The rotation cannot be negative.') 
    

    on the:

     
    @grammar(list) 
    def rotate(value): 
        if len(value) != 2: 
            raise ValueError('len(center) must = 2') 
    


    Now our object will be created correctly. It remains to tune our object. To begin, let's make our boundaries between maju objects less visible. To do this, we have marks, which are the main building element. Read more about them in the documentation for Vega . In our case, the code looks like this:

    vis.marks[0].properties.enter.stroke_opacity = vincent.ValueRef(value=0.5)
    

    Now let's set our values ​​to be colored in different colors depending on the group. This can be done using Scales objects designed to translate data values ​​(numeric, string, dates, etc.) into values ​​for display (pixels, colors, sizes). Code below:

    vis.scales['color'].type = 'threshold' #задает тип шкалы
    vis.scales['color'].domain = [10000, 15000, 20000, 25000, 30000] #задаем набор значений данных для группировки
    vis.legend(title=u'Доходы руб.') #вводим легенду карты
    

    Well, what about the card is configured, now you can see what we got. As indicated above, you can use the display () function for this , but for some unknown reason it did not work for me, so I first uploaded it to the final json file using the to_json () function :

    vis.to_json('example_map.json', html_out=True, html_path='example_map.html')
    

    As parameters, 3 parameters are passed to it:
    1. summary file name
    2. html_out indicates that you also need to create an html shell file
    3. html_path - sets the path to the html file


    To view our html file you need a simple HTTP server included in Python. To run it on the command line, run the command:

    python -m SimpleHTTPServer 8000

    As a result, our map will look like this:



    Conclusion


    Today I tried to show another way to visualize data using pandas . I would also like to note that the module under consideration is relatively young and is now actively developing. Of the shortcomings, I would note that not all objects are displayed when trying to output them directly to IPython and the inability to upload just a picture, not a json file, especially since such tools are developed for vega

    Also popular now: