okazymyrov January 20, 2012 at 02:33

Gephi as a data visualization tool

It just so happened that I turned out to be an assistant with a professor at the university. I never thought that I would have to deal with risk assessment and data visualization, being, by calling, a cryptographer. The course is called "Information Networks" and includes: analysis of random processes, modeling of small worlds; computer algorithms for evaluating network properties; experimental studies of large networks, as well as risk analysis, which are difficult to predict.

In view of the fact that the course is taught mainly for IT professionals, the lecturer relied on giving enough theory with a minimum of mathematics and a lot of practice. NetLogo is suitable for most of the above tasks.. It includes its own high-level programming language, which makes it easy to simulate various random processes. To visualize a variety of data, the Gephi program was selected .

Based on the experience of using the latter, an article was written that examined the receipt of input data for software with their subsequent visualization.

The actual statement of the problem was as follows: visualization of any real data using Gephi tools.

Data generation

After some thought, I decided to present the authors ’interactions from the IACR website . Those who work in the field of information protection will probably know about him, for everyone else - this is a site where articles on information security are openly available.

Gephi File Formats

Gephi understands a fairly large number of file formats.

As you can see from the figure, the best option is to use gexf . As it turned out, this is an open format and the library can be easily downloaded from here.. The site says that Libgexf currently only runs on Linux (tested on Ubuntu 8.10 and 9.04). Without hesitation, I decided to download the sources and compile under Ubuntu 11.10 with interfaces for Python (since I decided to implement the rest of it on it). I generated a test file and decided to launch it in Gephi, but with it, neither my file, nor downloaded from the Internet, nor even the one that I saved myself were opened. I don’t know what kind of bug, but, picking a little and trying on two machines, I decided to abandon this venture and look for an alternative. After playing with CSV and realized that its functionality is not enough, I came across import from the database.

It can be found in "File -> Import Database -> Edge list ...". At first I thought that the “Edge list” was just an import of edges, but it turned out that the vertices were also imported. You can use SQLite, MySQL, PostgreSQL and Teradata as a database. In the import window, at the top there is a hint for the table parameters. There should be two tables: nodes and edges. The nodes table should contain the columns:

required:
1. id - vertex number
not necessary:
1. label - mark or designation of the vertex
2. x - x coordinate on the plane where the vertex of the graph will be located
3. y - y coordinate on the plane where the vertex of the graph will be located
4. size - vertex size

For the edges table, the columns are as follows:

obligatory:
1. source - id of the starting vertex
2. target - id of the end vertex
optional:
1. label - mark or designation of an edge
2. weight - rib weight

The columns "start" and "end" for dynamic graphs are optionally available. After several test checks, it was decided to use SQLite3.

Python database script

The script (program) should do the following:

Upload article author data
Separate information about the authors into separate names
Fill in the database in accordance with the previous paragraph

Since the source code is quite a lot, I did not embed it in the article. You can download it here .

I will give only some comments. Sqlite3 interface was used as a databasewhich support is present in python. Tables and columns are created as described in the “File Formats” section. "Http: //eprint.iacr.org/cgi-bin/cite.pl? Entry = year / number" was used as the source link, where year is the year of publication, number is the article number. Initially, it was assumed that it was necessary to iterate from 001 to the last one, which is determined by the absence of the word "author". However, it later turned out that not all article numbers were present (for example, no 001 in 2004). Therefore, I had to pile up the bike to check 3 authors in a row not found. One more catch was waiting when I received the name. Lines of authors may look like:
"{Elena the Beautiful, Tugarin Zmey and Alyosha Popovich}"
"{Elena the Beautiful, Tugarin Zmey, and Alyosha Popovich}"
"{Elena the Beautiful, Tugarin Zmey,
"{Alyosha Popovich}"
"{Elena the Beautiful and Tugarin Snake}"

Who will tell you how quickly to get the name from the lines, I will be grateful. The output should be ['Elena Prekrasnaya', 'Tugarin Zmey', 'Alyosha Popovich'] (without spaces or commas, i.e. just a first and last name separated by a space).

And finally, in order not to torment the Internet, I decided to first create a local authors dump, and then use it as input. For this, I used pickle , which allows you to dump variables into a file, and then, if necessary, extract. For this, I created a dictionary. The keys were the years, and the values were an array of strings with the authors. Thus, I can create a database for the years that interest me,

Data visualization with Gephi

As an example, take a small graph : years from 1996 to 2003. It consists of 759 vertices and 437 edges. This information can be seen when importing the graph:

or in the "Context" window.

Additionally, when importing, you can select the orientation of the graph: oriented, non-oriented or mixed.

Next, select the "Ranking" window for the initial editing of the graph. Depending on the degree, we color all the vertices in shades of blue and set the size of the vertex itself. I recommend setting the maximum vertex size equal to the maximum degree of the vertex or 2 times larger.

In the end, this turned out:

Now, in order to present everything beautifully, we will select the “Layout” window. This window presents the algorithms that can be used to organize the graph.

The following are screenshots of some of them.

ARF

Radial Axis

Parallel Force Atlas

You can also attach vertices to one of the axes, however this is selected in the "Ranking" -> "Coordinates" window.

On the graph, you can select the output of the "label" field. For instance:

conclusions

Gephi is a young, but very powerful tool for visualizing various data, which is suitable for both a novice and an experienced user. Using databases allows you to quickly and easily obtain the necessary input data for visualization.

Tags: