SciPy, graph algorithms

From the sandbox
Tutorial

SciPy (pronounced sai pay) is a math application package based on the Numpy Python extension. It greatly enhances the capabilities of Python by providing the user with commands and high-level classes for managing data and visualizing it. With SciPy, an interactive Python session turns into the same comprehensive data processing and prototyping environment for complex systems like MATLAB, IDL, Octave, R-Lab and SciLab.

Introduction

The SciPy package is recommended to be installed as part of Anaconda. It is also advisable to get familiar with the basics of Python and have Numpy documentation on hand. For the sake of brevity and further convenience, we will agree that the main packages (numpy, scipy and matplotlib) will be imported as follows:

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

This is the official import agreement used in the NumPy, SciPy source code and documentation. Compliance with these conventions in your code is optional, but strongly recommended. Each of the packages should be imported separately as follows:

from scipy import linalg, optimize

SciPy and NumPy have full documentation that covers almost all of their functions. They are available at https://docs.scipy.org/ in HTML and PDF formats. Some parts may be incomplete or outdated, because documentation is constantly under development. Since SciPy is a volunteer organization and depends on the community, your participation from providing feedback to improving the documentation and code is welcome and actively encouraged.

Algorithms on sparse graphs (scipy.csgraph)

Lewis Carroll, Merry Christmas 1877

Consider the use of the package scipy.csgraph on the example of the children's game " Stairs of words ", invented by Lewis Carroll at Christmas 1877. In this game you need to find the path between the words, carrying out the replacement one letter at a time. For example, you can trace the path from a monkey to a human by associating the words "ape" and "man" as follows:

Note that each step involves changing only one letter of a word. This is just one possible path from monkey to man, but is this the shortest path? Thus, it is necessary to find the shortest path (ladder) between two given words. This problem can be solved with the help of the scipy.sparse.csgraph package.

First we need a list of valid words. Many operating systems have such a built-in list. For example, in linux, a list of words can be found in one of the following places:

/usr/share/dict 
/var/lib/dict

Another source of words are dictionaries that are available on various sites on the Internet (search your favorite search engine). The system word list is a file with one word per line. To use the system word list, read it as follows:

word_list = open('/usr/share/dict/words').readlines()
word_list = map(str.strip, word_list)

Since we are looking for words of three letters, then we will select only the necessary words from the general list. We also exclude words that begin with an uppercase (proper nouns) or contain non-letter characters, such as apostrophes and hyphens. At the end, check the size of the resulting list of words.

word_list = [word for word in word_list if len(word) == 3]
word_list = [word for word in word_list if word[0].islower()]
word_list = [word for word in word_list if word.isalpha()]
word_list = list (map (str.lower, word_list))
len (word_list)

Now we have a list of 591 three-letter words (the exact number may vary depending on the particular dictionary used). Each of these words is the top of the graph. Create edges connecting vertices (pairs of words) that differ only in one letter.
We use a fairly efficient way for which we need some manipulations with the numpy array:

word_list = np.asarray(word_list)
word_list.sort()  # сортируем для быстрого поиска
word_list.dtype   # тип данных - символы Unicode в Python

dtype('<U3')

Now we have an array in which each record contains three Unicode characters. We would like to find all pairs that have exactly one character. Let's start with the transformation of each word into a three-dimensional vector:

word_bytes = np.ndarray((word_list.size, word_list.itemsize),
                        dtype='uint8',
                        buffer=word_list.data)
# длина каждого символа юникода - 4 байта. Нам нужен только первый байт# каждое слово состоит точно из трех символов
word_bytes = word_bytes[:, ::word_list.itemsize//3]
word_bytes.shape

(591, 3)

We recall that the Hamming distance is the number of different characters in two vectors of the same length. Using Hamming distance, we determine the length of the edges connecting the pairs of words. In the ladder of words are connected every two words with a distance equal to where - the number of letters in a word:

from scipy.spatial.distance import pdist, squareform
from scipy.sparse import csr_matrix
hamming_dist = pdist (word_bytes, metric = 'hamming')
graph = csr_matrix (squareform (hamming_dist < 1.5 / 3))

When comparing distances, lax equality is used, because otherwise the result for floating point values may be unstable. Inequality gives the desired result in the event that the two characters in the word do not match. Now that the graph has been set, we will perform the shortest path search between any two words on the graph:

i1 = word_list.searchsorted ('ape')
i2 = word_list.searchsorted ('man')

Now we need to find the shortest path between these two vertices on the graph. To do this, use the Dijkstra algorithm, which allows you to find all possible distances for the specified vertex:

from scipy.sparse.csgraph import dijkstra
distances, predecessors = dijkstra(graph, indices=i1,
                                    return_predecessors=True)
print(distances[i2])

5.0

So, we see that the shortest path between the monkey and man contains only five steps. We can use the predecessors returned by the algorithm to restore this path:

path = []
i = i2
while i != i1:
    path.append(word_list[i])
    i = predecessors[i]
path.append(word_list[i1])
print(path[::-1])

['ape', 'apt', 'opt', 'oat', 'mat', 'man']

The length of the stairs is three steps less than in the initial example. Using other tools of the module, you can find the answer to other questions. For example, are there words with three letters that are not connected to a ladder of words? This is the task of determining the connectivity of the graph:

from scipy.sparse.csgraph import connected_components
N_components, component_list = connected_components(graph)
print(N_components)

In our example, for three-letter words there are 15 related components: that is, 15 different sets of words without paths in between. How many words are in each of these sets? We can learn this from the list of components:

[np.sum(component_list == i) for i in range(N_components)]

[577, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

We obtained one large graph, which includes 577 vertices, as well as 14 isolated graphs. Let's look at these words:

[list(word_list[np.where(component_list == i)]) for i in range(1, N_components)]

[['aha'],
 ['chi'],
 ['ebb'],
 ['gnu'],
 ['ism'],
 ['khz'],
 ['nth'],
 ['née'],
 ['ova'],
 ['qua'],
 ['ugh'],
 ['ups'],
 ['urn'],
 ['use']]

You can find out between which words there is a ladder of maximum length. This can be determined by calculating the adjacency matrix. Note that the distance between two unlinked points is considered to be infinite, so they should be excluded:

distances, predecessors = dijkstra(graph, return_predecessors=True)
max_distance = np.max(distances[~np.isinf(distances)])
print(max_distance)

13.0

So, in our example there are such pairs of words, between which the shortest staircase contains 13 steps! Let's determine what these pairs are:

i1, i2 = np.where(distances == max_distance)
list(zip(word_list[i1], word_list[i2]))

[('imp', 'ohm'),
 ('imp', 'ohs'),
 ('ohm', 'imp'),
 ('ohm', 'ump'),
 ('ohs', 'imp'),
 ('ohs', 'ump'),
 ('ump', 'ohm'),
 ('ump', 'ohs')]

We saw all possible combinations of pairs of words, the distance between which is equal to 13 (as much as possible separated from each other). If you look closely, on one side of the staircase are the words “imp” and “ump”, which differ only in one letter. On the other side of the staircase are the words "ohm" and "ohs", which also differ in just one letter. The list of steps between these words will be almost the same. It can be found just as stated above:

path = []
i = i2[0]
while i != i1[0]:
    path.append(word_list[i])
    i = predecessors[i1[0], i]
path.append(word_list[i1[0]])
print(path[::-1])

['imp', 'amp', 'asp', 'ass', 'ads', 'ids', 'ins', 'inn', 'ion', 'won', 'woo', 'who', 'oho', 'ohm']

Word ladders are just one of the possible applications of fast algorithms on sparse graphs in scipy. Graph theory is applied in many areas of mathematics, data analysis and machine learning. The tools for working with sparse graphs scipy are flexible enough to solve a wide range of tasks.
I would be glad if you write in the comments about which section of scipy it will be interesting to read in the next article.

Tags:

SciPy, graph algorithms

Introduction

Algorithms on sparse graphs (scipy.csgraph)

Also popular now: