infotanka August 11, 2014 at 11:18

Rose of intestinal bacteria

Scientific tasks associated with the processing and visualization of complex data are some of the most non-trivial and interesting. In scientific experiments, huge amounts of data are accumulated with various measurements and parameters specific to a particular field of knowledge, often interrelated. At the same time, a convenient and intuitive way of interpreting these data quickly leads to a result and clearly demonstrates it to interested parties - and there it is a stone's throw to an important discovery. Remember the periodic table, Feynman diagrams, spectral series of substances, genomic DNA schemes, relict radiation maps.

I will talk about the scientific task that we were lucky to work with in the Data Lab. We came up with and implemented a tool for comparing the phylogenetic distance of microbiota samples and named itthe intestinal bacteria rose :

Task

Microbiota - a collection of bacteria that live in our intestines. The composition, properties and genes of these bacteria are unique to each person and change over time, they depend on various factors and directly affect human health. The guys from the bioinformatics laboratory of the Research Institute of FHM, led by Dmitry Alekseev, are studying these dependencies.

The microbiota consists of bacteria of various types and numbers - a total of about 100 livestock. The experimental data contain the decoding of the genome of all the number of bacteria of the study participants (383 people). It is believed that the genomes of individual livestock bacteria in a particular patient are identical to each other. To calculate the proximity of the populations of two different patients, a metric is introduced - the phylogenetic distance, which takes into account the coincidences and differences of specific genes in these populations. This distance is calculated for all pairs of patients for all herds (types of bacteria). This information was to be visualized.

Existing methods and their disadvantages

Together with the data, the guys showed us the existing visualizations: hitmap and MDS projection.

The hitmap visualizes the distances between the samples in pairs: blue shows close pairs, red shows distant pairs. The blue diagonal bar shows zero distance in pairs of matching samples. The trouble with the hitmap is that the essentially linear characteristic (proximity) is displayed in color. Therefore, when reading constantly have to translate one into another - to decrypt the chart. In addition, the interpretation of the “proximity” of colors is not always unambiguous and depends on the individual characteristics of a person’s color perception. It would seem that all the information is in sight, and it is difficult to draw conclusions. The full square itself is redundant, symmetrical with respect to the diagonal (the proximity of A and B is the same as B and A).

MDS projection reduces the dimension of the problem by projecting onto the plane an N × N-dimensional pattern of the distribution of samples. This method is most popular when working with genomic data, since it does not require exact coordinates of objects, you only need to know the distances between them.

Its disadvantage is that most often a decrease in dimension occurs with the loss of part of the information. The computational procedure does not guarantee the uniqueness of the projection, the results of its application can be very different even for pictures that are very similar in essence, because the objects move along the axes, obeying the computational algorithm, and not the data features. Comparing scatterplots constructed for different matrices is a pointless exercise.

Rose bacteria

We decided to show the proximity of the samples literally - in length. To begin with, they imagined that one of the samples was in the center and the others were placed around it, laying off the phylogenetic distance values along the radii. Color coded the geographical location of the sample (USA, China, Europe or Russia). The result is somewhat reminiscent of a wind rose, hence the name rose of bacteria : On the circumference, the samples are grouped by country and city. When you hover over any sample, a tooltip is displayed with its code name, region and exact distance, the sample moves to the center by clicking, and the rest are located at the corresponding distance from it.

Imagine now that we laid roses of different patterns on top of each other. The distance from it to all other samples will be shown on each sample-radas: You can build a picture for samples from a specific region: Such a visualization for a specific livestock shows: 1) in which microbiome which bacteria is present (filled radii) and which do not have it (blank), 2) what is the distance from a particular sample to the rest: minimum, maximum, and distribution (pattern lines on radius) 3) as a distance dependent to a specific geographical location of the sample remaining samples (color lin minutes at radius) 4) what is the total distribution pattern for pairs of samples of the bacterium (Rose view as a whole)

5) what are the geographical patterns for pairs of samples for a given bacterium (regional roses).

Rose garden

We build such roses for each livestock of bacteria. To see the whole microbiome, we collected miniatures of roses on one screen: It turned out a whole "pink garden": it is clear by which bacteria how much data is collected and what patterns are observed in the distributions. In any rose, you can click and proceed to a more detailed analysis.

Concrete conclusions from Dima and Bori

We asked our fellow scientists to tell how such an idea helps them in their work. Here is what they use.

Clustering by country. The number of bacteria that cluster well in their respective countries looks like single-color petals in the picture for each country. On each petal we can see the stripes in the order in which this country is closer to another: In the example with the bacterium Eubacterium Eligens, it is clearly seen that the Chinese are closer to the Chinese, and the Russians to the Russians. At the same time, Americans are mixed with Europeans. Bacteria travelers.

Such bacteria are found in Chinese microbiomes: if you hover over the closest Chinese sample to it, we see that a lot of closer “European women” and “American women” get into the circle. So, the bacterium in this sample could come to China from Europe or America. More interestingly, Dialister Invisius is usually located in the oral cavity. Oh, these international kisses :-) Two different bacteria instead of one. The bacterium Barnesiella Intestinihominis was discovered recently (in 2008), we still do not know much about its varieties. But, judging by the picture, two distinct varieties can be distinguished. Samples are divided into two groups, within which the proximity is much greater than between groups, even in different countries. Quality control and artifacts.

Sample SRS014979 has an unusually large number of notches at a radius above, which means that it is far removed from all other samples. To be honest, it is unlikely that one American bacterium has 2 times more mutations than others, a much more likely error in the data or in the calculations. This is a good signal to analyze the situation in more detail. A dense garden. What we really liked was the rose garden - a good generalization of the entire top level with the ability to delve deeper. In a dense garden, we see roses with a large and small number of bacteria of each livestock, and we also immediately see in which populations the bactria is present and has divided structures (clusters).

Live prototype: rosegarden.datalaboratory.ru

The visualization was done on D3.js , the information designer was Tanya Misyutina, and the developer was Damir Melnikov. Thanks to Dima Alekseev and Bora Kovarsky for an interesting task and active participation in its solution.

Tags: