Algorithmic problems in bioinformatics. Lecture in Yandex
We have already mentioned the Data & Science series of events several times, where data analysts and scientists tell each other about their tasks and look for ways to interact. One of the meetings was dedicated to bioinformatics. This is a great example of an industry where there are plenty of unresolved tasks for developers.
Under the cut you will find a transcript of the lecture by Ignat Kolesnichenko, a graduate of the Moscow State University, and the School of Data Analysis. Ignat is now a leading developer of Yandex's distributed computing technology service.
Who is the programmer here? About half the audience. My report is for you.
The purpose of the report is as follows: I will tell you exactly how the data is taken and how it is arranged. The story will be a little deeper than that of Andrei . To do this, you will have to remind a little about biology. At least I hope that programmers do not know biology.
And then I’ll try to tell you which algorithmically complex problems are being solved in this area in order to interest you so that you understand that it’s great and you need to go into bioinformatics and help people solve complex problems. Obviously, biologists themselves can not cope with algorithmic problems.
I graduated from mehmat. I am a mathematician, I work in Yandex. I am a programmer doing YT, we are doing internal infrastructure in Yandex, we have a cluster for thousands of machines. All the data that Yandex collects is stored in these clusters, everything is processed, petabytes of data, in this I understand a little. I’m also a co-founder of Ebinom. There I was engaged in the same technical task. My goal was to understand how these algorithms, programs work, and teach them how to do it most automatically, so that a person could upload data to a site, poke a button, and he would have a cluster deployed, everything would be quickly calculated, minimized, and the data would have seemed to man.
While I was solving this technical problem, I had to figure out what kind of data was involved and how to work with them. Today I will tell part of what I learned.
On Facebook in the announcement of this eventGelfand's video was uploaded . Who watched it? One man. Gelfand looks at you disapprovingly.
We will work with DNA, but I want a little context around the DNA, so we will start with the cell. It is clear that bioinformatics as a science of molecular biology is interested in all the processes in the cell, but today we will talk more about the technical side, about the data and we will, of course, be interested in only the core and nearby components.
The core contains DNA. It has a nucleolus. It, along with the histones that Andrei showed, is in a sense responsible for regulation and epigenetics. There is a second important part - the ribosome, folds around the nucleus, the so-called endoplasmic reticulum. It has special complexes of molecules called ribosomes. It will be from your pieces of DNA, which you first subtract from DNA, to make proteins.
The third part is chromatin: all the substance in which all the DNA floats is found. It is packed quite tightly, but there are still many different regulatory molecules between it. Well, there is a membrane - protection, nutrition, the way the cell communicates.
There is much more in the cell, other functions. We will not talk about them, you can read in the textbooks.
Now about the information, DNA and RNA. There are two main molecules: deoxyribonucleic acid and ribonucleic acid. These are molecules of the same type. Watson and McCrick once discovered the structure of this molecule. An interesting story, you can read the autobiography, he writes very beautifully about everything. It is a double helix, it is a popular picture, a popular buzzword.
It consists of four larger molecules, complexes of dozens of atoms. These are guanine, cytosine, adenine and thymine. Moreover, in RNA instead of thymine will be uracil. When you make RNA from DNA, instead of the letter "T" you will get the letter "U". It is important for us to know that these four letters are contained in our DNA. We will work with them, by and large.
DNA has a higher structure. This is not just a spiral, which somehow lies, anyhow how twisted. It swirls around special complexes - histones. Then the histones are twisted into some fibers, from which the letter X is already built. Here is a beautiful picture that you saw in biology textbooks. Usually they don’t talk about histones, they didn’t say 15 years ago, but they show the letter X. It is arranged in this way, the DNA is twisted there in a cunning way.
Histones are responsible for regulation. When you want to read some kind of RNA with DNA, you may not succeed because it is so twisted, so hidden in this DNA that it cannot be read from there. Or maybe something will change and it will begin to be read.
And there are different squirrels that know how to attach to all this, know how to unfold it, pull it out for you to read. The secondary structure is very important to understand which parts of the DNA from which you can then make RNA proteins are available for reading.
What is protein? The main vitality of a cell, the diversity of all living things, is largely due to proteins. The fact is that protein has very high variability. This is a huge class of very diverse molecules that participate in a huge pile of molecules. They have very different structures, which allows all this complex chemistry to occur inside the cell.
What you need to know about proteins? The compound block is an amino acid. Amino acids in our body, in our world, 20 pieces. They say they are a little more, but in our body there are 20. They also have certain Latin letters. The protein is drawn schematically on the right, its beginning is written out, it is indicated which letters are present. I’m not a biologist, I don’t remember what V or M is, and biologists can immediately say what the letters mean.
The protein structure has several levels, you can look at it simply as a sequence, and when you are bioinformatics, you often look just like that. Although Arthur is interested in more complex secondary and tertiary structures.
We will look at it as a sequence of letters. Secondary and tertiary structures. There are such spirals, alpha spirals, they are clearly visible in the picture. There are still beta sheets, she knows how to line up in fairly flat pieces, and there are special pieces that collect these alpha spirals and beta sheets.
Further, what Andrei spoke of is the central dogma of molecular biology. It consists in the fact that there is translation and transcription, and there are many more different reverse processes, perpendicular, which also involve RNA, DNA and proteins in different directions. But today we will look only at broadcast and transcription. On the right there is a fairly complete picture describing this thing in eukaryotic cells, in particular in humans.
Do you have a gene? In biology textbooks they write that this is a unit of heredity. For us, when we look at DNA, a gene is some not so simple site. Not that it translates into pH, but then protein is made from RNA. Everything happens a little bit wrong. Firstly, in this section there are regions, introns, which are cut out. And there are exons that will later be encoded into your protein. If you look at DNA, at 3 billion letters, then - this is a popular phrase - there are only 2% of important areas there. Only 2% are responsible for coding your genes. And we are talking about exons.
If you take more introns, you will get not 2%, but 20%. And the remaining 80% are intergenic regions, apparently participating, to some extent, in regulation. But some are really not needed, just like that historically.
The following happens: you have a certain promoter in front of the gene, a special molecule sits on this promoter, a protein that will read and make RNA from your DNA.
Further from RNA, the transformation is carried out by another molecule, which cuts out unnecessary introns from it. You get ready-made messenger RNA from which you make your protein broadcast. In this case, 3 'and 5' are cut off - non-encoded ends. When you make protein from RNA, you actually have a triplet, some sequence that you can start with. Everything before her is simply skipped. The cell sits, it skips this, and begins to do everything from the start codon.
I used the word codon. And what are codons? How does encoding work? The logic is simple. There are four characters in DNA: A, C, T, G. There are 20 characters in a protein, 20 amino acids. We must somehow make 20. out of four. It is clear that this will not work directly, we must somehow encode them. From the point of view of scientists who are engaged in coding, the device is very simple, stupid coding, we would easily come up with this. But from the point of view of biology, this was not easy to accomplish.
You have three letters of DNA read, and each triplet of letters encodes a specific amino acid. Triplets of the letters 64, and 20 amino acids can be encoded in 64 variants. Encoding will be redundant. There are different triplets that encode the same thing.
This is all you need to know for today about broadcasting and transcription.
Andrei talked a lot about metagenomics, about different ohmics. Scientists over the past 50-100 years have been very actively studying what is happening in the cell, what processes are there, what kind of chemistry is there, and not just biology. And they have succeeded quite well in this matter. They may not have a complete picture and understanding, but they know so many things.
There are large maps of the conversion of different substances into each other. In particular, most of the elements are proteins that describe what generally happens in a cell.
Here is a small piece, but there is a big picture, scientists like to draw it. Before us is an approximate simplified scheme of everything that is done in our cell.
It is important to understand that all molecules are more or less always present in our cell - the question is in concentrations. All the time, your DNA unfolds somewhere, something is read from it, and what is read depends largely on what concentration different proteins have. And the whole life of the cell is regulation. You got something bigger, something went to do more than the other process, he began to translate something more, a new protein appeared, he took part in some process, said that it’s enough to produce me, he gave birth to a new protein, which began to interfere with its production. Such feedback-looped processes are present in large numbers in the cell and are generally presented here.
Where do the data come from - these A, C, T, G? Andrey said that there are sequencers that will read them to you. Let's understand in more detail how these devices are arranged and what they read.
There is a classic Senger method, which was invented in the late 70s, and which allowed in 1989 to begin the project "Human Genome". By 2001, the project was completed, they were able to assemble a sequence of some “reference” person, consisting of 3 billion letters. We analyzed all 23 chromosomes and realized what letters - A, C, T, G - are in each chromosome.
Senger's method is good and reliable, scientists trust him, but he is very slow. In 1977, it was a terrible horror, you worked with some radioactive solutions and substances in gloves, being behind a lead little thing, poured something somewhere in a test tube. Three hours after these exercises, if you accidentally didn’t pour something in the wrong place, you would get a picture from which you could recognize 200 nucleotides of your DNA.
It is clear that the process was quickly automated, and in the course of automation the technologies changed a little. By the mid-2000s, a host of new technologies had appeared that allowed sequencing to be done quickly and cheaply without spending three hours of the work of an expensive scientist for every 400 characters.
The last technology Andrei talked about is mininanopores. It seems to me that in the scientific community there is still some doubt about whether this technology will become mainstream or not. But it certainly opens up new boundaries for us.
All previous technologies are considered fairly accurate, they have few errors, we are talking about percentages and percentage fractions. There are substantially more errors in this technology, 10-20%, but you can read big ideas. In large applications, this seems to be really necessary.
Whether it will be able to finish off to a technology in which there will also be few errors is unknown. But if they can, it will be very cool.
The big picture about sequencing, about how this happens. They take a sample from you, a blood test, an analysis of saliva, or climb up with special tweezers, get a piece of bone marrow: a very painful scary procedure. Further along the sample, some chemistry is done, which removes all unnecessary and conditionally leaves only your DNA, which was present in the sample.
Then you do DNA fragmentation, get small fragments that you further propagate so that there are more of them. This is needed in many technologies. Then feed the sequencer, he reads everything in some way.
The output is always the same fastq file. All devices work in different ways, they have different lengths, different types of errors, and if you work with this data, it is important to understand what types of errors are there, how to work with such data correctly. If you apply the same method to everything, you will receive dirty data that you cannot rely on.
Briefly about the Senger method. The idea is very beautiful. You have a DNA sequence, you bred it, it floats in your test tube. Then you want to understand what it consists of. Special squirrels are fed there. Maybe all the DNA will even float there, but the molecule that knows how to disassemble it will also float.
Scientists have a very simple way. How did they learn to do such things with DNA? They looked at how nature does the same thing in our cells, realized what proteins they are doing, got them, and now they can replicate DNA. Just take the proteins that are involved in replication, and throw them into a test tube. Everything happens on its own.
They throw these squirrels, and also throw our ACTG. Among them there are special ones marked, which, when it is stuck to our DNA when reading, does not allow further gluing. Throwing this whole thing, shaking it up and letting it stand for a while, we get pieces of the unread source DNA that we are interested in. We stick something randomly, and the hope is that something sticks for each length, at this moment the stopping TTR will stick and everything will stop us.
More special tagged ACTG, fluorescent. If you shine on them in a special way, they will be green, blue, red or yellow. Then all this is thrown into a gel, there they are distributed by mass, they are highlighted and get a picture, as on the right. Further down it can be counted from top to bottom: shorter higher, heavier lower.
As a result, we get a fastq file. How it looks is drawn on the right. This is such a text format that programmers should already be perplexed - why store it in text form when you can compress it? So it happened. In fact, they are stored in such a file, they simply press it with some KZIP. The file contains data in four lines: first comes the ID of your one read, read. Further letters that were read there. Then some technical line, and the fourth line is quality. All devices are able to give out some indicators of how confident they are in what they read. Further, when you work with data, this must be taken into account.
The essence of most algorithms is that we read random pieces many, many times with a large coverage. Next we want to understand how the original DNA looked like. It is clear that when people started the Human Genome project, they really didn’t have any other assembled genomes. Bacteria then already learned to collect. And it was a terrible task. You have the human genome, it has 3 billion characters. You got a bunch of pieces of length 300-400, and from them you now need to understand what was the sequence from which such pieces could be obtained. It looks scary.
Let's talk about algorithms. What tasks are solved using NGS and how are they solved from an algorithmic point of view?
The first task is the assembly of the genome. About it you can read a six-month lecture course, there are a lot of materials. I will outline only general ideas so that you understand what algorithms are used here, why it is difficult and interesting.
What is the task of genome assembly? Reeds are given in fastq format, there are millions or tens of millions of them. For example, you collect one chromosome. And you need to get the genome or chromosome that you want to collect.
Now, from a programmer's point of view, how is this task formulated? You have short strings of ACTG characters, they have some indicators of their quality, weight. And when you collect them, there are still restrictions on the answer. When biologists collected the human genome, they knew that on the 11th chromosome there are such and such genes: as a result of long work, we examined them and in other ways found out that they are there. And I want to take this into account. It would be foolish to collect something, and then find out - for what we are sure, for some reason there is no, because the algorithm decided so.
You need to get some kind of superline over all short lines. And we want it to be as small as possible. Otherwise, we can simply glue all of our fragments and get something that no one suits. Since all the input lines were obtained from our chromosome, it means that we need to find this chromosome. If we did this with a sufficiently large coverage - conditionally, there is some line with each shift - and we will collect the shortest one, this will be our genome. The foregoing is not always true; there are various reasons why one can lose or miss something in such a formulation. But this is a sufficient approximation that suits everyone.
A simplified example of a task. You can put together a super string that will be shorter than their length 16, and it might look like this. We took the last line of CCTA, pasted the penultimate one to it, the letters TA from them overlapped. Thus, we saved two characters. Then they glued the first line to the speakers. The second line of TGAC was unsuccessful, it had to be glued just like that. It is clear that the example is artificial. In fact, you will not have such glues in the real world.
There is some bad news for you. Those who work as a programmer and versed in algorithms can open Gary Johnson's standard tutorial on the complexity of algorithms and find out that the super-string problem is NP-hard. Who does not know, in practice, this roughly means that no polynomial algorithm for solving the problem can be expected: all known algorithms have an exponential operating time. And this turns out to be complete horror: you have tens of millions of lines, and how will you do something in an exponential time? 2 to the extent of 10 million - this can not be expected during the life of the Universe.
But there is good news. we have not the worst input for the task, one of the very real ones. And for him there is a hope to solve the problem well enough.
The general approach to the solution is as follows: in fact, we will not be able to immediately assemble the entire chromosome at once. There are various reasons for this. It seems easier not to explain them, but to accept that it will be so.
When we start collecting, we will collect large enough pieces of hundreds of thousands of characters, maybe even millions. Next, we will have to glue them into the entire chromosome of 100 million in size. These continuous pieces that we can collect are called contigs. Then, according to various secondary data, we globally try to understand how the contigs relate to each other, and collect them with scaffold. In our case, we are talking about the chromosome, if we collect the human chromosome.
There are two algorithmic methods. It was the end of the 80s and the beginning of the 90s, the golden age of people who were engaged in string algorithms. They came up with their whole theory and science in the late 60s and early 70s, but nobody needed it. I was even at the report of the man who invented the suffix tree. He was only a graduate student, in the graduate school he came up with a suffix tree, wrote an article, and then went into business. And he told how, in the late 80s, biologists unexpectedly began to write letters to him and ask how this and this works in your article. He was very surprised, very happy for them. Very inspired by this case.
Why is everything needed? One of the algorithms is called Overlap-Layout-Consensus. We want to build an overlay-graph on our pieces, then we want to simplify this graph by throwing out all that is unnecessary, and then we will generally understand how our contigs will be arranged.
An overlay graph is the next thing. We have two of our reads, we want for each - which already sounds a bit crazy given our data - the pairs of our reads to understand how much they overlap.
There are such readings. They overlap by 5, and red shows where they overlap.
If we build a graph of these overlaps, we can continue to work with it, try to read our genome from this graph. The question of how to do this is a separate one. Obviously, the task immediately arises for each pair of reads to build an overlap. We have reads even if a million, a million squared. A crazy number, building a graph will work for years, and in addition, it is not clear where you will store it. If the graph will weigh a trillion bytes, it will not fit into any RAM.
And now, come back to 1990, and you will realize that you don’t even fit on a disk, you didn’t know how to store so much. And even no cluster will fit. Therefore, for an arbitrary pair of reads, most likely, there is no overlay. One, two or three is immaterial. And we don’t want to be interested in such. We want to be interested only in significant overlay, in tens and hundreds of characters.
You can use a data structure such as a suffix tree. You can build a suffix tree on all your lines, on all your reads, such a unified one. And then for each read in this suffix tree to find some way. It will greatly reduce the number of lines with which you find overlay.
When you built such a graph, each vertex will have, conditionally, from 100 to 1000. This is many, tens of billions of edges, but they can already fit in memory or at least on a disk, you can work with them.
Having found this graph, we should simplify it a bit. It is proposed to discard all transitive edges. Most likely, a transitive edge means weak overlapping, and there was also an intermediate read between two that were weakly overlapping. It overlaps well with both the former and the latter. We will throw an edge, the graph will become easier, and it will be clear what to do.
If such edges are thrown out, then the graph is already significantly simplified. By such a graph, as below, in general, it is already clear what to do. If it weren’t for the trouble in the middle, we could say that we just read from left to right and get our answer. After all, we know overlay and we know which lines correspond to each vertex of our graph, we can go, read and make a line.
Such things have to be somehow resolved. There are many different complex algorithms that do this.
It was a sketch of the idea of one algorithm. There is another, application to this task was proposed by Pavel Pevzner in 1989. He runs the laboratory in St. Petersburg, which wrote his genome collector, which is now very popular and effective, and from him went the whole galaxy of bioinformatics in St. Petersburg who made the bioinformatics institute and many other wonderful projects.
The idea was this: apply such a purely discrete thing as Count de Bruyne. We look at the lines, they are all the same length, and if they all overlap each other up to the first character of the first line and the last character of the second line, then we say that there is an edge between them. These are almost completely matching lines, except for the first and last character.
It turns out that if you build such a graph, it will always have an Euler path. If you took the line and said that we have k = 4, you said that the vertex consists of substrings that go in a row at k = 4, and there are edges between them in a row, then you will get some graph. Your string will somehow be compressed, as there may be repetitions in your large string. But, after reading a certain Eulerian path in your graph, you can restore the original line - not always for sure, sometimes because of large repetitions, they can be rearranged somewhere, something can happen to them. But in general, with good accuracy, you will restore your string.
The idea is to apply this all to genome assembly.
Here is a more concrete example of how a graph de Bruin is constructed from a row and a given k. We took the string AAABBVA, took k = 2, here are the vertices of two characters, built such a transition graph on them.
Further it is proposed to take all your reads, on each build, as on the previous slide, Count de Bruin, and combine all of them into one large graph. If a vertex corresponds to one row, then you should have one vertex, although it came from different reads.
You are building such a graph. The rest is clear: since you have a high coating density, there will be many identical edges. They need to be glued together. The sad news: you have errors from which you will receive extra branches. They will need to be somehow searched and deleted in this column de Bruyne.
The second bad news: you will most likely not find the Euler path in this column, and it will not be Euler. But this is solved by the fact that the problem needs to be slightly relaxed. Indeed, we are looking for contigs, we may have disjoint pieces.
If we weaken the problem, it will be formulated differently. We will need to look for a non-Euler path. We will just need to look for the smallest number of cycles that cover the entire graph in terms of the number of edges or, preferably, do not even cover, but do not intersect. In other words, we just have to decompose our graph into disjoint cycles.
I told everything about the assembly of the genome that I wanted. Assembling the genome is a rather complicated and interesting task, rather special, it uses string algorithms and graph algorithms, and much can be improved there. I urge you, if you are interested, to read more detailed lectures and materials about the assembly. I can tell you which ones.
A much more popular task, but simpler, is that alignment needs to be done. The idea is very simple: if you collected one person and now sequenced the second person, then you do not need to collect the second person from scratch. In fact, two people differ from each other in the sense of the nucleotides of their DNA by half a percent - the order is approximately the same. It can vary slightly depending on whether people come from the same population or from different ones. So you can just pick it up and find where your new person’s reads meet in the reference genome that you have already collected. You need to look with differences, you expect that the organism that you are researching is different from the reference. Many differences will be just pointy. There are even bigger inserts and deletions. You just need to carefully search your reads in your reference genome,
Actually, what most bioinformatics work with is alignment. They do an analysis, for example, of a person, align it with the reference genome of a person, and then look at the differences. They are already researching them, and those who are involved in genomic bioinformatics work with them.
Next is the task of calling options. When you level everything, you need to find differences. There are heterozygotes, they also need to be found separately.
What math and programming are there? Many statistical methods are used here. It seems that you could just take consensus or look at the two most frequent letters, but say that it works poorly ... no, not bad, but you can do better. People build different Bayesian models, classifiers that correctly call.
Then there is a difficult task, about which Fedya knows very well - an abstract of the options. Why do people do it? For example, for the diagnosis of hereditary diseases. To do this, they need to understand whether the option will lead to bad consequences or not. And here, unfortunately, no universal solution exists. There are many different methods. People are trying to model, understand, whether the protein will work or not. But it often happens that the protein seems to have broken down from the variant, and there are no bad consequences for the person. Because there is a neighboring protein that performs exactly the same function. It just started to work a little worse, but globally nothing broke.
There are different statistical methods when people collect thousands and tens of thousands of genomes of different populations and find out that nothing has happened except in letter C in such a position. This probably means something, and the letter C is not just here. Because your mutations in the body happen by chance, and if different letters are possible in one position, then they will be in your population as a whole, if they do not interfere with anything. And if there is only one letter, it means that other letters interfere with something, spoil something. Such people get sick or something else. She cannot be there.
The main essence is as follows: firstly, there is a construction of a three-dimensional structure and people are trying to predict it on the basis of it, and secondly, there are statistics and we must be able to analyze huge amounts of data. 10 thousand human genomes are some terabytes. By the standards of Yandex and mine, this is not very much, but by the standards of people who are used to working on one or five servers, this, on the contrary, is very much. And to analyze such data is difficult.
Next up is machine learning. He is trying to combine all these methods for a specific task that you are solving, to build some model that would best solve the problem.
I will miss the expression.
About the complexity of bioinformatics. The important part that I want to pay attention to is data processing. Everything is important, starting from the moment when you take a sample and chemists, biologists in the laboratory work with it, ending with how you work with data, with reads. Because if you do something wrong somewhere, do the wrong thing, you will get further information about which you think that there is a replacement of the letter C with the letter A, but in fact there is no replacement. You will draw some conclusions, and it will be very sad if you write an article, and then it turns out that in fact we are talking about a false replacement, it was not there. Because of an error in the whole analysis, you decided that it was there. It is clear that such things are double-checked, they use Senger for double-checking, but you can double-check one specific option if you build your article on it.
And if you write an article that is based on huge statistics on a huge number of options, and you’ve made a mistake somewhere in your analysis, it will be sad enough. Here you need a very good understanding of how everything works. It seems that without knowing how the equalizer or call works, it is difficult to work well with such data. Just taking a standard tool and aligning it in a standard way, you will probably get something wrong now. There is hope for corrections. As Andrei believes, they will standardize everything and say: with such data you need to do this, with such data - like that. We will believe. But for now, everything is different. While you need to understand the principles of work.
Further there is an interesting important task. Did you sequence DNA, get a sequence of characters, and then what? To understand something, we need to know very well where the DNA has genes, where it has different regions. And there is a separate science, for example, RNA is sequenced in order to improve the markup and understanding where are which genes, how they are spliced, how everything works. And while the above is quite manual labor, there is no good automation, but there are many different attempts to try to build neural networks or complex classifiers, which would simply say in sequence: I look at the DNA sequence and understand that this is a regulatory area.
I look, and then the gene began. And here, probably, the intron began, because the variability is large and the letters go this way. But these are dreams. There are certain results in this area, some things people can determine simply by the sequence of nucleotides, but not all. And the task of detecting genes well, it seems, has not yet been solved.
There is also a technical problem, which, unfortunately, biologists do not always solve well. This is storage, efficient data processing and unification. Unfortunately, the world is organized like this: each laboratory has its own cluster, its own data bank, and it is clear that it’s very difficult to keep a galaxy of system administrators, developers who will monitor everything. Firstly, there is the specifics of the task and the system administrator whom you hire, will not know anything about your specifics. Secondly, it’s also quite difficult to find people who understand well in all this alignment. So there is a hope that someday there will be common bases where everything can be poured, where you can come, watch, search for a specific variant of the genome. There are already many such databases, but they are often diverse, each laboratory has its own, in its own format. Will hope, that there will be a unification and the tasks will be solved better. In other words, if you are a programmer, there is something to be done.
In conclusion, I’ll say: if you want to learn bioinformatics, you can start to watch lectures and read something. Or you can go to study at a specialized institution, for example, at the Higher School of Economics at the Faculty of Computer Science. In 2016, the master's program, by and large, in bioinformatics, just opened there. It is called "Data Analysis in Biology and Medicine", but if you open the course, then in general it will be on bioinformatics.
There is also a school of bioinformatics, which the guys also do together with the FBB. There is the Institute of Bioinformatics in St. Petersburg. In general, there are many different options. Thanks to all.
Who is the programmer here? About half the audience. My report is for you.
The purpose of the report is as follows: I will tell you exactly how the data is taken and how it is arranged. The story will be a little deeper than that of Andrei . To do this, you will have to remind a little about biology. At least I hope that programmers do not know biology.
And then I’ll try to tell you which algorithmically complex problems are being solved in this area in order to interest you so that you understand that it’s great and you need to go into bioinformatics and help people solve complex problems. Obviously, biologists themselves can not cope with algorithmic problems.
I graduated from mehmat. I am a mathematician, I work in Yandex. I am a programmer doing YT, we are doing internal infrastructure in Yandex, we have a cluster for thousands of machines. All the data that Yandex collects is stored in these clusters, everything is processed, petabytes of data, in this I understand a little. I’m also a co-founder of Ebinom. There I was engaged in the same technical task. My goal was to understand how these algorithms, programs work, and teach them how to do it most automatically, so that a person could upload data to a site, poke a button, and he would have a cluster deployed, everything would be quickly calculated, minimized, and the data would have seemed to man.
While I was solving this technical problem, I had to figure out what kind of data was involved and how to work with them. Today I will tell part of what I learned.
On Facebook in the announcement of this eventGelfand's video was uploaded . Who watched it? One man. Gelfand looks at you disapprovingly.
We will work with DNA, but I want a little context around the DNA, so we will start with the cell. It is clear that bioinformatics as a science of molecular biology is interested in all the processes in the cell, but today we will talk more about the technical side, about the data and we will, of course, be interested in only the core and nearby components.
The core contains DNA. It has a nucleolus. It, along with the histones that Andrei showed, is in a sense responsible for regulation and epigenetics. There is a second important part - the ribosome, folds around the nucleus, the so-called endoplasmic reticulum. It has special complexes of molecules called ribosomes. It will be from your pieces of DNA, which you first subtract from DNA, to make proteins.
The third part is chromatin: all the substance in which all the DNA floats is found. It is packed quite tightly, but there are still many different regulatory molecules between it. Well, there is a membrane - protection, nutrition, the way the cell communicates.
There is much more in the cell, other functions. We will not talk about them, you can read in the textbooks.
Now about the information, DNA and RNA. There are two main molecules: deoxyribonucleic acid and ribonucleic acid. These are molecules of the same type. Watson and McCrick once discovered the structure of this molecule. An interesting story, you can read the autobiography, he writes very beautifully about everything. It is a double helix, it is a popular picture, a popular buzzword.
It consists of four larger molecules, complexes of dozens of atoms. These are guanine, cytosine, adenine and thymine. Moreover, in RNA instead of thymine will be uracil. When you make RNA from DNA, instead of the letter "T" you will get the letter "U". It is important for us to know that these four letters are contained in our DNA. We will work with them, by and large.
DNA has a higher structure. This is not just a spiral, which somehow lies, anyhow how twisted. It swirls around special complexes - histones. Then the histones are twisted into some fibers, from which the letter X is already built. Here is a beautiful picture that you saw in biology textbooks. Usually they don’t talk about histones, they didn’t say 15 years ago, but they show the letter X. It is arranged in this way, the DNA is twisted there in a cunning way.
Histones are responsible for regulation. When you want to read some kind of RNA with DNA, you may not succeed because it is so twisted, so hidden in this DNA that it cannot be read from there. Or maybe something will change and it will begin to be read.
And there are different squirrels that know how to attach to all this, know how to unfold it, pull it out for you to read. The secondary structure is very important to understand which parts of the DNA from which you can then make RNA proteins are available for reading.
What is protein? The main vitality of a cell, the diversity of all living things, is largely due to proteins. The fact is that protein has very high variability. This is a huge class of very diverse molecules that participate in a huge pile of molecules. They have very different structures, which allows all this complex chemistry to occur inside the cell.
What you need to know about proteins? The compound block is an amino acid. Amino acids in our body, in our world, 20 pieces. They say they are a little more, but in our body there are 20. They also have certain Latin letters. The protein is drawn schematically on the right, its beginning is written out, it is indicated which letters are present. I’m not a biologist, I don’t remember what V or M is, and biologists can immediately say what the letters mean.
The protein structure has several levels, you can look at it simply as a sequence, and when you are bioinformatics, you often look just like that. Although Arthur is interested in more complex secondary and tertiary structures.
We will look at it as a sequence of letters. Secondary and tertiary structures. There are such spirals, alpha spirals, they are clearly visible in the picture. There are still beta sheets, she knows how to line up in fairly flat pieces, and there are special pieces that collect these alpha spirals and beta sheets.
Further, what Andrei spoke of is the central dogma of molecular biology. It consists in the fact that there is translation and transcription, and there are many more different reverse processes, perpendicular, which also involve RNA, DNA and proteins in different directions. But today we will look only at broadcast and transcription. On the right there is a fairly complete picture describing this thing in eukaryotic cells, in particular in humans.
Do you have a gene? In biology textbooks they write that this is a unit of heredity. For us, when we look at DNA, a gene is some not so simple site. Not that it translates into pH, but then protein is made from RNA. Everything happens a little bit wrong. Firstly, in this section there are regions, introns, which are cut out. And there are exons that will later be encoded into your protein. If you look at DNA, at 3 billion letters, then - this is a popular phrase - there are only 2% of important areas there. Only 2% are responsible for coding your genes. And we are talking about exons.
If you take more introns, you will get not 2%, but 20%. And the remaining 80% are intergenic regions, apparently participating, to some extent, in regulation. But some are really not needed, just like that historically.
The following happens: you have a certain promoter in front of the gene, a special molecule sits on this promoter, a protein that will read and make RNA from your DNA.
Further from RNA, the transformation is carried out by another molecule, which cuts out unnecessary introns from it. You get ready-made messenger RNA from which you make your protein broadcast. In this case, 3 'and 5' are cut off - non-encoded ends. When you make protein from RNA, you actually have a triplet, some sequence that you can start with. Everything before her is simply skipped. The cell sits, it skips this, and begins to do everything from the start codon.
I used the word codon. And what are codons? How does encoding work? The logic is simple. There are four characters in DNA: A, C, T, G. There are 20 characters in a protein, 20 amino acids. We must somehow make 20. out of four. It is clear that this will not work directly, we must somehow encode them. From the point of view of scientists who are engaged in coding, the device is very simple, stupid coding, we would easily come up with this. But from the point of view of biology, this was not easy to accomplish.
You have three letters of DNA read, and each triplet of letters encodes a specific amino acid. Triplets of the letters 64, and 20 amino acids can be encoded in 64 variants. Encoding will be redundant. There are different triplets that encode the same thing.
This is all you need to know for today about broadcasting and transcription.
Andrei talked a lot about metagenomics, about different ohmics. Scientists over the past 50-100 years have been very actively studying what is happening in the cell, what processes are there, what kind of chemistry is there, and not just biology. And they have succeeded quite well in this matter. They may not have a complete picture and understanding, but they know so many things.
There are large maps of the conversion of different substances into each other. In particular, most of the elements are proteins that describe what generally happens in a cell.
Here is a small piece, but there is a big picture, scientists like to draw it. Before us is an approximate simplified scheme of everything that is done in our cell.
It is important to understand that all molecules are more or less always present in our cell - the question is in concentrations. All the time, your DNA unfolds somewhere, something is read from it, and what is read depends largely on what concentration different proteins have. And the whole life of the cell is regulation. You got something bigger, something went to do more than the other process, he began to translate something more, a new protein appeared, he took part in some process, said that it’s enough to produce me, he gave birth to a new protein, which began to interfere with its production. Such feedback-looped processes are present in large numbers in the cell and are generally presented here.
Where do the data come from - these A, C, T, G? Andrey said that there are sequencers that will read them to you. Let's understand in more detail how these devices are arranged and what they read.
There is a classic Senger method, which was invented in the late 70s, and which allowed in 1989 to begin the project "Human Genome". By 2001, the project was completed, they were able to assemble a sequence of some “reference” person, consisting of 3 billion letters. We analyzed all 23 chromosomes and realized what letters - A, C, T, G - are in each chromosome.
Senger's method is good and reliable, scientists trust him, but he is very slow. In 1977, it was a terrible horror, you worked with some radioactive solutions and substances in gloves, being behind a lead little thing, poured something somewhere in a test tube. Three hours after these exercises, if you accidentally didn’t pour something in the wrong place, you would get a picture from which you could recognize 200 nucleotides of your DNA.
It is clear that the process was quickly automated, and in the course of automation the technologies changed a little. By the mid-2000s, a host of new technologies had appeared that allowed sequencing to be done quickly and cheaply without spending three hours of the work of an expensive scientist for every 400 characters.
The last technology Andrei talked about is mininanopores. It seems to me that in the scientific community there is still some doubt about whether this technology will become mainstream or not. But it certainly opens up new boundaries for us.
All previous technologies are considered fairly accurate, they have few errors, we are talking about percentages and percentage fractions. There are substantially more errors in this technology, 10-20%, but you can read big ideas. In large applications, this seems to be really necessary.
Whether it will be able to finish off to a technology in which there will also be few errors is unknown. But if they can, it will be very cool.
The big picture about sequencing, about how this happens. They take a sample from you, a blood test, an analysis of saliva, or climb up with special tweezers, get a piece of bone marrow: a very painful scary procedure. Further along the sample, some chemistry is done, which removes all unnecessary and conditionally leaves only your DNA, which was present in the sample.
Then you do DNA fragmentation, get small fragments that you further propagate so that there are more of them. This is needed in many technologies. Then feed the sequencer, he reads everything in some way.
The output is always the same fastq file. All devices work in different ways, they have different lengths, different types of errors, and if you work with this data, it is important to understand what types of errors are there, how to work with such data correctly. If you apply the same method to everything, you will receive dirty data that you cannot rely on.
Briefly about the Senger method. The idea is very beautiful. You have a DNA sequence, you bred it, it floats in your test tube. Then you want to understand what it consists of. Special squirrels are fed there. Maybe all the DNA will even float there, but the molecule that knows how to disassemble it will also float.
Scientists have a very simple way. How did they learn to do such things with DNA? They looked at how nature does the same thing in our cells, realized what proteins they are doing, got them, and now they can replicate DNA. Just take the proteins that are involved in replication, and throw them into a test tube. Everything happens on its own.
They throw these squirrels, and also throw our ACTG. Among them there are special ones marked, which, when it is stuck to our DNA when reading, does not allow further gluing. Throwing this whole thing, shaking it up and letting it stand for a while, we get pieces of the unread source DNA that we are interested in. We stick something randomly, and the hope is that something sticks for each length, at this moment the stopping TTR will stick and everything will stop us.
More special tagged ACTG, fluorescent. If you shine on them in a special way, they will be green, blue, red or yellow. Then all this is thrown into a gel, there they are distributed by mass, they are highlighted and get a picture, as on the right. Further down it can be counted from top to bottom: shorter higher, heavier lower.
As a result, we get a fastq file. How it looks is drawn on the right. This is such a text format that programmers should already be perplexed - why store it in text form when you can compress it? So it happened. In fact, they are stored in such a file, they simply press it with some KZIP. The file contains data in four lines: first comes the ID of your one read, read. Further letters that were read there. Then some technical line, and the fourth line is quality. All devices are able to give out some indicators of how confident they are in what they read. Further, when you work with data, this must be taken into account.
The essence of most algorithms is that we read random pieces many, many times with a large coverage. Next we want to understand how the original DNA looked like. It is clear that when people started the Human Genome project, they really didn’t have any other assembled genomes. Bacteria then already learned to collect. And it was a terrible task. You have the human genome, it has 3 billion characters. You got a bunch of pieces of length 300-400, and from them you now need to understand what was the sequence from which such pieces could be obtained. It looks scary.
Let's talk about algorithms. What tasks are solved using NGS and how are they solved from an algorithmic point of view?
The first task is the assembly of the genome. About it you can read a six-month lecture course, there are a lot of materials. I will outline only general ideas so that you understand what algorithms are used here, why it is difficult and interesting.
What is the task of genome assembly? Reeds are given in fastq format, there are millions or tens of millions of them. For example, you collect one chromosome. And you need to get the genome or chromosome that you want to collect.
Now, from a programmer's point of view, how is this task formulated? You have short strings of ACTG characters, they have some indicators of their quality, weight. And when you collect them, there are still restrictions on the answer. When biologists collected the human genome, they knew that on the 11th chromosome there are such and such genes: as a result of long work, we examined them and in other ways found out that they are there. And I want to take this into account. It would be foolish to collect something, and then find out - for what we are sure, for some reason there is no, because the algorithm decided so.
You need to get some kind of superline over all short lines. And we want it to be as small as possible. Otherwise, we can simply glue all of our fragments and get something that no one suits. Since all the input lines were obtained from our chromosome, it means that we need to find this chromosome. If we did this with a sufficiently large coverage - conditionally, there is some line with each shift - and we will collect the shortest one, this will be our genome. The foregoing is not always true; there are various reasons why one can lose or miss something in such a formulation. But this is a sufficient approximation that suits everyone.
A simplified example of a task. You can put together a super string that will be shorter than their length 16, and it might look like this. We took the last line of CCTA, pasted the penultimate one to it, the letters TA from them overlapped. Thus, we saved two characters. Then they glued the first line to the speakers. The second line of TGAC was unsuccessful, it had to be glued just like that. It is clear that the example is artificial. In fact, you will not have such glues in the real world.
There is some bad news for you. Those who work as a programmer and versed in algorithms can open Gary Johnson's standard tutorial on the complexity of algorithms and find out that the super-string problem is NP-hard. Who does not know, in practice, this roughly means that no polynomial algorithm for solving the problem can be expected: all known algorithms have an exponential operating time. And this turns out to be complete horror: you have tens of millions of lines, and how will you do something in an exponential time? 2 to the extent of 10 million - this can not be expected during the life of the Universe.
But there is good news. we have not the worst input for the task, one of the very real ones. And for him there is a hope to solve the problem well enough.
The general approach to the solution is as follows: in fact, we will not be able to immediately assemble the entire chromosome at once. There are various reasons for this. It seems easier not to explain them, but to accept that it will be so.
When we start collecting, we will collect large enough pieces of hundreds of thousands of characters, maybe even millions. Next, we will have to glue them into the entire chromosome of 100 million in size. These continuous pieces that we can collect are called contigs. Then, according to various secondary data, we globally try to understand how the contigs relate to each other, and collect them with scaffold. In our case, we are talking about the chromosome, if we collect the human chromosome.
There are two algorithmic methods. It was the end of the 80s and the beginning of the 90s, the golden age of people who were engaged in string algorithms. They came up with their whole theory and science in the late 60s and early 70s, but nobody needed it. I was even at the report of the man who invented the suffix tree. He was only a graduate student, in the graduate school he came up with a suffix tree, wrote an article, and then went into business. And he told how, in the late 80s, biologists unexpectedly began to write letters to him and ask how this and this works in your article. He was very surprised, very happy for them. Very inspired by this case.
Why is everything needed? One of the algorithms is called Overlap-Layout-Consensus. We want to build an overlay-graph on our pieces, then we want to simplify this graph by throwing out all that is unnecessary, and then we will generally understand how our contigs will be arranged.
An overlay graph is the next thing. We have two of our reads, we want for each - which already sounds a bit crazy given our data - the pairs of our reads to understand how much they overlap.
There are such readings. They overlap by 5, and red shows where they overlap.
If we build a graph of these overlaps, we can continue to work with it, try to read our genome from this graph. The question of how to do this is a separate one. Obviously, the task immediately arises for each pair of reads to build an overlap. We have reads even if a million, a million squared. A crazy number, building a graph will work for years, and in addition, it is not clear where you will store it. If the graph will weigh a trillion bytes, it will not fit into any RAM.
And now, come back to 1990, and you will realize that you don’t even fit on a disk, you didn’t know how to store so much. And even no cluster will fit. Therefore, for an arbitrary pair of reads, most likely, there is no overlay. One, two or three is immaterial. And we don’t want to be interested in such. We want to be interested only in significant overlay, in tens and hundreds of characters.
You can use a data structure such as a suffix tree. You can build a suffix tree on all your lines, on all your reads, such a unified one. And then for each read in this suffix tree to find some way. It will greatly reduce the number of lines with which you find overlay.
When you built such a graph, each vertex will have, conditionally, from 100 to 1000. This is many, tens of billions of edges, but they can already fit in memory or at least on a disk, you can work with them.
Having found this graph, we should simplify it a bit. It is proposed to discard all transitive edges. Most likely, a transitive edge means weak overlapping, and there was also an intermediate read between two that were weakly overlapping. It overlaps well with both the former and the latter. We will throw an edge, the graph will become easier, and it will be clear what to do.
If such edges are thrown out, then the graph is already significantly simplified. By such a graph, as below, in general, it is already clear what to do. If it weren’t for the trouble in the middle, we could say that we just read from left to right and get our answer. After all, we know overlay and we know which lines correspond to each vertex of our graph, we can go, read and make a line.
Such things have to be somehow resolved. There are many different complex algorithms that do this.
It was a sketch of the idea of one algorithm. There is another, application to this task was proposed by Pavel Pevzner in 1989. He runs the laboratory in St. Petersburg, which wrote his genome collector, which is now very popular and effective, and from him went the whole galaxy of bioinformatics in St. Petersburg who made the bioinformatics institute and many other wonderful projects.
The idea was this: apply such a purely discrete thing as Count de Bruyne. We look at the lines, they are all the same length, and if they all overlap each other up to the first character of the first line and the last character of the second line, then we say that there is an edge between them. These are almost completely matching lines, except for the first and last character.
It turns out that if you build such a graph, it will always have an Euler path. If you took the line and said that we have k = 4, you said that the vertex consists of substrings that go in a row at k = 4, and there are edges between them in a row, then you will get some graph. Your string will somehow be compressed, as there may be repetitions in your large string. But, after reading a certain Eulerian path in your graph, you can restore the original line - not always for sure, sometimes because of large repetitions, they can be rearranged somewhere, something can happen to them. But in general, with good accuracy, you will restore your string.
The idea is to apply this all to genome assembly.
Here is a more concrete example of how a graph de Bruin is constructed from a row and a given k. We took the string AAABBVA, took k = 2, here are the vertices of two characters, built such a transition graph on them.
Further it is proposed to take all your reads, on each build, as on the previous slide, Count de Bruin, and combine all of them into one large graph. If a vertex corresponds to one row, then you should have one vertex, although it came from different reads.
You are building such a graph. The rest is clear: since you have a high coating density, there will be many identical edges. They need to be glued together. The sad news: you have errors from which you will receive extra branches. They will need to be somehow searched and deleted in this column de Bruyne.
The second bad news: you will most likely not find the Euler path in this column, and it will not be Euler. But this is solved by the fact that the problem needs to be slightly relaxed. Indeed, we are looking for contigs, we may have disjoint pieces.
If we weaken the problem, it will be formulated differently. We will need to look for a non-Euler path. We will just need to look for the smallest number of cycles that cover the entire graph in terms of the number of edges or, preferably, do not even cover, but do not intersect. In other words, we just have to decompose our graph into disjoint cycles.
I told everything about the assembly of the genome that I wanted. Assembling the genome is a rather complicated and interesting task, rather special, it uses string algorithms and graph algorithms, and much can be improved there. I urge you, if you are interested, to read more detailed lectures and materials about the assembly. I can tell you which ones.
A much more popular task, but simpler, is that alignment needs to be done. The idea is very simple: if you collected one person and now sequenced the second person, then you do not need to collect the second person from scratch. In fact, two people differ from each other in the sense of the nucleotides of their DNA by half a percent - the order is approximately the same. It can vary slightly depending on whether people come from the same population or from different ones. So you can just pick it up and find where your new person’s reads meet in the reference genome that you have already collected. You need to look with differences, you expect that the organism that you are researching is different from the reference. Many differences will be just pointy. There are even bigger inserts and deletions. You just need to carefully search your reads in your reference genome,
Actually, what most bioinformatics work with is alignment. They do an analysis, for example, of a person, align it with the reference genome of a person, and then look at the differences. They are already researching them, and those who are involved in genomic bioinformatics work with them.
Next is the task of calling options. When you level everything, you need to find differences. There are heterozygotes, they also need to be found separately.
What math and programming are there? Many statistical methods are used here. It seems that you could just take consensus or look at the two most frequent letters, but say that it works poorly ... no, not bad, but you can do better. People build different Bayesian models, classifiers that correctly call.
Then there is a difficult task, about which Fedya knows very well - an abstract of the options. Why do people do it? For example, for the diagnosis of hereditary diseases. To do this, they need to understand whether the option will lead to bad consequences or not. And here, unfortunately, no universal solution exists. There are many different methods. People are trying to model, understand, whether the protein will work or not. But it often happens that the protein seems to have broken down from the variant, and there are no bad consequences for the person. Because there is a neighboring protein that performs exactly the same function. It just started to work a little worse, but globally nothing broke.
There are different statistical methods when people collect thousands and tens of thousands of genomes of different populations and find out that nothing has happened except in letter C in such a position. This probably means something, and the letter C is not just here. Because your mutations in the body happen by chance, and if different letters are possible in one position, then they will be in your population as a whole, if they do not interfere with anything. And if there is only one letter, it means that other letters interfere with something, spoil something. Such people get sick or something else. She cannot be there.
The main essence is as follows: firstly, there is a construction of a three-dimensional structure and people are trying to predict it on the basis of it, and secondly, there are statistics and we must be able to analyze huge amounts of data. 10 thousand human genomes are some terabytes. By the standards of Yandex and mine, this is not very much, but by the standards of people who are used to working on one or five servers, this, on the contrary, is very much. And to analyze such data is difficult.
Next up is machine learning. He is trying to combine all these methods for a specific task that you are solving, to build some model that would best solve the problem.
I will miss the expression.
About the complexity of bioinformatics. The important part that I want to pay attention to is data processing. Everything is important, starting from the moment when you take a sample and chemists, biologists in the laboratory work with it, ending with how you work with data, with reads. Because if you do something wrong somewhere, do the wrong thing, you will get further information about which you think that there is a replacement of the letter C with the letter A, but in fact there is no replacement. You will draw some conclusions, and it will be very sad if you write an article, and then it turns out that in fact we are talking about a false replacement, it was not there. Because of an error in the whole analysis, you decided that it was there. It is clear that such things are double-checked, they use Senger for double-checking, but you can double-check one specific option if you build your article on it.
And if you write an article that is based on huge statistics on a huge number of options, and you’ve made a mistake somewhere in your analysis, it will be sad enough. Here you need a very good understanding of how everything works. It seems that without knowing how the equalizer or call works, it is difficult to work well with such data. Just taking a standard tool and aligning it in a standard way, you will probably get something wrong now. There is hope for corrections. As Andrei believes, they will standardize everything and say: with such data you need to do this, with such data - like that. We will believe. But for now, everything is different. While you need to understand the principles of work.
Further there is an interesting important task. Did you sequence DNA, get a sequence of characters, and then what? To understand something, we need to know very well where the DNA has genes, where it has different regions. And there is a separate science, for example, RNA is sequenced in order to improve the markup and understanding where are which genes, how they are spliced, how everything works. And while the above is quite manual labor, there is no good automation, but there are many different attempts to try to build neural networks or complex classifiers, which would simply say in sequence: I look at the DNA sequence and understand that this is a regulatory area.
I look, and then the gene began. And here, probably, the intron began, because the variability is large and the letters go this way. But these are dreams. There are certain results in this area, some things people can determine simply by the sequence of nucleotides, but not all. And the task of detecting genes well, it seems, has not yet been solved.
There is also a technical problem, which, unfortunately, biologists do not always solve well. This is storage, efficient data processing and unification. Unfortunately, the world is organized like this: each laboratory has its own cluster, its own data bank, and it is clear that it’s very difficult to keep a galaxy of system administrators, developers who will monitor everything. Firstly, there is the specifics of the task and the system administrator whom you hire, will not know anything about your specifics. Secondly, it’s also quite difficult to find people who understand well in all this alignment. So there is a hope that someday there will be common bases where everything can be poured, where you can come, watch, search for a specific variant of the genome. There are already many such databases, but they are often diverse, each laboratory has its own, in its own format. Will hope, that there will be a unification and the tasks will be solved better. In other words, if you are a programmer, there is something to be done.
In conclusion, I’ll say: if you want to learn bioinformatics, you can start to watch lectures and read something. Or you can go to study at a specialized institution, for example, at the Higher School of Economics at the Faculty of Computer Science. In 2016, the master's program, by and large, in bioinformatics, just opened there. It is called "Data Analysis in Biology and Medicine", but if you open the course, then in general it will be on bioinformatics.
There is also a school of bioinformatics, which the guys also do together with the FBB. There is the Institute of Bioinformatics in St. Petersburg. In general, there are many different options. Thanks to all.