RNAInSpace and folding tRNA - closing season, new season - Structural alignment
And now, less than a year and a half, I got to collect the tertiary structure of tRNA. I will remind that earlier wrote an article on this subject on a habr "Development of RNAInSpace, the CRA algorithm, problems of the code on Linux and others" . I must say that for about a year I did not do this, but during this time my second scientific article on this topic, “Application of game theory for the folding of ribonucleic acids” problem, was published (this is for those who want to talk about it professionally). But recently I can say that I got the tertiary structure of tRNA and compared it with a sample available in the database (PDB), which was obtained by biological methods (crystallography).
Under the cut, there are drawings of the 3D structure of tRNA, explanations and plans for the future.


It would be possible to make a folding roller, but I was too lazy - it doesn’t show much, as an example you can see this one , then this one , and then folding turns the chain into the tRNA shown in the figure.
In the figures of tRNA from two angles of view. Green is the model I got, red is the model from the PDB database. Now I can tell experts that RMSD = 6.71 (this is a measure of the similarity of the two models). As you can see, the overall profile is almost the same. Also in my model almost all standard hydrogen bonds are formed and noncanonical hydrogen bonds are close to formation.
It should be noted (to remind, if someone has already read my articles) that I get the tertiary model only on the basis of the primary structure (the so-called de novo), if possible, predict the sites of hydrogen bonds and find critical stacking sites. If there is interest, I’m ready to explain the details and discuss these results.
Having brought this direction of my activity to a certain logical conclusion - with this article I would like to close the series of articles that I wrote on Habré. In essence, I have achieved my goals. And here I will briefly tell you about this:
1. The first article on the Internet dates from 2009. In it, the task of folding is posed in the spirit of cybernetic ideas.
2. Next, I tried to develop an open project on Wikiversity .
The main thesis was as follows: " serious results can be obtained if you know only a certain minimum and do not have a specialized education in biology, physics, or chemistry ." Now I have no doubt that I got serious results, and the method that I got is superior to all other methods that currently exist.
So gentlemen, do not be afraid to start, on your way you will meet a lot of opposition and criticism of those who know little about what, but are ready to show their erudition. If there are results, they will have to retreat.
3. I had to abandon many modern approaches in this direction, sometimes I even got the feeling that the methods are not used to solve the problem, but to show how this or that method works. and if at first I had hopes for some methods, including artificial intelligence methods, then it turned out that they were not good. Only the general ideology of game theory and the agent approach is suitable. And so it comes down to certain heuristics in finding the objective function (of course, to speak in more detail, there are minor goodies in the algorithms I developed - but this is not for this article - this is not the level of immersion in the problems).
4. Two articles in peer-reviewed journals - for me personally, enough on this topic. Thanks for attention.
5. In fact, I developed a method and approach, now it is up to the technique and the followers.
6. Next, I come to the question “for what and why?” About this in the next section.
Even in that first article, the answer to the question of why to study the three-dimensional structure of RNA was given (this is in addition to being interesting in itself, and may be useful to biologists).
Of course, looking back now it is somewhat naive. Nevertheless, it carries a certain meaning. I'll try to clarify.
I have already noted more than once that the modern theory of sequence alignment is essentially erroneous, it allows you to essentially adjust the results, rather than get a true picture. I also wrote that annotation in biological databases contains many errors “Genomes of sequenced organisms - errors in databases” , and those who work there were forced to agree.
Now, looking back, I can say that then, without knowing the essence of bioinformatics, in my first article I “made a bet” on the so-called Structural alignment. This is such a finding of genes in the genome, and subsequent comparison of genomic sequences, which takes into account NOT mutations of individual nucleotides and their statistics, but focuses on the tertiary structure of functionally similar genes.
Indeed, now my approach to obtaining a tertiary structure allows me to judge whether a certain nucleotide sequence can be folded into one or another structure. This means that it is possible to understand which parts of the nucleotide sequence should be conservative, and in which mutations are possible.
All this information, which really affects the ability of the same tRNA, or ribozyme, or any other RNA structure to function, is not used with a simple analysis (alignment), which means there will definitely be errors that will not even be noticeable to a researcher who does not pay attention on the functionality of the tertiary structure. And the statistical approach, which is now universally used for this, is further obscuring this issue.
And now, when we know the (approximately) tertiary structure - we can build, let's call it - a functional profile, for example, tRNA. And after that, and only after that, we can find with sufficient accuracy in the DNA the location of all tRNAs.
But to build this functional profile is not so simple. It turns out that we have few 100% of conservative plots - almost everything can change in absolute value. To understand this, consider an example with tRNA.
Here let us compare the two tRNA:
gcgcggauagcucagucgguagagcaggggauugaaaauccccguguccuugguucgauuccgaguccgcgc
gcggauuuagcucaguugggagagcgccagacugaagucuggagguccuguguucgauccacagaauucgca
try to align these two tRNA and say what do they differ? In reality, the problem is much worse - these sequences are not highlighted, as in this example - they are among millions of similar gcau characters. And we do not know where we need tRNA.
You can of course engage in nonsense and align these signs, making assumptions where the breaks and where the insertions occurred during mutation.
But you can do something simpler, let's find hydrogen bonds, for a start at least classic ones. We get:
((((((((.. ((((........)))). (((((((...)))))) ..... ((((((.......))))))))))))))
((((((((.. ((((........)))). (((((((....)))))) ..... (((((.......)))))))))))).
Isn't it getting more fun? It turns out the difference is not so big. Tolerances must be made for plus or minus 1-3 points (unpaired nucleotides) and 1-3 pairs of brackets (nucleotides paired with a hydrogen bond). To obtain greater accuracy, it will be possible to find the correspondence of non-canonical hydrogen bonds (which stabilize the structure at the 3D level).
Of course, it is still difficult to find these structures among the millions of gcau signs. But here there is a guideline. We divide the task into parts, and look for not all tRNAs, but those that bring Phenylalanine. And since we know for sure that in the center is the sequence gaa. Then we can search for all such sequences in the genome, which in the middle of gaa, also has a corresponding profile:
((((((((.. ((((........))). (( ((((((gaa))))))) ..... ..... (((((.......))))))))))))
((((((((. . (((((........)))). ((((((gaa.))))) ..... ..... (((((.......) )))))))))))).
With allowable limits in the structure.
This is what I am going to do in the near future - to reliably find all tRNAs in the sequenced bacterial genomes. Maybe someone wants to participate in this - I invite.
Under the cut, there are drawings of the 3D structure of tRNA, explanations and plans for the future.
Tertiary structure of tRNA - results


It would be possible to make a folding roller, but I was too lazy - it doesn’t show much, as an example you can see this one , then this one , and then folding turns the chain into the tRNA shown in the figure.
In the figures of tRNA from two angles of view. Green is the model I got, red is the model from the PDB database. Now I can tell experts that RMSD = 6.71 (this is a measure of the similarity of the two models). As you can see, the overall profile is almost the same. Also in my model almost all standard hydrogen bonds are formed and noncanonical hydrogen bonds are close to formation.
It should be noted (to remind, if someone has already read my articles) that I get the tertiary model only on the basis of the primary structure (the so-called de novo), if possible, predict the sites of hydrogen bonds and find critical stacking sites. If there is interest, I’m ready to explain the details and discuss these results.
Season closure
Having brought this direction of my activity to a certain logical conclusion - with this article I would like to close the series of articles that I wrote on Habré. In essence, I have achieved my goals. And here I will briefly tell you about this:
1. The first article on the Internet dates from 2009. In it, the task of folding is posed in the spirit of cybernetic ideas.
2. Next, I tried to develop an open project on Wikiversity .
The main thesis was as follows: " serious results can be obtained if you know only a certain minimum and do not have a specialized education in biology, physics, or chemistry ." Now I have no doubt that I got serious results, and the method that I got is superior to all other methods that currently exist.
So gentlemen, do not be afraid to start, on your way you will meet a lot of opposition and criticism of those who know little about what, but are ready to show their erudition. If there are results, they will have to retreat.
3. I had to abandon many modern approaches in this direction, sometimes I even got the feeling that the methods are not used to solve the problem, but to show how this or that method works. and if at first I had hopes for some methods, including artificial intelligence methods, then it turned out that they were not good. Only the general ideology of game theory and the agent approach is suitable. And so it comes down to certain heuristics in finding the objective function (of course, to speak in more detail, there are minor goodies in the algorithms I developed - but this is not for this article - this is not the level of immersion in the problems).
4. Two articles in peer-reviewed journals - for me personally, enough on this topic. Thanks for attention.
5. In fact, I developed a method and approach, now it is up to the technique and the followers.
6. Next, I come to the question “for what and why?” About this in the next section.
"The difference between living and non-living"
Even in that first article, the answer to the question of why to study the three-dimensional structure of RNA was given (this is in addition to being interesting in itself, and may be useful to biologists).
We have a clear biological task: "To find out exactly what and how much changes in the three-dimensional structure of the 50-100 nucleotide RNA chain fundamentally affect the fact that this RNA chain is a ribozyme." In other words, which mutations of the ribozyme improve or worsen the possibility of self-replication, up to their absence. And popularizing - this will be a detailed answer to the question of how the difference between living and non-living.
Of course, looking back now it is somewhat naive. Nevertheless, it carries a certain meaning. I'll try to clarify.
I have already noted more than once that the modern theory of sequence alignment is essentially erroneous, it allows you to essentially adjust the results, rather than get a true picture. I also wrote that annotation in biological databases contains many errors “Genomes of sequenced organisms - errors in databases” , and those who work there were forced to agree.
Now, looking back, I can say that then, without knowing the essence of bioinformatics, in my first article I “made a bet” on the so-called Structural alignment. This is such a finding of genes in the genome, and subsequent comparison of genomic sequences, which takes into account NOT mutations of individual nucleotides and their statistics, but focuses on the tertiary structure of functionally similar genes.
Indeed, now my approach to obtaining a tertiary structure allows me to judge whether a certain nucleotide sequence can be folded into one or another structure. This means that it is possible to understand which parts of the nucleotide sequence should be conservative, and in which mutations are possible.
All this information, which really affects the ability of the same tRNA, or ribozyme, or any other RNA structure to function, is not used with a simple analysis (alignment), which means there will definitely be errors that will not even be noticeable to a researcher who does not pay attention on the functionality of the tertiary structure. And the statistical approach, which is now universally used for this, is further obscuring this issue.
And now, when we know the (approximately) tertiary structure - we can build, let's call it - a functional profile, for example, tRNA. And after that, and only after that, we can find with sufficient accuracy in the DNA the location of all tRNAs.
But to build this functional profile is not so simple. It turns out that we have few 100% of conservative plots - almost everything can change in absolute value. To understand this, consider an example with tRNA.
Here let us compare the two tRNA:
gcgcggauagcucagucgguagagcaggggauugaaaauccccguguccuugguucgauuccgaguccgcgc
gcggauuuagcucaguugggagagcgccagacugaagucuggagguccuguguucgauccacagaauucgca
try to align these two tRNA and say what do they differ? In reality, the problem is much worse - these sequences are not highlighted, as in this example - they are among millions of similar gcau characters. And we do not know where we need tRNA.
You can of course engage in nonsense and align these signs, making assumptions where the breaks and where the insertions occurred during mutation.
But you can do something simpler, let's find hydrogen bonds, for a start at least classic ones. We get:
((((((((.. ((((........)))). (((((((...)))))) ..... ((((((.......))))))))))))))
((((((((.. ((((........)))). (((((((....)))))) ..... (((((.......)))))))))))).
Isn't it getting more fun? It turns out the difference is not so big. Tolerances must be made for plus or minus 1-3 points (unpaired nucleotides) and 1-3 pairs of brackets (nucleotides paired with a hydrogen bond). To obtain greater accuracy, it will be possible to find the correspondence of non-canonical hydrogen bonds (which stabilize the structure at the 3D level).
Of course, it is still difficult to find these structures among the millions of gcau signs. But here there is a guideline. We divide the task into parts, and look for not all tRNAs, but those that bring Phenylalanine. And since we know for sure that in the center is the sequence gaa. Then we can search for all such sequences in the genome, which in the middle of gaa, also has a corresponding profile:
((((((((.. ((((........))). (( ((((((gaa))))))) ..... ..... (((((.......))))))))))))
((((((((. . (((((........)))). ((((((gaa.))))) ..... ..... (((((.......) )))))))))))).
With allowable limits in the structure.
This is what I am going to do in the near future - to reliably find all tRNAs in the sequenced bacterial genomes. Maybe someone wants to participate in this - I invite.