Analysis of the bacterial genome. Continuation

    In a previous article , the discussion turned out to be too noisy. But we opened our site and there will be more advanced information (where? - write letters). I promised to write a sequel about my experiment, so those who are interested in the problems of building evolutionary trees - please, under cat.

    No. 1. Selection of all homologous sequences (paralogs)


    In a previous article, we compared evolutionary trees built on 16S and 23S genes. My method is different in that it offers to compare what has not mutated in organisms. In early articles on Habr, I suggested using tRNA, because these are the most conservative sequences. But this did not provide much information. Therefore, I asked myself - how can I find all those sequences that have not mutated in organisms? To do this in real time, I went for a little trick. The fact is that before any DNA sequence is inherited, it will certainly (if it is useful) be represented in the genome in several copies. Those. it's about paralogs.

    If within a single organism, as a result of a chromosomal mutation, a doubling of a gene occurs, then its copiescalled paralogs.

    So if you find all the paralogs in one organism, then if inheritance occurred then they were transferred to other organisms. We just need to then select those that did not have time to mutate.

    Those. we do the following:
    1. We
    look for in each DNA (the genome of the organism) that generally has duplicates from 50 to 150 characters 2. For each duplicate found, we look for all its organisms, i.e. we learn and compile the base how ALL a lot of paralogs are included in many genomes of the body

    (in order not to be distracted from the essence of how to do this, I’ll either tell you in a separate article, or rather, if you are interested, I’ll write an article on our website over time)

    No. 2. Actually building an evolutionary tree


    How to build an evolutionary tree according to my methodology, I have already told . Therefore, we will focus on the results of cross-validation. Let me remind you that the cross-check of two trees built on the 23S rRNA gene and built on the 16S rRNA gene, which is the latest result of The All-Species Living Tree , gave the following error distribution (compared to the previous article, it is translated as a percentage of the total number of species pairs considered ):



    I was hoping that my approach would give better results, but alas, it yielded about the same in quality - but different in essence. At first, about quality, then cross-checking was done like this. Since it was found about a million occurrences of paralogs in the genome of the body, i.e. there are a million records of the form “DNA sequence ID such and such enters the body such and such”, then for cross-checking I divided this set randomly into two samples. I built trees on them and compared the constructed trees in the same way. It turned out the following:



    Thus, in fact, trust in these trees is about the same. Both are correct by about 50%.

    Of course, the point is that there is not so much information in the genomes that only half of the sample could have similarities. Therefore, I thought about how I would manage the available information as economically as possible. And I thought that such a cross-analysis could be done. Take all the available information to build a complete tree, and compare it with half trees. Those. take the entire million records and compare them first with one half million, and then with the second. In the figure below, the images of trees (and by reference in full resolution) are constructed in full sampling, and those nodes that are quite stable are displayed in red - i.e. cross-analysis did not give more than one mistake.

    As you can see, not everything is so bad, part of the branches are completely red, but the closer to the root, the less information and the position of the species in the tree does not pass the check.

    But interestingly, I then compared the tree I received and the tree of The All-Species Living Tree project (after being reduced to the same composition). It turned out that they coincide only 25%.

    And I had an important question of interpretation , can someone tell me what this could mean. It turns out that you can trust my method of constructing trees, and you can also apparently trust the classical method used in The All-Species Living Tree project. They do not differ significantly in terms of coincidence. But why do they not coincide? They turn out to show, as it were, two versions of the same thing. But how can there be simultaneously two half-truths that coincide only by 25%?


    Full-size format can be seen here and here .

    I also thought that non-coincidences appear nonrandom, and somewhere at the level of families of organisms. In the second version of the tree image, it can be seen that the species are clustered into groups, and there are many coincidences within the group, while the position of the groups themselves is inaccurate.

    There are two options - or really little data so far, few sequenced intermediate species. Or, nevertheless, really, at a level above the families they do not have a common ancestor, and evolution does not follow Darwin? At least so far we do not have reliable data that in general there was a common ancestor.

    Also popular now: