Using UML for an experiment on the evolutionary systematics of prokaryotes, and indirectly on the psychology of scientists

    This article is a continuation of two other interesting results on the evolutionary systematics of prokaryotes or “multi-species origin” . Genomes of sequenced organisms are errors in the databases .

    After them, I had the honor to receive some feedback from both interested and professionals in this matter. Also, as you could see, there was a fairly lively discussion. On the one hand, I would like to respond to the comments received.

    On the other, put on a new experiment. And it would be desirable to bring to this those who are interested in such things. If you do not have time - maybe you have free processor time :)?



    NCBI Position

    Thanks to Kalobokwho works at the NCBI managed to find out why "people from the NCBI do not perform such a simple cross-analysis" (see Genomes of sequenced organisms - errors in the databases ). I must say the conversation with Kalobok was not very pleasant. At first, like many commentators, he tried to teach me, in every way possible, to hint that I did not understand anything, I did everything wrong, etc.

    Here are some typical quotes from the correspondence: " The essence of your complaints ... is related to the incorrect use of data. ... there is nothing to fix there. You just need to use the data correctly. ", "As for the anti-codons - I already said, I don’t remember the details, so I can’t comment. Judging by your level of knowledge, you also do not have much reason to discuss this topic. First, figure out whether genomes of different organisms can be compared at all. "," Let us first source our information about 34-35-36. Because with a probability of 99%, any biologist will answer that this is not true and there will be nothing to discuss further. Once again, try to understand that the NCBI is not sitting undergraduates, but professionals. And I am not inclined to suspect them that they drive more than half of the defective data. Rather, I will assume that the layman is mistaken, who climbed into someone else's area. ","Note that the NCBI is primarily aimed at biologists. They are quite happy with the available data and tools. And the only non-biologist programmer with obscure ideas of the weather does not. "," Here is the opinion of the boss of the group of bacterial genomes after reading the article: Yes, quite naively ... Such work has been carried out for the past 20 years. And this is some kind of loner "

    Well, and all that sort of thing. Probably few would have restrained in that tone of discussion. But alas, these are the current morals of people who have received a biologist’s diploma (biophysics, biochemist ...) and started to understand programming a bit and now work in a respectable place.

    How to survive in this evil world :)

    What is to be done to a person who does not have an appropriate diploma, but has knowledge in a narrow, but not his own sphere? Alas, there will always be a relationship with him, starting with “ your post is teeming with the self-confidence of a gifted student, ” to containing an instructive, patronizing tone “ But actually it’s very interesting. Correct the language, add a literature review, describe methods and results, remove speculations from discussions and, maybe something will work out. "," > and there they’ll just twist $ at the temple - They don’t, it’s normal work. If you add links, you might type in a diploma; True, a discussion with the author shows that things would not go further "

    But here the main thing is to understand several psychological points. A man with a diploma and with a warm place - unbelts. Even without the appropriate knowledge (not in a broad aspect, but in this particular task), he allows himself to speak out in the spirit of "superiority over the interlocutor." The discussion, as a rule, is not conducted in essence, but the weakest argument of the interlocutor, or intentional speculation, is sought, then “son, read, teach this” is advised, and as a rule, not relevant to the issue, strong arguments are ignored, and then it becomes clear whether you are a diploma, who is your dad with your mom, etc.

    I met this, well, over the course of 10 years, more than once. My advice is simple - ignore it. Do not be fooled by provocations, and do not learn what they say to you - you do not need it. The first such case was in my school, when I spoke about literary characters, for example, about “The Master and Margarita” or about Natasha Rostova from “War and Peace”. Then they told me how it is possible to speak negatively about what I did not read. Then I got seduced and read “Masters ...” and then I was able to speak out with all scrupulousness on this subject. It was easier with Natasha (I read the novel only diagonally), they wrote an essay - there he wrote convincingly that this girl should not be given such important attention - it is not worth it. The rating was excellent, with a comment in the form of “everything is very well-grounded with quotes, but it may be worth looking at it from the other side, as a manifestation of the Russian soul ... ". Not worth it, I said then, and went into adulthood :)

    Over time, I felt sorry for the time when you spend it on a whisker. Why do you ask the literature - all in one to one, since then everything has been laid down - either they will lead you, or you will decide for yourself.

    And yet - all of the above does not mean that such discussions should be avoided. Never let yourself behave like your opponent (although often it’s not easy) - look for particles of truth in his words, they are really particles, but if the opponent leads a discussion with you, he is already interested, and from time to time he gives out something useful for you - be able to filter.

    However, apologize for this lyrical digression. Further to the point.

    What ended up with the NCBI?

    As expected, they acknowledged their mistakes, but did so with a good face on their face :)

    "The data that you took from ftp is the original sequences and annotations sent by the researchers. They have not yet been verified in the NCBI, and there can be quite a lot of errors [they are marked as] This record has not yet been subject to final NCBI review ... Even in the verified data, such errors can occur, as due to the not 100% reliability of the checks, and for historical reasons (in many old records they relied on the reliability of the data of the submitters and did not do additional checks - so they still lie). In principle, such data is periodically reviewed and corrected. ... One of the biologists clearly said that if for some reason he used raw data from the genome, he would simply have adjusted it manually. But, most likely, I would use tRNAdb [this is another database, where there is less data, but they are fixed]. "

    Here, by the way, another comrade responded. He says that now our standard program for checking data just does not check the correctness of tRNA. Because it is very expensive for computing power. They plan to write a separate program for this, but so far there are more priority tasks. So wait. "

    Therefore the lyrics are the lyrics, but the fact turned out to be a fact. You can “attack” a non-biologist programmer for a long time, but the fact is that more than 50% of the data from the NCBI is not verified - there is a reliable and recognized fact. This should not be taken as criticism of the NCBI - they do and contain a lot of good information, which is valuable even with errors. This is just for information to biologists who told tales in the comments in past articles.

    They seem to be going to correct this data, but this is not a priority for them, because many of these errors are not noticed, if they notice they are corrected. But if they fix it themselves, because they don’t trust lists of errors from others.

    Bug fixes we will not wait. But what can be done without this?

    The main criticism of the article. Interesting results about the evolutionary systematics of prokaryotes or “multi-species origin” consisted in the following claim: “ You cannot consider one gene as a measure ”. I completely agree with this, and new experiments should fix this.

    A few numbers. Now in the NCBI there are about 2000 bacterial genomes. In preparation for the experiment, I selected all the tRNAs that are labeled in this way. They turned out to be more than 40 thousand unique variations. But alas, there are many mistakes among them.

    But I thought that you can skip the stage of full error correction. How to do it? I sorted the indicated tRNAs by length and by the presence of the end of CCA at the end of the sequence. I must say that the CCA sequence is required for any tRNA, and the length can be from 74 to 96 nucleotides.

    There are many miracles in the NCBI right down to tRNA from a single nucleotide, or more than 1300 :) (you can't say without a smile). Therefore, I removed sequences that have a length of up to 70 and greater than 100, as well as those that do not end in CCA.

    There are about 20,000 of them. These are the most probable tRNAs that do not contain errors from the NCBI. With the remaining half of tRNA - you can figure it out later.

    In fact, for the planned experiment, it makes no difference whether this particular sequence contains 70-100 nucleotides in length or not. Why? Since I am going to double-check the genomes of 2000 bacteria, are there really such sequences - errors will be excluded. And tRNA is actually the second thing or not. The main thing is that significant organisms of DNA coincide in different organisms. The coincidence of the sequence length of 70-100 in the genomes is far from accidental. Already after a length> 10, the randomness of coincidence approaches zero, and at 70-100 this is already some important part of the genome that cannot simply coincide in different organisms by chance.

    Therefore, what am I doing now. I take these 20,000 tRNAs and find in which bacteria they are present. If the sequence is present in only one organism, this is not interesting. And most likely this is an erroneous sequence. And thus a substantial percentage of errors are eliminated.

    If the sequence is in more than one organism, this is one association (connection) between two organisms.

    Next came the question of how to visualize it well. The idea is this - the body is a class. The current phylogenetic taxonomy in the form of a tree is inheritance between classes.

    tRNA is a class property, and the aggregation of these properties in different organisms is horizontal gene transfer (the same association).

    Having generated the appropriate code backbone, you can display this automatically using UML and visually see all these relationships in the class diagram.

    What is the problem?

    Now the problem is CPU time. I make up the base for the presence of 20,000 tRNAs in 2,000 bacterial genomes. Only about 100 tRNAs are processed per day. Therefore, I would be grateful to those who are interested and will help with processor time - well, such as an unallocated project :)

    If anyone is interested, write in personal messages - you need a place on the hard drive of about 50 GB, a little time for me to explain what's what, and then I can send packets of 100 tRNA for processing, and you send me the results after processing.

    Also popular now: