MiTCR is a platform for diagnosing a new type. Yandex Workshop

    The story of Dmitry Bolotin is dedicated to the MiTCR software developed for the analysis of repertoires of immunological receptors. In his report, he considered the main features of the analysis of raw sequencing data, in particular, sequence alignment and error correction algorithms in the source data, and also briefly described the architecture, performance, and the immediate development plan of the program. The source code for MiTCR is open. In the future, this software may result in a common platform for bioinformatics, where they will be able to process their data and share it with other researchers. The result of such joint work should be a new type of diagnosis: with the help of a blood test it will be possible to answer not only the question of whether a person has a particular disease, but immediately determine what exactly he is sick with.



    Video recording of the report

    We will start from afar, so that it is clear what data we are working with and where it comes from. In the picture under the cut, immunity is depicted very schematically. Cells that have the same specificity are stained with the same color (i.e. they recognize the same types of infections). We call such cells clones. During an infectious attack, the number of cells that recognize it increases.

    The specificity of these cells is due to the fact that they have a T-cell receptor on the surface, the assembly rules of which are recorded in the corresponding gene. For the subsequent narrative, it is important to understand its structure.





    A gene is a piece of DNA, you can think of it simply as a sequence consisting of a four-letter alphabet. The uniqueness of the T cell receptor gene is that it is different in different cells. All other genes that are in our cells are the same, only the genes of the T-cell receptor and antibodies differ in T- and B-lymphocytes, respectively.



    This gene consists of 4 main pieces. One piece in T-cell receptors is always the same. The other two sections (red and green in the picture) are selected from a small set. During the assembly of these two sections, random letters (of the order of ten) are added between them, which leads to the formation of a giant variety of such receptors. The area that then recognizes the antigen in the protein is called CDR3.



    It is interesting from two sides: on the one hand, it determines specificity, because it’s just biology that it encodes the receptor site responsible for antigen recognition, and on the other hand, it’s convenient as an identifier, because all the diversity is concentrated in it, it includes a piece of red, a piece of green. From these pieces we can completely determine which of the set of these segments was selected, and all these random letters are completely concentrated in it. Therefore, knowing CDR3 we can reconstruct the entire protein. We can also use it as an identifier, because if CDR3 is the same, then the whole gene is also the same.

    So, we have a blood sample, we carry out a certain sequence of complex molecular biological reactions in order to isolate DNA sequences containing CDR3 from it. Next, we load them onto the sequencer, and at the output we get something like this:



    This is the real data that comes to us from the sequencing center. If we discard the excess, then in fact it is just a DNA sequence from 100 to 250 nucleotides long. And there are a lot of such sequences.

    Our task is to reconstruct from this data what was at the very beginning i.e. Find out how many and which clones were in the primary sample.

    Everything is complicated by the fact that there are a variety of errors in the data. These errors are introduced at all stages: while sample preparation takes place, while sequencing, etc. The rate of occurrence of errors can be different, depending on the architecture of the experiment from the sequencer, which was used, in general, from everything. These can be inserts, deletions and replacements.



    And this is bad, first of all, from the point of view that the data on the sample are distorted and artificial diversity is introduced. Suppose we had a sample consisting of only one clone, but as a result of preparation errors were introduced into it. As a result, we see that we have more clones than there were, and we have an incorrect picture about the repertoire of T cells in the patient.

    Now briefly about what data we are working with. At the input, we get about a hundred million sequences, or even a billion, 100-250 nucleotides long. All this in total weighs up to 100 GB. And the concentration of CDR3, it can be from 1/5 (in very rare cases, it is more) and up to millions of read sequences.

    The scheme of the MiCTR software we developed looks something like this:



    MiCTR solves the problem that I described earlier: it makes a list of clones from the data and corrects errors. The first block extracts CDR3 from the data, the following blocks only work with CDR3 and combine them into clones and correct errors.

    I will not concentrate much on the first block. He is engaged in alignments, and this is a huge part of bioinformatics, which would require a separate lecture. In short, this block marks the input sequences, determining where in them red, and where green sections. The set from which they are selected is known, it is in the databases and is absolutely the same for all people. We determine which of these red patches was found and where. We do the same with green areas. Having built these alignments, we already know where CDR3 is located, and we can select them from sequences and work with them.

    The most commonplace thing to do with this data is to combine them in the same way: you need to collect the same CDR3 and count them. But since there are errors in the data that need to be corrected, and for their correction we will subsequently perform a fuzzy search on the set of CDR3 (for example, search for CDR3 that differ by one letter), then for storage we use the prefix tree (trie), which is very convenient for such operations . In addition, with growth this tree does not rebuild, therefore it is very convenient to implement all sorts of competitive algorithms on it. No need to worry much about some tricky synchronization. This tree is very easy to use, but it also has drawbacks. It is heavy and weighs quite a lot in RAM. However, in real conditions, it turns out

    What do we do with errors? In fact, we collect all of these clones only from part of the sequences. The fact is that the sequencer, in addition to giving the letters that he has read, puts a quality value for each position that determines the probability of an error. Those. the sequencer read the letter, but he can be sure of it or not. And for each position, the probability of error is established. And here we are doing a fairly simple trick: we just spend some threshold of the probability of error, say, one hundredth. All letters that have a high probability of error, we call bad. Accordingly, all CDR3 containing such a letter, we also consider bad. All this is stored in a container that not only collects data on abundance, but also collects other information: it searches for equivalent CDR3, or CDR3 differing only in positions with poor quality. Thus, it is possible to correct a large proportion of errors. Unfortunately, it’s impossible to get rid of all errors, some of the errors are introduced before the sequencer, so there is no additional information about them.

    And here we are doing a fairly simple thing, we find pairs of CDR3 that differ by one letter and vary greatly in concentration. Most often, this is an explicit error marker. And we just delete them like that, choosing a threshold of the order of 1 to 10 in concentration: one error per sequence.

    In total, all this allows us to correct 95% of the errors and at the same time not to remove the natural diversity. It is clear that with such algorithms one can make a mistake and take the natural clone for an error, but in fact they are so diverse that this does not happen. Such performance is good numbers for real work.

    It is worth a bit to describe the software implementation of all this. We have a lot of system, the markup sequence runs in parallel. In uniprocessor systems, the best result can be achieved if the number of threads corresponds to the number of cores. The diagram above shows the work in two streams: one is engaged in the formation of clones from good CDR3, the other from bad ones. All this scales well: for example, on a six-core processor, code execution speeds up about 3.5-4 times. As regards performance, but on a very ordinary hardware (AMD Phenom II X4 @ 3.2 GHz) it processes 50,000 sequences per second. Storage of one clonotype takes about 5 KB of RAM. Thus, the most complex (diverse) array of our experience was able to process in 20 minutes on a machine with the above processor and 16 GB of RAM.

    Everything is written in Java, the sources are open , there is a well-documented API , because it is often convenient for bioinformatics to embed something in pieces in their code, so we try to make the software as convenient as possible to use.

    Now we are approaching how diagnostics can be made from this data. Already, all these experiments with genome sequencing help to come up with some new drugs. But to use this data as such, you need to learn how to extract patterns from them.

    It is indeed possible, many articles show that the same types of infections produce the same CDR3. This suggests that it is possible to build diagnostics. It is clear that there are complex objects that give rise to not very pronounced patterns. And in order to detect such patterns, you need a lot of statistics, a large set of both control data (healthy donors) and sick patients with known diseases. Most likely, one clinic will not be able to accumulate such a number of samples to trace the necessary patterns for a wide range of diseases. And a single place for collaboration will contribute to the development of diagnostics.

    Such a database should store this data in some convenient structure that allows them to make specific casts necessary for calculating patterns and analyzing samples. It must distinguish between access, because many clinical data cannot be shared with the public. And it is also necessary for this database to do its job: to catch data patterns, and then search for them in new data, gradually refine them through the same new data, and improve its performance.

    Such a base is set up for a rather distant future. From the very beginning, it may not be clear why it is needed. But this data is complex in itself. Imagine that you are working with hundreds of arrays, each of which has a million sequences. In this case, you need to look for patterns between them (and they can be very complex), and you are looking for not necessarily completely equivalent CDR3, but, for example, similar ones. And the search is carried out immediately in many arrays. In general, finding patterns is very difficult. We need some kind of high-level language that would allow us to describe all this. Now researchers who work with this data are forced to write some kind of scripts each time. In fact, tracing some clear patterns is trickier than just finding the same clones of the same arrays in the same arrays, Now it just doesn’t come out, there are no necessary tools. Therefore, if you make such a base, it will be interesting to people, because will solve their specific research problems. And all this should be convenient and clear, since biologists often work with this data, who simply need a graphical interface. We need a platform for collaboration, because for really interesting results you need a lot of data that would flock to one place from different sources.

    The main problem in organizing such a common database is the delivery of this data from researchers. We are talking about hundreds of GB, or even terabytes. Gradually, this problem begins to be solved thanks to cloud services. Many researchers upload their work results to Yandex.Disk, Dropbox, etc. There are also specialized cloud storage. For example, BaseSpace is specially tailored for Illumina sequencers. Those. buying a sequencer, you automatically get an account in BaseSpace, and the data is loaded into it right in the process of sequencing. In addition, BaseSpace has a convenient API through which external applications can access it. So it can be used as part of the infrastructure.

    We already have some prototype database. The idea is that if the raw data potentially requires re-processing after some time, then it can be written to tape and stored somewhere separately. This is the cheapest way to store large amounts of data. Such services are provided, for example, by Amazon Glacier. Processed data can be stored in a relational database, but this is completely optional. In essence, this is unstructured data - CDR3 lists. Therefore, they can be stored as files in block repositories and loaded when rebuilding the database structure.

    When calculating patterns at a low level, bioinformatics will most likely need the following queries:

    • Find all arrays that a particular clone meets.
    • Find all the clones found in the array group.
    • Various fuzzy searches and groupings

    To do this quickly and simply, without writing any of your own algorithms, you can use PostgreSQL. Most likely, this is not the only way, but in the prototype we used it. PostgreSQL has such a thing as a reverse key: you can simply store arrays of ints in the fields, respectively, you can make queries like "give me all the records that contain the numbers 3, 5 and 8 in these fields." And these queries are very fast. The picture below shows a simplified model of our prototype. There should still be a bunch of meta information, other collateral information that comes with CDR3. In fact, we achieved good performance due to the fact that CDR3 acted as keys, and the list of data sets in which this CDR3 participated, in which it met, was stored as data. Therefore, requests for this key are fully consistent with those requests

    You can talk about the performance of the prototype in more detail. For example, take a machine with a quad-core Intel Core i7 processor 2600 @ 3.8 GHz, 16 GB of RAM and software RAID1 of two SSDs of 128 GB each. 112 arrays containing a total of 45 million columns were sent to the entrance. All this fit in 70 GB on an SSD. As a result, on queries like “all clones from two arrays” we managed to achieve a performance of 50 operations per second with a delay of about 150 milliseconds.

    In conclusion, I want to repeat that the accumulation of data in one place is necessary in order for such a thing to work as a diagnosis. Now all tests are aimed at determining whether a person has a particular disease. A new type of diagnosis is intended with the help of one analysis to immediately give an answer to what exactly a person is sick. To do this, you need to collect and process a lot of data, and our prototype shows that it is quite real and not very resource intensive.

    Also popular now: