How DNA is sequenced
DNA sequencing in recent decades has evolved from a narrow field, which was handled by a small number of scientists, into one of the fastest growing technologies. Productivity growth and falling costs are even ahead of Moore’s law, and, due to strong competition in the market and huge demand, development will continue to continue at a high pace. In addition, the development of sequencing led to the same boom in bioinformatics and radically changed biology, and, gradually, also fundamentally changes medicine.
By kat, I tell you more about how they do it.
What is DNA
For starters, to understand the process itself, a little necessary theory.
DNA is a polymer chain consisting of four types of monomers called nucleotides, the sequence of which encodes information about the body. In other words, DNA can be represented as text written in a four-letter alphabet. DNA is a molecule consisting of two chains, and although their nucleotide sequence is different, the sequence of one chain can be unambiguously restored if the sequence of the other is known. Therefore, chains are called complementary. (Eng. Complement - supplement) This property is used when copying a cell when DNA strands are unwrapped, and on each, as on a matrix, the second is synthesized, and each of two daughter cells receives its double-stranded DNA. The entire DNA sequence of an organism is called a genome. For example, the human genome consists of 46 chromosomes.
Despite the large number of diverse, both experimental and obsolete methods, mainstream commercial methods are quite similar, and in order not to make reservations every time, I’ll say right away that it will be about these mainstream methods.
How it looks in general
Before describing sequencing technology, for an intuitive understanding, I will draw the following analogy: they blow up a stack of identical newspapers so that they fly apart into small pieces with fragments of text, and then each of these pieces is read and, from these reads, the text is restored original newspaper.
To sequence DNA, it is first isolated from the test sample, then cut into small fragments at random, the fragments are called reads. One chain is left from each read, and the second is synthesized on this chain, as on a matrix, and the type of each subsequent nucleotide being attached is somehow detected. Thus, recording the sequence of joined nucleotides, restore their sequence in each read. Then, the genome is reconstructed from computer readings using computer programs.
An important point. The total length of the reads should be many times greater than the length of the studied DNA. This is done because when the DNA is extracted from the sample, and when it is cut, part of it is lost, so no one guarantees that each of its sections will fall into at least one read. Therefore, in order for each section to be guaranteed to be read, DNA is taken with a large margin. In addition, errors may occur during sequencing, and in order to read DNA more reliably, each section of it should be read several times.
DNA is cut into reads that read, and from them restore the original sequence
This technique is not used for a good life. It adds many difficulties, and if the researchers could take and read the whole sequence of the genome at a time, they would be happy, however, this is not possible at the moment.
There are 2 reasons for this. The first is errors that occur when reading each nucleotide. They gradually accumulate, and each subsequent nucleotide is read worse than the previous one, and at some point the reading quality is so reduced that it is pointless to continue the process. For different sequencing methods, the length of the read, which they can read well, is on the order of tens or hundreds of nucleotides. The second is that DNA is a very long molecule, and, with a scrupulous reading of each letter after friend, sequencing would take an indecent time, and in this case this process is easily parallelized, and millions and billions of reads can be read at the same time.
Illumina
Such a scheme outlines all the popular sequencing techniques. They differ only in the methods of detection of joined nucleotides during synthesis, and in the method of preparation of the material.
To date, the most common method is used in Illumina sequencers. In this method, first, many different reads are attached to the glass plate. Then, from each read, many copies are made on the surface of the plate so that only identical copies are located on each small section of it. This is done so that during subsequent sequencing to receive a signal not from a single molecule, but from a group of identical molecules located nearby. So the signal is easier to read, and the reliability of reading increases. These molecules are single-stranded DNA, and complementary chains are synthesized onto them during sequencing. The synthesis reaction is carried out as follows: One nucleotide is attached to the beginning of each molecule. This nucleotide is chemically blocked so that after its addition, the synthesis does not go any further. In addition, a tag is attached to it, which under the action of a laser luminesces. Moreover, for each type of nucleotides, the color of luminescence is different. After the nucleotide is attached, the plate is illuminated with a laser and the camera captures the colors with which the plate luminesces. After this, the lock is removed, the label is also removed, and the next nucleotide is attached in the same way. The sequence of light signals at each section of the plate in the computer is translated into a sequence of nucleotides, and, at the output, a file containing the sequence of reads is obtained. After the nucleotide is attached, the plate is illuminated with a laser and the camera captures the colors with which the plate luminesces. After this, the lock is removed, the label is also removed, and the next nucleotide is attached in the same way. The sequence of light signals at each section of the plate in the computer is translated into a sequence of nucleotides, and, at the output, a file containing the sequence of reads is obtained. After the nucleotide is attached, the plate is illuminated with a laser and the camera captures the colors with which the plate luminesces. After this, the lock is removed, the label is also removed, and the next nucleotide is attached in the same way. The sequence of light signals at each section of the plate in the computer is translated into a sequence of nucleotides, and, at the output, a file containing the sequence of reads is obtained.
Sequencing according to the Illumina method
1 - genomic DNA 2 - is cut into reads 3 - adapters are attached to the reads, with which they are glued to 4 - plate 5 - propagation of reads on the plate 6 - insertion into the sequencer and 7 - sequencing
Assembly and annotation of the genome
If the genomes of close organisms have not been sequenced before, then from the reads, then, using programs, they try to assemble a single nucleotide sequence. Reeds partially overlap, and, using these overlaps, they try to build a single sequence. There are many points that significantly complicate the matter. For example, you can contaminate a sample, and the program will try to build one sequence of DNA from different organisms. The sequencer may make a mistake when reading the read, or incorrectly link the two places in the genome, because they are very similar. In fact, there are so many difficulties that you will not list everyone here. And, some of them are so difficult to eliminate that even the human genome, the most important and widely studied genome, is still not sequenced to the end.
reads and below the genome sequence, which is reconstructed based on them.
When the genome sequence is assembled, you need to understand what it means. On it are found areas that look like genes. This is done as follows: At the beginning and end of the genes there are certain “labels” of nucleotides, and if the DNA contains such sequences at such a distance that a gene can fit between them, then this place is entered in the list of potential genes. Then, this applicant is compared with a database of already known genes of other organisms, and if a gene is found in it that is quite similar to this site, then he is assigned the function of this gene.
If the genome of another organism of this species has already been sequenced, then it is used for assembly. Since the genomes of different organisms of the same species differ only slightly, for each read they find a place on the sequenced genome to which it is closest, and a new one is assembled on the basis of this genome.
By kat, I tell you more about how they do it.
What is DNA
For starters, to understand the process itself, a little necessary theory.
DNA is a polymer chain consisting of four types of monomers called nucleotides, the sequence of which encodes information about the body. In other words, DNA can be represented as text written in a four-letter alphabet. DNA is a molecule consisting of two chains, and although their nucleotide sequence is different, the sequence of one chain can be unambiguously restored if the sequence of the other is known. Therefore, chains are called complementary. (Eng. Complement - supplement) This property is used when copying a cell when DNA strands are unwrapped, and on each, as on a matrix, the second is synthesized, and each of two daughter cells receives its double-stranded DNA. The entire DNA sequence of an organism is called a genome. For example, the human genome consists of 46 chromosomes.
Despite the large number of diverse, both experimental and obsolete methods, mainstream commercial methods are quite similar, and in order not to make reservations every time, I’ll say right away that it will be about these mainstream methods.
How it looks in general
Before describing sequencing technology, for an intuitive understanding, I will draw the following analogy: they blow up a stack of identical newspapers so that they fly apart into small pieces with fragments of text, and then each of these pieces is read and, from these reads, the text is restored original newspaper.
To sequence DNA, it is first isolated from the test sample, then cut into small fragments at random, the fragments are called reads. One chain is left from each read, and the second is synthesized on this chain, as on a matrix, and the type of each subsequent nucleotide being attached is somehow detected. Thus, recording the sequence of joined nucleotides, restore their sequence in each read. Then, the genome is reconstructed from computer readings using computer programs.
An important point. The total length of the reads should be many times greater than the length of the studied DNA. This is done because when the DNA is extracted from the sample, and when it is cut, part of it is lost, so no one guarantees that each of its sections will fall into at least one read. Therefore, in order for each section to be guaranteed to be read, DNA is taken with a large margin. In addition, errors may occur during sequencing, and in order to read DNA more reliably, each section of it should be read several times.
DNA is cut into reads that read, and from them restore the original sequence
This technique is not used for a good life. It adds many difficulties, and if the researchers could take and read the whole sequence of the genome at a time, they would be happy, however, this is not possible at the moment.
There are 2 reasons for this. The first is errors that occur when reading each nucleotide. They gradually accumulate, and each subsequent nucleotide is read worse than the previous one, and at some point the reading quality is so reduced that it is pointless to continue the process. For different sequencing methods, the length of the read, which they can read well, is on the order of tens or hundreds of nucleotides. The second is that DNA is a very long molecule, and, with a scrupulous reading of each letter after friend, sequencing would take an indecent time, and in this case this process is easily parallelized, and millions and billions of reads can be read at the same time.
Illumina
Such a scheme outlines all the popular sequencing techniques. They differ only in the methods of detection of joined nucleotides during synthesis, and in the method of preparation of the material.
To date, the most common method is used in Illumina sequencers. In this method, first, many different reads are attached to the glass plate. Then, from each read, many copies are made on the surface of the plate so that only identical copies are located on each small section of it. This is done so that during subsequent sequencing to receive a signal not from a single molecule, but from a group of identical molecules located nearby. So the signal is easier to read, and the reliability of reading increases. These molecules are single-stranded DNA, and complementary chains are synthesized onto them during sequencing. The synthesis reaction is carried out as follows: One nucleotide is attached to the beginning of each molecule. This nucleotide is chemically blocked so that after its addition, the synthesis does not go any further. In addition, a tag is attached to it, which under the action of a laser luminesces. Moreover, for each type of nucleotides, the color of luminescence is different. After the nucleotide is attached, the plate is illuminated with a laser and the camera captures the colors with which the plate luminesces. After this, the lock is removed, the label is also removed, and the next nucleotide is attached in the same way. The sequence of light signals at each section of the plate in the computer is translated into a sequence of nucleotides, and, at the output, a file containing the sequence of reads is obtained. After the nucleotide is attached, the plate is illuminated with a laser and the camera captures the colors with which the plate luminesces. After this, the lock is removed, the label is also removed, and the next nucleotide is attached in the same way. The sequence of light signals at each section of the plate in the computer is translated into a sequence of nucleotides, and, at the output, a file containing the sequence of reads is obtained. After the nucleotide is attached, the plate is illuminated with a laser and the camera captures the colors with which the plate luminesces. After this, the lock is removed, the label is also removed, and the next nucleotide is attached in the same way. The sequence of light signals at each section of the plate in the computer is translated into a sequence of nucleotides, and, at the output, a file containing the sequence of reads is obtained.
Sequencing according to the Illumina method
1 - genomic DNA 2 - is cut into reads 3 - adapters are attached to the reads, with which they are glued to 4 - plate 5 - propagation of reads on the plate 6 - insertion into the sequencer and 7 - sequencing
Assembly and annotation of the genome
If the genomes of close organisms have not been sequenced before, then from the reads, then, using programs, they try to assemble a single nucleotide sequence. Reeds partially overlap, and, using these overlaps, they try to build a single sequence. There are many points that significantly complicate the matter. For example, you can contaminate a sample, and the program will try to build one sequence of DNA from different organisms. The sequencer may make a mistake when reading the read, or incorrectly link the two places in the genome, because they are very similar. In fact, there are so many difficulties that you will not list everyone here. And, some of them are so difficult to eliminate that even the human genome, the most important and widely studied genome, is still not sequenced to the end.
reads and below the genome sequence, which is reconstructed based on them.
When the genome sequence is assembled, you need to understand what it means. On it are found areas that look like genes. This is done as follows: At the beginning and end of the genes there are certain “labels” of nucleotides, and if the DNA contains such sequences at such a distance that a gene can fit between them, then this place is entered in the list of potential genes. Then, this applicant is compared with a database of already known genes of other organisms, and if a gene is found in it that is quite similar to this site, then he is assigned the function of this gene.
If the genome of another organism of this species has already been sequenced, then it is used for assembly. Since the genomes of different organisms of the same species differ only slightly, for each read they find a place on the sequenced genome to which it is closest, and a new one is assembled on the basis of this genome.