Practical bioinformatics
I found a severe lack of information on bioinformatics in the Russian segment. I don’t know whether it is in demand or not, but I want to provide the reader with an introductory part, which can be called practical bioinformatics, which I really did not have enough to familiarize myself with the subject. In this chapter I want to describe the path that I have had to go to the present moment, when I no longer shy away from phrases: here's a FASTQ file for you and build me a bed graph for the genome browser. In order to continue to talk about interesting things, I want to go diagonally through definitions and primary data processing programs, without which it is difficult to speak the same language.
First, a few definitions. We assume that chromosomes are one-dimensional coordinate axes, from unity to about 10e8. The length of the axis depends on the length of the chromosome. Each axis point is an integer. Biologists and chemists conducted a large number of experiments and thanks to them they were able to describe parts of chromosomes with great accuracy (about 90%). These descriptions are called annotations. The abstract contains information on the length of chromosomes, on the coordinates of individual sections of chromosomes, the most known of which are the gene, intron , exon. There are a huge number of these sections, but their main property is that these are segments or a set of segments located on the coordinate axis. Some segments may include others or intersect in some way. Here is a set of sites where you can see annotations of the genomes of man, mouse and the like.
Biologists with chemists conduct their experiments and as a result of their operations on cells, they get a solution that contains relatively small pieces of DNA or RNA (I don’t really want to go into the details of the difference or the same, just a sequence of nucleotides) This solution is passed through a sequencing apparatus, the output of which is small lines. These lines are the ends of pieces of DNA or RNA in solution. The length of the strings obtained from the equipment is only 36-50 bases (the length of the string in nucleotides) is sometimes longer, but at the moment it seems to be no more than 200. These segments, obtained from sequencing equipment and determined by the nucleotide sequence, are called reads (from English reads - “ reading"). It is worth noting that the reads are characterized only by the sequence of nucleotides, and not by the location on the genome. Sometimes these sequences are supplemented by a string of probabilities, corresponding to the position of the nucleotide, its probability of being in this position. FASTA file is a file with no probabilities, FASTQ is a file with probabilities.
Further, depending on what was the result of the experiment — pieces of DNA or RNA, one of the two ChIP-seq or RNA-seq sequencing techniques is performed , respectively. More details about them are here http://en.wikipedia.org/wiki/DNA_sequencing .
After the expensive sequencing machines work out and produce the result in the FAST A / Q file, it is necessary to find the resulting sequences in the genome. For ChIP-seq very quick search, possible in the home, the program does bowtie, found millions of reads in just 5 minutes. Those. she searches for an entry of a string of 36-50 characters in length, consisting of at least four alphabetic alphabets in strings with a total length of up to 10е9. Why is the turnover used at least: in addition to the standard use of the A / TG / C alphabet, the symbol N is often added to replace any possible letter (for more details http://genome.ucsc.edu/FAQ/FAQdownloads.html#download5) The program has many parameters. So, for example, you can resolve an error in a line (read) or two (up to a maximum of 3 errors). She can search not one occurrence of read in the genome, but many. It can slyly sort data, for example, if a read with one error was found unambiguously in the genome, and with two or three errors was found in many parts of the genome, then only the first result will be displayed. This process of finding reads in the genome is called mapping from English mapping (also called alignment). The algorithm for such a quick search is very interesting, but you can devote a separate article to this or find a linkto an English article from the developer's site, which tells how the bzip2 algorithm prompted them to such a solution. There are a lot of mapping programs and sites where they make it online; by the blast, eland + genome keywords in google you can find additional information.
For RNA-seq, the procedure is a little more complicated, ChIP-seq mapping is performed for it, and then those reads that did not find themselves during ChIP-seq are processed. A good program that actively uses bowtie for its work is called tophat . As a result of splicing and isoformism generated by itparts of the read can be located in different parts of the genome. For example, the first 15 characters may be in one part of the genome, and the other 11 in another. This division of the read into parts at the end of one exon and the beginning of another is called splice-junctions. Tophat allows you to find them, and also identifies new possible isoforms of existing genes.
The result of these programs is sam / bama file containing information from the FAST A / Q file plus information about the coordinates on the corresponding helices of the chromosome. The process from the library to the sam / bam file is often called the pipeline procedure and in many laboratories stands on the stream, so you should ask what parameters and program versions are installed by default. In general, this is where the introductory part ends, then the turn of research comes. It can be noted that from now on we have data that are with some certainty in the same coordinates: coordinate axis, annotation, reads with the corresponding coordinates.
You can begin to analyze data from a simple clarification of uniformity and continuity, complexity, ending with complex statistical calculations that help divide the reads into certain groups: noise, zero level and enrichment. Data analysis is necessary so that in the future it is possible to reasonably discard certain data.
If you are interested in the introductory part, then at each stage I can dwell in more detail. Unfortunately, the simple introductory part already takes several pages of dry text without pictures, so I did not dare to describe the written programs and mechanisms in this chapter. I myself am most attracted to the last paragraph, where I briefly mentioned statistics. I would like to consecrate existing libraries and mechanisms for working with such data. You can attach here the methods included in datamining (different types of clustering), which are described on the Habr. How to apply the Poisson distribution for analyzing data without control, how to use a complex chain of Poisson, f-test, to find sections enriched with reads on the genome (Diarac delta function)? Are ready-made libraries useful for working with intervals boost.intervals, boost.icl?
And, of course, if this topic is interesting, can someone tell me how and where to dig, in what matters, and supplement it. Or maybe he will write his own. For the solution of biological issues without mathematics and programming at this stage is already definitely impossible. There are English-language resources that discuss similar issues at www.seqanswers.com . But I would like to go further from the description of finished products and discuss the legality of using mathematical and statistical methods in these programs and the possibility of using new methods.
At the current stage of our work, we tried to find the parameters by which you can filter the reads, both interesting for research and noise. The task is quite nontrivial in the absence of control. In the future, it was decided to add a control to the library for sequencing, which will allow us to measure the level of error, but the statistics still will not let go.
First, a few definitions. We assume that chromosomes are one-dimensional coordinate axes, from unity to about 10e8. The length of the axis depends on the length of the chromosome. Each axis point is an integer. Biologists and chemists conducted a large number of experiments and thanks to them they were able to describe parts of chromosomes with great accuracy (about 90%). These descriptions are called annotations. The abstract contains information on the length of chromosomes, on the coordinates of individual sections of chromosomes, the most known of which are the gene, intron , exon. There are a huge number of these sections, but their main property is that these are segments or a set of segments located on the coordinate axis. Some segments may include others or intersect in some way. Here is a set of sites where you can see annotations of the genomes of man, mouse and the like.
Biologists with chemists conduct their experiments and as a result of their operations on cells, they get a solution that contains relatively small pieces of DNA or RNA (I don’t really want to go into the details of the difference or the same, just a sequence of nucleotides) This solution is passed through a sequencing apparatus, the output of which is small lines. These lines are the ends of pieces of DNA or RNA in solution. The length of the strings obtained from the equipment is only 36-50 bases (the length of the string in nucleotides) is sometimes longer, but at the moment it seems to be no more than 200. These segments, obtained from sequencing equipment and determined by the nucleotide sequence, are called reads (from English reads - “ reading"). It is worth noting that the reads are characterized only by the sequence of nucleotides, and not by the location on the genome. Sometimes these sequences are supplemented by a string of probabilities, corresponding to the position of the nucleotide, its probability of being in this position. FASTA file is a file with no probabilities, FASTQ is a file with probabilities.
Further, depending on what was the result of the experiment — pieces of DNA or RNA, one of the two ChIP-seq or RNA-seq sequencing techniques is performed , respectively. More details about them are here http://en.wikipedia.org/wiki/DNA_sequencing .
After the expensive sequencing machines work out and produce the result in the FAST A / Q file, it is necessary to find the resulting sequences in the genome. For ChIP-seq very quick search, possible in the home, the program does bowtie, found millions of reads in just 5 minutes. Those. she searches for an entry of a string of 36-50 characters in length, consisting of at least four alphabetic alphabets in strings with a total length of up to 10е9. Why is the turnover used at least: in addition to the standard use of the A / TG / C alphabet, the symbol N is often added to replace any possible letter (for more details http://genome.ucsc.edu/FAQ/FAQdownloads.html#download5) The program has many parameters. So, for example, you can resolve an error in a line (read) or two (up to a maximum of 3 errors). She can search not one occurrence of read in the genome, but many. It can slyly sort data, for example, if a read with one error was found unambiguously in the genome, and with two or three errors was found in many parts of the genome, then only the first result will be displayed. This process of finding reads in the genome is called mapping from English mapping (also called alignment). The algorithm for such a quick search is very interesting, but you can devote a separate article to this or find a linkto an English article from the developer's site, which tells how the bzip2 algorithm prompted them to such a solution. There are a lot of mapping programs and sites where they make it online; by the blast, eland + genome keywords in google you can find additional information.
For RNA-seq, the procedure is a little more complicated, ChIP-seq mapping is performed for it, and then those reads that did not find themselves during ChIP-seq are processed. A good program that actively uses bowtie for its work is called tophat . As a result of splicing and isoformism generated by itparts of the read can be located in different parts of the genome. For example, the first 15 characters may be in one part of the genome, and the other 11 in another. This division of the read into parts at the end of one exon and the beginning of another is called splice-junctions. Tophat allows you to find them, and also identifies new possible isoforms of existing genes.
The result of these programs is sam / bama file containing information from the FAST A / Q file plus information about the coordinates on the corresponding helices of the chromosome. The process from the library to the sam / bam file is often called the pipeline procedure and in many laboratories stands on the stream, so you should ask what parameters and program versions are installed by default. In general, this is where the introductory part ends, then the turn of research comes. It can be noted that from now on we have data that are with some certainty in the same coordinates: coordinate axis, annotation, reads with the corresponding coordinates.
You can begin to analyze data from a simple clarification of uniformity and continuity, complexity, ending with complex statistical calculations that help divide the reads into certain groups: noise, zero level and enrichment. Data analysis is necessary so that in the future it is possible to reasonably discard certain data.
If you are interested in the introductory part, then at each stage I can dwell in more detail. Unfortunately, the simple introductory part already takes several pages of dry text without pictures, so I did not dare to describe the written programs and mechanisms in this chapter. I myself am most attracted to the last paragraph, where I briefly mentioned statistics. I would like to consecrate existing libraries and mechanisms for working with such data. You can attach here the methods included in datamining (different types of clustering), which are described on the Habr. How to apply the Poisson distribution for analyzing data without control, how to use a complex chain of Poisson, f-test, to find sections enriched with reads on the genome (Diarac delta function)? Are ready-made libraries useful for working with intervals boost.intervals, boost.icl?
And, of course, if this topic is interesting, can someone tell me how and where to dig, in what matters, and supplement it. Or maybe he will write his own. For the solution of biological issues without mathematics and programming at this stage is already definitely impossible. There are English-language resources that discuss similar issues at www.seqanswers.com . But I would like to go further from the description of finished products and discuss the legality of using mathematical and statistical methods in these programs and the possibility of using new methods.
At the current stage of our work, we tried to find the parameters by which you can filter the reads, both interesting for research and noise. The task is quite nontrivial in the absence of control. In the future, it was decided to add a control to the library for sequencing, which will allow us to measure the level of error, but the statistics still will not let go.