DNA methylation and bioinformatics

After reading the portah introductory article on bioinformatics , in particular Chip-Seq and RNA-Seq technologies, I really liked the idea of ​​replenishing, as far as possible, Russian-language articles on bioinformatics, and especially about its “practical” component. Therefore, I offer this brief overview of the pipeline for analyzing methyloma using Illumina 450K Human Methylation technology .

During the life of an organism, the nucleotide sequence of its DNA generally remains unchanged (for more information about genes, genome and DNA, see, for example, this article ). Nevertheless, there are processes that allow you to influence the genome, its work, and even inherited. These processes are called epigenetic changes.

One of the main epigenetic mechanisms is DNA methylation. Methylation is a change in a DNA molecule by attaching a methyl group (-CH3) to nucleotide C, and it is necessary that C is followed by nucleotide G. The nucleotide sequence -CG- is called the CpG dinucleotide, or CpG site. Methylation does not occur in all cells at the same time, therefore, they speak of the percentage methylation of a specific CpG site.

DNA methylation is one of the important mechanisms for regulating gene expression. It has been shown that diseases such as various types of cancer, diabetes of the first and second kind, schizophrenia, etc., are associated with a change in the methylation profile. Therefore, it is important to be able to analyze the genome methylation profile.

Currently, several methods are widespread for quantitative measurements of the methylation profile. One of the most common is the Illumina microchip series. I will dwell in more detail on the Illumina 450K Infinium Array chip description and analysis of the data obtained with its help.

The 450K chip measures the methylation level of approximately 486,000 CpG sites, more or less evenly distributed throughout the genome. Without going into the biological and chemical details of the functioning of the chip, the technology can be briefly described as follows. Each CpG site is measured using two fluorescence samples. The fluorescence signal of the samples is proportional to the number of methylated and unmethylated CpG sites in the test sample, respectively. The chip allows testing of up to 12 biological samples at a time.

So, at the output we have a table of values ​​in which the number of rows is equal to the number of CpG sites, and the number of columns is equal to the number of analyzed biological samples. From this moment, bioinformatics proper begins.

The pipeline for data analysis using the R language and the Bioconductor library has approximately the following items (with the corresponding packages from Bioconductor indicated):

1. Select a measurement scale (Beta or M value). More details here .

2. Adjustment of color balance (color channel balance adjustment). Some CpG sites are measured using samples of the same color, and some using two. This problem is eliminated by normalizing the signals of two samples in each biological sample.

3. Background correction. Each slot for biological samples on a chip has a different default background. Therefore, to equalize the values ​​between the samples, background correction is necessary.

4. Normalization between samples (between-sample normalization). Most commonly used are quantile normalization and SVN normalization ( lumi package ).

5. Testing for a group effect (batch effect) using the analysis of principal components.

6. Peak correction .

7. Group effect correction using ComBat and SVA packages .

8. Testing for statistical significance using linear models, permutations, or conventional tests for hypothesis testing ( limma and multtest packages ).

9. Data analysis using various machine learning algorithms (I will not list, there is a whole ocean of possibilities).

10. Correlation with data on gene expression and SNP (methylation Quantitative Trait Loci ). The matrixEQTL package is recommended .

I apologize for the confusion - this is a consequence of an attempt to present everything in one short review article. If anyone is interested, I will describe the process of building a pipeline in several more detailed articles with sample code for R.

Also popular now: