Bioinformatic pipeline using Docker

In this article I want to share my experience in developing a pipeline using Docker to analyze biomedical data. Probably, one reader will be interested in the bioinformatic pipeline itself, and for someone - using Docker, so we will split the article into two parts.

Part 1. Bioinformatic pipeline, or what we did and why

The technology for reading the DNA sequence of living organisms is gradually evolving. Scientists are increasingly finding out which sections of DNA are responsible for what and how they generally affect the body. All these works have great medical potential. Aging, cancer, genetic diseases - DNA sequence analysis can be a powerful tool in the fight against them. We built a pipeline that analyzes the sequence of some sections of a person’s DNA and tries to predict whether he can have cardiomyopathy, a genetically caused heart disease.

Why are only some sections of DNA taken? Reading all of the DNA (which is 3.2 billion nucleotides or “letters”) will cost a lot more. And in order to understand whether a particular person has “mistakes” leading to one or another “genetically determined” disease, it is enough to read only those sections that, so to speak, influence the development of these diseases. And this is much cheaper.

How do scientists know which DNA sections to read? The answer here is: comparing the genomes of healthy and sick people. These data are quite reliable, because today the whole genomes of more than a thousand people around the world are known to mankind .

When readings on the necessary DNA sites are received, it is necessary to make a forecast whether the disease is waiting for their owner or not. That is, to understand the sequences in these areas, such as in a healthy person, or in them there are abnormalities that occur in sick people.

There are quite a lot of such studies, so there are best practiceshow to do it better. They describe in detail how to align the cleared data to the reference genome (that is, the genome of some abstract healthy person), how to find “single-letter” differences between them, and then analyze these differences: weed out deliberately insignificant ones, and search for others in biomedical databases. For all these actions, the bioinformatic community has developed many programs: bwa, gatk, annovar, etc. The pipeline is arranged so that the output of the desired program is input to the next one, and so on, until the desired result is obtained. There are many ways to implement a pipeline, in our work, inspired by the excellent course “ Management of Computing ”, we used snakemake.


Using the pipeline, we analyzed data for one family, some of whose members were diagnosed with cardiomyopathy (in the figure they are in a red frame). Variations were found (that is, deviations from the reference genome), which, according to medical databases, are found in people with this disease (in the figure they are indicated in blue and green).


What conclusions can be drawn from all this? As expected, the pipeline itself cannot be diagnosed. He can only give information according to which the doctor decides whether the risks are high for a given person or not. This ambiguity is due to the fact that cardiomyopathy is a complex disease, depending on both genetic and external factors. The biochemical mechanisms of its occurrence are not known (all of this is difficult), therefore it is impossible to say exactly which sets of variations will lead to the disease. All that is is statistics on the sick and healthy, which allows the doctor to assess the likelihood of the disease and, if necessary, start treatment on time.

We also made an attempt to evaluate the quality of the pipeline. As mentioned above, the pipeline finds variations - “one-letter” deviations of the DNA sequence of the person under study from the reference genome. Then he analyzes them and searches for information on them in biomedical databases. The most controversial step that requires fine tuning is finding these variations. Here you need to find a balance between redundancy - when there are too many variations, most of which are garbage, and insufficiency - when the variations are so strictly selected that we lose the necessary information. Therefore, quality control has come down to checking how the pipeline finds variations in the data about which we know the “right answer”. Genome in a Bottle was taken as this data.- a certain human genome read as accurately as possible, according to which there are reliable data on variations. The quality control result gave a 85% match, which is pretty good.

Part 2. Using Docker

If you express the main idea of ​​this article with one sentence, it would be like this: “Use Docker in your pipelines, it is much more convenient with it”. Indeed, what problems do people who consider using pipelines usually encounter? If the pipeline is on your working computer, you can inadvertently change the environment or dependencies of the programs used, automatic updates are possible - all this can lead to the pipeline becoming a little different than before or starting to generate errors. Also, deployment of a pipeline on a new computer may be problematic: you need to install all the programs, again, keep track of versions and dependencies, take into account the operating system. Using Docker will not have all these problems, but in order to run pipeline on a new computer,

The idea of ​​Docker is that each program used by the pipeline will run in an isolated container in which the developer builds the necessary dependencies and environments. Everything that he needs is described in the corresponding Dockerfile, then, with the docker build command, he builds an image of the container that can be downloaded to dockerhub. When someone (pipeline or another user) wants to use this program with these dependencies, he simply downloads the desired image from dockerhub and uses the docker create command to create the necessary container on his computer.


Our pipeline using Docker is available on github. Each time, invoking a program, the pipeline runs the corresponding container, passes the necessary parameters to it and the calculation is in progress. In fact, all the work of the programmer was to write a Dockerfile for each container. It indicates the base image (FROM), which commands to execute in the specified image (RUN), or which files to add (ADD), you can specify the working directory (WORKDIR), into which, when the container starts, the folder with the data necessary for calculations is mounted. Based on the Dockerfile, an image is created:

$ docker build -t imagename .

And it loads into a repository like .

Let us describe some typical cases for our pipeline. You can read more about Dockerfile on the official website :

You need to run a standard program installed from repositories, for example picard-tools. Dockerfile will be like this:

FROM ubuntu:14.04
RUN apt-get update && apt-get install -y picard-tools \
    && mkdir /home/source
WORKDIR /root/source

$ docker run -it --rm -v $(pwd):/root/source picard-tools picard-tools MarkDuplicates INPUT={input} OUTPUT={output[0]} METRICS_FILE={output[1]} 

You need to run your shell script,, which parses the file. To do this, you can use the standard ubuntu docker image:

$ docker run -it --rm -v $(pwd):/root/source -w="/root/source" ubuntu:16.04 /bin/sh scripts/ {input} {output}

You need to run a script that is not in the standard repositories. Dockerfile:

FROM ubuntu:16.04
RUN apt-get update && apt-get install -y perl && apt-get install -y wget \
    && mkdir /root/source
ADD annovar /root/annovar
ENV PATH="/root/annovar:${PATH}"
WORKDIR "/root/source"

$ docker run -it --rm -v $(pwd):/root/source annovar {input} reference/humandb/ -buildver hg38 -out {} -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -vcfinput

You need to run the java application. Here we use ENTRYPOINT (, which allows you to run the container as an executable file.

FROM ubuntu:16.04
RUN apt-get update && apt-get install -y default-jre \
    && mkdir /home/source
ADD GenomeAnalysisTK.jar /root/GenomeAnalysisTK.jar
WORKDIR "/root/source"
ENTRYPOINT ["/usr/bin/java", "-jar", "/root/GenomeAnalysisTK.jar"]

$ docker run -it --rm -v $(pwd):/root/source gatk -R {input[0]} -T HaplotypeCaller -I {input[1]} -o {output}

Also popular now: