Three simple steps to reduce Docker images

Original author: Daniele Polencic
  • Transfer
image

When it comes to creating Docker containers, it's best to always strive to minimize the size of the images. Images that use the same layers and weigh less are transferred and deposited more quickly.


But how to control the size when each execution of the operator RUNcreates a new layer? Plus, we still need intermediate artifacts before creating the image itself ...


Perhaps you know that most Docker-files have their own, rather strange, features, for example:


FROM ubuntu
RUN apt-get update && apt-get install vim

Well, why here &&? Isn't it easier to run two operators RUNlike this?


FROM ubuntu
RUN apt-get update
RUN apt-get install vim

Starting with Docker version 1.10, operators COPY, ADDand RUNadd a new layer to the image. In the previous example, two layers were created instead of one.


image


Layers like git commits.


Docker layers preserve the differences between the previous and the current version of the image. And as Git-commits, they are convenient if you share them with other repositories or images. In fact, when requesting an image from the registry, only missing layers are loaded, which simplifies the separation of images between containers.


But at the same time, each layer takes place, and the more of them, the heavier the final image. Git repositories are similar in this respect: the size of the repository grows with the number of layers, because it has to keep all the changes between commits. It used to be good practice to combine several operators RUNin one line, as in the first example. But now, alas, no.


1. Merge several layers into one using phased build of Docker images


When the Git repository grows, you can simply reduce the entire change history to one commit and forget about it. It turned out that something similar can be implemented in Docker - through a phased assembly.


Let's create a Node.js container.


Let's start with index.js:


const express = require('express')
const app = express()
app.get('/', (req, res) => res.send('Hello World!'))
app.listen(3000, () => {
 console.log(`Example app listening on port 3000!`)
})

and package.json:


{
 "name": "hello-world",
 "version": "1.0.0",
 "main": "index.js",
 "dependencies": {
   "express": "^4.16.2"
 },
 "scripts": {
   "start": "node index.js"
 }
}

We will pack the application with the following Dockerfile:


FROM node:8
EXPOSE 3000
WORKDIR /app
COPY package.json index.js ./
RUN npm install
CMD ["npm", "start"]

Create an image:


$ docker build -t node-vanilla .

Check that everything works:


$ docker run -p 3000:3000 -ti --rm --init node-vanilla

Now you can follow the link: http: // localhost: 3000 and see “Hello World!” There.


As Dockerfilewe now have the operators COPYand RUNso fix an increase of at least two layers, in comparison with the original image:


$ docker history node-vanilla
IMAGE          CREATED BY                                      SIZE
075d229d3f48   /bin/sh -c #(nop)  CMD ["npm" "start"]          0B
bc8c3cc813ae   /bin/sh -c npm install                          2.91MB
bac31afb6f42   /bin/sh -c #(nop) COPY multi:3071ddd474429e1…   364B
500a9fbef90e   /bin/sh -c #(nop) WORKDIR /app                  0B
78b28027dfbf   /bin/sh -c #(nop)  EXPOSE 3000                  0B
b87c2ad8344d   /bin/sh -c #(nop)  CMD ["node"]                 0B
<missing>      /bin/sh -c set -ex   && for key in     6A010…   4.17MB
<missing>      /bin/sh -c #(nop)  ENV YARN_VERSION=1.3.2       0B
<missing>      /bin/sh -c ARCH= && dpkgArch="$(dpkg --print…   56.9MB
<missing>      /bin/sh -c #(nop)  ENV NODE_VERSION=8.9.4       0B
<missing>      /bin/sh -c set -ex   && for key in     94AE3…   129kB
<missing>      /bin/sh -c groupadd --gid 1000 node   && use…   335kB
<missing>      /bin/sh -c set -ex;  apt-get update;  apt-ge…   324MB
<missing>      /bin/sh -c apt-get update && apt-get install…   123MB
<missing>      /bin/sh -c set -ex;  if ! command -v gpg > /…   0B
<missing>      /bin/sh -c apt-get update && apt-get install…   44.6MB
<missing>      /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      /bin/sh -c #(nop) ADD file:1dd78a123212328bd…   123MB

As we see, the final image has grown by five new layers: one for each operator in ours Dockerfile. Let's try a phased Docker build now. We use the same Dockerfile, consisting of two parts:


FROM node:8 as build
WORKDIR /app
COPY package.json index.js ./
RUN npm install
FROM node:8
COPY --from=build /app /
EXPOSE 3000
CMD ["index.js"]

The first part Dockerfilecreates three layers. Then the layers are combined and copied to the second and final stages. Two more layers are added on top of the image. As a result, we have three layers.


image


Let's try. First create the container:


$ docker build -t node-multi-stage .

Checking history:


$ docker history node-multi-stage
IMAGE          CREATED BY                                      SIZE
331b81a245b1   /bin/sh -c #(nop)  CMD ["index.js"]             0B
bdfc932314af   /bin/sh -c #(nop)  EXPOSE 3000                  0B
f8992f6c62a6   /bin/sh -c #(nop) COPY dir:e2b57dff89be62f77…   1.62MB
b87c2ad8344d   /bin/sh -c #(nop)  CMD ["node"]                 0B
<missing>      /bin/sh -c set -ex   && for key in     6A010…   4.17MB
<missing>      /bin/sh -c #(nop)  ENV YARN_VERSION=1.3.2       0B
<missing>      /bin/sh -c ARCH= && dpkgArch="$(dpkg --print…   56.9MB
<missing>      /bin/sh -c #(nop)  ENV NODE_VERSION=8.9.4       0B
<missing>      /bin/sh -c set -ex   && for key in     94AE3…   129kB
<missing>      /bin/sh -c groupadd --gid 1000 node   && use…   335kB
<missing>      /bin/sh -c set -ex;  apt-get update;  apt-ge…   324MB
<missing>      /bin/sh -c apt-get update && apt-get install…   123MB
<missing>      /bin/sh -c set -ex;  if ! command -v gpg > /…   0B
<missing>      /bin/sh -c apt-get update && apt-get install…   44.6MB
<missing>      /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      /bin/sh -c #(nop) ADD file:1dd78a123212328bd…   123MB

See if the file size has changed:


$ docker images | grep node-
node-multi-stage   331b81a245b1   678MB
node-vanilla       075d229d3f48   679MB

Yes, it has become smaller, but not significantly yet.


2. We demolish all the excess from the container using distroless


Current image provides us with Node.js, yarn, npm, bashand many other useful binaries. Also, it is based on Ubuntu. Thus, deploying it, we get a full-fledged operating system with many useful binaries and utilities.


However, we do not need them to run the container. The only dependency needed is Node.js.


Docker-containers should provide work of one process and contain the minimum necessary set of tools for its start. The whole operating system is not required for this.


Thus, we can take out everything from it except Node.js.


But how?


Google has already come to a similar decision - GoogleCloudPlatform / distroless .


The description for the repository reads as follows:


Distroless images contain only the application and the dependencies for its work. There are no package managers, shells and other programs that are usually found in the standard Linux distribution.


This is what you need!


Run Dockerfileto get a new image:


FROM node:8 as build
WORKDIR /app
COPY package.json index.js ./
RUN npm install
FROM gcr.io/distroless/nodejs
COPY --from=build /app /
EXPOSE 3000
CMD ["index.js"]

We collect the image as usual:


$ docker build -t node-distroless .

The application should earn normally. To check, run the container:


$ docker run -p 3000:3000 -ti --rm --init node-distroless

And we go to http: // localhost: 3000 . Has the image become easier without extra binaries?


$ docker images | grep node-distroless
node-distroless   7b4db3b7f1e5   76.7MB

And how! Now it weighs only 76.7 MB, as much as 600 MB less!


Everything is cool, but there is one important point. When the container is running and you need to check it, you can connect using:


$ docker exec -ti <insert_docker_id> bash

Connecting to a running container and running is bashvery similar to creating an SSH session.


But since distroless is a stripped-down version of the original operating system, there are neither additional binaries, nor, actually, a shell!


How to connect to a running container if there is no shell?


The most interesting thing is that.


This is not very good, since only binaries can be executed in a container. And the only one that can be run is Node.js:


$ docker exec -ti <insert_docker_id> node

In fact, there is a plus in this, because if suddenly an attacker can gain access to the container, he will do much less damage than if he had access to the shell. In other words, fewer binaries — less weight and better security. But, truth, at the price of more complex debugging.


Here it would be necessary to make a reservation that you should not connect and debug containers on the prod-environment. It is better to rely on properly configured logging and monitoring systems.


But what if we still need debugging, and at the same time we want the docker image to have the smallest size?


3. Reduce Base Images with Alpine


You can replace the distroless Alpine-image.


Alpine Linux is a security-oriented, lightweight distribution based on musl libc and busybox . But let's not believe the word, but rather check.


Run Dockerfileusing node:8-alpine:


FROM node:8 as build
WORKDIR /app
COPY package.json index.js ./
RUN npm install
FROM node:8-alpine
COPY --from=build /app /
EXPOSE 3000
CMD ["npm", "start"]

Create an image:


$ docker build -t node-alpine .

Check the size:


$ docker images | grep node-alpine
node-alpine   aa1f85f8e724   69.7MB

At the output we have 69.7MB - it is even less than a distroless-image.


Check whether it is possible to connect to a running container (in the case of the distrolles image, we could not do this).


We start the container:


$ docker run -p 3000:3000 -ti --rm --init node-alpine
Example app listening on port 3000!

And connect:


$ docker exec -ti 9d8e97e307d7 bash
OCI runtime exec failed: exec failed: container_linux.go:296: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown

Unsuccessful. But perhaps the container has sh' ell ...:


$ docker exec -ti 9d8e97e307d7 sh / #

Fine! We managed to connect to the container, and at the same time its image is also smaller. But here it was not without nuances.


Alpine images are based on muslc - an alternative standard library for C. While most Linux distributions, such as Ubuntu, Debian and CentOS, are based on glibc. It is believed that both of these libraries provide the same interface for working with the kernel.


However, they have different goals: glibc is the most common and fast, muslc takes up less space and is written with an emphasis on security. When an application is compiled, as a rule, it is compiled under a particular C library. If you need to use it with another library, you will have to recompile.


In other words, assembling containers on Alpine images can lead to unexpected developments, since the standard C library used in it is different. The difference will be noticeable when working with precompiled binaries, such as Node.js extensions for C ++.


For example, the PhantomJS package does not work on Alpine.


So which base image to choose?


Alpine, distroless or vanilla image - to solve, of course, better according to the situation.


If we are dealing with a prod and security is important, perhaps the most appropriate would be distroless.


Each binary added to a Docker image adds a certain risk to the stability of the entire application. This risk can be reduced by having only one binary installed in the container.


For example, if an attacker could find a vulnerability in an application running on the basis of a distroless image, he will not be able to launch a shell in the container, because it is not there!


If for some reason the size of the docker image is extremely important to you, you should definitely look at the images based on Alpine.


They are really small, but true, at the cost of compatibility. Alpine uses a slightly different standard C - muslc library, so sometimes problems will pop up. Examples can be found on the links: https://github.com/grpc/grpc/issues/8528 and https://github.com/grpc/grpc/issues/6126 .


Vanilla images are ideal for testing and development.


Yes, they are big, but they are as close as possible to a full-fledged machine with installed Ubuntu. In addition, all binaries are available in the OS.


Let's summarize the size of the received Docker images:


node:8681MB
node:8with step-by-step assembly 678MB
gcr.io/distroless/nodejs76.7MB
node:8-alpine69.7MB


Parting words from the translator


Read other articles on our blog:


Stateful backups in Kubernetes


Backup a large number of heterogeneous web-projects


Telegram-bot for Redmine. How to simplify the life of yourself and people


Also popular now: