SinTeZoiD May 16, 2019 at 16:52

He don’t need you

In connection with the growing popularity of Rook, I want to talk about its pitfalls and the problems that await you on the way.

About me: Ceph administration experience with the hammer version, founder of the t.me/ceph_ru community in telegrams.

In order not to be unfounded, I will refer to posts accepted by the Habr (judging by the rating) about problems with ceph. I encountered most of the problems in these posts too. Links to the material used at the end of the post.

In a post about Rook, we mention ceph for a reason - Rook is essentially ceph wrapped in kubernetes, which means it inherits all its problems. We will begin with ceph problems.

Simplify cluster management

One of the advantages of Rook is the convenience of managing ceph through kuberentes.

However, ceph contains more than 1000 parameters for tuning, at the same time through rook we can edit only a small part of them.

Luminous example
> ceph daemon mon.a config show | wc -l
1401

Rook is positioned as a convenient way to install and update ceph
There is no problem installing ceph without Rook - ansible playbook is written in 30 minutes, but there are a lot of problems with updating it.

Quote from Krok's post

For example: incorrect operation of crush tunables after updating with hummer in jewel

> ceph-osd crush show tunables
{
...
«straw_calc_version»: 1,
«allowed_bucket_algs»: 22,
«profile»: «unknown»,
«optimal_tunables»: 0,
...
}

But even within the minor versions there are problems.

Example: Update 12.2.6 bringing the cluster to health err state and conditionally broken PG
ceph.com/releases/v12-2-8-released

Do not upgrade, wait and test? But we kind of use Rook for the convenience of updates as well.

The complexity of disaster recovery cluster in Rook

Example: OSD crashes rashing errors under its feet. You suspect that the problem is in one of the parameters in the config, you want to change the config for a specific daemon, but you cannot, because you have kubernetes and DaemonSet.

There is no alternative. ceph tell osd.Num injectargs does not work - OSD lies.

Debug complexity

For some settings and performance tests, you must connect directly to the osd daemon socket. In the case of Rook, you first need to find the right container, then go into it, find the missing for debug tuning and be very upset.

The difficulty of sequentially raising the OSD

Example: OSD falls on OOM, rebalance begins, then the following fall.

Solution: Raise the OSD one at a time, wait for it to be fully included in the cluster, and raise the next ones. (More in the report by Ceph. Anatomy of a disaster.)

In the case of baremetal installation, this is done simply by hand, in the case of Rook and one OSD on the node there are no particular problems, problems with successive lifting will occur if OSD> 1 on the node.

Of course, they are solvable, but we carry Rook for simplification, but we get complication.

The difficulty of selecting limits for ceph demons

For baremetal ceph installations, it’s easy enough to calculate the necessary resources per cluster — there are formulas and there are studies. When using weak CPUs, you still have to conduct a series of performance tests to find out what Numa is, but it's still simpler than in Rook.

In the case of Rook, in addition to the memory limits that can be calculated, the question arises of setting the CPU limit.

And then you have to sweat with performance tests. In case of underestimating the limits, you will get a slow cluster, in the case of setting unlim you will get active CPU usage with rebalance, which will badly affect your applications in kubernetes.

Networking issues v1

For ceph, it is recommended to use a 2x10gb network. One for client traffic, another for office use ceph (rebalance). If you live with ceph on baremetal, then this separation is easy to configure, if you live with Rook, then with separation by networks it will cause problems for you, because far from every cluster config allows two different networks to be submitted to pod.

Networking issues v2

If you refuse to share networks, then with rebalance, ceph traffic will clog the entire channel and your applications in kubernetes will slow down or crash. You can reduce the ceph rebalance rate, but then due to the long rebalance, you get an increased risk of the second node falling out of the cluster on disks or OOM, and there is already guaranteed read only on the cluster.

Long rebalance - long application brakes

Quote from a Ceph post. Disaster anatomy.

Test cluster performance:

A 4 KB write operation takes 1 ms, performance 1000 operations / second in 1 stream.

An operation of 4 MB in size (object size) takes 22 ms, performance 45 operations / second.

Therefore, when one of the three domains fails, the cluster is in a degraded state for some time, and half of the hot objects will spread according to different versions, then half of the write operations will begin with a forced recovery.

Forced recovery time is calculated approximately - write operations in a degraded object.

First, we read 4 MB in 22 ms, write 22 ms, and then 1 ms we write 4 KB of data itself. In total, 45 ms for one write operation to a degraded object on the SSD, when the standard performance was 1 ms - a performance drop of 45 times.

The more we have the percentage of degraded objects, the worse it gets.

It turns out that the rebalance rate is critical for the correct operation of the cluster.

Server specific settings for ceph

ceph may need specific host tuning.

Example: sysctl settings and the same JumboFrame, some of these settings can negatively affect your payload.

The real need for a Rook remains in question

If you are in the cloud, you have storage from your cloud provider, which is much more convenient.

If you are on your servers, then ceph management will be more convenient without kubernetes.

Do you rent a server in some low cost hosting? Then you will find a lot of fun with the network, its delays and bandwidth, which obviously negatively affects ceph.

Total: The introduction of kuberentes and the introduction of the repository are different tasks with different introductory and different solution options - to mix them, then make a dangerous trade-off possible for the sake of this or that. Combining these solutions will be very difficult even at the design stage, and there is still a period of operation.

Bibliography:

Post # 1 But you say Ceph ... is he good?
Post # 2Ceph. Disaster anatomy

Tags: