Cluster storage in Proxmox. Part three. Nuances
Hello!
The third part of the article is a kind of appendix to the two previous ones, in which I talked about working with the Proxmox cluster. In this part, I will describe the problems that we encountered in working with Proxmox, and how to solve them.
If you need to specify creditsyal when connecting to iSCSI - it is better to do this bypassing Proxmox . Why?
Easier to connect manually :
These commands must be executed for all portals that provide the target we need on all nodes of the cluster. Or you can execute these commands on one node, and distribute the configuration files for the connection to the rest. Files are located in the directories " / etc / iscsi / nodes " and " / etc / iscsi / send_targets ".
In order to mount a GFS2- file system on a new node , it ( file system ) needs to add one more log. This is done as follows: on any node of the cluster on which the FS we need is mounted , the command is executed:
The " -j " parameter specifies the number of logs to add to FS .
This command may fail:
Causes of the error:
Inside the GFS2 volume there is actually not one file system, but two. Another file system is for business purposes. If desired, it can be mounted by adding the option " -o meta ". Changes within this FS could potentially lead to the destruction of the data file system. When adding a log to FS , the meta- file system is mounted in the " / tmp / TEMP_RANDOM_DIR " directory , after which a log file is created in it. For reasons that are not yet clear to us, the kernel sometimes believes that in mounted meta-FSThe quota for creating objects has been exceeded, which is why such an error occurs. You can get out of this situation by remounting GFS2 with data (of course, for this you need to stop all the virtual machines located on this FS ) and run the add log command again. You also need to unmount meta-FS from the last unsuccessful attempt to add a log:
Container virtualization technology is good because the host has almost unlimited possibilities for communicating with the virtual machine.
When the container starts, vzctl tries to execute the following set of scripts ( if any ):
When stopped, the following scripts are executed:
where " CTID " is the container number. The " vps. * " Scripts are executed during operations with any container. The scripts " * .start " and " * .stop " are executed in the context of the container, all the rest in the context of the host. Thus, we can script the start / stop process of the container by adding data mounting to it. Here are some examples:
If the container works with a large amount of data, we try not to keep this data inside the container, but mount it from the host. There are two positive aspects to this approach:
The contents of the file " CTID.mount ":
There is one on the host that needs to be given to the container. The contents of the file " CTID.mount ":
Why would you need this? If some tricky product ( for example, Splunk ) does not want to work with simfs in any way , or we are not satisfied with the speed of GFS2 under certain conditions. For example, we have some kind of cache on a heap of small files. GFS2 does not work very quickly with large volumes of small files. Then you can create a file system on the host other than GFS2 ( ext3 ) and connect it to the container.
Mount the loop device from the file to the container:
First, create the file:
Format FS in a file:
The contents of the file " CTID.mount ":
When the container stops, the system automatically tries to unmount all file systems connected to it. But in particularly exotic configurations, she cannot do this. Therefore, just in case, an example of a simple script " CTID.umount ":
If for some reason there is no desire to use a cluster FS (it does not suit the stability of work, it does not suit the performance, etc. ), but you want to work with a single storage, then this option is possible. For this we need:
Procedure: For
each node in the cluster, we select our logical volume in CLVM , format it.
We create the main storage. Create a directory with the same name on all nodes of the cluster ( for example, "/ storage" ). We mount our logical volume into it. In the Proxmox admin panel, create a repository of the type “ Directory ”, name it, for example, “ STORAGE ”, and say that it is not shared.
Create backup storage. Create a directory with the same name on all nodes of the cluster ( for example, "/ storage2" ). In the Proxmox admin panel, create a repository of the type " Directory ", name it, for example, "", we say that it is not general. In the event of a fall / shutdown of one of the nodes, we will mount its volume in the" / storage2 " directory on that node of the cluster that will take the burden of the deceased.
What do we have as a result:
Why " under- " and why " theoretically ":
Virtual machines live in the " STORAGE " storage , which is located in the " / storage " directory . The disk from the dead node will be mounted in the " / storage2 " directory , where Proxmox will see the containers, but will not be able to start them from there. In order to raise the virtual machines located in this storage, you need to do three actions:
All. You can run containers.
If the containers from the dead node take up little space, or we have incredibly fast disks, or we can afford to wait, then you can avoid the first and third steps by simply moving the containers we need from " / storage2 / private " to " / storage / private "
A cluster is a moody creature, and there are times when it comes to a pose. For example, after massive problems with the network, or due to a massive power outage. The pose looks like this: when accessing the cluster storage, the current session is blocked, polling the status of the fence domain displays alarm messages of the form " wait state messages ", and connection errors are poured into dmesg .
If no attempts to revive the cluster lead to success, then the simplest thing is to disable automatic entry to the fence domain on all nodes of the cluster ( file "/ etc / default / redhat-cluster-pve"), and sequentially reload all the nodes. You must be prepared for the fact that the nodes can not reboot on their own. When all the nodes are rebooted, manually connect to the fence domain, start CLVM , and so on. Previous articles have written how to do this.
In the next part I will talk about how we automate the work in the cluster.
Thanks for attention!
The third part of the article is a kind of appendix to the two previous ones, in which I talked about working with the Proxmox cluster. In this part, I will describe the problems that we encountered in working with Proxmox, and how to solve them.
Authorized iSCSI Connection
If you need to specify creditsyal when connecting to iSCSI - it is better to do this bypassing Proxmox . Why?
- Firstly, because it is not possible to create an authorized iSCSI connection through the Proxmox web interface .
- Secondly, even if you decide to create an unauthorized connection in Proxmox in order to specify the authorization information manually, you will have to hang out with the system for the ability to change the target configuration files, because if the connection to the iSCSI host fails, Proxmox overwrites the target information and retries the connection.
Easier to connect manually :
root@srv01-vmx:~# iscsiadm -m discovery -t st -p 10.11.12.13
root@srv01-vmx:~# iscsiadm -m node --targetname "iqn.2012-10.local.alawar.ala-nas-01:pve-cluster-01" --portal "10.11.12.13:3260" --op=update --name node.session.auth.authmethod --value=CHAP
root@srv01-vmx:~# iscsiadm -m node --targetname "iqn.2012-10.local.alawar.ala-nas-01:pve-cluster-01" --portal "10.11.12.13:3260" --op=update --name node.session.auth.username --value=Admin
root@srv01-vmx:~# iscsiadm -m node --targetname "iqn.2012-10.local.alawar.ala-nas-01:pve-cluster-01" --portal "10.11.12.13:3260" --op=update --name node.session.auth.password --value=Lu4Ii2Ai
root@srv01-vmx:~# iscsiadm -m node --targetname "iqn.2012-10.local.alawar.ala-nas-01:pve-cluster-01" --portal "10.11.12.13:3260" --login
These commands must be executed for all portals that provide the target we need on all nodes of the cluster. Or you can execute these commands on one node, and distribute the configuration files for the connection to the rest. Files are located in the directories " / etc / iscsi / nodes " and " / etc / iscsi / send_targets ".
Mounting on the new GFS2-FS node
In order to mount a GFS2- file system on a new node , it ( file system ) needs to add one more log. This is done as follows: on any node of the cluster on which the FS we need is mounted , the command is executed:
root@pve01:~# gfs2_jadd -j 1 /mnt/cluster/storage01
The " -j " parameter specifies the number of logs to add to FS .
This command may fail:
create: Disk quota exceeded
Causes of the error:
Inside the GFS2 volume there is actually not one file system, but two. Another file system is for business purposes. If desired, it can be mounted by adding the option " -o meta ". Changes within this FS could potentially lead to the destruction of the data file system. When adding a log to FS , the meta- file system is mounted in the " / tmp / TEMP_RANDOM_DIR " directory , after which a log file is created in it. For reasons that are not yet clear to us, the kernel sometimes believes that in mounted meta-FSThe quota for creating objects has been exceeded, which is why such an error occurs. You can get out of this situation by remounting GFS2 with data (of course, for this you need to stop all the virtual machines located on this FS ) and run the add log command again. You also need to unmount meta-FS from the last unsuccessful attempt to add a log:
cat /proc/mounts | grep /tmp/ | grep -i gfs2 | awk '{print $2}' | xargs umount
Mounting a data source inside a container
Container virtualization technology is good because the host has almost unlimited possibilities for communicating with the virtual machine.
When the container starts, vzctl tries to execute the following set of scripts ( if any ):
- /etc/pve/openvz/vps.premount
- /etc/pve/openvz/CTID.premount
- /etc/pve/openvz/vps.mount
- /etc/pve/openvz/CTID.mount
- /etc/pve/openvz/CTID.start
When stopped, the following scripts are executed:
- /etc/pve/openvz/CTID.stop
- /etc/pve/openvz/CTID.umount
- /etc/pve/openvz/vps.umount
- /etc/pve/openvz/CTID.postumount
- /etc/pve/openvz/vps.postumount
where " CTID " is the container number. The " vps. * " Scripts are executed during operations with any container. The scripts " * .start " and " * .stop " are executed in the context of the container, all the rest in the context of the host. Thus, we can script the start / stop process of the container by adding data mounting to it. Here are some examples:
Mounting a data directory inside a container
If the container works with a large amount of data, we try not to keep this data inside the container, but mount it from the host. There are two positive aspects to this approach:
- The container is small, quickly backed up by Proxmox . We have the opportunity at any time to quickly restore / clone the functionality of the container.
- Container data can be centrally backed up by an adult backup system with all the amenities provided by it (multilevel backups, rotation, statistics, and so on).
The contents of the file " CTID.mount ":
#!/bin/bash
. /etc/vz/vz.conf # подключаем файл с глобальным описанием внутренних переменных OpenVZ. В частности, в нем определена переменная ${VE_ROOT} - корневой каталог контейнера на хосте.
. ${VE_CONFFILE} # подключаем файл с описанием переменных контейнера
DIR_SRC=/storage/src_dir # каталог на хосте, который надо примонтировать внутрь контейнера
DIR_DST=/data # каталог внутри контейнера, к которому будет примонтирован $DIR_SRC
mkdir -p ${VE_ROOT}/${DIR_DST} # создаем внутри контейнера каталог назначения
mount -n -t simfs ${DIR_SRC} ${VE_ROOT}/{$DIR_DST} -o /data # монтируем каталог внутрь контейнера
Mounting the file system inside the container
There is one on the host that needs to be given to the container. The contents of the file " CTID.mount ":
#!/bin/bash
. /etc/vz/vz.conf
. ${VE_CONFFILE}
UUID_SRC=3d1d8ec1-afa6-455f-8a27-5465c454e212 # UUID тома, который надо примонтировать внутрь контейнера
DIR_DST=/data
mkdir -p ${VE_ROOT}/${DIR_DST}
mount -n -U ${UUID_SRC} ${VE_ROOT}/{$DIR_DST}
Mounting the file system in the file inside the container
Why would you need this? If some tricky product ( for example, Splunk ) does not want to work with simfs in any way , or we are not satisfied with the speed of GFS2 under certain conditions. For example, we have some kind of cache on a heap of small files. GFS2 does not work very quickly with large volumes of small files. Then you can create a file system on the host other than GFS2 ( ext3 ) and connect it to the container.
Mount the loop device from the file to the container:
First, create the file:
root@srv01:/storage# truncate -s 10G CTID_ext3.fs
Format FS in a file:
root@srv01:/storage# mkfs.ext3 CTID_ext3.fs
mke2fs 1.42 (29-Nov-2011)
CTID_ext3.fs is not a block special device.
Proceed anyway? (y,n) y
...
The contents of the file " CTID.mount ":
#!/bin/bash
. /etc/vz/vz.conf
. ${VE_CONFFILE}
CFILE_SRC=/storage/CTID_ext3.fs # путь к файлу, который надо примонтировать внутрь контейнера
DIR_DST=/data
mkdir -p ${VE_ROOT}/${DIR_DST}
mount -n ${CFILE_SRC} -t ext3 ${VE_ROOT}/{$DIR_DST} -o loop
Unmounting external data in the container when stopped
When the container stops, the system automatically tries to unmount all file systems connected to it. But in particularly exotic configurations, she cannot do this. Therefore, just in case, an example of a simple script " CTID.umount ":
#!/bin/bash
. /etc/vz/vz.conf
. ${VE_CONFFILE}
DIR=/data
if mountpoint -q "${VE_ROOT}${DIR}" ; then
umount ${VE_ROOT}${DIR}
fi
Work in a cluster with a non-clustered file system
If for some reason there is no desire to use a cluster FS (it does not suit the stability of work, it does not suit the performance, etc. ), but you want to work with a single storage, then this option is possible. For this we need:
- Separate logical volume in CLVM for each cluster node
- Primary storage for container operation
- Empty backup storage for urgent mounting of a foreign node volume in case of its failure / shutdown
Procedure: For
each node in the cluster, we select our logical volume in CLVM , format it.
We create the main storage. Create a directory with the same name on all nodes of the cluster ( for example, "/ storage" ). We mount our logical volume into it. In the Proxmox admin panel, create a repository of the type “ Directory ”, name it, for example, “ STORAGE ”, and say that it is not shared.
Create backup storage. Create a directory with the same name on all nodes of the cluster ( for example, "/ storage2" ). In the Proxmox admin panel, create a repository of the type " Directory ", name it, for example, "", we say that it is not general. In the event of a fall / shutdown of one of the nodes, we will mount its volume in the" / storage2 " directory on that node of the cluster that will take the burden of the deceased.
What do we have as a result:
- Migration ( including online ) of containers between nodes ( if no data is mounted to the container from the side ). The container is transferred from node to node by copying, respectively, the migration time depends on the amount of data in the container. The more data, the longer the container will be transported between nodes. Do not forget about the increasing disk load at the same time.
- ( Under- ) fault tolerance. When a node falls, its data can be mounted on a neighboring node, and theoretically, you can start working with them.
Why " under- " and why " theoretically ":
Virtual machines live in the " STORAGE " storage , which is located in the " / storage " directory . The disk from the dead node will be mounted in the " / storage2 " directory , where Proxmox will see the containers, but will not be able to start them from there. In order to raise the virtual machines located in this storage, you need to do three actions:
- To inform the fire victims that their new home is not the " / storage " directory, but the / storage2 directory . To do this, in each file " * .conf " in the directory " / etc / pve / nodes / dead_name / openvz ", change the contents of the VE_PRIVATE variable from " / storage / private / CTID " to " / storage2 / private / CTID ".
- Tell the cluster that the virtual machines out of that inanimate node are now located on this living one. To do this, just transfer all the files from the directory / etc / pve / nodes / dead_name / openvz to the directory / etc / pve / nodes / live_node / openvz . Perhaps there is some kind of correct API instruction for this , but we didn’t bother with this :)
- Reset the quota for each burnt container ( just in case ):
vzquota drop CTID
All. You can run containers.
If the containers from the dead node take up little space, or we have incredibly fast disks, or we can afford to wait, then you can avoid the first and third steps by simply moving the containers we need from " / storage2 / private " to " / storage / private "
If the cluster fell apart
A cluster is a moody creature, and there are times when it comes to a pose. For example, after massive problems with the network, or due to a massive power outage. The pose looks like this: when accessing the cluster storage, the current session is blocked, polling the status of the fence domain displays alarm messages of the form " wait state messages ", and connection errors are poured into dmesg .
If no attempts to revive the cluster lead to success, then the simplest thing is to disable automatic entry to the fence domain on all nodes of the cluster ( file "/ etc / default / redhat-cluster-pve"), and sequentially reload all the nodes. You must be prepared for the fact that the nodes can not reboot on their own. When all the nodes are rebooted, manually connect to the fence domain, start CLVM , and so on. Previous articles have written how to do this.
That’s probably all.
In the next part I will talk about how we automate the work in the cluster.
Thanks for attention!
- Cluster storage in Proxmox. Part one. Fencing
- Cluster storage in Proxmox. Part two. Launch
- Cluster storage in Proxmox. Part three. Nuances