varanio February 1, 2017 at 17:35

Gitlab “lies”, the base is destroyed (restored)

Yesterday, January 31, the Gitlab service accidentally destroyed its production database (the git repositories themselves were not affected).

It was something like this.

For some reason, the hot-standby database replica (PostgreSQL) began to lag (the replica was the only one). A gitlab employee for some time tried to influence the situation with various settings, etc., then decided to erase everything and pour the replica again. I tried to erase the data folder on the replica, but mixed up the server and erased it on the wizard (I did rm -rf on db1.cluster.gitlab.com instead of db2.cluster.gitlab.com).

Interestingly, the system had 5 different types of backups / replicas, and none of this worked. There was only an LVM snapshot made by chance 6 hours before the fall.

Here, I quote an abridged quote from their document. Issues Found:

1) LVM snapshots are by default only taken once every 24 hours.
2) Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored.
3) Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
4) The synchronization process removes webhooks once it has synchronized data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
5) The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
6) Our backups to S3 apparently don't work either: the bucket is empty
7) We don't have solid alerting / paging for when backups fails, we are seeing this in the dev host too now.

Thus, they conclude gitlab, out of 5 backups / replication techniques, nothing worked reliably and as it should => therefore, restoration is in progress from a randomly made 6-hour backup

→ Here is the full text of the document

Tags:

Gitlab “lies”, the base is destroyed (restored)

Also popular now: