Today, each git repository hosted on gitlab.com is stored on exactly one storage node. Specifically, it is stored on an ext4 filesystem backed by one Google Cloud persistent volume. We take GCE snapshots of that volume every 4 hours, which are used for recovery in the event of volume loss (a rare event on GCP), and also for restoring accidentally deleted files.
The data in the snapshot are not guaranteed to be consistent while processes are still writing to the volume being snapshotted. Also, Creating GCE disks from these snapshots is a slow process.
In the event of volume loss we could stand to lose up to 4 hours of writes, but in practice, this is unlikely due to the existence of our disaster recovery (DR) environment as a GitLab Geo.
This project aims to improve our mean time to recovery (MTTR) and recovery point objective (RPO) when restoring accidentally deleted git repository data. ZFS snapshots are fast to create, cheap to store (due to its copy-on-write semantics) and are guaranteed to be consistent. This goal fits into the FY20 Q2 infrastructure team OKR “Drive all user-services MTTR to less than 60 seconds”.
It is sometimes desirable to use a realistic, production-sized dataset for testing. Taking consistent ZFS snapshots will help us achieve this goal.
The Adaptive Replacement Cache (ARC) that ZFS makes use of offers potential performance improvements over our current configuration. In addition to the level 1 (memory) ARC, we may be able to attain further performance improvements by adding a level 2 ARC on local SSDs that are attached via NVMe.
Having recent, guaranteed consistent snapshots is also a step towards improving our MTTR and RPO for disaster recovery, although we won’t reap the benefits of this until we ship these snapshots somewhere other than the same persistent disk that be the subject of the disaster.
Improve I/O throughput and reduce disk space required through the use of ZFS native compression.
ZFS was previously investigated as a potential solution to backing GitLab services and we believe that it is a good solution to solve the problems described above. In particular, the ZFS features that we believe will help us achieve our this project’s goals are fast and cheap snapshot+restore, the ARC, and native compression. ZFS provides many features such as disk-level redundancy (RAIDZ) and error correction from block-level corruption but we do not anticipate needing to take advantage of these features (see the non-goals section).
High availability with regard to single-volume failure. We are satisfied that this is too rare an event for GCE persistent disks to attempt to improve.
Error correction from block-level corruption (bit flipping). We are satisfied with Google’s guarantees on persistent disk data durability.
High availability with regard to virtual machine failure. There is ongoing work on Gitaly HA features that will enable this.
Replace PD-SSD disk snapshots for longer-term recoverability.
Changes to disaster recovery MTTR or RPO, where a disaster is the irrecoverable loss of a GCP region, including volume data. We rely on our DR environment as a Geo backup in these cases.
Zero-downtime migration at the individual repository level. We will aim to make use of a GitLab API that moves a repository between shards to perform the migration, and will first investigate the downtime consequences and consistency guarantees of this API.
Infrastructure automation for obtaining realistic, production-sized test data sets. Further work should be able to build on this project to do this.
ZFS is a versatile but complex filesystem. We will have to undertake work to observe and tune its behavior so as to optimize it towards our workload.
Some mistakes in ZFS setup (such as recordsize, and other configuration options that cannot easily be changed) will require a reset of the pool with all of its data forcing another migration.
When git-gc runs, it repacks the whole repository into new files. From a storage point of view, this may cause ZFS to provision new blocks. If the old blocks are still referenced by an existing snapshot then we must have enough storage to account for this. We will need to investigate how much extra storage we will need for this. This is dependent on the size of the repositories and the frequency with which git-gc is run relative to snapshot retention time. See this post.