Gitlab hero border pattern left svg Gitlab hero border pattern right svg

ZFS for Repository Storage Nodes

On this page

Scope

The initial problem statement, goals, and non-goals of this project were defined in the blueprint, which you should read first. Only where this design diverges from the blueprint will these things be documented here.

Following this Project

An epic has been created, and you can follow the board.

Timeline

We see the project as having 3 phases: development, migration, and post-migration. That is not to say they don't overlap at all, but broadly speaking we will aim to have infrastructure code (mainly chef and terraform) in master of the necessary repositories, ready for a migration with minimal moving parts.

Development Considerations

Packages and Kernels

Our machines currently run Ubuntu 16.04, using the LTS enablement stack kernel. Since 16.04 is more than 2 years old this is frozen on the 4.15 series. This is the kernel that 18.04 shipped with, and this is when Ubuntu began shipping with the ZFS kernel module included, at version 0.7.5.

The latest version of ZFS-on-Linux (ZoL), the Linux port of OpenZFS, is 0.8.0, which was released recently. There are several new features in this release. One notable feature that we would benefit from is TRIM support for SSDs.

The LTS enablement kernel minor version of Ubuntu 18.04 will not be frozen until April 2020. It is currently on the 4.18 series, and ships with the ZFS 0.7.9 kernel module. It is likely that Ubuntu 18.04 LTS enablement kernels will get ZFS 0.8.0 in the coming months, but we do not want to upgrade our storage nodes to Ubuntu 18.04 for 2 reasons: it would increase the scope and complexity of the project, and would leave us with a heterogeneous fleet, running different distribution versions.

We would like to build our own ZFS package from the ZoL source and install it from our Aptly server. A proof of concept project has been created to investigate the viability of this.

ZFS Configuration

Disk Configuration

Disk layout and sizing is being discussed in an issue.

Although the blueprint states that we are leaning towards a single-disk configuration because we are satisfied with Google's availability guarantees and error correction, we have since decided to use a RAIDZ1 configuration of 9 disks (plus local SSDs for L2ARC). The reasons for this change of mind are defence in depth, and also performance. Benchmarking results will be written up in an issue.

We are also considering running HDDs instead of SSDs to save money, relying on the SSD-backed L2ARC to mitigate the performance drop (issue). Benchmarks will need to satisfy us that this is feasible, howere.

We are not currently considering configuring an external ZFS intent log (external ZIL / SLOG device. The performance of a GCP disk (IOPS and throughput) is correlated with disk size. In order to not bottleneck writes at the SLOG disk we would need to provision a disk much larger than is needed to store the intent log. This is not cost-effective.

Kernel Module Configuration

Gitaly has a small memory footprint in production. That leaves us with the ability to dedicate a large proportion of our main memory to the ZFS adaptive replacement cache (ARC). The exact amount should be determined while working an issue but we anticipate at least 75%.

Storage nodes will have some local SSDs attached by NVMe to use as a fast level 2 ARC. ZFS should be properly configured to utilize these.

Filesystem Configuration

The exact zpool and zfs configuration has not been decided up-front but will instead be investigated as part of an issue, and initial configuration will be merged into a chef cookbook.

Being a copy-on-write filesystem, deletions consume a small amount of metadata in ZFS. Therefore, filling a zpool completely can be a difficult situation to recover from. For this reason it is considered a best practice to create a “reservation” filesystem in each zpool. It can be temporarily shrunk in order to give headroom to recover from other filesystems becoming full. The exact size of this reservation will be determined when we propose an initial ZFS configuration in an MR.

ZFS supports native compression, using various algorithms. If it is confirmed by testing that git repository data benefits from compression (i.e. the git objects not already being compressed) then we will make use of this feature.

ZoL supports native encryption as of the latest version, 0.8.0. Our GCE disks are already encrypted at rest, but we do not control the keys ourselves using Google’s Key Management Service (KMS). At the time of writing we have not decided whether or not it makes more sense to use ZFS encryption, start using KMS for volumes, or both. An issue will be created.

Capacity Planning / Resource Sizing

As stated in the ZFS config section above, we intend to use 9 GCP PD-SSDs in a raidz1 configuration. Due to inefficiencies in git repack operations (see https://christian.amsuess.com/idea-incubator/space-efficient-git), that run periodically for all repositories hosted on gitlab.com, a ZFS snapshot taken before the repack will reference disjoint blocks to one taken after. Therefore the zpool disk usage for the repo will be doubled until the old snapshot passes out of the retention window. If repacking occurred simultaneously for every repo on a GitLab installation (which is unlikely), our filesystem usage would spike to double. Therefore in the worst case we would need at least twice the usable filesystem space as the data we intend to store on the node. However, since repacking of different repos should be spread out in time in realistic scenarios, we could use a lower multiple. Determining this multiple will be the output of a research issue.

We will need more zpool space than usable filesystem space to make room for the aforementioned, “reservation” filesystem.

We will need more block device space than usable zpool space, because raidz doesn’t offer 100% block device space utilisation due to redundancy.

RAM/CPU changes compared to today are not being considered at this time. We intend to keep using GCP n1-standard-32 VMs, which have 32 vCPUs and 120GB of RAM. We don’t want to reduce the amount of memory available for caching, and we have determined in the past that 32 vCPUs are needed for acceptable network bandwidth.

We can make use of GCP local SSDs, attached via NVMe, as a level 2 ARC for ZFS. These are ephemeral and scoped to the life of the VM, but that doesn’t matter for a cache. These disks are available in a fixed size of 375GB. In order to have a sensible multiple of the level 1 ARC available at level 2, and in order to increase IO throughput of the cache, we are planning to use 2 of these disks per node.

Dependency Validation

GitLab APIs

Our migration plan (see below) will likely involve using a GitLab API to move repositories between file store shards. We will be calling this API for every repository that is stored on gitlab.com, so it is worth doing some validation work in advance to:

Validate that this API is not racy: after it returns, the new repository location is guaranteed to contain all data that was in the old, even if writes were attempted during the migration. Quantify the downtime, if any, for repos being migrated. This should be done for various representative sizes of repo (50th, 95th, 99th percentile size). Real repositories should be used rather than dummies if possible.

Git

If git makes use of direct IO on Linux, then we will need to use the latest version of ZoL, 0.8.0, in which the feature was added. A quick grep of the git source code indicates that it does not in any case.

Migration Considerations

Migration Plan

Node Migration

Migrations will be run on staging first, with the goal of using the same developed scripts against production. We intend to automate as much as is sensible in order to minimize human error.

The general migration strategy is to roll out N nodes per phase using terraform and chef in advance of a planned migration date, then to migrate repositories in batches orchestrating the shard migration API on the date itself.

Currently, disks are formatted and filesystems are mounted at boot time using the bootstrap script. In order to support ext4 and zfs (possibly with raidz) depending on node config, it might make sense to move this functionality into a chef cookbook, for improved testability and speed of iteration (issue).

Once the migration of an environment is complete we can decommission the old nodes using terraform.

The first phase will consist of one node, referred to as the canary deployment. We can further tune our ZFS configuration by observing the canary in production. In particular, we anticpate optimizing the ZFS recordsize for the observed workload.

The initial plan for the second phase will migrate the rest of the nodes one by one, but this will be informed by the output of a research issue so that we know how long this is expected to take, and make the appropriate handover plans across time zones if necessary.

After Node Migration

After the node migration is complete, we can configure a reasonable snapshot schedule. Our initial plan is to take snapshots every 15 minutes, with a 6 hour retention window. Older data restoration will be done by creating disks from GCE snapshots, the method currently used. The output of this phase will be a runbook for recovery of accidentally deleted data, with as much of the runbook automated as possible. Making this self-service is not a goal, SREs will still fulfil these requests on an ad-hoc basis.

Post-Migration Considerations

This section is intended to capture “day 2” concerns.

ZFS-specific Alerting

It is important to alert on the zpool and individual zfs filesystems becoming full. Due to ZFS’s aforementioned intolerance to being full, we should set these alerts to a lower threshold than we would for ext4. These should be low-severity alerts that do not page an on-call engineer, so the threshold should be sufficiently low that the space is not likely to fill within the next few hours. We could consider setting a higher-threshold, paging alert as well.

In ZFS inodes are allocated on demand so do not need to be alerted on (https://blogs.oracle.com/roch/need-inodes).

Further work

This section describes potential future work that is not in scope for this project, but if embarked on could build on it.

Disaster Recovery improvements

Periodically zfs send snapshots to object storage so that we can recover from disasters with snapshots that are guaranteed consistent.

Unused repository cold storage

It might become desirable in the future to move dormant repositories to colder storage than mounted PDs, for example to object storage such as GCS. This could be done by sending serialized ZFS snapshots to cold storage using zfs send, but that would require that the ZFS filesystem being snapshotted to only contain dormant repositories. It’s likely that this would require Gitaly features to accomplish, as well as a possible rethink of our storage architecture and another node migration.