The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.
Section | Group | Maturity | Last updated |
---|---|---|---|
Enablement | Geo | Complete | 2022-11-05 |
Thanks for visiting this category strategy page for GitLab Geo Disaster Recovery. This page belongs to the Geo group.
Disaster Recovery helps our customers fulfill their business continuity plans by creating processes that allow the recovery of a GitLab instance following a natural or human-created disaster in the data center the GitLab instance is operating in.
Disaster Recovery provides an easily configurable warm standby (Geo site) in an additional region, which can quickly take over in the event of an issue with the primary.
Setting up a Disaster Recovery solution for GitLab requires significant investment and is cumbersome in more complex setups. Geo replicates around 95% of GitLab's data, which means that systems administrators need to be aware of what is automatically covered and what parts need to be backed up separately, for example via rsync
. Geo provides documentation for planned and unplanned failover processes.
Please reach out to Sampath Ranasinghe, Product Manager for the Geo group (Email) if you'd like to provide feedback or ask any questions related to this product category.
This strategy is a work in progress, and everyone can contribute. Please comment and contribute in the linked issues and epics on this page. Sharing your feedback directly on GitLab.com is the best way to contribute to our strategy and vision.
We envision the GitLab Disaster recovery solution to be semi-autonomous, detecting fault conditions and failing over automatically to a suitable secondary site. The failover will be seamless and transparent to end users.
The solution will support all current and future GitLab reference architectures.
GitLab customers will be able to scale and tailor the Disaster Recovery solution to best fit their business continuity plans. They will be able to configure the disaster recovery solution to meet a defined Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
We will replicate and verify 100% of GitLab data, giving customers the confidence to perform routine failovers without fear of data loss and interruption to end-users. Further, the failover process will be simple and easy to trigger with a single instruction or click of a button on the UI even on multi-node architectures.
GitLab Dedicated and GitLab.com will actively use our Disaster Recovery solution to ensure that all best practices are followed and that we dog food our own solutions.
Given the criticality of the replication process to recovering from a failover event, we will have end-to-end observability of the process for each site. Alarms and notifications will inform systems administrators about faults or degradation of performance. We aim to empower systems administrators to troubleshoot and remediate failures using guided instructions, helpful error logs and other tools. Systems administrators should be able to resolve replication issues quickly, minimizing the risk of data loss and guaranteeing a successful recovery.
For more information on how we use personas and roles at GitLab, please click here.
A complete overview of work required to reach lovable maturity is available in the Disaster Recovery lovable maturity epic.
As of June 2022, Geo replicates ~90% and verifies ~71% of all planned data; however, not all data is automatically verified. We've created a self-service framework that supports replication strategies for Git repositories and blobs (files). We are currently expanding support for verification of blob data types.
It is possible to promote a secondary site to a primary site, either during a planned failover or in a genuine disaster recovery situation. Geo supports promotion for a single node installation and for an HA configuration. The current promotion process consists of a large number of manual preflight checks, followed by the actual promotion. The promotion is only possible in the command line; no UI flow is possible and for high-availability configurations modifications to the gitlab.rb
file are required on almost all nodes. Given the critical nature of this process, Geo should make it simple to promote a secondary, especially for more complex high-availability configurations.
Some of our existing datatypes, such as Projects and Wikis, do not yet use the self service framework. We are migrating these datatypes over to reduce technical debt and so that all datatypes can benefit from new features that are added to the framework.
Improve observability of replication and verification operations will allow systems administrators to monitor the health of the warm standby secondary site(s). It will help identify any fault conditions in the replication and verification process and aid in the remediation of these faults. By surfacing the underlying error(s) in the UI, it will provide easy access to this information and speed up and recovery actions needed on the part of the systems administrators.
GitLab Dedicated provides a GitLab SaaS offering to large enterprises and customers in industries with strict security and compliance requirements. We will make Disaster Recovery available to these customers.
After a failover, an administrator may want to re-add the demoted primary site back as a secondary site in order to failback to the original primary at some point. This is currently possible. However, the process is highly manual and not well-documented. After we have simplified the promotion process, we want to simplify demoting a secondary site site of any size by reducing the steps required and making the process easily automatable.
GitLab.com is by far the largest GitLab instance and is used by GitLab to dogfood GitLab itself. GitLab.com does not use GitLab Geo for DR purposes. This has many disadvantages and the Geo Team is working with Infrastructure to enable Geo on GitLab.com.
We currently don't plan to replace PostgreSQL with a different database e.g. CockroachDB.
This category is currently at the complete maturity level, and our next maturity target is lovable (see our definitions of maturity levels).
In order to move this category from complete to lovable we are working on all items in the lovable maturity epic.
We currently track
GitHub Enterprise Server 2.22 supports a passive replica server that can be used for disaster recovery purposes.