Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Disaster Recovery Working Group

On this page

Attributes

Property Value
Date Created November 11, 2020
End Date TBD
Slack #wg_disaster-recovery (only accessible from within the company)
Google Doc Working Group Agenda (only accessible from within the company)
Issue Board Working Group Issue Board
Epic Link

Charter

This working group will determine what is needed to introduce a disaster recovery mechanism for GitLab.com, and what effort is necessary to leverage GitLab Geo as a mechanism for building reliable and predictable disaster recovery at the largest scale.

Scope and Definitions

In the context of this working group:

  1. Recovery Point Objective (RPO) : targeted duration of time in which data might be lost due to a major incident.
  2. Recovery Time Objective (RTO) : targeted duration of time and service level within which a business process must be restored after a disaster to avoid unacceptable consequences of a break in business continuity.

This working group is working towards the proposed targets for both RPO and RTO.

Sequence Order Of Deliverables

Planned:

  1. Set up a multi-node Geo site on staging for the next iterations of failover tests.
  2. Define a roadmap containing identified gaps and what is needed to provide the necessary failover functionality for GitLab.com production scale.
  3. Regularly plan and execute failover tests on the staging secondary Geo site.
  4. Demonstrate ability to execute a successful full failover of Staging.
  5. A design of how GitLab Geo would be used in production in the form of a blueprint and readiness review.
  6. Ensure that the cost is kept in check with the proposed design.
  7. Decide on go/no-go for production rollout based on the proposed design.
  8. Create and update a single handbook page, and deprecate resources in other locations.

Completed:

  1. 2020-11-30 Plan and execute a test of a staging failover leveraging GitLab Geo by 2020-11-30 with minimal disruption to the existing deployment and testing processes.
  2. 2021-01-13 Execute a follow up test of a staging failover, automating the testing and tooling processes
  3. Generated a proposal and received approval for building out a staging secondary site
  4. Evaluated the cost impact and received approval for a secondary site for production starting September 2021.
  5. Defined the DR flow on GitLab.com and the need to find a balanced solution to ensure a fully operational site after failover

Roles and Responsibilities

Working Group Role Person Title
Executive Stakeholder Steve Loyd VP of Infrastructure
Facilitator/DRI Brent Newton Director of Infrastructure, Reliability
Functional Lead Andrew Thomas Principal Product Manager, Enablement
Functional Lead Fabian Zimmer Senior Product Manager, Geo
Member Chun Du Director of Engineering, Enablement
Member Davis Townsend Data Analyst, Infrastructure
Member Nick Nguyen Backend Engineering Manager, Geo
Member Nick Westbury Senior Software Engineer in Test, Geo
Git is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license