GitLab, by its remote-only nature, is not easily affected by typical causes of business disruption, such as local failures of equipment, power supplies, telecommunications, social unrest, terrorist attacks, fire, or natural disasters. Additionally, to ensure BCP procedures are planned and documented appropriately, data from the Business Impact Analysis is utilized as part of business continuity planning. The BCP works in conjunction with the Disaster Recovery Plan (DRP).
In case of an all-remote company like GitLab, it is sufficient to have simple contingency plans in the form of service-level agreements with companies that host our data and services. The advantage of an all-remote workforce like GitLab is that if there are clusters of people or systems that are unavailable, the rest of the company will continue to operate normally.
The exception to this would be a scenario of a single point of failure, (for example, if one of the Engineering heads who should sign off on triggering the plan is unavailable due to a disaster). In this case we would need an alternate plan in place that covers how to get in contact with the person or people affected by the disaster and trigger this business continuity plan.
RTO and RPO are two of the most important parameters of a Business Continuity Plan. These are objectives to guide GitLab Infrastructure team in choosing the optimal data backup plan. The RTO/RPO provides the basis for identifying and analyzing viable strategies for inclusion in the business continuity plan. Viable strategy options include any which would enable resumption of a business process in a time frame within the RPO/RTO.
Recovery Point Objective (RPO) is the interval of time that might pass during a disruption before the quantity of data lost during that period exceeds the Business Continuity Plan’s maximum allowable threshold.
The Recovery Time Objective (RTO) is the duration of time a service level or business process must be restored after a disaster, in order to avoid risks associated with a break in continuity.
For a business continuity plan to be effective, it needs to be triggered as soon as possible; too early or late can reduce its efficacy. Key decision points to consider when a BCP has to be triggered or invoked are given below:
This section provides details about the production environment that must be available for GitLab.com to run effectively:
GitLab.com is hosted on Google cloud platform, customers.gitlab.com is in Azure and license.gitlab.com is in AWS. Since Customers and Licenses are hosted on different providers, they are unlikely to be unavailable when/if GitLab.com is down; the converse of this can also be true.
Priority::1: Outage would have immediate impact on GitLab customer/user operations
P2: Outage would have immediate impact on GitLab ability to continue business Malicious Software attack and hacking or other Internet attacks.
P3: Outage greater than 72 hours would have impact on GitLab ability to continue to do business Disruption of service from Salesforce.com, Zuora, NetSuite, Google Workspace
P4: Non critical system Disruption of service from TripActions or internal chat tool (Slack).
When it comes to a disaster, communication is of the essence. A plan is essential because it puts all team-members on the same page and clearly outlines all communication. Documents should all have updated team-member contact information and team-members should understand exactly what their role is, in the days following the triggering of the BC plan. Assignments like setting up workstations, assessing damage, redirecting phones and other tasks will need assignments if you don’t have some sort of technical resource to help you sort through everything.
Each GitLab team should be trained and ready to deploy in the event of a disruptive situation requiring plan activation. The plan of action steps, procedures, and guidelines will be documented in their team runbooks page (currently under development) and should be available offline. This should have detailed steps on recovery capabilities, and instructions on how to return the system to normal operations.
More details on this will be covered in the
BC plan - roles & responsibilities section which is in development.
Make sure that backups are performed daily, and include running an additional full local backup on all servers and data in the Business Continuity preparation plan. Run them as far in advance as possible tp ensure that they’re backed up to a location that will not be impacted by the disaster. Alternate storage provisioning.
A plan cannot be successful without restoring customer confidence. As a final step, ensure that there is a detailed vendor communication plan as part of the Business continuity preparation plan. This plan will check for all the systems and services to ensure normal operations have resumed as intended once the damage is repaired in the area. Also, include the section to check with the main service providers on restoration and access.
After formalizing the business continuity plan, or BCP, the next important step is to test the plan. Testing verifies the effectiveness of the plan, trains plan participants on what to do in a real scenario, and identifies areas where the plan needs to be strengthened. A test of the plan review, has to be conducted at least annually.
GitLab's first test of the business continuity plan was performed in April 2020 and tests will be conducted at least annually or when significant business changes occur.
Testing can present a lot of challenges. It requires investing time and resources. With that in mind, to start with, it may make more sense to conduct a tabletop test at a conference room, rather than involving the entire organization in a full-blown drill. Also an initial "dry run" of the plan can be performed, by conducting a structured walk-through test of the approved BC plan. The initial testing is done in sections and after normal business hours to minimize disruptions. Subsequent tests can occur during normal business hours. An actual test-run can be performed eventually. Based on the gaps and weaknesses learnt from the testing, underlying problems should be corrected and the plan updated accordingly. The various types of tests that can be conducted include: checklist tests, simulation tests, parallel tests, and full interruption tests Not testing the plan, will put both the business and customer confidence at risk.
There are several types of tests, such as a plan review, a tabletop test, or a simulation test, which was detailed in the previous section. Some testing scenarios that can be performed, are given below:
Data Recovery Testing