The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.
This page contains a description of the Gitlab Resolve workflow vision as a part of our Monitor stage.
Resolve is the process of restoring IT services following an incident that disrupted availability. This workflow follows Triage in which the problem at hand was investigated and the root cause determined. Once the fix for the root cause has been determined, the solution must be verified in a local environment before it is deployed to production. Following release, the services must be monitored to ensure that they return to levels that meet SLOs.
The root cause has been discovered and responders have determined a potential solution. The next step is to propose a set of changes for review with the intention of restoring impacted services. In this scenario, the responding team is typically under pressure and the proposed solution may not be a long-term solution. The goal is to restore services for stakeholders as quickly as possible and follow-up the incident with a review where a long-term solution can be designed, discussed, and scheduled for implementation.
The solution has been reviewed and approved. A responder implements the solution and tests in their local environment before pushing to master and deploying to production. Depending on progressiveness of the team, this process may be streamlined using CI/CD workflows.
After release it is important to monitor production metrics to ensure the solution was comprehensive and worked as intended. Alerts will often auto-resolve during this phase.
Services have been restored and meet SLOs. A member of the team, often the Incident Commander, will communicate with stakeholders via different channels (Status Page, social media platforms, internal email, etc) to inform them that services are back up and available.
After an incident, it is important to document what happened and how it was fixed. Taking the time to document this information may help the team triage and resolve a similar incident much faster in the future.
We have not enabled the entire workflow detailed above, however, we do have a couple features you can take advantage of today to simplify your Resolve processes:
This workflow is currently at Planned stage. Workflows in the Operations section are graded on the same maturity scale as categories.
We plan to provide a Resolve experience to allows our users to efficiently restore services whether it be deploying a patch to application code or running a script to unclog ETL pipelines. Work supporting this workflow is captured in this epic.