All requests for work to the Reliability Team come through the Reliability Issue Tracker. The management of this queue is an ongoing maintenance task for Reliability Engineers and Managers. This page contains an overview of the criteria used in determining how work is triaged and prioritized.
Priority for incoming work is based on a matrix measuring the impact and urgency of an issue.
Impact is the measure of the effect of an incident, problem, or change on business processes as detailed in the issue.
The table below can be used as a general guide for determining impact:
|High||-The issue needs to be resolved to mitigate an active S1 or S2 incident
-The issue is a roadblock on GitLab.com and blocking customer's business goals and day to day workflow
-The damage to the reputation of the business is likely to be high.
-Deploys are blocked as a result
-The potential financial impact is high
|Medium||-The issue is impacting a moderate subset of employees or a small subset of customers
-The damage to the reputation of the business is not likely to be high.
- The potential financial impact is low but greater than 0
|Low||-The issue is impacting a small subset of employees
-There is no impact on customers
-There is no risk to the reputation of the business
-There is no financial impact
Urgency is the speed at which an issues should be resolved based on business need or expectation.
The table below can be used as a general guide for determining impact:
|High||-The impact increases rapidly over time.
-Damage to the reputation of the business will increase over time.
-Any roadblock that puts the guaranteed self-managed release date at risk
-A minor incident could be prevented from becoming a major incident by acting immediately.
-A member of senior leadership has requested urgency
|Medium||-The impact increases only slightly over time.
-The damage to the reputation of the business will not increase over time.
-The customer has requested urgency.
|Low||-The impact does not increase at all over time.
-The customer indicates that the issue is not urgent.
Once the impact and urgency of an issue has been determined, it is time to assign a priority.
The table below can be used as a general guide for assigning priority:
|-A resource is assigned as soon as possible.
Issue is handed off between regions until it is resolved or mitigated to the point that priority can be lowered.
|-Issue is prioritized above P3s and P4s
-Issue is worked on but is not handed off between regions
-If an engineer is changing roles before they are able to resolve the issue, it should be handed over/assigned to another resource.
Impact: Low Urgency: High
Impact: Medium Urgency: Medium
|-Issue is prioritized above P4s only
-If an engineer is changing roles before they are able to resolve the issue, the issue should be dropped back into the tracker with a summary of what has been done so far and what the next steps are.
|-Resources are not assigned until all higher priority work has been completed|
|-If an issue is determined to be a P5, there is a question on if the issue should be done at all. If it turns out something was missed it can be moved to a higher priority. Otherwise it should be closed with an explanatory note.|
Except for corrective actions and security issues, reliability does not use
~"severity" labels, if these labels are added they will be removed.
In analyzing incoming work we need to determine if the work belongs to one of the following classifications, by default all incoming work will be put into the
work::general category until it has been classified.
Any issue that doesn't belong to a larger project and is able to be completed in 5 days should be considered a general issue. General issues are smaller and are worked on by individuals as opposed to project squads. General issues are labeled as
Incident related work will be labeled as
work::incident. Investigation work is normally completed by the EOC during or shortly after their shift ends.
Corrective actions follow the corrective action process and will also have the
Any issue that is associated with a larger effort that will take longer than 5 days is considered a project. Once an issue is determined to be a project vs a general work item, it will need to be prioritized within the already existing backlog of projects. Project issues are labeled as
Corrective actions and security issues labeled as
~"corrective action" or
~"security" are labeled differently in the Reliability issue tracker.
For these issues,
severity::* labels are set to meet specific SLOs.
|Corrective Action Label||SLO (days after issue has been created)|
Note: For issues labeled with
~"corrective action" we do not use the
~"priority::*" labels, if added they will be removed. Please use the
Reliability::P* labels instead.
Issues should always fall in to one of the following states, as defined by the following labels:
workflow-infra::Triage- Applied to all new Reliability issues automatically. It indicates that the issue has not yet been reviewed by the team.
workflow-infra::Proposal- Applied to any issues that require discussion, input or review. Some examples include:
workflow-infra::Ready- Applied after the issues has been triaged by an SRE or Engineering Manager within the Reliability Team. Ensure the following questions are answered and labeled before marking an issue as
work::generalif the issue can reasonably be completed by a single SRE in less than 5 days. Apply
work::projectfor anything that would take longer. - What is the priority of the issue as defined by the prioritization matrix?
~Reliability::P5based on the prioritization matrix.
workflow-infra::In Progress- This label should be applied only after the issue has a Reliability Team member as an assignee. This label is meant to represent only work that is actively in progress and not to indicate that an issue will be worked on in the future.
workflow-infra::Done- This label is applied only when the issue has been closed.
During the traige process, issues will be routed to Standing Squads where appropriate. See the table below for the current list of active Standing Squad Labels:
||Reliability Foundations Squad|
||Database Reliability Squad|
T-shirt labels are used to estimate the size of issues. These are always a rough estimate and often need to be adjusted once the full scope of an issue is defined.
|Label||Estimated Time Requirement||Example|
||4 hours or less||TBD|
||1 day or less||TBD|
||1 week or less||TBD|
||1 month or less||TBD|
||More than 1 month||TBD|
unblocks others- Apply this label to all issues that originate outside of the Infrastructure Department
The issue board for Reliability is reviewed twice a week by Reliability Leadership Team. If you have an urgent issue that you believe should be prioritized ahead of other work, please reach out to any Engineering Manager on the Reliability Team to discuss.
workflow-infra::Triagetickets lists most recently updated issues first.
work::generallabel. If the scope of work is larger add the
workflow-infra::Readylabel. This indicates that the issue has been through the prioritization process and is ready to be looked at by an engineer.
workflow-infra::Triagelist view, select the next issue in the list and put it through the same process.
Capacity planning warning issues are generated by the Scalability team and indicate concerns that we should address before they cause an alert (or an outage!). These issues are tracked outside of the Reliability project in the capacity planning project. The list of issues having the ~team::Reliability label is reviewed once per quarter and the work is then assigned based on impact and urgency.