All requests for work to the Reliability Team come through the Reliability Issue Tracker. The management of this queue is an ongoing maintenance task for Reliability Engineers and Managers. This page contains an overview of the criteria used in determining how work is triaged and prioritized.
Priority for incoming work is based on a matrix measuring the impact and urgency of an issue.
Impact is the measure of the effect of an incident, problem, or change on business processes as detailed in the issue.
The table below can be used as a general guide for determining impact:
Impact | Description |
---|---|
High | -The issue needs to be resolved to mitigate an active S1 or S2 incident -The issue is a roadblock on GitLab.com and blocking customer's business goals and day to day workflow -The damage to the reputation of the business is likely to be high. -Deploys are blocked as a result -The potential financial impact is high |
Medium | -The issue is impacting a moderate subset of employees or a small subset of customers -The damage to the reputation of the business is not likely to be high. - The potential financial impact is low but greater than 0 |
Low | -The issue is impacting a small subset of employees -There is no impact on customers -There is no risk to the reputation of the business -There is no financial impact |
Urgency is the speed at which an issues should be resolved based on business need or expectation.
The table below can be used as a general guide for determining impact:
Urgency | Description |
---|---|
High | -The impact increases rapidly over time. -Damage to the reputation of the business will increase over time. -Any roadblock that puts the guaranteed self-managed release date at risk -A minor incident could be prevented from becoming a major incident by acting immediately. -A member of senior leadership has requested urgency |
Medium | -The impact increases only slightly over time. -The damage to the reputation of the business will not increase over time. -The customer has requested urgency. |
Low | -The impact does not increase at all over time. -The customer indicates that the issue is not urgent. |
Once the impact and urgency of an issue has been determined, it is time to assign a priority.
The table below can be used as a general guide for assigning priority:
Priority | Impact/Urgency | Action |
---|---|---|
~Reliability::P1 |
Impact: High Urgency: High |
-An engineer is assigned as soon as possible. -Issue is handed off between regions until it is resolved or mitigated to the point that priority can be lowered. -Issues should be labeled as ~Reliability::P1 only when immediate action is required |
~Reliability::P2 |
Impact: High Urgency: Medium or Impact: Medium Urgency: High |
-Issue is prioritized above P3s and P4s -Issue is worked on but is not handed off between regions -If an engineer is changing roles before they are able to resolve the issue, it should be handed over/assigned to another engineer. |
~Reliability::P3 |
Impact: High Urgency: Low or Impact: Low Urgency: High or Impact: Medium Urgency: Medium |
-Issue is prioritized above P4s only -If an engineer is changing roles before they are able to resolve the issue, the issue should be dropped back into the tracker with a summary of what has been done so far and what the next steps are. |
~Reliability::P4 |
Impact: Low Urgency: Medium or Impact: Medium Urgency: Low |
-Engineers are not assigned until all higher priority work has been completed |
~Reliability::P5 |
Impact: Low Urgency: Low |
-If an issue is determined to be a P5, there is a question on if the issue should be done at all. If it turns out something was missed it can be moved to a higher priority. Otherwise it should be closed with an explanatory note. |
Note: Issue priority levels generally align with the severity levels defined within the Incident Management process.
The Reliability team uses the following response time SLAs for all ~Reliability::P1
and ~Reliability::P2 issues
that also have the ~unblocks others
label. The ~unblocks others
label indicates that the request originated from outside of the Infrastructure Department.
Priority | Initial Response Time | Follow Up Response Time | Coverage | How to engage |
---|---|---|---|---|
~Reliability::P1 |
30 minutes | 4 hours | 24x7 | This priority level is reserved for issues requiring immediate attention. To create a ~Reliability::P1 issue, first declare an incident to page the EOC |
~Reliability::P2 and unblocks others |
3 days | 7 days | 24x5 | Create an issue in the Reliability Issue Tracker and use labels ~Reliability::P2 and unblocks others |
Note: Except for corrective actions and security issues, reliability does not use ~"priority"
or ~"severity"
labels, if these labels are added they will be removed.
Corrective actions and security issues labeled as ~"corrective action"
or ~"security"
are labeled differently in the Reliability issue tracker.
For these issues, severity::*
labels are set to meet specific SLOs.
Corrective Action Label | SLO (days after issue has been created) |
---|---|
severity::1 | 1 week |
severity::2 | 30 days |
severity::3 | 60 days |
severity::4 | 90 days |
Note: For issues labeled with ~"security"
and ~"corrective action"
we do not use the ~"priority::*"
labels, if added they will be removed. Please use the Reliability::P*
labels instead.
Issues and epics should always use one of the following workflow labels:
Label | Description |
---|---|
workflow-infra::Triage |
Applied to all new issues automatically. It indicates that the issue has not yet been reviewed by the team. Note: This label should not be used on epics. |
workflow-infra::Proposal |
Applied when an issue or epic requires discussion, input or review. This label should be used when there there is not enough scope defined to begin work. |
workflow-infra::Ready |
Applied after triage, for any work in an issue or epic that is ready to start but is not yet scheduled. It should be clear what needs to be done in the epic or issue. If there are no immediate tasks because it requires more discussion, use the workflow-infra::Proposal label |
workflow-infra::In Progress |
Applied only after the issue has a Reliability Team member as an assignee or a DRI assigned for an epic. It is meant to represent work that is actively in-progress and not to indicate that an issue will be worked on in the future. |
workflow-infra::Blocked |
Applied when an issue or epic cannot proceed because it is waiting on other tasks to complete. All blocked work should have a clear description about why it is blocked with a related issue that can be tracked. |
workflow-infra::Stalled |
Applied when an issue was in-progress, but work has stopped due to unrelated priority shifts. All stalled work should have a clear description about why it is stalled with a related issue that can be tracked. |
workflow-infra::Cancelled |
Applied when closing incomplete work because it is decided that no more will be done on an issue or epic. |
workflow-infra::Done |
Applied when an issue or epic has been closed and the work is complete. |
service::
labels are used to route issues to the right team within Reliability. All issues should have a service::
label specified during the triage process. If multiple services are involved choose the service that best fits based on the details in the issue.
Service | Team | Issue |
---|---|---|
service::Terraform |
Foundations | Issue#17010 |
service::Prometheus |
Observability | Issue#14574 |
service::Gitaly |
Gitaly Stable Counterpart | Issue#16271 |
service::Sidekiq |
General | Issue#15720 |
During the traige process, issues will be routed to a Reliability Team where appropriate. See the table below for the current list of active team Labels:
Label | Team |
---|---|
~team::Foundations |
Foundations |
~team::Observability |
Observability |
~team::Database Reliability |
Database Reliability |
~team::Practices |
Practices |
~team::General |
General |
T-shirt labels are used to estimate the size of issues. These are always a rough estimate and often need to be adjusted once the full scope of an issue is defined.
Label | Estimated Time Requirement | Example |
---|---|---|
tShirt-size::XS |
4 hours or less | TBD |
tShirt-size::S |
1 day or less | TBD |
tShirt-size::M |
1 week or less | TBD |
tShirt-size::L |
1 month or less | TBD |
tShirt-size::XL |
More than 1 month | TBD |
unblocks others
- Apply this label to all issues that originate outside of the Infrastructure DepartmentThe issue board for Reliability is reviewed twice a week by Reliability Leadership Team. If you have an urgent issue that you believe should be prioritized ahead of other work, please reach out to any Engineering Manager on the Reliability Team to discuss.
Issue Triage is performed twice weekly by members of the Reliability Leadership Team. The process consists of four parts.
workflow-infra::Triage
tickets lists most recently updated issues first.workflow-infra::Ready
label. This indicates that the issue has been through the prioritization process and is ready to be looked at by an engineer.workflow-infra::Triage
list view, select the next issue in the list and put it through the same process.service::
label.The overall backlog of general issues is reviewed to assess and adjust priority. The following boards are used to help with this process:
~workflow-infra::In Progress
are truly active.Corrective Actions for all of Infrastructure are reviewed, prioritized, and assigned to prevent or reduce the likelihood and/or impact of an incident recurrence. We use this board to track corrective actions work. Corrective Actions are also an important performance indicator for the Infrastructure Department.
The process is as follows:
~infradev
label.Capacity planning warning issues are generated by the Scalability team and indicate concerns that we should address before they cause an alert (or an outage!). These issues are tracked outside of the Reliability project in the capacity planning project. The list of issues having the ~team::Reliability label is reviewed once per quarter and the work is then assigned based on impact and urgency.
Note: Epics that require status tracking should be updated each Wednesday.
Example:
DRI: <!-- GitLab username for the DRI -->
## Status YYYY-MM-DD
<!-- Status summary, one or two sentences that is meant for management and team members outside of reliability.
Example:
## Status 2022-02-02
This week we rolled out the latest update to this amazing component, the team had a few setbacks that will be overcome in the next few weeks.
-->
### Additional status sections (optional)
<!-- Additional status updates underneath headings. For example, what was shipped, what is in progress, blockers, etc.
This part won't be rolled up into the parent epic.
Example:
### Details
- Update `2.27.1` was rolled out to GPRD successfully.
- Issue #44 was closed after the team agreed to cancel the work.
- Issue #45 was completed successfully.
-->
### Overview
<!-- Overview of the project and its goals -->
### Reference (optional)
<!-- Links to boards, blueprints, readiness, or anything else relevant to the project -->