GitLab believes in Open Development, and we encourage the community to file issues and open merge requests for our projects on GitLab.com. Their contributions are valuable, and we should handle them as effectively as possible. A central part of this is triage - the process of categorization according to type and severity.
Any GitLab team-member can triage issues. Keeping the number of un-triaged issues low is essential for maintainability, and is our collective responsibility. Consider triaging a few issues around your other responsibilities, or scheduling some time for it on a regular basis.
The Engineering Productivity team own the issue triage process, but there is no capacity to manually triage issues without a group label at present. We rely on a combination of self triage and tanuki-stan to ensure group labels are added, and issues are seen/triaged by the relevant group.
#abuse
slack channel./duplicate
action to create the link to the original and close the issue.~"type::bug"
: assign a severity label.
An issue is considered completely triaged when all of the following criteria are met:
~"type::bug"
and ~"UX Debt"
.Type labels are defined on the Engineering Metrics page. If you are unsure about the type, you can tag the product or engineering manager for the group and ask their opinion.
Assigning a group label allows gitlab-bot
to automatically assign the right stage label.
The Features by Group listing can help find the right group.
The priority label is used to indicate the importance and guide the scheduling of the issue. Priority labels are expected to be set based on the circumstances of the market, product direction, IACV impact, number of impacted users and capacity of the team. DRIs for prioritization are based on work type:
Priority | Importance | Intention | DRI |
---|---|---|---|
~"priority::1" |
Urgent | We will address this as soon as possible regardless of the limit on our team capacity. Our target resolution time is 30 days. | PM, EM, or QEM of that product group, based on work type |
~"priority::2" |
High | We will address this soon and will provide capacity from our team for it in the next few releases. This will likely get resolved in 60-90 days. | PM, EM, or QEM of that product group, based on work type |
~"priority::3" |
Medium | We want to address this but may have other higher priority items. This will likely get resolved in 90-120 days. | PM, EM, or QEM of that product group, based on work type |
~"priority::4" |
Low | We don't have visibility when this will be addressed. No timeline designated. | PM, EM, or QEM of that product group, based on work type |
If you need help estimating severity, tag the group's corresponding Software Engineer in Test or Quality Engineering Manager in the respective issue.
Note: Theses severity definitions apply to issues only. Please see Severity Levels section of the Incident Management page for details on incident severity.
Severity labels help us determine urgency and clearly communicate the impact of a ~"type::bug"
on users. There can be multiple categories of a ~"type::bug"
. Severity is also applicable to non-type::bug
~SUS::Impacting
issues.
The presence of bug category labels ~"bug::availability"
, ~"bug::performance"
, ~"bug::vulnerability"
, and ~UX
denotes to use the severity definition in that category. When a ~"type::bug"
correspond to multiple categories, the severity to apply should be the higher, for example, if an issue has a ~"severity::2"
for ~"bug::availability"
and a ~"severity::1"
for ~"bug::performance"
then the severity assigned to the issue should be ~"severity::1"
.
Once you've determined a severity for an issue add a note that explains in summary why you selected the severity you did. This will help future team members understand your rationale so they will know how to proceed with acting upon the issue.
Type of ~"type::bug" |
~"severity::1" : Blocker |
~"severity::2" : Critical |
~"severity::3" : Major |
~"severity::4" : Low |
Triage DRI |
---|---|---|---|---|---|
General bugs | Broken feature with no workaround or any data-loss. | Broken feature with an unacceptably complex workaround. | Broken feature with a workaround. | Functionality is inconvenient. | SET or QEM for that product group. |
~"bug::performance" Response time (API/Web/Git)1 |
Above 9000ms to timing out | Between 2000ms and 9000ms | Between 1000ms and 2000ms | Between 200ms and 1000ms | Enablement Quality Engineering team |
~"bug::performance" Browser Rendering (LCP)2 |
Above 9000ms to timing out | Between 4000ms and 9000ms | Between 3000ms and 4000ms | Between 3000ms and 2500ms | Enablement Quality Engineering team |
~"bug::performance" Browser Rendering (TBT)2 |
Above 9000ms to timing out | Between 2000ms and 9000ms | Between 1000ms and 2000ms | Between 300ms and 1000ms | Enablement Quality Engineering team |
~UX User experience problem ³ |
"I can't figure this out." Users are blocked and/or likely to make risky errors due to poor usability, and are likely to ask for support. | "I can figure out why this is happening, but it's really painful to solve." Users are significantly delayed by the available workaround. | "This still works, but I have to make small changes to my process." Users are self sufficient in completing the task with the workaround, but may be somewhat delayed. | "There is a small inconvenience or inconsistency." Usability isn't ideal or there is a small cosmetic issue. | Product Designers of that Product group |
~"bug::availability" of GitLab SaaS |
See Availability section | See Availability section | See Availability section | See Availability section | |
~"bug::vulnerability" Security Vulnerability |
See Security Prioritization | See Security Prioritization | See Security Prioritization | See Security Prioritization | AppSec team |
Global Search | See Search Prioritization | See Search Prioritization | See Search Prioritization | See Search Prioritization | |
~test Bugs blocking end-to-end test execution |
See Blocked tests section | See Blocked tests section | See Blocked tests section | See Blocked tests section | Quality Engineering Sub-Department |
~GitLab.com Resource Saturation Capacity planning warnings |
Mean forecast shows Hard SLO breach within 3 months. | Scalability Engineering Manager (who will hand over to EM that owns the resource) |
The severity label also helps us define the time a ~"type::bug" or ~"corrective action" of that severity should be completed. This indicates the expected timeline & urgency which is used to measure our SLO targets.
Severity | Incident root cause analysis ~corrective action SLO |
~"type::bug" resolution SLO |
~"GitLab.com Resource Saturation" resolution SLO |
Security ~vulnerability SLO |
---|---|---|---|---|
~"severity::1" |
1 week | The current release + next available deployment to GitLab.com (within 30 days) | Within 2 months | See Vulnerability Remediation SLAs |
~"severity::2" |
30 days | The next release (60 days) | See Vulnerability Remediation SLAs | |
~"severity::3" |
60 days | Within the next 3 releases (approx one quarter or 90 days) | See Vulnerability Remediation SLAs | |
~"severity::4" |
90 days | Anything outside the next 3 releases (more than one quarter or 120 days). | See Vulnerability Remediation SLAs |
If a issue seems to fall between two severity labels, assign it to the higher severity label.
~"severity::1"
~"severity::2"
~"severity::3"
~"severity::4"
As the triager of an issue you are responsible for adjusting your decision based on additional information that surfaces later. To do that, track subsequent activity on issues that you have closed and adjust your decision as needed.
Issues with ~"bug::availability"
label directly impacts the availability of GitLab.com SaaS. It is considered as another category of ~"type::bug"
.
For the purposes of Incident Management, incident issue severities are chosen based on the availability
severity matrix below.
We categorize these issues based on the impact to GitLab.com's customer business goal and day to day workflow.
The prioritization scheme adheres to our product prioritization where security and availability work are prioritized over feature velocity.
The presence of these severity labels modifies the standard severity labels(~"severity::1"
, ~"severity::2"
, ~"severity::3"
, ~"severity::4"
) by primarily taking into account the impact to users. The severity of these issues may change depending on the re-analysis of the impact to GitLab.com users.
Severity | Availability impact | Time to mitigate (TTM)(1) | Time to resolve (TTR)(2) | Minimum priority |
---|---|---|---|---|
~"severity::1" |
Problem on GitLab.com blocking the typical user's workflow Impacts 20% or more of users without an available workaround AND/OR Any roadblock that puts the guaranteed self-managed release date at risk (use ~backstage label) AND/OR Any data loss directly impacting customers |
Within 8 hrs | Within 48 hrs | ~"priority::1" |
~"severity::2" |
Problem on GitLab.com blocking the typical user's workflow Impacts 20% or more of users, but a reasonable workaround is available. Impacts between 5%-20% of users without an available workaround |
Within 24 hrs | Within 7 days | ~"priority::1" |
~"severity::3" |
Broad impact on GitLab.com and minor inconvenience to typical user's workflow. No workaround needed. Impacts up to 5% of users |
Within 72 hrs | Within 30 days | ~"priority::2" |
~"severity::4" |
Minimal impact on GitLab.com typical user's workflow to less than 5% of users May also include incidents with no impact, but with importance to resolve to prevent future risk |
Within 7 days | Within 60 days | ~"priority::3" |
(1) - Mitigation uses non-standard work processes, eg. hot-patching, critical code and configuration changes. Owned by Infrastructure department, leveraging available escalation processes (dev-escalation and similar)
(2) - Resolution uses standard work processes, eg. code review. Scheduling is owned by the Product department, within the defined SLO targets.
The priority of an availability issue is tied to severity in the following manner:
Issue with the labels | Allowed priorities | Not-allowed priorities |
---|---|---|
~"bug::availability" ~"severity::1" |
~"priority::1" only |
~"priority::2" , ~"priority::3" , and ~"priority::4" |
~"bug::availability" ~"severity::2" |
~"priority::1" only |
~"priority::2" , ~"priority::3" , and ~"priority::4" |
~"bug::availability" ~"severity::3" |
~"priority::2" as baseline, ~"priority::1" allowed |
~"priority::3" , and ~"priority::4" |
~"bug::availability" ~"severity::4" |
~"priority::3" as baseline, ~"priority::2" and ~"priority::1" allowed |
~"priority::4" |
The merge request (MR) experience is the core of our product. Due to many teams contributing to the MR workflow components, it has become a disjointed experience.
The overlapping is largely seen in the following areas: Merge Request Widgets, Mergeability Checks, MWPS and Merge Trains.
As part of the analysis in the Transient Bug working group, we have discovered that the top most affected product areas are:
create::code review
verify::continuous integration
create::source code
(tied)plan::project management
(tied)These product groups also have a high sensitivity to GMAU. This product groups will benefit from a heightened awareness on bugs overlapping with Merge Request functionality.
We need an elevated sense of action in this area. If a bug is related to the merge request experience it should have the labels ~UX
~merge requests
.
Priority is tied to severity in the following manner:
MR UX bug severity | Allowed priorities | Not-allowed priorities |
---|---|---|
~"severity::1" |
~"priority::1" only |
~"priority::2" , ~"priority::3" and ~"priority::4" |
~"severity::2" |
~"priority::1" only |
~"priority::2" , ~"priority::3" and ~"priority::4" |
~"severity::3" |
~"priority::1" or ~"priority::2" |
~"priority::3" and ~"priority::4" |
~"severity::4" |
~"priority::1" or ~"priority::2" or ~"priority::3" |
~"priority::4" |
End-to-end tests that don't run lead to blind spots that can cause unforeseen availability issues. We must ensure coverage is stable and active by quickly resolving issues that cause quarantined end-to-end tests.
To promote awareness of bugs blocking end-to-end test execution, newly opened ~test ~"type::bug" issues will be announced in several Slack channels:
Priority is tied to severity in the following manner:
Type of test blocked | Bug severity | Allowed priorities | Not-allowed priorities |
---|---|---|---|
Smoke end-to-end test | ~"severity::1" |
~"priority::1" only |
~"priority::2" , ~"priority::3" and ~"priority::4" |
Non-smoke end-to-end test | ~"severity::2" |
~"priority::2" as baseline, ~"priority::1" allowed |
~"priority::3" and ~"priority::4" |
Improving performance: It may not be possible to reach the intended response time in one iteration. We encourage performance improvements to be broken down. Improve where we can and then re-evaluate the next appropriate level of severity & priority based on the new response time.
Some UX-related issues are identified as impacting our System Usability Scale (SUS) score, which is a focus in our three-year strategy. We identify SUS-impacting issues with at least one of the labels listed in the Total open SUS-impacting issues by severity UX KPI. If one of these labels is applied, the tracking label "~SUS::Impacting"
will automatically be added. These issues can have a severity label applied with or without an accompanying ~"type::bug"
label. For issues with type::bug
, they follow the severity and SLOs for type::bug
issues. Issues without type::bug
are without SLO.
type::bug
SeveritySUS issue severity without type::bug label |
Allowed priorities | Recommended delivery |
---|---|---|
~"severity::1" |
~"priority::1" only |
within 60 days |
~"severity::2" |
~"priority::1" or ~"priority::2" |
within 120 days |
~"severity::3" |
~"priority::1" , or ~"priority::2" , or ~"priority::3" |
No SLA set today |
~"severity::4" |
~"priority::1" , or ~"priority::2" , or ~"priority::3" , or ~"priority::4" |
No SLA set today |
Note: The above delivery timeframes only apply for new UX bugs filed after 2022-03-22. All UX bugs file prior to this date need to be reevaluated for the correct delivery timeframe.
Additionally, we include UX bugs (identified with both the ~UX
~"type::bug"
labels) in our list of SUS-Impacting issues.
Note: SUS-impacting issues are intended to have an impact on the current product experience rather than on new feature additions. An issue will have the SUS::Impacting
label automatically applied if any of the SUS-impacting labels are used. However, there are exceptions:
type::feature
and feature::addition
indicate we are not making a change or improvement to an existing experience.Actionable Insights::Exploration needed
label applied but the issue is not ready to be prioritized and added to the product.In these cases, you should replace the SUS::Impacting
label with the SUS::Non-impacting
label and a severity label is not needed.
As noted above, issues labeled as ~UX Debt
also have a severity (and additionally priority) label applied without an accompanying ~"type::bug"
label. UX Debt results from the decision to release a user-facing feature that needs refinement, with the intention to improve it in subsequent iterations. Because it is an intentional decision, ~UX Debt
should not have a severity higher than ~"severity::3"
, because MVCs should not intentionally have obvious bugs or significant usability problems. If you find yourself creating a UX debt issue that is higher than ~"severity::3"
, please talk to your stage group team about reincorporating that issue into the MVC.
A transient bug is unexpected, unintended behavior that does not always occur in response to the same action.
Transient bugs give users conflicting impressions about what is happening when they take action, may not consistently occur, and last for a short period of time. While these bugs may not block a user's workflow and are usually resolved by a total page refresh, they are detrimental to the user experience and can build a lack of trust in the product. Users can become unsure about whether the data they are seeing is stale, fresh, or has even updated after they took an action. Examples of transient behaviors include:
In order to define an issue as a "transient bug," use the ~"bug::transient"
label
An issue may have an infradev
label attached to it, which means it subscribes to a dedicated process to related to SaaS availability and reliability, as detailed in the Infradev Engineering Workflow. These issues follow the established severity SLOs for bugs.
GitLab, like most large applications, enforces limits within certain features. The absences of limits can impact security, performance, and availability. For this reason issues related to limits are considered ~"type::bug"
in the ~"bug::availability"
sub-category.
In order to define an issue as related to limits add the labels ~"availability::limit"
and ~"bug::availability"
.
Severity should be assessed using the following table:
Severity | Availability impact |
---|---|
~"severity::1" |
Absence of this limit enables a single user to negatively impact availablity of GitLab |
~"severity::2" |
Absence of this limit poses a risk to reduced availability of GitLab |
~"severity::3" |
Absence of this limit has a negative impact on ability to manage cost, performance, or availability |
~"severity::4" |
A limit could be applied, but it's absences does not pose availability risk |
These issues follow the established severity SLOs for bugs.
Initial triage involves (at a minimum) labelling an issue appropriately, so un-triaged issues can be discovered by searching for issues without any labels.
Follow one of these links:
Pick an issue, with preference given to the oldest in the list, and evaluate it with a critical eye, bearing the issue triage practices below in mind. Some questions to ask yourself:
~"bug::vulnerability"
label be appropriate?~"bug::vulnerability"
issues or
issues that contain private information.Apply each label that seems appropriate. Issues with a security impact should be treated specially - see the security disclosure process.
If the issue seems unclear - you aren't sure which labels to apply - ask the requester to clarify matters for you. Keep our user communication guidelines in mind at all times, and commit to keeping up the conversation until you have enough information to complete triage.
Consider whether the issue is still valid. Especially for older issues, a ~"type::bug"
may have been fixed since it was reported, or a ~"type::feature"
may have already been implemented.
Be sure to check cross-reference notes from other issues or merge requests, they are a great source of information!
For instance, by looking at a cross-referenced merge request, you could see a "Picked into 8-13-stable
, will go into 8.13.6
." which would mean that the issue is fixed since the version 8.13.6
.
If the issue meets the requirements, it may be appropriate to make a scheduling request - use your judgement!
You're done! The issue has all appropriate labels, and may now be in the backlog, closed, awaiting scheduling, or awaiting feedback from the requestor. Pick another, if you've got the time.
We're enforcing some of the policies automatically in
triage-ops, using the
@gitlab-bot
user.
For more information about the automated triage, please read the
Triage Operations
That said, we can't automate everything. In this section we'll describe some of the practices we're doing manually.
From time to time you may encounter issues for which it is difficult to pick a group or stage that should be responsible. It is likely that these issues address what is called Shared Responsibility Functionality of the product.
The approach for these is to use a decentralized triage process. The triage is not centralized in a single report or list, and it does not fall to one individual or group to have the responsibility to review those issues. This helps with scaling our triage operations to address a large number of issues that may fall into this shared responsibility category on an ongoing basis rather than in a recurring scheduled event.
The goal is to empower leadership at the group level (i.e. Product Manager and/or Engineering Manager) to make decisions on who, when and how these issues should be addressed. Higher-level management individuals and groups act as a backup to address escalations and make decisions when competing priorities make it difficult to decide on a course of action.
If you are triaging one of these issues as a GitLab engineer or as a quality department manager, or if you are the author of the issue, please make your best effort to assign a group label to the issue as soon as possible after creation. You don't have to get it perfect, but just make a conscious effort to identify the group that is the best one set up for success to work on the issue.
You can ask yourself these questions when picking a group:
To help with initially narrowing down the list of possible groups, you may review the Product Categories page or the Stage Groups Ownership Index page.
In any case, you should attempt to understand the nature of the issue by asking follow-up questions to the author if necessary, and then map the requirements to the group that best matches the skills or expertise required.
Secondary triage happens when the issue has already been assigned to a group and now someone within the group (typically the PM or EM) is assessing the issue for prioritization and/or estimation. If you are the one doing this triage you may take one of the following courses of action:
As you work through the triage, exercise your judgement to decide when it is time to escalate issues to a higher level (i.e. senior management, directors or above) if you and your EM/PM peers can’t agree on the value, severity, priority or group purview of the issue. For now, the method to escalate is flexible and you can choose the right communication channel and modality for the situation.
As the DRI you should consider take additional steps to ensure the continued support of the affected area. This may involve putting forward proposals for the creation of new platform groups that can take the ongoing responsibility and technical strategy for the components in question. This of course does not preclude the need to take immediate action on the issue assigned to your group.
If as a result of the triage process a group is identified as qualified and willing to take ownership on a permanent basis, product and engineering leaders should officially document the type of ownership model and the team in the shared services components section of the Development handbook. Multiple groups may permanently share ownership of the same component if deemed appropriate.
It is important to keep in mind that throughout this process, as a leader in your group, you are deemed the initial Directly Responsible Individual (DRI) until the issue is resolved or someone else agrees to take over. Simply removing your group label without further triage conversations with other groups is not an acceptable or helpful action to take in this process. This aligns with our value of Results: global optimization.
For issues that haven't been updated in the last 3 months the "Awaiting Feedback" label should be added to the issue. After 14 days, if no response has been made by anyone on the issue, the issue should be closed. This is a slightly modified version of the Rails Core policy on outdated issues.
If they respond at any point in the future, the issue can be considered for reopening. If we can't confirm an issue still exists in recent versions of GitLab, we're just adding noise to the issue tracker.
To find duplicates:
Use the issue with the better title, description, or more comments and positive reactions as the canonical version. If you can't decide, keep the earlier issue.
If the issue is really a support request for help, you can post this message:
Hey {{author}} thanks for reaching out, but it looks like this might be a request for support. The issue tracker is for new bug reports and feature proposals. For support requests we have several resources that you can use to find help and support from the Community, including:
* [Technical Support for Paid Tiers](https://about.gitlab.com/support/)
* [Community Forum](https://forum.gitlab.com/)
* [Reference Documents and Videos](https://about.gitlab.com/get-help/#references)
Please refer to our [Support page](https://about.gitlab.com/support/) for more information.
I'm closing this issue but if you believe this was closed in error, please feel free to reopen the issue.
/label ~"support request"
/close
If you find duplicates, you can post this message:
Hey {{author}}! Thanks for submitting this issue. It looks like a duplicate of {{issue}}. I'm marking your issue as a duplicate and close it.
Please add your thoughts and feedback on {{issue}}. Don't forget to upvote feature proposals.
/duplicate {{issue}}
Don't make any forward looking statements around milestone targets that the duplicate issue may be assigned.
We simply can't satisfy everyone. We need to balance pleasing users as much as possible with keeping the project maintainable.
When an issue comes in, it should be triaged and labeled. Issues without labels are harder to find and often get lost.
Be careful with severity labels. Underestimating severity can make a problem worse by suggesting resolution can wait longer than it should. Review available Severity labels. If you are not certain of the right label for a bug, it is OK to overestimate the severity. But do get confirmation from a domain expert.
Sort by "Author: your username" and close any issues which you know have been fixed or have become irrelevant for other reasons. Label them if they're not labeled already.
Some issues may not fall into the type labels, but they contain useful feedback on how GitLab features are used.
These issues should be mentioned to the product manager and labeled as ~"Product Feedback"
in addition to the group, category and stage labels.
https://gitlab.com/gitlab-org/gitlab/-/issues/324770 is an example of a Product Feedback issue.
If it's a question, or something vague that can't be addressed by the development team for whatever reason, close it and direct them to the relevant support resources we have (e.g. https://about.gitlab.com/get-help/, our Discourse forum or emailing Support).
If you notice a common pattern amongst various issues (e.g. a new feature that doesn't have a dedicated label yet), suggest adding a new label in Slack or a new issue.
If possible, ask the reporter to reproduce the issue in a public project on GitLab.com. You can also try to do so yourself in the issue-reproduce group. You can ask any owner of that group for access.
The original issue about these policies is #17693. We'll be working to improve the situation from within GitLab itself as time goes on.
The following projects, resources, and blog posts were very helpful in crafting these policies:
Our current response time targets for APIs, Web Controllers and Git calls are based on the TTFB P90 results of the GitLab Performance Tool (GPT) being run against a 10k-user reference environment under lab like conditions. This run happens nightly and results are outputted to the wiki on the GPT project. ↩
Our current Browser Rendering targets for Largest Contentful Paint (LCP) and Total Blocking Time (TBT) are based on results of SiteSpeed being run against a 10k-user reference environment under lab like conditions. This run happens nightly and results are outputted to the wiki on the GPT project. ↩ ↩2