These guidelines are intended to help you to investigate end-to-end test pipeline failures so that they can be properly addressed. This will involve analyzing each failure and creating an issue to report it. It might also involve fixing tests, putting them in quarantine, or reporting bugs in the application.
The Pipeline triage DRI is responsible for analyzing and debugging test pipeline failures. Please refer to the Quality Department pipeline triage rotation schedule to know who the current DRI is.
The test pipelines run on a scheduled basis, and their results are posted to Slack. The following are the QA test pipelines that are monitored every day.
^Test pipelines also run against an Omnibus-GitLab Docker image that reflects the current release on staging.com. These are referred to in notifications as dev.gitlab.org:5005/gitlab/omnibus-gitlab/gitlab-ee:xxx
For each pipeline there is a notification of success or failure (except for master
pipelines, which only report failures).
If there's a failure, we use emoji to indicate the state of its investigation:
The general triage steps are:
After triaging failed tests, possible follow up actions are:
Your priority is to make sure we have an issue for each failure, and to communicate the status of its investigation and resolution.
If there are multiple failures we recommend that you identify whether each one is new or old (and therefore already has an issue open for it). For each new failure, open an issue that includes only the required information. Once you have opened an issue for each new failure you can investigate each more thoroughly and act on them appropriately, as described in later sections.
The reason for reporting all new failures first is to allow faster discovery by engineers who might find the test failing in their own merge request test pipeline. If there is no open issue about that failure, the engineer will have to spend time trying to figure out if their changes caused it.
Known failures should be linked to the current pipeline triage report. However, issues can be opened by anyone and are not linked automatically, so be sure to confirm there is no existing issue before creating one.
failure::*
label. By order of likelihood:
In the relevant Slack channel:
Please use this step if there are no issues created to capture the failure. If there is already an issue please skip this step.
#infrastructure-lounge
, or open an issue in the infrastructure project#quality
if you're unsure where to file the issue.The aim of this step is to understand the failure. The results of the investigation will also let you know what to do about the failure. Update the failure issue with any findings from your review.
The following can help with your investigation:
Log or artifact | Notes |
---|---|
Stack trace | Shown in the job's log; the starting point for investigating the test failure |
Screenshots and HTML captures | Available for download in the job's artifact for up to 1 week after the job run |
QA Logs | Included in the job's artifact; valuable for determining the steps taken by the tests before failing |
Sentry logs (Staging, Preprod, Production) | If staging, preprod or production tests fail due to a server error, there should be a record in Sentry. For example, you can search for all unresolved staging errors linked to the gitlab-qa user with the query is:unresolved user:"username:gitlab-qa" . However, note that some actions aren't linked to the gitlab-qa user, so they might only appear in the full unresolved list. |
Kibana logs (Staging, Production) | Various application logs are sent to Kibana, including Rails, Postgres, Sidekiq, and Gitaly logs |
Depending on your level of context for the test and its associated setup, you might feel comfortable investigating the root cause on your own, or you might get help from other SETs right away.
When investigating on your own, we suggest spending at most 20-30 minutes actively trying to find the root cause (this excludes time spent reporting the failure, reviewing the failure logs, or any test setup and pipeline execution time). After that point, or whenever you feel out of ideas, we recommend asking for help to unblock you.
You can run the test (or perform the test steps manually) against your local GitLab instance to see if the failure is reproducible. For example:
CHROME_HEADLESS=false bundle exec bin/qa Test::Instance::All http://localhost:3000 qa/specs/features/browser_ui/1_manage/project/create_project_spec.rb
Orchestrated tests are excluded by default. To run them, use -- --tag orchestrated
before your file name. For example:
CHROME_HEADLESS=false bundle exec bin/qa Test::Instance::All http://localhost:3000 -- --tag orchestrated qa/specs/features/browser_ui/1_manage/project/create_project_spec.rb
You can also use the same Docker image (same sha256 hash) as the one used in the failing job to run GitLab in a container on your local.
In the logs of the failing job, search for Downloaded newer image for gitlab/gitlab-ce:nightly
or Downloaded newer image for gitlab/gitlab-ee:nightly
and use the sha256 hash just above that line.
To run GitLab in a container on your local, the docker command similar to the one shown in the logs can be used. E.g.:
docker run --publish 80:80 --name gitlab --net test --hostname localhost gitlab/gitlab-ce:nightly@sha256:<hash>
You can now run the test against this Docker instance. E.g.:
CHROME_HEADLESS=false bundle exec bin/qa Test::Instance::All http://localhost qa/specs/features/browser_ui/1_manage/project/create_project_spec.rb
To run CustomersDot E2E tests locally against staging environment, you will need to clone CustomersDot project, switch to qa
directory, and then run
STAGING=1 CP_ADMIN_TOKEN=<TOKEN> GL_ADMIN_TOKEN=<TOKEN> bundle exec rspec spec/ui/purchase/purchase_plan_spec.rb
Note - Token value can be found in GitLab-QA Vault. For details on running tests locally with more options, please refer to CustomersDot README doc
QA_DEBUG=true
to enable logging output including page actions and Git commands.Sometimes tests may fail due to an outdated Docker image. To check if that's the case, follow the instructions below to see if specific merged code is available in a Docker image.
Checking test code (QA image)
If you suspect that certain test is failing due to the gitlab/gitlab-{ce|ee}-qa
image being outdated, follow these steps:
docker run -it --entrypoint /bin/sh gitlab/gitlab-ce-qa:latest
to check for GitLab QA CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee-qa:latest
to check for GitLab QA EE codeqa
directory (cd /home/qa/qa
)cat
to see if the code you're looking for is available in certain file (e.g., cat page/project/issue/show.rb
)Note if you need to check in another tag (e.g.,
nightly
), change it in one of the scripts of step 1 above.
Checking application code
docker run -it --entrypoint /bin/sh gitlab/gitlab-ce:latest
to check for GitLab CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee:latest
to check for GitLab EE codegitlab-rails
directory (cd /opt/gitlab/embedded/service/gitlab-rails/
)cat
to see if the code you're looking for is available or not in a certain file (e.g., cat public/assets/issues_analytics/components/issues_analytics-9c3887211ed5aa599c9eea63836486d04605f5dfdd76c49f9b05cc24b103f78a.vue
.)Note if you want to check another tag (e.g.,
nightly
) change it in one of the scripts of step 1 above.
The aim of this step is to categorize the failure as either a stale test, a bug in the test, a bug in the application code, or a flaky test.
We use the following labels to capture the cause of the failure.
~"failure::investigating"
: Default label to apply at the start of investigation.~"failure::stale-test"
: Stale test due to application change~"failure::broken-test"
: Bug in the test~"failure::flaky-test"
: Flaky test~"failure::test-environment"
: Failure due to test environment~bug
: Bug in the applicationNote: It might take a while for a fix to propagate to all environments. Be aware that a new failure could be related to a recently-merged fix that hasn't made it to the relevant environment yet. Similarly, if a known failure occurs but the test should pass because a fix has been merged, verify that the fix has been deployed to the relevant environment before attempting to troubleshoot further.
The failure was caused by a change in the application code and the test needs to be updated.
~"failure::stale-test"
label.The failure was caused by a bug in the test code itself, not in the application code.
~"failure::broken-test"
label.The failure was caused by a bug in the application code.
~"bug"
label, and cc-ing the corresponding Engineering Managers (EM), QEM, and SET.type: :bug
in the quarantine
tag.Note: GitLab maintains a daily deployment cadence so a breaking change in master
reaches Canary and Production fast. Please communicate broadly to ensure that the corresponding Product Group is aware of the regression and action is required. If the bug is qualified for dev escalation (example: priority::1/severity::1
issue that blocks the deployment process), consider involving On-call Engineers in the #dev-escalation
channel. To find out who’s on-call follow the links in the channel subject line.
To find the appropriate team member to cc, please refer to the Organizational Chart. The Quality Engineering team list and DevOps stage group list might also be helpful.
The failure is due to flakiness in the test itself.
~"failure::flaky-test"
label.Flakiness can be caused by a myriad of problems. Examples of underlying problems that have caused us flakiness include:
For more details, see the list with example issues in our Testing standards and style guidelines section on Flaky tests.
The failure is due external factors outside the scope of the test. This could be due to environments, deployment hang-ups, or upstream dependencies.
~"failure::test-environment"
label.A job may fail due to infrastructure or orchestration issues that are not related to any specific test. In some cases these issues will fail a job before tests are ever executed. Some examples of non-test related failures include:
If you've found that the test is the cause of the failure (either because the application code was changed or there's a bug in the test itself), it will need to be fixed. This might be done by another SET or by yourself. However, it should be fixed as soon as possible. In any case, the steps to follow are as follows:
If the test was flaky:
Note The number of passes needed to be sure a test is stable is just a suggestion. You can use your judgement to pick a different threshold.
If the test was in quarantine, remove it from quarantine.
Note We should be very strict about quarantining tests. Quarantining a test is very costly and poses a higher risk because it allows tests to fail without blocking the pipeline, which could mean we miss new failures.
The aim of quarantining a test is not to get back a green pipeline, but rather to reduce the noise (due to constantly failing tests, flaky tests, and so on) so that new failures are not missed. If you're unsure about quarantining a test ask for help in the#quality
Slack channel, and then consider adding to the list of examples below to help future pipeline triage DRIs.
Examples of when to quarantine a test:
~"failure::broken-test"
), and a fix won't be ready for review within 24 hours~"failure::stale-test"
), and a fix won't be ready for review within 24 hoursExamples of when not to quarantine a test:
~"failure::test-environment"
), and neither the application code nor test code are the cause of the failureNote The time limit for the fix is just a suggestion. You can use your judgement to pick a different threshold.
To quarantine a test:
:quarantine
metadata to the test with a link to the issue (see quarantined test types)Note If the example has a
before
hook, thequarantine
metadata should be assigned to the outer context to avoid running thebefore
hook
~"Quality", ~"QA", ~"bug", ~"Pick into auto-deploy"
.~"devops::create" ~"group::source code"
.To be sure that the test is quarantined quickly, ask in the #quality
Slack channel for someone to review and merge the merge request, rather than assigning it directly.
Here is an example quarantine merge request.
If a test is placed under quarantine, it is important to specify why. By specifying a quarantine type we can see quickly the reason for the quarantine.
The report accepts custom quarantine types, but we follow the below guidelines for the most commonly recurring reasons.
Quarantine Type | Requires :issue ? |
Description |
---|---|---|
:flaky |
Yes | This test fails intermittently |
:bug |
Yes | This test is failing due to an actual bug in the application |
:waiting_on |
Yes | This test is quarantined temporarily due to an issue or MR that is a prerequisite for this test to pass |
:investigating |
No | This test is a :flaky test but it might be blocking other MRs and so should be quarantined while it's under investigation |
:stale |
No | This test is outdated due to a feature change in the application and must be updated to fit the new changes |
Note Be sure to attach an
issue
to the quarantine metadata if there is a related issue or merge request for maximum visibility
it 'is flaky', quarantine: 'https://gitlab.com/gitlab-org/gitlab/issues/12345'
it 'is flaky', quarantine: { issue: 'https://gitlab.com/gitlab-org/gitlab/issues/12345', type: :flaky }
it 'is due to a bug', quarantine: {
issue: 'https://gitlab.com/gitlab-org/gitlab/issues/12345',
type: :bug
}
it 'is being worked on', quarantine: { type: :investigating }
context 'when these tests rely on another MR', quarantine: {
type: :waiting_on,
issue: 'https://gitlab.com/gitlab-org/gitlab/merge_requests/12345'
}
Failing to dequarantine tests periodically reduces the effectiveness of the test suite. Hence, the tests should be dequarantined on or before the due-date mentioned in the corresponding issue.
Before dequarantining a test:
RELEASE
variable to the release that has your changes. See Running Gitlab-QA pipeline against a specific GitLab release
for instruction on finding your release version created and tagged by the Omnibus pipeline.To dequarantine a test:
:quarantine
tag.As with quarantining a test, you can ask in the #quality
Slack channel for someone to review and merge the merge request, rather than assigning it.
If the due date of a failing test issue is reached, you should re-evaluate if the failing test should really be covered at the end-to-end test level, or if it should be covered in a lower level of the testing levels pyramid.
If you decide to delete the test, open a merge request to delete it and close the test failure issue. In the MR description or comment, mention the stable counterpart SET for the test's stage for their awareness. Then open a new issue to cover the test scenario in a different test level.
If you decide the test is still valuable but don't want to leave it quarantined, you could replace :quarantine
with :skip
, which will skip the test entirely (i.e., it won't run even in jobs for quarantined tests). That can be useful when you know the test will continue to fail for some time (e.g., at least the next milestone or two).
Two videos walking through the triage process were recorded and uploaded to the GitLab Unfilitered YouTube channel.