Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Debugging Failing Tests and Test Pipelines

Overview

These guidelines are intended to help you to investigate end-to-end test pipeline failures so that they can be properly addressed. This will involve analyzing each failure and creating an issue to report it. It might also involve fixing tests, putting them in quarantine, or reporting bugs in the application.

Scheduled QA test pipelines

The test pipelines run on a scheduled basis, and their results are posted to Slack. The following are the QA test pipelines that are monitored every day.

Environment Tests type Schedule Slack channel
Production Smoke Every 2 hours. #qa-production
Canary Smoke Every 2 hours, and after each deployment to Canary. The bi-hourly schedule is useful to catch failures introduced by a configuration change. #qa-production
Staging Smoke Every 2 hours, and after each deployment to Staging. The bi-hourly schedule is useful to catch failures introduced by a configuration change. #qa-staging
Staging^ Full, Orchestrated After each deployment to Staging. #qa-staging
Nightly packages Full Daily at 4:00 am UTC. #qa-nightly
GitLab master Full When the package-and-qa job executes from a scheduled pipeline every 2 hours. #qa-master
GitLab FOSS master Full When the package-and-qa job executes from a scheduled pipeline every 2 hours. #qa-master

^Test pipelines also run against an Omnibus-GitLab Docker image that reflects the current release on staging.com. These are referred to in notifications as dev.gitlab.org:5005/gitlab/omnibus-gitlab/gitlab-ee:xxx

For each pipeline there is a notification of success or failure (except for master pipelines, which only report failures). If there's a failure, we use emoji to indicate the state of its investigation:

Triage overview

Steps for debugging a QA test pipeline failure

1. Initial analysis

Start with a brief analysis of the failure. The aim of this step is to make a quick decision about how much time you can spend investigating in each failure.

In the relevant Slack channel:

  1. Apply the :eyes: emoji to indicate that you're investigating the failure(s).
  2. If there's a system failure (e.g., Docker or runner failure), retry the job and apply the :retry: emoji. Read below for examples of system failures.
  3. Look into the QA failures board to see if the failure is already reported or not.
  4. If the failure is already reported, add a :fire_engine: emoji. It can be helpful to reply to the failure notification with a link to the issue(s), but this isn't always necessary, especially if the failures are the same as in the previous pipeline and there are links there.

Your priority is to create issues for all new failures, so if there are multiple failures we recommend that you identify whether each one is old (i.e., there is an issue open for it), or new. For each new failure, open an issue that includes only the required information. Once you have opened an issue for each new failure you can investigate each more thoroughly and act on them appropriately, as described in later sections.

The reason for reporting all new failures first is to allow faster discovery by engineers who may find the test failing in their own merge request test pipeline. If there is no open issue about that failure, the engineer will have to spend time trying to figure out if their changes caused it.

System failures

A job may fail due to infrastructure or orchestration issues that are not related to any specific test. In some cases these issues will fail a job before tests are ever executed. Some examples of non-test related failures include:

2. Create an issue

  1. Create an issue for the test or system failure (if retrying the job does not resolve the latter) in https://gitlab.com/gitlab-org/gitlab/issues using the QA failure template. For system failures, it may make sense to open an issue in a different project such as Omnibus GitLab, GitLab QA, or GitLab Runner. Ask in #quality if you're unsure where to file the issue.
  2. In the relevant Slack channel, add the :boom: emoji and reply to the failure notification with a link to the issue.

3. Investigate the failure further

The aim of this step is to understand the failure. The results of the investigation will also let you know what to do about the failure.

The following points can help with your investigation:

Checking Docker images

Sometimes tests may fail due to an outdated Docker image. To check if that's the case, follow the instructions below to see if specific merged code is available in a Docker image.

Checking test code (QA image)

If you suspect that certain test is failing due to the gitlab/gitlab-{ce|ee}-qa image being outdated, follow these steps:

  1. Locally, run docker run -it --entrypoint /bin/sh gitlab/gitlab-ce-qa:latest to check for GitLab QA CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee-qa:latest to check for GitLab QA EE code
  2. Then, navigate to the qa directory (cd /home/qa/qa)
  3. Finally, use cat to see if the code you're looking for is available in certain file (e.g., cat page/project/issue/show.rb)

Note if you need to check in another tag (e.g., nightly), change it in one of the scripts of step 1 above.

Checking application code

  1. Locally, run docker run -it --entrypoint /bin/sh gitlab/gitlab-ce:latest to check for GitLab CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee:latest to check for GitLab EE code
  2. Then, navigate to the gitlab-rails directory (cd /opt/gitlab/embedded/service/gitlab-rails/)
  3. Finally, use cat to see if the code you're looking for is available or not in a certain file (e.g., cat public/assets/issues_analytics/components/issues_analytics-9c3887211ed5aa599c9eea63836486d04605f5dfdd76c49f9b05cc24b103f78a.vue.)

Note if you want to check another tag (e.g., nightly) change it in one of the scripts of step 1 above.

4. Classify and triage the test failure

The aim of this step is to categorize the failure as either a stale test, a bug in the test, a bug in the application code, or a flaky test.

Test is stale due to an application change

The failure was caused by a change in the application code and the test needs to be updated.

See Quarantining Tests

Bug in test code

The failure was caused by a bug in the test code itself, not in the application code.

See Quarantining Tests

Bug in application code

The failure was caused by a bug in the application code.

To find the appropriate team member to cc, please refer to the Organizational Chart. The Quality Engineering team list and DevOps stage group list might also be helpful.

See Quarantining Tests

Flaky Test

The failure is due to flakiness in the test itself.

See Quarantining Tests

Following up on test failures

Fixing the test

If you've found that the test is the cause of the failure (either because the application code was changed or there's a bug in the test itself), it will need to be fixed. This might be done by another SET or by yourself. However, it should be fixed as soon as possible. In any case, the steps to follow are as follows:

If the test was flaky:

Note The number of passes needed to be sure a test is stable is just a suggestion. You can use your judgement to pick a different threshold.

If the test was in quarantine, remove it from quarantine.

Quarantining Tests

Note We should be very strict about quarantining tests. Quarantining a test is very costly and poses a higher risk because it allows tests to fail without blocking the pipeline, which could mean we miss new failures. The aim of quarantining the tests is not to get back a green pipeline, but rather to reduce the noise (due to constantly failing tests, flaky tests, etc.) so that new failures are not missed.

Following are the steps to quarantine a test:

Note If the example has a before hook, the quarantine metadata should be assigned to the outer context to avoid running the before hook

To be sure that the test is quarantined quickly, ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it directly.

Here is an example quarantine merge request.

Quarantined test types

If a test is placed under quarantine, it is important to specify why. By specifying a quarantine type we can see quickly the reason for the quarantine.

The report accepts custom quarantine types, but we follow the below guidelines for the most commonly recurring reasons.

Quarantine Type Requires :issue? Description
:flaky Yes This test fails intermittently
:bug Yes This test is failing due to an actual bug in the application
:waiting_on Yes This test is quarantined temporarily due to an issue or MR that is a prerequisite for this test to pass
:new No This test was newly introduced to the E2E suite and should be promoted to a standard test by de-quarantining when proven to pass several times
:investigating No This test is a :flaky test but it might be blocking other MRs and so should be quarantined while it's under investigation
:stale No This test is outdated due to a feature change in the application and must be updated to fit the new changes

Note Be sure to attach an issue to the quarantine metadata if there is a related issue or merge request for maximum visibility

Examples
it 'is flaky', quarantine: 'https://gitlab.com/gitlab-org/gitlab/issues/12345'
it 'is flaky', quarantine: { issue: 'https://gitlab.com/gitlab-org/gitlab/issues/12345', type: :flaky }
it 'is due to a bug', quarantine: {
                        issue: 'https://gitlab.com/gitlab-org/gitlab/issues/12345',
                        type: :bug
                      }
it 'is being worked on', quarantine: { type: :investigating }
it 'is a new test', quarantine: { type: :new }
context 'when these tests rely on another MR', quarantine: {
                                                 type: :waiting_on,
                                                 issue: 'https://gitlab.com/gitlab-org/gitlab/merge_requests/12345'
                                               }

Dequarantining Tests

Failing to dequarantine tests periodically reduces the effectiveness of the test suite. Hence, the tests should be dequarantined on or before the due-date mentioned in the corresponding issue.

Before dequarantining a test:

To dequarantine a test:

As with quarantining a test, you can ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it.

Re-evaluating tests

If the due date of a failing test issue is reached, you should re-evaluate if the failing test should really be covered at the end-to-end test level, or if it should be covered in a lower level of the testing levels pyramid.

If you decide to delete the test, open a merge request to delete it and close the test failure issue. In the MR description or comment, mention the stable counterpart SET for the test's stage for their awareness. Then open a new issue to cover the test scenario in a different test level.

If you decide the test is still valuable but don't want to leave it quarantined, you could replace :quarantine with :skip, which will skip the test entirely (i.e., it won't run even in jobs for quarantined tests). That can be useful when you know the test will continue to fail for some time (e.g., at least the next milestone or two).

Training Videos

Two videos walking through the triage process were recorded and uploaded to the GitLab Unfilitered YouTube channel.