Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Debugging Failing tests

Overview

These guidelines are intended to help you to investigate failures in the end-to-end tests so that they can be properly addressed.

This will involve analysing each failure and creating an issue to report it.

It might also involve fixing tests, putting them in quarantine, or reporting bugs in the application.

The tests run on a scheduled basis, and their results are posted to Slack.

Scheduled QA test pipelines

The following are the QA test pipelines that are monitored every day.

Environment Tests type Schedule Slack channel
Production Smoke Every 2 hours. #qa-production
Canary Smoke Every 2 hours, and after each deployment to Canary. The bi-hourly schedule is useful to catch failures introduced by a configuration change. #qa-production
Staging Smoke Every 2 hours, and after each deployment to Staging. The bi-hourly schedule is useful to catch failures introduced by a configuration change. #qa-staging
Staging Full, Orchestrated After each deployment to Staging. #qa-staging
Nightly packages Full Daily at 4:00 am UTC. #qa-nightly
GitLab master Full When the package-and-qa job executes from a scheduled pipeline every 2 hours. #qa-master
GitLab FOSS master Full When the package-and-qa job executes from a scheduled pipeline every 2 hours. #qa-master

For each pipeline there is a notification of success or failure (except for master pipelines, which only report failures). If there's a failure, we use emoji to indicate the state of investigation of the failure:

Time to triage

Steps for debugging QA pipeline test failures

1. Initial analysis

Start with a brief analysis of the failures. The aim of this step is to make a quick decision about how much time you can spend investigating in each failure.

In the relevant Slack channel:

  1. Apply the :eyes: emoji to indicate that you're investigating the failure(s).
  2. If there's a system failure (e.g., Docker or runner failure), retry the job and apply the :retry: emoji.
  3. Look into the QA failures board to see if the failure is already reported or not.
  4. If the failure is already reported, add a :fire_engine: emoji. It can be helpful if you reply to the failure notification with a link to the issue(s), but this isn't always necessary, especially if the failures are the same as in the previous pipeline and there are links there.

Your priority is to report all new failures, so if there are many failures we recommend that you identify whether each failure is old (i.e., there is an issue open for it), or new. For each new failure, open an issue that includes only the required information. Once you have opened an issue for each new failure you can investigate each more thoroughly and act on them appropriately, as described in later sections.

The reason for reporting all new failures first is that engineers may find the test failing in their own merge request, and if there is no open issue about that failure they will have to spend time trying to figure out if their changes caused it.

2. Create an issue

  1. Create an issue in https://gitlab.com/gitlab-org/gitlab/issues using the QA failure template.
  2. In the relevant Slack channel, add the :boom: emoji and reply to the failure notification with a link to the issue.

3. Investigate the failure further

The aim of this step is to understand the failure. The results of the investigation will also let you know what to do about the failure.

The following points can help with your investigation:

Checking Docker image

Sometimes tests may fail due to an outdated Docker image. To check if that's the case, follow the below instructions to see if some merged code is available or not in a Docker image.

Checking test code

If you suspect that certain test is failing due to the gitlab/gitlab-{ce|ee}-qa image being outdated, follow the below steps:

  1. Locally, run docker run -it --entrypoint /bin/sh gitlab/gitlab-ce-qa:latest to check for GitLab QA CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee-qa:latest to check for GitLab QA EE code
  2. Then, navigate to the qa directory (cd /home/qa/qa)
  3. Finally, use cat to see if the code you're looking for is available or not in certain file (e.g., cat page/project/issue/show.rb)

Note: if you need to check in another tag (e.g., nightly), change it in one of the scripts of step 1 above.

Checking application code
  1. Locally, run docker run -it --entrypoint /bin/sh gitlab/gitlab-ce:latest to check for GitLab CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee:latest to check for GitLab EE code
  2. Then, navigate to the gitlab-rails directory (cd /opt/gitlab/embedded/service/gitlab-rails/)
  3. Finally, use cat to see if the code you're looking for is available or not in certain file (e.g., cat public/assets/issues_analytics/components/issues_analytics-9c3887211ed5aa599c9eea63836486d04605f5dfdd76c49f9b05cc24b103f78a.vue.)

Note: if you want to check another tag (e.g., nightly) change it in one of the scripts of step 1 above.

4. Classify and triage the failure

The aim of this step is to categorise the failure as either a broken test, a bug in the application code, or a flaky test.

Test is broken

In this case, you've found that the failure was caused by some change in the application code and the test needs to be updated. You should:

Bug in code

In the case you've found that a failure was caused by a bug in the application code:

To find the appropriate team member to cc, please refer to the Organizational Chart. The Quality Engineering team list and DevOps stage group list might also be helpful.

Example

it 'is quarantined', quarantine: { issue: 'https://gitlab.com/gitlab-org/gitlab/issues/<issue_id>', type: :bug }
Flaky Test

In this case, you've found that the failure is due to flakiness in the test itself:

Example

it 'is quarantined', quarantine: 'https://gitlab.com/gitlab-org/gitlab/issues/<issue_id>'

Following up on failures

Fixing the test

If you've found that the test is the cause of the failure (either because the application code was changed or there's a bug in the test itself), it will need to be fixed. This might be done by another SET or by yourself. However, it should be fixed as soon as possible. In any case, the steps to follow are as follows:

If the test was flaky:

Note: the number of passes needed to be sure a test is stable is just a suggestion. You can use your judgement to pick a different threshold.

If the test was in quarantine, remove it from quarantine.

Quarantining Tests

We should be very strict about quarantining tests. Quarantining a test is very costly and poses a higher risk because it allows tests to fail without blocking the pipeline, which could mean we miss new failures. The aim of quarantining the tests is not to get back a green pipeline, but rather to reduce the noise (due to constantly failing tests, flaky tests, etc.) so that new failures are not missed. Hence, a test should be quarantined only under the following circumstances:

Following are the steps to quarantine a test:

To be sure that the test is quarantined quickly, ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it directly.

Here is an example quarantine merge request.

Dequarantining Tests

Failing to dequarantine tests periodically reduces the effectiveness of the test suite. Hence, the tests should be dequarantined on or before the due-date mentioned in the corresponding issue.

Before dequarantining a test:

To dequarantine a test:

As with quarantining a test, you can ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it.

Re-evaluating tests

If the due date of a failing test issue is reached, you should re-evaluate if the failing test should really be covered at the end-to-end test level, or if it should be covered in a lower level of the testing levels pyramid.

If you decide to delete the test, open a merge request to delete it and close the test failure issue. In the MR description or comment, mention the stable counterpart TAE for the test's stage for their awareness. Then open a new issue to cover the test scenario in a different test level.

If you decide the test is still valuable but don't want to leave it quarantined, you could replace :quarantine with :skip, which will skip the test entirely (i.e., it won't run even in jobs for quarantined tests). That can be useful when you know the test will continue to fail for some time (e.g., at least the next milestone or two).

Training Videos

Two videos walking through the triage process were recorded and uploaded to the GitLab Unfilitered YouTube channel.