Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Debugging Failing tests

Overview

These guidelines are intended to help you to investigate failures in the end-to-end tests so that they can be properly addressed.

This will involve analysing each failure and creating an issue to report it.

It might also involve putting tests in quarantine, or fixing tests, or reporting bugs in the application.

The tests run on a scheduled basis and the results of the tests can be seen in the relevant pipelines:

Scheduled QA Test Pipelines

The following are the QA test pipelines that are monitored every day.

Environment Tests type Schedule Slack channel
Production Smoke Every 2 hours. #qa-production
Canary Smoke Every 2 hours, and after each deployment to Canary. The bi-hourly schedule is useful to catch failures introduced by a configuration change. #qa-production
Staging Smoke Every 2 hours, and after each deployment to Staging. The bi-hourly schedule is useful to catch failures introduced by a configuration change. #qa-staging
Staging Full, Orchestrated After each deployment to Staging. #qa-staging
GitLab master Full When the schedule:package-and-qa job executes from a scheduled pipeline every 2 hours. #development
GitLab FOSS master Full When the schedule:package-and-qa job executes from a scheduled pipeline every 2 hours. #development
Nightly packages Full Daily at 4:00 am UTC. #qa-nightly

We also use the #qa-*environment* Slack channels to quickly see the current status of the tests, like we do with failures on master. For each pipeline there is a notification of success or failure. If there's a failure, we use emoji to indicate the state of investigation of the failure:

The failure reports in #development are a little different from those in the #qa-*environment* channels. Those in #development include a link to the upstream failed jobs, but to see which tests failed we need the downstream gitlab-qa pipeline and all its jobs. To get there from the notification in #development you can click the commit link and then scroll to the bottom of that page to see the comments with links to the downstream pipelines.

To tell the difference between GitLab and GitLab FOSS notifications in the #development channel, look for GitLab or GitLab FOSS in the bottom left corner near the date.

Note: It's worth mentioning that notifications in some channels, like #qa-production and #qa-staging, provide a link to an HTML report, which can be filtered by 'passing', 'failed', and 'pending'. These reports can be used as a quick way to look for failures.

Time to triage

Steps for debugging QA pipeline test failures

1. Initial analysis

Start with a brief analysis of the failures. The aim of this step is to make a quick decision about how much time you can spend investigating in each failure.

In the relevant Slack channel:

  1. Apply the :eyes: emoji to indicate that you're investigating the failure(s).
  2. If there's a system failure (e.g., Docker or runner failure), retry the job and apply the :retry: emoji.
  3. Look into the QA failures board to see if the failure is already reported or not.
  4. If the failure is already reported, add a :fire_engine: emoji. It can be helpful if you reply to the failure notification with a link to the issue(s), but this isn't always necessary, especially if the failures are the same as in the previous pipeline and there are links there.

Your priority is to report all new failures, so if there are many failures we recommend that you identify whether each failure is old (i.e., there is an issue open for it), or new. For each new failure, open an issue that includes only the required information. Once you have opened an issue for each new failure you can investigate each more thoroughly and act on them appropriately, as described in later sections.

The reason for reporting all new failures first is that engineers may find the test failing in their own merge request, and if there is no open issue about that failure they will have to spend time trying to figure out if their changes caused it.

2. Create an issue

The issue should have the following:

The issue description can have a brief description of what you think is the cause of the failure.

3. Investigate the failure further

The aim of this step is to understand the failure. The results of the investigation will also let you know what to do about the failure.

The following points can help with your investigation:

Checking Docker image

Sometimes tests may fail due to an outdated Docker image. To check if that's the case, follow the below instructions to see if some merged code is available or not in a Docker image.

Checking test code

If you suspect that certain test is failing due to the gitlab/gitlab-{ce|ee}-qa image being outdated, follow the below steps:

  1. Locally, run docker run -it --entrypoint /bin/sh gitlab/gitlab-ce-qa:latest to check for GitLab QA CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee-qa:latest to check for GitLab QA EE code
  2. Then, navigate to the qa directory (cd /home/qa/qa)
  3. Finally, use cat to see if the code you're looking for is available or not in certain file (e.g., cat page/project/issue/show.rb)

Note: if you need to check in another tag (e.g., nightly), change it in one of the scripts of step 1 above.

Checking application code
  1. Locally, run docker run -it --entrypoint /bin/sh gitlab/gitlab-ce:latest to check for GitLab CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee:latest to check for GitLab EE code
  2. Then, navigate to the gitlab-rails directory (cd /opt/gitlab/embedded/service/gitlab-rails/)
  3. Finally, use cat to see if the code you're looking for is available or not in certain file (e.g., cat public/assets/issues_analytics/components/issues_analytics-9c3887211ed5aa599c9eea63836486d04605f5dfdd76c49f9b05cc24b103f78a.vue.)

Note: if you want to check another tag (e.g., nightly) change it in one of the scripts of step 1 above.

4. Classify and triage the failure

The aim of this step is to categorise the failure as either a broken test, a bug in the application code, or a flaky test.

Test is broken

In this case, you've found that the failure was caused by some change in the application code and the test needs to be updated. You should:

Bug in code

In this case, you've found that the failure was caused by a bug in the application code. You should:

To find the appropriate team member to cc, please refer to the Organizational Chart. The Quality Engineering team list and DevOps stage group list might also be helpful.

Flaky Test

In this case, you've found that the failure is due to flakiness in the test itself. You should:

Following up on failures

Fixing the test

If you've found that the test is the cause of the failure (either because the application code was changed or there's a bug in the test itself), it will need to be fixed. This might be done by another TAE or by yourself. However, it should be fixed as soon as possible. In any case, the steps to follow are as follows:

If the test was flaky:

Note: the number of passes needed to be sure a test is stable is just a suggestion. You can use your judgement to pick a different threshold.

If the test was in quarantine, remove it from quarantine as described below.

Quarantining Tests

We should be very strict about quarantining tests. Quarantining a test is very costly and poses a higher risk because it allows tests to fail without blocking the pipeline, which could mean we miss new failures. The aim of quarantining the tests is not to get back a green pipeline, but rather to reduce the noise (due to constantly failing tests, flaky tests, etc.) so that new failures are not missed. Hence, a test should be quarantined only under the following circumstances:

Following are the steps to quarantine a test:

To be sure that the test is quarantined quickly, ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it directly.

Here is an example quarantine merge request.

Dequarantining Tests

Failing to dequarantine tests periodically reduces the effectiveness of the test suite. Hence, the tests should be dequarantined on or before the due-date mentioned in the corresponding issue.

To dequarantine a test:

As with quarantining a test, you can ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it.

Re-evaluating tests

If the due date of a failing test issue is reached, you should re-evaluate if the failing test should really be covered at the end-to-end test level, or if it should be covered in a lower level of the testing levels pyramid.

If you decide to delete the test, open a merge request to delete it and close the test failure issue. In the MR description or comment, mention the stable counterpart TAE for the test's stage for their awareness. Then open a new issue to cover the test scenario in a different test level.

If you decide the test is still valuable but don't want to leave it quarantined, you could replace :quarantine with :skip, which will skip the test entirely (i.e., it won't run even in jobs for quarantined tests). That can be useful when you know the test will continue to fail for some time (e.g., at least the next milestone or two).

Training Videos

Two videos walking through the triage process were recorded and uploaded to the GitLab Unfilitered YouTube channel.