Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Debugging Failing tests

On this page


These guidelines are intended to help you to investigate failures in the end-to-end tests so that they can be properly addressed.

This will involve analysing each failure and creating an issue to report it.

It might also involve putting tests in quarantine, or fixing tests, or reporting bugs in the application.

The tests run on a scheduled basis and the results of the tests can be seen in the relevant pipelines:

Scheduled QA Test Pipelines

The following are the two runs that are monitored every day.

We also use the #qa-nightly and #qa-staging Slack channels to quickly see the current status of the tests, like we do with failures on master. For each pipeline there is a notification of success or failure. If there's a failure, we use emoji to indicate the state of investigation of the failure:

Time to triage

Steps for Debugging QA Pipeline Test Failures

1. Initial Analysis

Start with a brief analysis of the failures. The aim of this step is to make a quick decision about how much time you can spend investigating each failure.

In the relevant Slack channel:

Your priority is to report all new failures, so if there are many failures we recommend that you identify whether each failure is old (i.e., there is an issue open for it), or new. For each new failure, open an issue that includes only the required information. Once you have opened an issue for each new failure you can investigate each more thoroughly and act on them appropriately, as described in later sections.

The reason for reporting all new failures first is that engineers may find the test failing in their own merge request, and if there is no open issue about that failure they will have to spend time trying to figure out if their changes caused it.

2. Create an issue

The issue should have the following:

The issue description can have a brief description of what you think is the cause of the failure.

3. Investigate the failure further

The aim of this step is to understand the failure. The results of the investigation will also let you know what to do about the failure.

The following points can help with your investigation:

4. Classify and triage the failure

The aim of this step is to categorise the failure as either a broken test, a bug in the application code, or a flaky test.

Test is broken

In this case, you've found that the failure was caused by some change in the application code and the test needs to be updated. You should:

Bug in code

In this case, you've found that the failure was caused by a bug in the application code. You should:

To find the appropriate team member to cc, please refer to the Organizational Chart. The Quality Engineering team list and DevOps stage group list might also be helpful.

Flaky Test

In this case, you've found that the failure is due to flakiness in the test itself. You should:

Following up on failures

Fixing the test

If you've found that the test is the cause of the failure (either because the application code was changed or there's a bug in the test itself), it will need to be fixed. This might be done by another TAE or by yourself. However, it should be fixed as soon as possible. In any case, the steps to follow are as follows:

If the test was flaky:

Note: the number of passes needed to be sure a test is stable is just a suggestion. You can use your judgement to pick a different threshold.

If the test was in quarantine, remove it from quarantine as described below.

Quarantining Tests

We should be very strict about quarantining tests. Quarantining a test is very costly and poses a higher risk because it allows tests to fail without blocking the pipeline, which could mean we miss new failures. The aim of quarantining the tests is not to get back a green pipeline, but rather to reduce the noise (due to constantly failing tests, flaky tests, etc.) so that new failures are not missed. Hence, a test should be quarantined only under the following circumstances:

Following are the steps to quarantine a test:

To be sure that the test is quarantined quickly, ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it directly.

Here is an example quarantine merge request.

Dequarantining Tests

Failing to dequarantine tests periodically reduces the effectiveness of the test suite. Hence, the tests should be dequarantined on or before the due-date mentioned in the corresponding issue.

To dequarantine a test:

As with quarantining a test, you can ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it.

Training Videos

Two videos walking through the triage process were recorded and uploaded to the GitLab Unfilitered YouTube channel.