We also use the #qa-nightly and #qa-staging Slack channels to quickly see the current status of the tests, like we do with failures on master. For each pipeline there is a notification of success or failure. If there's a failure, we use emoji to indicate the state of investigation of the failure:
The :eyes: emoji, to show you're investigating a failing pipeline.
The :boom: emoji, when there's a new failure.
The :fire_engine: emoji, when a failure is already reported.
The :retry: emoji, when there's a system failure (e.g., Docker or runner failure).
Time to triage
The DRI should decide in the first 20 minutes of analysis, whether the failure can be fixed or it has to be quarantined.
In any case, the counterpart TAE is kept informed about the issue.
If the DRI finds that the issue can be fixed, they should spend not more than 2 hours in fixing the failure. Any test failure whose fix takes more than 2 hours, should be quarantined.
Start with a brief analysis of the failures. The aim of this step is to make a quick decision about how much time you can spend investigating each failure.
In the relevant Slack channel:
Apply the :eyes: emoji to indicate that you're investigating the failure(s).
If a failure is already reported, add a :fire_engine: emoji. (It can be helpful if you reply to the failure notification with a link to the issue(s), but this isn't always necessary, especially if the failures are the same as in the previous pipeline and there are links there.)
If there's a system failure (e.g., Docker or runner failure), retry the job and apply the :retry: emoji.
Your priority is to report all new failures, so if there are many failures we recommend that you identify whether each failure is old (i.e., there is an issue open for it), or new. For each new failure, open an issue that includes only the required information. Once you have opened an issue for each new failure you can investigate each more thoroughly and act on them appropriately, as described in later sections.
The reason for reporting all new failures first is that engineers may find the test failing in their own merge request, and if there is no open issue about that failure they will have to spend time trying to figure out if their changes caused it.
Use the environment variable QA_DEBUG=true to enable logging output including page actions and Git commands.
You can also use the same docker image (same sha256 hash) as the one used in the failing job to run GitLab in a container on your local. In the logs of the failing job, search for Downloaded newer image for gitlab/gitlab-ce:nightly or Downloaded newer image for gitlab/gitlab-ee:nightly and use the sha256 hash just above that line. To run GitLab in a container on your local, the docker command similar to the one shown in the logs can be used. E.g.:
docker run --publish 80:80 --name gitlab --net test --hostname localhost gitlab/gitlab-ce:nightly@sha256:<hash>
You can now run the test against this docker instance. E.g.:
Additional information about running tests locally can be found in the QA readme.
Determine if the test is flaky: check the logs or run the test a few times. If it passes at least once but fails otherwise, it's flaky.
4. Classify and triage the failure
The aim of this step is to categorise the failure as either a broken test, a bug in the application code, or a flaky test.
Test is broken
In this case, you've found that the failure was caused by some change in the application code and the test needs to be updated. You should:
Include your findings in a note in the issue about the failure.
If possible, mention the merge request which caused the test to break, to keep the corresponding engineer informed.
Bug in code
In this case, you've found that the failure was caused by a bug in the application code. You should:
Include your findings in a note in the issue about the failure.
If there is an issue open already for the bug, mention the test failure in the issue with all the details, cc-ing the corresponding Test Automation Engineer (TAE) and Quality Engineering Managers.
If there is no issue open for the bug, create an issue mentioning the failure details, cc-ing the corresponding Engineering Managers, Quality Engineering Managers, and TAE.
Communicate the issue in the corresponding Slack channels as well.
Do notquarantine the test immediately unless the bug won't be fixed quickly (e.g., it might be a minor/superficial bug). Instead, leave a comment in the issue for the bug asking if the bug can be fixed in the current release. If if can't, quarantine the test.
If you've found that the test is the cause of the failure (either because the application code was changed or there's a bug in the test itself), it will need to be fixed. This might be done by another TAE or by yourself. However, it should be fixed as soon as possible. In any case, the steps to follow are as follows:
Create a merge request (not an issue) with the fix for the test failure.
Apply the ~"Pick into auto-deploy" label.
If the test was flaky:
Confirm that the test is stable by passing at least 5 times.
Note: the number of passes needed to be sure a test is stable is just a suggestion. You can use your judgement to pick a different threshold.
We should be very strict about quarantining tests. Quarantining a test is very costly and poses a higher risk because it allows tests to fail without blocking the pipeline, which could mean we miss new failures. The aim of quarantining the tests is not to get back a green pipeline, but rather to reduce the noise (due to constantly failing tests, flaky tests, etc.) so that new failures are not missed. Hence, a test should be quarantined only under the following circumstances:
There is a bug in the application code or in the test that won't be fixed in the current release.
The test is flaky or failing for an unknown reason and requires further investigation.
Following are the steps to quarantine a test:
Open a merge request.
Assign the :quarantine metadata to the test and also add a link to the issue.
If the example has a before hook, the :quarantine meta should be assigned to the outer context to avoid running the before hook.
The merge request should have the following labels:
~"Pick into auto-deploy"
The merge request can have the following labels:
a DevOps stage label ( ~"devops::create", ~"devops::manage", etc.)
~"Quality:flaky-tests" if you know for sure the failure is due to flakiness
The merge request should have the current milestone
To be sure that the test is quarantined quickly, ask in the #quality Slack channel for someone to review and merge the merge request, rather than assigning it directly.