These guidelines are intended to help you to investigate end-to-end test pipeline failures so that they can be properly addressed. This will involve analyzing each failure and creating an issue to report it. It might also involve fixing tests, putting them in quarantine, or reporting bugs in the application.
The Pipeline triage DRI is responsible for analyzing and debugging test pipeline failures. Please refer to the Quality Department pipeline triage rotation schedule to know who the current DRI is.
master
before other development work: Failing tests on master
are treated as the highest priority relative to other development work, e.g., new features. Note that for pipeline triage DRIs, triage and reporting takes priority over fixing tests.The test pipelines run on a scheduled basis, and their results are posted to Slack. The following are the QA test pipelines that are monitored every day.
For each pipeline there is a notification of success or failure (except for master
pipelines, which only report failures).
If there's a failure, we use emoji to indicate the state of its investigation:
The general triage steps are:
After triaging failed tests, possible follow up actions are:
Your priority is to make sure we have an issue for each failure, and to communicate the status of its investigation and resolution. When there are multiple failures to report, consider their impact when deciding which to report first. See the pipeline triage reponsilibities for further guidance.
If there are multiple failures we recommend that you identify whether each one is new or old (and therefore already has an issue open for it). For each new failure, open an issue that includes only the required information. Once you have opened an issue for each new failure you can investigate each more thoroughly and act on them appropriately, as described in later sections.
The reason for reporting all new failures first is to allow faster discovery by engineers who might find the test failing in their own merge request test pipeline. If there is no open issue about that failure, the engineer will have to spend time trying to figure out if their changes caused it.
Known failures should be linked to the current pipeline triage report. However, issues can be opened by anyone and are not linked automatically, so be sure to confirm there is no existing issue before creating one.
failure::*
label. By order of likelihood:
In the relevant Slack channel:
Please use this step if there are no issues created to capture the failure. If there is already an issue please skip this step.
#infrastructure-lounge
, or open an issue in the infrastructure project#quality
if you're unsure where to file the issue.Staging-Canary
is unique when it comes to its blocking smoke
and reliable
tests that are triggered by the deployer
pipeline. Staging-Canary
executes smoke/reliable
tests for both Staging-Canary
AND Staging
environments. This special configuration is designed to help catch issues that occur when incompatabilities arise between the shared and non-shared components of the environments.
Staging-Canary
and Staging
both share the same database backend, for example. Should a migration or change to either of the non-shared components during a deployment create an issue, running these tests together helps expose this situation. When the deployer
pipeline triggers these test runs, they are reported serially in the #qa_staging
Slack channel and they appear as different runs.
Note when viewing a deployment failure from the #announcements
Slack channel, you will have to click into the pipeline and look at the Downstream
results to understand if the deployment failure arose from a failure in Staging-Canary
or if the failure occurred in Staging
.
Click on the diagram below to visit the announcement issue for more context and view an uncompressed image:
The aim of this step is to understand the failure. The results of the investigation will also let you know what to do about the failure. Update the failure issue with any findings from your review.
The following can help with your investigation:
Log or artifact | Notes |
---|---|
Stack trace | Shown in the job's log; the starting point for investigating the test failure |
Screenshots and HTML captures | Available for download in the job's artifact for up to 1 week after the job run |
QA Logs | Included in the job's artifact; valuable for determining the steps taken by the tests before failing |
Sentry logs (Staging, Staging Ref, Preprod, Production) | If staging, preprod or production tests fail due to a server error, there should be a record in Sentry. For example, you can search for all unresolved staging errors linked to the gitlab-qa user with the query is:unresolved user:"username:gitlab-qa" . However, note that some actions aren't linked to the gitlab-qa user, so they might only appear in the full unresolved list. |
Kibana logs (Staging, Production) | Various application logs are sent to Kibana, including Rails, Postgres, Sidekiq, and Gitaly logs |
Depending on your level of context for the test and its associated setup, you might feel comfortable investigating the root cause on your own, or you might get help from other SETs right away.
When investigating on your own, we suggest spending at most 20-30 minutes actively trying to find the root cause (this excludes time spent reporting the failure, reviewing the failure logs, or any test setup and pipeline execution time). After that point, or whenever you feel out of ideas, we recommend asking for help to unblock you.
Note: Please avoid logging in via gitlab-qa
and all the other bot accounts on Canary/Production. They are monitored by SIRT and will raise an alert if someone uses them to log in. If it is really needed to log in with these accounts, please give a quick heads-up in #security-department that someone is logging into the bot and tag @sirt-members
for awareness.
Below is the list of the common root causes in descending order of likelihood:
https://gitlab.com/gitlab-org/security/gitlab/-/compare/start_commit_sha...end_commit_sha
to see if there was a change that could have affected the test.host
labels that can help filter by environment when searching through issues (ex: ~host::staging.gitlab.com
)
#infrastructure-lounge
and ask if something was changed recently on the environment in question.Failure examples can be seen in Training Videos.
You can run the test (or perform the test steps manually) against your local GitLab instance to see if the failure is reproducible. For example:
CHROME_HEADLESS=false bundle exec bin/qa Test::Instance::All http://localhost:3000 qa/specs/features/browser_ui/1_manage/project/create_project_spec.rb
Orchestrated tests are excluded by default. To run them, use -- --tag orchestrated
before your file name. For example:
CHROME_HEADLESS=false bundle exec bin/qa Test::Instance::All http://localhost:3000 -- --tag orchestrated qa/specs/features/browser_ui/1_manage/project/create_project_spec.rb
You can also use the same Docker image (same sha256 hash) as the one used in the failing job to run GitLab in a container on your local.
In the logs of the failing job, search for Downloaded newer image for gitlab/gitlab-ce:nightly
or Downloaded newer image for gitlab/gitlab-ee:nightly
and use the sha256 hash just above that line.
To run GitLab in a container on your local, the docker command similar to the one shown in the logs can be used. E.g.:
docker run --publish 80:80 --env GITLAB_OMNIBUS_CONFIG='gitlab_rails["initial_root_password"] = "CHOSEN_PASSWORD"' --name gitlab --hostname localhost gitlab/gitlab-ce:nightly@sha256:<hash>
You can now run the test against this Docker instance. E.g.:
CHROME_HEADLESS=false bundle exec bin/qa Test::Instance::All http://localhost qa/specs/features/browser_ui/1_manage/project/create_project_spec.rb
To run CustomersDot E2E tests locally against staging environment, you will need to clone CustomersDot project, switch to qa
directory, and then run
STAGING=1 CP_ADMIN_TOKEN=<TOKEN> GL_ADMIN_TOKEN=<TOKEN> bundle exec rspec spec/ui/purchase/purchase_plan_spec.rb
Note - Token value can be found in GitLab-QA Vault. For details on running tests locally with more options, please refer to CustomersDot README doc
QA_DEBUG=true
to enable logging output including page actions and Git commands.Sometimes tests may fail due to an outdated Docker image. To check if that's the case, follow the instructions below to see if specific merged code is available in a Docker image.
If you suspect that certain test is failing due to the gitlab/gitlab-{ce|ee}-qa
image being outdated, follow these steps:
docker run -it --entrypoint /bin/sh gitlab/gitlab-ce-qa:latest
to check for GitLab QA CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee-qa:latest
to check for GitLab QA EE codeqa
directory (cd /home/qa/qa
)cat
to see if the code you're looking for is available in certain file (e.g., cat page/project/issue/show.rb
)Note if you need to check in another tag (e.g.,
nightly
), change it in one of the scripts of step 1 above.
docker run -it --entrypoint /bin/sh gitlab/gitlab-ce:latest
to check for GitLab CE code, or docker run -it --entrypoint /bin/sh gitlab/gitlab-ee:latest
to check for GitLab EE codegitlab-rails
directory (cd /opt/gitlab/embedded/service/gitlab-rails/
)cat
to see if the code you're looking for is available or not in a certain file (e.g., cat public/assets/issues_analytics/components/issues_analytics-9c3887211ed5aa599c9eea63836486d04605f5dfdd76c49f9b05cc24b103f78a.vue
.)Note if you want to check another tag (e.g.,
nightly
) change it in one of the scripts of step 1 above.
docker pull dev.gitlab.org:5005/gitlab/omnibus-gitlab/gitlab-ee-qa
and use the version specified after gitlab-ee-qa:
.
/help
page or call the /api/v4/version
API.bundle exec rake docker:push:nightly
command in the Docker-branch
job of the Package-and-image
stage. Once you find the latest pipeline, search for gitlab-rails
under build-component_shas
in any job under the Gitlab_com:package
stage. For example, in this Ubuntu-16.04-branch
job, the commit SHA for gitlab-rails
is 32e76bc4fb02a615c2bf5a00a8fceaee7812a6bd
.gitlab-ee-qa:13.10-4b373026c98
, navigate to https://gitlab.com/gitlab-org/gitlab/-/commits/<commit_SHA>
page, in our example the commit SHA is 4b373026c98
.13.10.0-rc20210223090520-ee
, navigate to https://gitlab.com/gitlab-org/gitlab/-/commits/v<tag>
page, in our example the tag is 13.10.0-rc20210223090520-ee
.The aim of this step is to categorize the failure as either a stale test, a bug in the test, a bug in the application code, or a flaky test.
We use the following labels to capture the cause of the failure.
~"failure::investigating"
: Default label to apply at the start of investigation.~"failure::stale-test"
: Stale test due to application change~"failure::broken-test"
: Bug in the test~"failure::flaky-test"
: Flaky test~"failure::test-environment"
: Failure due to test environment~"type::bug"
: Bug in the applicationBugs blocking end-to-end test execution (due to the resulting quarantined tests) should additionally have severity and priority labels. For guidelines about which to choose, please see the blocked tests section of the issue triage page.
Note: It might take a while for a fix to propagate to all environments. Be aware that a new failure could be related to a recently-merged fix that hasn't made it to the relevant environment yet. Similarly, if a known failure occurs but the test should pass because a fix has been merged, verify that the fix has been deployed to the relevant environment before attempting to troubleshoot further.
The failure was caused by a change in the application code and the test needs to be updated.
~"failure::stale-test"
label.The failure was caused by a bug in the test code itself, not in the application code.
~"failure::broken-test"
label.The failure was caused by a bug in the application code.
~"type::bug"
label, and cc-ing the corresponding Engineering Managers (EM), QEM, and SET.type: :bug
in the quarantine
tag.Note: GitLab maintains a daily deployment cadence so a breaking change in master
reaches Canary and Production fast. Please communicate broadly to ensure that the corresponding Product Group is aware of the regression and action is required. If the bug is qualified for dev escalation (example: priority::1/severity::1
issue that blocks the deployment process), consider involving On-call Engineers in the #dev-escalation
channel. To find out who’s on-call follow the links in the channel subject line.
To find the appropriate team member to cc, please refer to the Organizational Chart. The Quality Engineering team list and DevOps stage group list might also be helpful.
The failure is due to flakiness in the test itself.
~"failure::flaky-test"
label.Flakiness can be caused by a myriad of problems. Examples of underlying problems that have caused us flakiness include:
For more details, see the list with example issues in our Testing standards and style guidelines section on Flaky tests.
The failure is due external factors outside the scope of the test. This could be due to environments, deployment hang-ups, or upstream dependencies.
~"failure::test-environment"
label.A job may fail due to infrastructure or orchestration issues that are not related to any specific test. In some cases these issues will fail a job before tests are ever executed. Some examples of non-test related failures include:
If the failure is in a smoke
or a reliable
test, it will block deployments. Please inform the release managers of the root cause and if a fix is in progress by Quality. On GitLab.com you can use @gitlab-org/release/managers
. In Slack you can use @release-managers
.
Please also raise awareness by looping in the appropriate team members from the product group, such as SET or EM. You may also want to post to Quality's Slack channel, #quality
, depending on the impact of the failure.
If the failure could affect the performance of GitLab.com
production, or make it unavailable to a specific group of users, you can declare an incident with /incident declare
in the #production
slack channel, this will automatically prevent deployments (if the incident is at least an S2).
If you've found that the test is the cause of the failure (either because the application code was changed or there's a bug in the test itself), it will need to be fixed. This might be done by another SET or by yourself. However, it should be fixed as soon as possible. In any case, the steps to follow are as follows:
If the test was flaky:
Note The number of passes needed to be sure a test is stable is just a suggestion. You can use your judgement to pick a different threshold.
If the test was in quarantine, remove it from quarantine.
Note We should be very strict about quarantining tests. Quarantining a test is very costly and poses a higher risk because it allows tests to fail without blocking the pipeline, which could mean we miss new failures.
The aim of quarantining a test is not to get back a green pipeline, but rather to reduce the noise (due to constantly failing tests, flaky tests, and so on) so that new failures are not missed. If you're unsure about quarantining a test ask for help in the#quality
Slack channel, and then consider adding to the list of examples below to help future pipeline triage DRIs.
Examples of when to quarantine a test:
~"failure::broken-test"
), and a fix won't be ready for review within 24 hours~"failure::stale-test"
), and a fix won't be ready for review within 24 hoursExamples of when not to quarantine a test:
~"failure::test-environment"
), and neither the application code nor test code are the cause of the failure:smoke
tag should be removed from the test to prevent it running with the smoke
suite, but still allowed to run elsewhere while the flakiness is under investigation or being worked on to unblock deployment.:smoke
tag as soon as possible. Tests at the :smoke
level should be given priority when addressing flakiness within our test suites.# TODO
note in the test as a reminder with a link to the previously created issue url. For example:
# TODO restore :smoke tag and close https://gitlab.com/gitlab-org/gitlab/-/issues/######
Note The time limit for the fix is just a suggestion. You can use your judgement to pick a different threshold.
To quarantine a test:
:quarantine
metadata to the test with a link to the issue (see quarantined test types)Note If the example has a
before
hook, thequarantine
metadata should be assigned to the outer context to avoid running thebefore
hook.
~"Quality", ~"QA", ~"type::bug"
.~"Pick into auto-deploy", ~"priority::1", and ~"severity::1"
. Please note that this is reserved for emergency cases only, such as blocked deployments, as it will delay all other deployments by around two hours.~"devops::create" ~"group::source code"
.To be sure that the test is quarantined quickly, ask in the #quality
Slack channel for someone to review and merge the merge request, rather than assigning it directly.
Here is an example quarantine merge request.
If a test is placed under quarantine, it is important to specify why. By specifying a quarantine type we can see quickly the reason for the quarantine.
The report accepts the quarantine types below:
Quarantine Type | Description |
---|---|
:flaky |
This test fails intermittently |
:bug |
This test is failing due to an actual bug in the application |
:stale |
This test is outdated due to a feature change in the application and must be updated to fit the new changes |
:broken |
This test is failing because of a change to the test code or framework |
:waiting_on |
This test is quarantined temporarily due to an issue or MR that is a prerequisite for this test to pass |
:investigating |
This test is a :flaky test but it might be blocking other MRs and so should be quarantined while it's under investigation |
:test_environment |
This test is failing due to problems with the test environment and will not be fixed within 24 hours |
Note: Be sure to attach an
issue
to the quarantine metadata. We use this issue for tracking the average age of the quarantined tests.
it 'is flaky', quarantine: { issue: 'https://gitlab.com/gitlab-org/gitlab/issues/12345', type: :flaky }
it 'is due to a bug', quarantine: {
issue: 'https://gitlab.com/gitlab-org/gitlab/issues/12345',
type: :bug
}
context 'when these tests rely on another MR', quarantine: {
type: :waiting_on,
issue: 'https://gitlab.com/gitlab-org/gitlab/merge_requests/12345'
}
You should apply the quarantine tag to the outermost describe
/context
block that has tags relevant
to the test being quarantined.
# Good
RSpec.describe 'Plan', :smoke, quarantine: { issue: 'https://gitlab.com/gitlab-org/gitlab/issues/12345', type: :flaky } do
describe 'Feature' do
before(:context) do
# This before(:context) block will only be executed in smoke quarantine jobs
end
end
end
# Bad
RSpec.describe 'Plan', :smoke do
describe 'Feature', quarantine: { issue: 'https://gitlab.com/gitlab-org/gitlab/issues/12345', type: :flaky } do
before(:context) do
# This before(:context) block could be mistakenly executed in quarantine jobs that _don't_ have the smoke tag
end
end
end
Failing to dequarantine tests periodically reduces the effectiveness of the test suite. Hence, the tests should be dequarantined on or before the due-date mentioned in the corresponding issue.
Before dequarantining a test:
RELEASE
variable to the release that has your changes. See Running GitLab-QA pipeline against a specific GitLab release
for instruction on finding your release version created and tagged by the Omnibus pipeline.To dequarantine a test:
:quarantine
tag using the Quarantine End to End Test MR template.As with quarantining a test, you can ask in the #quality
Slack channel for someone to review and merge the merge request, rather than assigning it.
If the due date of a failing test issue is reached, you should re-evaluate if the failing test should really be covered at the end-to-end test level, or if it should be covered in a lower level of the testing levels pyramid.
If you decide to delete the test, open a merge request to delete it and close the test failure issue. In the MR description or comment, mention the stable counterpart SET for the test's stage for their awareness. Then open a new issue to cover the test scenario in a different test level.
If you decide the test is still valuable but don't want to leave it quarantined, you could replace :quarantine
with :skip
, which will skip the test entirely (i.e., it won't run even in jobs for quarantined tests). That can be useful when you know the test will continue to fail for some time (e.g., at least the next milestone or two).
These videos walking through the triage process were recorded and uploaded to the GitLab Unfilitered YouTube channel.
You can find some general tips for troubleshooting problems with GitLab end-to-end tests in the development documentation.