The goal of this page is to create, share and iterate on the Risk Map for the Pipeline Execution group.
Utilise the Risk Map as a tool to:
Risk Area | Risk Description | Impact | Impact level (1 LOW to 5 HIGH) | Probability (1 LOW to 5 HIGH) | Priority | Mitigation |
---|---|---|---|---|---|---|
Team/Capacity | We have 3 dedicated BE engineers on CI and have a large (and growing) backlog | Burn out, missed SLO/SLA, lowers team productivity | 5 | 3 | 15 | Make BE headcount more available |
Team/Capacity | As the team grows, evaluate whether a secondary stable counterpart in SET or UX is needed | Burn out for current stable counterparts (SET, UX) | 4 | 3 | 12 | Consider scaling other counterparts if the size of the engineering team grows |
Team/Escalations | Escalations like Rapid Actions, Engineering Allocations are disrupting the ability to focus on team priorities | Burn out, low level of autonomy, lowers team productivity | 5 | 4 | 20 | Find ways to proactively mitigate urgent issues with gitlab.com, work on GraphQL to unblock FE, find a dedicated SRE for CI |
Product/Backlog | Bug and Technical Debt backlog has been accruing over the years | missed SLO/SLA, prioritzation is harder | 5 | 3 | 15 | Revisit ownership of domains to better share the gaps |
Infrastructure availability | Pipelines get stuck due to stuck sidekiq shard | Mass failure in E2E test suites and/or customer usage impacted | 4 | 3 | 12 | |
Feature/Performance | Service outages dues to large artifacts storage/removals | 4 | 3 | 12 | ||
Quality/Testability | Hard to replicate production traffic to account for performance testing | 4 | 4 | 16 | ||
Quality/Test covereages | This is a mature product, there are many features and feature sets have yet to have test coverages (historical test gaps) | Escape regession bugs | 4 | 4 | 16 | |
Product/Cost | CI abuse (free tier) | 5 | 5 | 25 | Escalations to prevent pipeline abuse underway | |
Feature/Performance | DB timeout from CI minutes monthy reset | 3 | 1 | 3 | This has been mitigated as of 2021-04-01 run, but we are keeping an eye on it. Ongoing CI Minutes Rearchitecture efforts will also help with this | |
Feature/Performance | Unperformant database queries | Adding load to gitlab.com database, slow page and feature load times | 3 | 3 | 9 | Recent rapid actions has helped, and there's continual effort to address this to ensure we don't regress |
Team/Efficiency | Migrating more REST to GraphQL to help unblock FE | FE productivity and delivery | 5 | 3 | 15 | |
Feature/Dependencies | Depends on runner response and processing time - https://gitlab.com/gitlab-org/gitlab/-/issues/326113 - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3631 |
If runners fail to process, jobs are not executed, pipeline is stuck | 5 | 3 | 15 | |
Feature/Security | Overdue security vulnerabilities not yet addressed | Security risks exposes our users who uses the CI features | 4 | 3 | 12 | Build a plan to mitigate these risks, especially overdue ones; proactively planned between EM and PM |
Infrastructure availability | CI/CD Data model scaling | CI/CD Data model scaling | 5 | 2 | 10 | Actively being worked on in CI/CD Data Model Blueprint MR |