Engineering Productivity team
Child Pages
Wider Community Merge Request triage
Mission
- Constantly improve efficiency for our entire engineering team, to ultimately increase value for our customer.
- Measure what matters: quality of life, efficiency, and toil reduction improvements with quantitative and qualitative measures.
- Build partnerships across organizational boundaries to deliver broad efficiency improvements.
Team
Members
Name | Role |
---|---|
Andrew Smith | Core Team member |
Ben Bodenmiller | Core Team member |
Dave Munchiello | Board Observer |
George Hoyem | Board Observer |
George Tsiolis | Core Team member |
Godfrey Sullivan | Lead Independent Director, Board of Directors |
Hannes Rosenögger | Core Team member |
Jacopo Beschi | Core Team member |
Karen Blasing | Board of Directors |
Kyle Doherty | Board Observer |
Marco Zille | Core Team member |
Mark Porter | Board of Directors |
Matt Mullenweg | Board Observer |
Matthew Jacobson | Board of Directors |
Patrick Rice | Core Team member |
Randy Gottfried | Advisor |
Robert Schilling | Core Team member |
Merline Saintil | Board of Directors |
Siddharth Asthana | Core Team member |
Sue Bostrom | Board of Directors |
Sunny Bedi | Board of Directors |
Takuya Noguchi | Core Team member |
Niklas van Schrick | Core Team member |
Vitaliy Klachkov | Core Team member |
Stable Counterpart
Person | Role |
---|---|
Greg Alfaro | GDK Project Stable Counterpart, Application Security |
Core Responsibilities
graph LR A[Engineering Productivity Team] A --> B[Planning & Reporting] B --> B1[Weekly team reports<br>Providing teams with an overview of their current, planned & unplanned work] click B1 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/32" B --> B2[Issues & MRs hygiene automation<br>Ensuring healthy issue/MR trackers] click B2 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/32" A --> C[Development Tools] C --> C1[GitLab Development Kit<br>Providing a reliable development environment] click C1 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/31" C --> C2[GitLab Remote Development<br>Providing a remote reliable development environment] click C1 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/31" A --> F[Review & CI] F --> F2[Merge Request Review Process<br>Ensuring a smooth, fast and reliable review process] click F2 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/34" F --> F3[Merge Request Pipelines<br>Providing fast and reliable pipelines] click F3 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/28" F --> F4[Review apps<br>Providing review apps to explore a merge request changes] click F4 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/33" A --> D[Maintenance & Security] D --> D1[Automated dependency updates<br>Ensuring dependencies are up-to-date] click D1 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/40" D --> D2[Automated management of CI/CD secrets<br>Providing a secure CI/CD environment] click D2 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/46" D --> D3[Automated main branch failing pipelines management<br>Providing a stable `master` branch] click D3 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/30" D --> D4[Static analysis<br>Ensuring the codebase style and quality is consistent and reducing bikeshedding] click D4 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/38" D --> D5[Shared CI/CD components<br>Providing CI/CD components to ensure consistency in all GitLab projects] click D5 "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/41" A --> G[JiHu Support] click G "https://gitlab.com/groups/gitlab-org/quality/engineering-productivity/-/epics/35"
- See it and find it: Build automated measurements and dashboards to gain insights into the productivity of the Engineering organization to identify opportunities for improvement.
- Implement new measurements to provide visibility into improvement opportunities.
- Collaborate with other Engineering teams to provide visualizations for measurement objectives.
- Improve existing performance indicators.
- Do it for internal team: Increase contributor and developer productivity by making measurement-driven improvements to the development tools / workflow / processes, then monitor the results, and iterate.
- Identify and implement quantifiable improvement opportunities with proposals and hypothesis for metric improvements.
- Automated merge request quality checks and code quality checks.
- GitLab project pipeline improvements to improve efficiency, quality or duration.
- Dogfood use: Dogfood GitLab product features to improve developer workflow and provide feedback to product teams.
- Use new features from related product groups (Analytics, Monitor, Testing).
- Improve usage of Review apps for GitLab development and testing.
- Engineering support:
#master-broken
pipeline monitoring.- KPI corrective actions such as Review Apps stabilization.
- Merge Request Coach for ~“Community contribution” merge requests.
- Engineering workflow: Develop automated processes for improving label classification hygiene in support of product and Engineering workflows.
- Automated issues and merge requests triage.
- Improvements to the labelling classification and automation used to support Engineering measurements.
- See the
gitlab-triage
Ruby gem, and Triage operations projects for examples.
- Do it for wider community: Increase efficiency for wider GitLab Community contributions.
- Dogfood build: Enhance and add new features to the GitLab product to improve engineer productivity.
KPIs
Infrastructure Performance Indicators are our single source of truth
PIs
Shared
- Quality Handbook MR Rate
- Quality Department Promotion Rate
- Quality Department Discretionary Bonus Rate
OKRs
Objectives and Key Results (OKRs) help align our sub-department towards what really matters. These happen quarterly and are based on company OKRs. We follow the OKR process defined here.
Here is an overview of our current OKRs.
Communication
Description | Link |
---|---|
GitLab Team Handle | @gl-quality/eng-prod |
Slack Channel | #g_engineering_productivity |
Team Boards | Team Board & Priority Board |
Issue Tracker | gitlab-org/quality/engineering-productivity/team |
Office hours
Engineering productivity has monthly office hours on the 3rd Wednesday of the month at 3:00 UTC (20:00 PST) on even months (e.g February, April, etc) open for anyone to add topics or questions to the agenda. Office hours can be found in the GitLab Team Meetings calendar
Meetings
Engineering Productivity has weekly team meeting in two parts (EMEA / AMER) to allow for all team members to collaborate in times that work for them.
- Part 1 is Tuesdays 11:00 UTC, 04:00 PST
- Part 2 is Tuesdays 22:00 UTC, 15:00 PST
Work prioritization
The Engineering Productivity team has diverse responsibilities and reactive work. Work is categorized as planned and reactive.
Guiding principles
- We focus on OKRs, corrective actions and preventative work.
- We adhere to the general release milestones like %x.y.
- We are ambitious with our targeted planned work per milestone. These targets are not reflective of a commitment. Reactive work load will ebb and flow and we do not expected to accomplish everything planned for the current milestone.
- Priority labels are used to indicate relative priority for a milestone.
Weighting
We follow the department weighting guidelines to relatively weight issues over time to understand a milestone velocity and increase predictability.
When weighting, think about knowns and complexity related to recently completed work. The goal with weighting is to allow for some estimation ambiguity that allows for a consistent predictable flow of work each milestone.
Prioritization activities
When | Activity | DRI |
---|---|---|
Weekly | Assign ~priority::1 , ~priority::2 issues to a milestone |
Engineering Productivity Engineering Manager |
Weekly | Weight issues identified with ~"needs weight" |
Engineering Productivity Backend Engineer |
Weekly | Prioritize all ~"Engineering Productivity" issues |
Engineering Productivity Engineering Manager |
2 weeks prior to milestone start | Milestone planned work is identified and scheduled | Engineering Productivity Engineering Manager |
2 weeks prior to milestone start | Provide feedback on planned work | Engineering Productivity team |
1 week prior to milestone start | Transition any work that is not in progress for current milestone to upcoming milestone | Engineering Productivity Engineering Manager |
1 week prior to milestone start | Adjust planned work for upcoming milestone | Engineering Productivity Engineering Manager |
1 week prior to milestone start | Final adjustments to planned scope | Engineering Productivity team |
During milestone | Adjust priorities and scope based on newly identified issues and reactive workload | Engineering Productivity Engineering Manager |
Projects
The Engineering Productivity team recently reviewed (2023-05-19) all our projects and discussed relative priority. Aligning this with our business goals and priorities is very important. The list below is ordered based on aligned priorities and includes primary domain experts for communication as well as a documentation reference for self-service.
Project | Domain Knowledge | Documentation |
---|---|---|
GitLab CI Pipeline configuration optimization and stability | Jen-Shin, David, Nao | Pipelines for the GitLab project |
Triaging master-broken | Jenn, Nao | Broken Master |
GitLab Development Kit (GDK) continued development | Nao, Peter | GitLab Development Kit |
Triage operations for issues, merge requests, community contributions | Jenn, Alina | triage-ops |
Review Apps | David, Rémy | Using review apps in the development of GitLab |
Triage engine, used by GitLab triage operations | Jen-Shin, Rémy | GitLab Triage |
Danger & Dangerfiles (includes Reviewer roulette) for shared Danger rules and plugins | Rémy, Jen-Shin, Peter | gitLab-dangerfiles Ruby gem for shared Danger rules and plugins |
JiHu | Jen-Shin | JiHu Support |
Development department metrics for measurements of Quality and Productivity | Jenn, Rémy | Development Department Performance Indicators |
RSpec Profiling Statistics for profiling information on RSpec tests in CI | Peter | rspec_profiling_stats |
RuboCop & shared RuboCop cops | Peter | gitLab-styles Ruby gem for shared RuboCop cops |
Feature flag alert for reporting on GitLab feature flags | Rémy | Gitlab feature flag alert |
Chatops (especially for feature flags toggling) | Rémy | Chatops scripts for managing GitLab.com from Slack |
CI/CD variables, Triage ops, and Internal workspaces infrastructure | David, Rémy | Engineering Productivity infrastructure |
Tokens management | Rémy | “Rotating credentials” runbook |
Gems management | Rémy | Rubygems committee project |
Shared CI/CD config & components | David, Rémy | gitlab-org/quality/pipeline-common and gitlab-org/components |
Dependency management (Gems, Ruby, Vue, etc.) | Jen-Shin, Peter | Renovate GitLab bot |
Metrics
The Engineering Productivity team creates metrics in the following sources to aid in operational reporting.
- Engineering Productivity Collection
- Broken Master Pipeline Root Cause Analysis
- Time to First Failure
- Flaky test issues
- Test Intelligence Accuracy
- Engineering Productivity Pipeline Durations
- Engineering Productivity Jobs Durations
- Engineering Productivity Package And QA Durations (to be replaced in Tableau)
- GDK - Jobs Durations (to be replaced in Tableau)
- Issue Types Detail
- GitLab-Org Native Insights
- Review Apps monitoring dashboard
- Triage Reactive monitoring dashboards
Communication guidelines
The Engineering Productivity team will make changes which can create notification spikes or new behavior for GitLab contributors. The team will follow these guidelines in the spirit of GitLab’s Internal Communication Guidelines.
Pipeline changes
Critical pipeline changes
Pipeline changes that have the potential to have an impact on the GitLab.com infrastructure should follow the Change Management process.
Pipeline changes that meet the following criteria must follow the Criticality 3 process:
- update to the
cache-repo
job job
These kind of changes led to production issues in the past.
Non-critical pipeline changes
The team will communicate significant pipeline changes to #development
in Slack and the Engineering Week in Review.
Pipeline changes that meet the following criteria will be communicated:
- addition, removal, renaming, parallelization of jobs
- changes to the conditions to run jobs
- changes to pipeline DAG structure
Other pipeline changes will be communicated based on the team’s discretion.
Automated triage policies
Be sure to give a heads-up to #development
,#eng-managers
,#product
, #ux
Slack channels
and the Engineering week in review when an automation is expected to triage more
than 50 notifications or change policies that a large stakeholder group use (e.g. team-triage report).
Asynchronous Issue Updates
Communicating progress is important but status doesn’t belong in one on ones as it can be more appropriately communicated with a broader audience using other methods. The “standup” model used by a lot of organizations practicing scrum assumes a certain time of day for those to happen. In the context of a timezone distributed team, there is no “9am” that the team shares. Additionally, the act of losing and gaining context after completing work for the day only to gain it again to share a status update is context switching. The intended audience of the standup model assumes that it’s just the team but in GitLab’s model, that means folks need to be aware of where this is being communicated (slack, issues, other). Since this information isn’t available to the intended audience, the information needs to be duplicated which at worst means there’s no single source of truth and at a minimum means copy pasting information.
The proposal is to trial using an Asynchronous Issue Update model, similar to what the Package Group uses. This process would replace the existing daily standup update we post in Slack with Geekbot
. The time period for the trial would be a milestone or two, depending on feedback cycles.
The async daily update communicates the progress and confidence using an issue comment and the milestone health status using the Health Status field in the issue. A daily update may be skipped if there was no progress. Merge requests that do not have a related issue should be updated directly. It’s preferable to update the issue rather than the related merge requests, as those do not provide a view of the overall progress. Where there are blockers or you need support, Slack is the preferred space to ask for that. Being blocked or needing support are more urgent than email notifications allow.
When communicating the health status, the options are:
on track
- when the issue is progressing as plannedneeds attention
- when the issue requires attention or intervention to keep it on scheduleat risk
- when there is a risk the issue will not be completed according to schedule
The async update comment should include:
- what percentage complete the work is, in other words, how much work is done to put all the required MRs in review
- the confidence of the person that their estimate is correct
- notes on what was done and/or if review has started
- it could be good to specify the relevant dependencies in the update, if there are multiple people working on it
Example:
**Status**: 20% complete, 75% confident
Expecting to go into review tomorrow.
Include one entry for each associated MR
Example:
**Issue status**: 20% complete, 75% confident
Expecting to go into review tomorrow.
**MR statuses**:
- !11111+ - 80% complete, 99% confident - docs update - need to add one more section
- !21212+ - 10% complete, 70% confident - api update - database migrations created, working on creating the rest of the functionality next
How to measure confidence?
Ask yourself, how confident am I that my % of completeness is correct?.
For things like bugs or issues with many unknowns, the confidence can help communicate the level of unknowns. For example, if you start a bug with a lot of unknowns on the first day of the milestone you might have low confidence that you understand what your level of progress is. Your confidence in the work may go down for whatever reason, it’s acceptable to downgrade your confidence. Consideration should be given to retrospecting on why that happened.
Weekly Epic updates
A weekly update should be added to epics you’re assigned to and/or are actively working on. The update should provide an overview of the progress across the feature. Consider adding an update if epic is blocked, if there are unexpected competing priorities, and even when not in progress, what is the confidence level to deliver by the expected delivery date. A weekly update may then be skipped until the situation changes. Anyone working on issues assigned to an epic can post weekly updates.
The epic updates communicate a high level view of progress and status for quarterly goals using an epic comment. It does not need to have issue or MR level granularity because that is part of each issue updates.
The weekly update comment should include:
- Status: ok, so-so, bad? Is there something blocked in the general effort?
- How much of the total work is done? How much is remaining? Do we have an ETA?
- What’s your confidence level on the completion percentage?
- What is next?
- Is there something that needs help/support? (tag specific individuals so they know ahead of time)
Examples
Some good examples of epic updates that cover the above aspects:
- https://gitlab.com/groups/gitlab-org/-/epics/8628#note_1090732793
- https://gitlab.com/groups/gitlab-org/-/epics/5152#note_1029337901
Test Intelligence
As the owner of pipeline configuration for the GitLab project, the Engineering Productivity team has adopted several test intelligence strategies aimed to improve pipeline efficiency with the following benefits:
- Shortened feedback loop by prioritizing tests that are most likely to fail
- Faster pipelines to scale better when Merge Train is enabled
These strategies include:
- Predictive test jobs via test mapping
- Fail-fast job
- Re-run previously failed tests early
- Selective jobs via pipeline rules
- Selective jobs via labels
Predictive test jobs via test mapping
Tests that provide coverage to the code changes in each merge request are most likely to fail. As a result, merge request pipelines for the GitLab project run only the predictive set of tests by default. These include:
- RSpec predictive jobs which runs relevant RSpec tests that are mapped to the code changes
- Jest predictive jobs which runs relevant Jest tests that are mapped to the code changes
See https://docs.gitlab.com/ee/development/pipelines/index.html#predictive-test-jobs-before-a-merge-request-is-approved for more information.
Fail-fast job
There is a fail-fast job in each merge request pipeline aimed to run all the RSpec tests that provide coverage for the code changes, hence are most likely to fail. It uses the same test_file_finder gem for test mapping. The job provides faster feedback by running early and stops the rest of the pipeline right away if any of the fail-fast job tests fail. Take a look at this youtube video for details on how GitLab implements the fail-fast job with test_file_finder. Note that the current design only works with low-impacting merge requests which are only mapped to a small set of tests. If there is a large number of tests that are likely to fail for a merge request, putting them in a single job is not feasible and could result in a long-running bottleneck which defeats its purpose.
See https://docs.gitlab.com/ee/development/pipelines/index.html#fail-fast-job-in-merge-request-pipelines for more information.
Premium GitLab customers, who wish to incorporate the Fail-Fast job
into their Ruby projects, can set it up with our Verify/Failfast template.
Re-run previously failed tests early
Tests that previously failed in a merge request are likely to fail again, so they provide the most urgent feedback in the next run. To grant these tests the highest priority, the GitLab pipeline prioritizes previously failed tests by re-running them early in a dedicated job, so it will be one of the first jobs to fail if attention is needed.
See https://docs.gitlab.com/ee/development/pipelines/index.html#re-run-previously-failed-tests-in-merge-request-pipelines for more information.
Selective jobs via pipeline rules
The GitLab pipeline consists of hundreds of jobs, but not all are necessary for each merge request. For example, a merge request with only changes to documenation files do not need to run any backend tests, so we can exclude all backend test jobs from the pipeline. See specify-when-jobs-run-with-rules for how to include/exclude CI jobs based on file changes. Most of the pipeline rules for the GitLab project can be found in https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/ci/rules.gitlab-ci.yml.
Selective jobs via labels
Developers can add labels to run jobs in addition to the ones selected by the pipeline rules. Those labels start with pipeline:
and multiple can be applied. A few examples that people commonly use:
~"pipeline:run-all-rspec"
~"pipeline:run-all-jest"
~"pipeline:run-as-if-foss"
~"pipeline:run-as-if-jh"
~"pipeline:run-praefect-with-db"
~"pipeline:run-single-db"
See docs for when to use these pipeline labels.
Experiments
This is a list of Engineering Productivity experiments where we identify an opportunity, form a hypothesis and experiment to test the hypothesis.
Experiment | Status | Hypothesis | Feedback Issue or Findings |
---|---|---|---|
Automatic issue creation for test failures | Complete | The goal is to track each failing test in master with an issue, so that we can later automatically quarantine tests. |
Feedback issue. |
Always run predictive jobs for fork pipelines | Complete | The goal is to reduce the compute minutes consumed by fork pipelines. The “full” jobs only run for canonical pipelines (i.e. pipelines started by a member of the project) once the MR is approved. | |
Retry failed specs in a new process after the initial run | Complete | Given that a lot of flaky tests are unreliable due to previous test which are affecting the global state, retrying only the failing specs in a new RSpec process should result in a better overall success rate. | Results show that this is useful. |
Experiment with automatically skipping identified flaky tests | Complete - Reverted | Skipping flaky tests should reduce the number of false broken master and increase the master success rate. |
We found out that it can actually break master in some cases, so we reverted the experiment with gitlab-org/gitlab!111217 . |
Experiment with running previously failed tests early | Complete | We have not noticed a significant improvement in feedback time due to other factors impacting our Time to First Failure metric. | |
Store/retrieve tests metadata in/from pages instead of artifacts | Complete | We’re only interested in the latest state of these files, so using Pages makes sense here. This simplifies the logic to retrieve the reports and reduce the load on GitLab.com’s infrastructure. | This has been enabled since 2022-11-09. |
Reduce pipeline cost by reducing number of rspec tests before MR approval | Complete | Reduce the CI cost for GitLab pipelines by running the most applicable rspec tests for changes prior to approval | Improvements needed to identify and resolve selective test gaps as this impacted pipeline stability. |
Enabling developers to run failed specs locally | Complete | Enabling developers to run failed specs locally will lead to less pipelines per merge request and improved productivity from being able to fix regressions more quickly | Feedback issue. |
Use dynamic analysis to streamline test execution | Complete | Dynamic analysis can reduce the amount of specs that are needed for MR pipelines without causing significant disruption to master stability | Miss rate of 10% would cause a large impact to master stability. Look to leverage dynamic mapping with local developer tooling. Added documentation from the experiment. |
Using timezone for Reviewer Roulette suggestions | Complete - Reverted | Using timezone in Reviewer Roulette suggestions will lead to a reduction in the mean time to merge | Reviewer Burden was inconsistently applied and specific reviewers were getting too many reviews compared to others. More details in the experiment issue and feedback issue |
Engineering productivity Project Management
Flaky tests Primer
Issue Triage
Triage Operations
Wider Community Merge Request Triage
Workflow Automation
62484f99
)