Infrastructure Department Performance Indicators

Executive Summary

KPI Health Status
GitLab.com Availability SLO Okay
  • February 2024 Availability 99.82%
  • January 2024 Availability 100.00%
  • December 2023 Availability 99.99%
Mean Time To Production (MTTP) Okay
  • Work towards MTTP epic 280.
Corrective Action SLO Okay
  • Corrective Action SLO are back below 0
Master Pipeline Stability Okay
  • Current month improved to 93%
  • Key issues have been internal Gitaly performance
  • Dependency upgrade issue that has also been resolved
Merge request pipeline duration Okay
  • Reduced to 42 minutes for this month
  • Two previous months below target due to increased retries and lack of parellization
  • Implemented a timeout for jobs such that we can capture artifacts and resolve issues
S1 Open Customer Bug Age (OCBA) Attention
  • Promoted to KPI in FY24Q2
  • Slight uptick in last 3 months due to the triaging of all untriaged customer bugs
  • All S1 bugs are scheduled for current milestone
S2 Open Customer Bug Age (OCBA) Attention
  • Promoted to KPI in FY24Q2
  • Above target, significant reduction will require a focus on older customer impacting S2
Quality Team Member Retention Confidential
  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric
Infrastructure Team Member Retention Confidential
  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric

Key Performance Indicators

GitLab.com Availability SLO

Percentage of time during which GitLab.com is fully operational and providing service to users within SLO parameters. Definition is available on the GitLab.com Service Level Availability page. Historical Availability is available on the Service Level Availability page.

Target: equal or greater than 99.80% Health:Okay

  • February 2024 Availability 99.82%
  • January 2024 Availability 100.00%
  • December 2023 Availability 99.99%

Chart (Tableau↗)

URL(s):


Mean Time To Production (MTTP)

Measures the elapsed time (in hours) from merging a change in gitlab-org/gitlab projects master branch, to deploying that change to gitlab.com. It serves as an indicator of our speed capabilities to deploy application changes into production. This metric is equivalent to the Lead Time for Changes metric in the Four Keys Project from the DevOps Research and Assessment. Additionally, the data for this metric also shows Deployment Frequency, another of the Four Keys metrics. MTTP breakdown can be visualized on the Delivery Metrics page .

Target: less than 12 hours Health:Okay

  • Work towards MTTP epic 280.

Chart (Tableau↗)

URL(s):


Corrective Action SLO

The Corrective Actions (CAs) SLO focuses on the number of open severity::1/severity::2 Corrective Action Issues past their due date. Corrective Actions and their due dates are defined in our Incident Review process.

Target: below 0 Health:Okay

  • Corrective Action SLO are back below 0

Chart (Tableau↗)

URL(s):


Master Pipeline Stability

Measures our monolith master pipeline success rate. A key indicator to engineering productivity and the stability of our releases. We will continue to leverage Merge Trains in this effort.

Target: Above 95% Health:Okay

  • Current month improved to 93%
  • Key issues have been internal Gitaly performance
  • Dependency upgrade issue that has also been resolved

Chart (Tableau↗)

URL(s):


Merge request pipeline duration

Measures the average successful duration for the monolith merge request pipelines. Key building block to improve our cycle time, and efficiency. More pipeline improvements.

Target: Below 45 minutes Health:Okay

  • Reduced to 42 minutes for this month
  • Two previous months below target due to increased retries and lack of parellization
  • Implemented a timeout for jobs such that we can capture artifacts and resolve issues

Chart (Tableau↗)

URL(s):


S1 Open Customer Bug Age (OCBA)

S1 Open Customer Bug Age (OCBA) measures the total number of days that all S1 customer-impacting bugs are open within a month divided by the number of S1 customer-impacting bugs within that month.

Target: Below 30 days Health:Attention

  • Promoted to KPI in FY24Q2
  • Slight uptick in last 3 months due to the triaging of all untriaged customer bugs
  • All S1 bugs are scheduled for current milestone

Chart (Tableau↗)

URL(s):


S2 Open Customer Bug Age (OCBA)

S2 Open Customer Bug Age (OCBA) measures the total number of days that all S2 customer-impacting bugs are open within a month divided by the number of S2 customer-impacting bugs within that month.

Target: Below 250 Health:Attention

  • Promoted to KPI in FY24Q2
  • Above target, significant reduction will require a focus on older customer impacting S2

Chart (Tableau↗)

URL(s):


Quality Team Member Retention

We need to be able to retain talented team members. Retention measures our ability to keep them sticking around at GitLab. Team Member Retention = (1-(Number of Team Members leaving GitLab/Average of the 12 month Total Team Member Headcount)) x 100. GitLab measures team member retention over a rolling 12 month period.

Target: at or above 84% This KPI cannot be public Health:Confidential

  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric

URL(s):


Infrastructure Team Member Retention

We need to be able to retain talented team members. Retention measures our ability to keep them sticking around at GitLab. Team Member Retention = (1-(Number of Team Members leaving GitLab/Average of the 12 month Total Team Member Headcount)) x 100. GitLab measures team member retention over a rolling 12 month period.

Target: at or above 84% This KPI cannot be public Health:Confidential

  • Confidential metric, see notes in Key Review agenda
  • Will be merged into a combined department retention metric

URL(s):


Regular Performance Indicators

Review App deployment success rate

Measures the stability of our test tooling to enable end to end and exploratory testing feedback.

Target: Above 95% Health:Attention

  • Moved to regular PI in FY24Q2
  • Stabilized at 95% to 96% in the past 3 months

Chart (Tableau↗)

URL(s):


Time to First Failure p80

TtFF (pronounced “teuf”) measures the 80th percentile time from pipeline creation until the first actionable failed build is completed for the GitLab monorepo project. We want to run the tests that are likely to fail first and shorten the feedback cycle to R&D teams.

Target: Below 20 minutes Health:Okay

  • Track this metric in addition to average starting FY23Q4
  • Plan to optimize selective tests in place for backend and frontend tests

Chart (Tableau↗)

URL(s):


Time to First Failure

TtFF (pronounced “teuf”) measures the average time from pipeline creation until the first actionable failed build is completed for the GitLab monorepo project. We want to run the tests that are likely to fail first and shorten the feedback cycle to R&D teams.

Target: Below 15 minutes Health:Okay

  • Moved to regular PI in FY24Q2
  • Under target of 15 mins for the past 2 months

Chart (Tableau↗)

URL(s):


Average duration of end-to-end test suite

Measures the speed of our full QA/end-to-end test suite in the master branch. A Software Engineering in Test job-family performance-indicator.

Target: at 90 mins Health:Okay

  • Below target of 90 mins

Chart (Tableau↗)

URL(s):


Average age of quarantined end-to-end tests

Measures the stability and effectiveness of our QA/end-to-end tests running in the master branch. A Software Engineering in Test job-family performance-indicator.

Target: TBD Unknown

  • Chart to track historical metric was broken. Chart has been recently fixed, but our visibility is limited.

Chart (Tableau↗)

URL(s):


S1 Open Bug Age (OBA)

S1 Open Bug Age (OBA) measures the total number of days that all S1 bugs are open within a month divided by the number of S1 bugs within that month.

Target: Below 60 days Health:Okay

  • Under target for the past 5 months
  • Moved to regular PI in FY24Q3

Chart (Tableau↗)

S2 Open Bug Age (OBA)

S2 Open Bug Age (OBA) measures the total number of days that all S2 bugs are open within a month divided by the number of S2 bugs within that month.

Target: Below 250 days Health:Okay

  • Under target for the past 11 months
  • Moved to regular PI in FY24Q3

Chart (Tableau↗)

URL(s):


Quality Handbook MR Rate

The handbook is essential to working remote successfully, to keeping up our transparency, and to recruiting successfully. Our processes are constantly evolving and we need a way to make sure the handbook is being updated at a regular cadence. This data is retrieved by querying the API with a python script for merge requests that have files matching /source/handbook/engineering/quality/** over time.

Target: Above 1 MR per person per month Health:Problem

  • Declining in last 3 months
  • To be combined into one handbook structure and measurement

Chart (Tableau↗)
		</tableau-viz>
	</div>

Quality Department Promotion Rate

The total number of promotions over a rolling 12 month period divided by the month end headcount. The target promotion rate is 12% of the population. This metric definition is taken from the People Success Team Member Promotion Rate PI.

Target: 12% Health:Okay

  • Under target for 4 months, which is expected after being above target for 8 months

Chart (Tableau↗)
		</tableau-viz>
	</div>

Quality Department Discretionary Bonus Rate

The number of discretionary bonuses given divided by the total number of team members, in a given period as defined. This metric definition is taken from the People Success Discretionary Bonuses KPI.

Target: at or above 10% Health:Attention

  • We have not been close to target for 10 months
  • Combining into one measurement in progress

Chart (Tableau↗)
		</tableau-viz>
	</div>

Infrastructure Handbook MR Rate

The handbook is essential to working remote successfully, to keeping up our transparency, and to recruiting successfully. Our processes are constantly evolving and we need a way to make sure the handbook is being updated at a regular cadence. This data is retrieved by querying the API with a python script for merge requests that have files matching /source/handbook/engineering/ or /source/handbook/support/ over time.

Target: 0.25 Health:Attention

  • Adjusted the target to .55 to be consistent with larger org, reflect less activity from managers, and overall the trend that our initial suggested target is higher than many months of observed activity.
  • Combining into one handbook structure and measurement in progress

Chart (Tableau↗)
		</tableau-viz>
	</div>

Infrastructure Department Discretionary Bonus Rate

The number of discretionary bonuses given divided by the total number of team members, in a given period as defined. This metric definition is taken from the People Success Discretionary Bonuses KPI.

Target: at or above 10% Health:Okay

  • Combining into one department measurement in-progress

Chart (Tableau↗)
		</tableau-viz>
	</div>

Mean Time Between Incidents (MTBI)

Measures the mean elapsed time (in hours) from the start of one production incident, to the start of the next production incident. It serves primarily as an indicator of the amount of disruption being experienced by users and by on-call engineers. This metric includes only Severity 1 & 2 incidents as these are most directly impactful to customers. This metric can be considered “MTBF of Incidents”.

Target: more than 120 hours Health:Okay

  • Target at 120 hours with the intent that we should not have such incidents more than approximately weekly (hopefully less). Furter iterations will increase this target when we incorporate environment (production only).
  • Deployment failures (and the mean time between them) will be extracted into a separate metric to serve as a quality countermeasure for MTTP, unrelated to this metric which focuses on declared service incidents.

Chart (Tableau↗)

URL(s):


Mean Time To Resolution (MTTR)

For all customer-impacting services, measures the elapsed time (in hours) it takes us to resolve when an incident occurs. This serves as an indicator of our ability to execute said recoveries. This includes Severity 1 & Severity 2 incidents from production project

Target: less than 24 hours Health:Attention

  • data depends on SREs adding incident::resolved label
  • as we continue to participate in dogfooding GitLab Incident Management we intend to improve this metric

Chart (Tableau↗)

URL(s):


Mean Time To Mitigate (MTTM)

For all customer-impacting services, measures the elapsed time (in hours) it takes us to mitigate when an incident occurs. This serves as an indicator of our ability to mitigate production incidents. This includes Severity 1 & Severity 2 incidents from production project

Target: less than 1 hours Health:Attention

  • This metric is equivalent to the Time to Restore metric in the Four Keys Project from the DevOps Research and Assessment
  • data depends on SREs adding incident::mitigate label
  • as we continue to participate in dogfooding GitLab Incident Management we intend to improve this metric

Chart (Tableau↗)

URL(s):


GitLab.com Saturation Forecasting

It is critical that we continuously observe resource saturation normal growth as well as detect anomalies. This helps to ensure that we have the appropriate platform capacity in place. This metric uses the results of Tamland forecasting framework of non-horizontally scalable services, which end up as issues in Capacity Planning Project. This metric counts the number of open capacity issues in that project.

Target: at or below 5 open issues Health:Attention

  • Next improvements are to document the existing process for creating capacity planning issues with a view to simplifying and automating the process. Documentation an improvement is a requirement for the SOC 2 Availability Criteria and is an OKR for Scalability for Q1.
  • Once we have Thanos data available in Snowflake we will switch this PI to show the percentage

Chart (Tableau↗)

URL(s):


GitLab.com Hosting Cost / Revenue

We need to spend our investors’ money wisely. As part of this we aim to follow industry standard targets for hosting cost as a % of overall revenue. In this case revenue is measured as MRR + one time monthly revenue from CI & Storage

Target: TBD This KPI cannot be public Health:Confidential

  • Confidential metric - See Key Review agenda

URL(s):


Infrastructure Department Promotion Rate

The total number of promotions over a rolling 12 month period divided by the month end headcount. The target promotion rate is 12% of the population. This metric definition is taken from the People Success Team Member Promotion Rate PI.

Target: 12% Health:Okay

  • Above target
  • Combining into one department measurement in-progress

Chart (Tableau↗)
		</tableau-viz>
	</div>

Legends

Health

Value Level Meaning
3 Okay The KPI is at an acceptable level compared to the threshold
2 Attention This is a blip, or we’re going to watch it, or we just need to enact a proven intervention
1 Problem We'll prioritize our efforts here
-1 Confidential Metric & metric health are confidential
0 Unknown Unknown

How pages like this work

Data

The heart of pages like this are Performance Indicators data files which are YAML files. Each - denotes a dictionary of values for a new (K)PI. The current elements (or data properties) are:

Property Type Description
name Required String value of the name of the (K)PI. For Product PIs, product hierarchy should be separate from name by " - " (Ex. {Stage Name}:{Group Name} - {PI Type} - {PI Name}
base_path Required Relative path to the performance indicator page that this (K)PI should live on
definition Required refer to Parts of a KPI
parent Optional should be used when a (K)PI is a subset of another PI. For example, we might care about Hiring vs Plan at the company level. The child would be the division and department levels, which would have the parent flag.
target Required The target or cap for the (K)PI. Please use Unknown until we reach maturity level 2 if this is not yet defined. For GMAU, the target should be quarterly.
org Required the organizational grouping (Ex: Engineering Function or Development Department). For Product Sections, ensure you have the word section (Ex : Dev Section)
section Optional the product section (Ex: dev) as defined in sections.yml
stage Optional the product stage (Ex: release) as defined in stages.yml
group Optional the product group (Ex: progressive_delivery) as defined in stages.yml
category Optional the product group (Ex: feature_flags) as defined in categories.yml
is_key Required boolean value (true/false) that indicates if it is a (key) performance indicator
health Required indicates the (K)PI health and reasons as nested attributes. This should be updated monthly before Key Reviews by the DRI.
health.level Optional indicates a value between 0 and 3 (inclusive) to represent the health of the (K)PI. This should be updated monthly before Key Reviews by the DRI.
health.reasons Optional indicates the reasons behind the health level. This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one reason.
urls Optional list of urls associated with the (K)PI. Should be an array (indented lines starting with dashes) even if you only have one url
funnel Optional indicates there is a handbook link for a description of the funnel for this PI. Should be a URL
sisense_data Optional allows a Sisense dashboard to be embeded as part of the (K)PI using chart, dashboard, and embed as neseted attributes.
sisense_data.chart Optional indicates the numeric Sisense chart/widget ID. For example: 9090628
sisense_data.dashboard Optional indicates the numeric Sisense dashboard ID. For example: 634200
sisense_data.shared_dashboard Optional indicates the numeric Sisense shared_dashboard ID. For example: 185b8e19-a99e-4718-9aba-96cc5d3ea88b
sisense_data.embed Optional indicates the Sisense embed version. For example: v2
sisense_data_secondary Optional allows a second Sisense dashboard to be embeded. Same as sisense data
sisense_data_secondary.chart Optional Same as sisense_data.chart
sisense_data_secondary.dashboard Optional Same as sisense_data.dashboard
sisense_data_secondary.shared_dashboard Optional Same as sisense_data.shared_dashboard
sisense_data_secondary.embed Optional Same as sisense_data.embed
public Optional boolean flag that can be set to false where a (K)PI does not meet the public guidelines.
pi_type Optional indicates the Product PI type (Ex: AMAU, GMAU, SMAU, Group PPI)
product_analytics_type Optional indicates if the metric is available on SaaS, SM (self-managed), or Both.
is_primary Optional boolean flag that indicates if this is the Primary PI for the Product Group.
implementation Optional indicates the implementation status and reasons as nested attributes. This should be updated monthly before Key Reviews by the DRI.
implementation.status Optional indicates the Implementation Status status. This should be updated monthly before Key Reviews by the DRI.
implementation.reasons Optional indicates the reasons behind the implementation status. This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one reason.
lessons Optional indicates lessons learned from a K(PI) as a nested attribute. This should be updated monthly before Key Reviews by the DRI.
lessons.learned Optional learned is an attribute that can be nested under lessonsand indicates lessons learned from a K(PI). This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one lesson learned
monthly_focus Optional indicates monthly focus goals from a K(PI) as a nested attribute. This should be updated monthly before Key Reviews by the DRI.
monthly_focus.goals Optional indicates monthly focus goals from a K(PI). This should be updated monthly before Key Reviews by the DRI. Should be an array (indented lines starting with dashes) even if you only have one goal
metric_name Optional indicates the name of the metric in Self-Managed implemenation. The SaaS representation of the Self-Managed implementation should use the same name.