Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Infrastructure Department Performance Indicators

On this page

Executive Summary

KPI Health Reason(s)
Infrastructure Hiring Actual vs Plan Okay
  • Engineering is on plan. But we are lending some of our recruiters to sales for this quarter. And we just put in place a new "one star minimum" rule that might decrease offer volume.
  • Health: Monitor health closely
  • Infrastructure Average Location Factor Attention
  • We are at our target of 0.58 exactly overall, but trending upward.
  • We need to get the target location factors in the charts.
  • It will probably be cleaner if we get this in periscope.
  • We need to set the target location factors on our vacancies and make sure recruiting is targeting the right areas of the globe on a per role basis.
  • GitLab.com Availability Attention
  • We’re above the SLO threshold, but we also know the data needs to be better
  • Maturity: Commit and implement SLIs/SLOs for GitLab.com.
  • Take generalized metrics and produce and track the resulting uptime metric.
  • Infrastructure Cost per GitLab.com Monthly Active Users Okay
  • Met savings goals last quarter in working group.
  • This is one of the first priorities for the Operations Analyst, Infrastructure role.
  • Manually make rough calculation in the Working Group.
  • GitLab.com Performance Attention
  • We are experiencing occasionally slowness of both the frontend and git operations.
  • Infrastructure cost vs plan Okay
  • Met savings goals last quarter in working group.
  • This is one of the first priorities for the Operations Analyst, Infrastructure role.
  • Manually make rough calculation in the Working Group
  • Key Performance Indicators

    Infrastructure Hiring Actual vs Plan

    Are we able to hire high quality workers to build our product vision in a timely manner? Hiring information comes from BambooHR where employees are in the division `Engineering`.

    Target: 39 by Dec 31, 2019

    URL(s)

    Health: Okay

    Maturity: Level 2 of 3

    Infrastructure Average Location Factor

    We remain efficient financially if we are hiring globally, working asynchronously, and hiring great people in low-cost regions where we pay market rates. We track an average location factor by function and department so managers can make tradeoffs and hire in an expensive region when they really need specific talent unavailable elsewhere, and offset it with great people who happen to be in low cost areas.

    Target: 0.58

    URL(s)

    Health: Attention

    Maturity: Level 2 of 3

    GitLab.com Availability

    Percentage of time during which GitLab.com is fully operational and providing service to users within SLO parameters.

    Target: 99.95%

    URL(s)

    Health: Attention

    Maturity: Level 2 of 3

    Infrastructure Cost per GitLab.com Monthly Active Users

    This metric reflects the dollar cost necessary to support one user in GitLab.com. It is an important metric because it allows us to estimate Infrastructure costs as our user base grows. Infrastructure cost comes from Netsuite; it is all expenses with the department name of `Infrastructure`, excluding account 6999 (Allocation). This cost is divided by MAU

    URL(s)

    Health: Okay

    Maturity: Level 1 of 3

    GitLab.com Performance

    This metric needs to reflect the performance of GitLab as experienced by users. It should capture both frontend and backend performance. Even though the Infrastructure will be responsible for this metric they will need other departments such as Development, Quality, PM, and UX to positively affect change.

    URL(s)

    Health: Attention

    Maturity: Level 1 of 3

    Infrastructure cost vs plan

    Tracks our actual infrastructure against our planned infrastructure costs for GitLab.com. We need this metric to manage our financial position. [Description of what this metric tells us, and why it’s important]

    URL(s)

    Health: Okay

    Maturity: Level 1 of 3

    Regular Performance Indicators

    Apdex and Error SLO per Service

    Each service at GitLab has two general metrics SLOS. "Apdex Score" is simply put, this is a measure of the percentage of requests to that service that complete within a satisfactory amount of time. The thresholds are defined per service, so for some services it will be in microseconds for others it could be seconds. "Error Ratio" is the percentage of requests to a service which end in error. For each service in the system we define an acceptable threshold for these values. For apdex we want the actual score to be above the threshold, for error ratio, we want it to be below the threshold. We don’t expect the apdex to be a perfect 100%, and we don’t expect the error rate to be a perfect 0%, but we would like these values to be within their predefined thresholds 100% of the time. The actual amount of time that they adhere to their SLO thresholds is far below this currently

    URL(s)

    Health: Unknown

    Maturity: Level 1 of 3

    Mean Time To Detection (MTTD)

    Measures the elapsed time it takes us to detect the onset of an anomalous condition and its actual detection, and serves as an indicator of our ability to monitor the environment and minimize incident resolution.

    URL(s)

    Health: Unknown

    Maturity: Level 1 of 3

    Mean Time To Resolution (MTTR)

    Measures the elapsed time it takes us to recover when an incident occurs, and serves as an indicator of our ability to execute said recoveries.

    Target: 60m

    URL(s)

    Health: Problem

    Maturity: Level 2 of 3

    Mean Time Between Failures (MTBF)

    Measures the mean amount of time elapsed between incidents that affect GitLab.com’s availability.

    Target: 7d

    URL(s)

    Health: Problem

    Maturity: Level 2 of 3

    Mean Time To Production (MTTP)

    Measures the elapsed time it takes us deploy changes in production, and serves as an indicator of our speed capabilities to deploy changes into production.

    Target: 60m

    URL(s)

    Health: Unknown

    Maturity: Level 2 of 3

    Deploys to Production per Month

    Tracks the total number of deployments to production over the course of the month due to regressions and/or outages, which allows us to measure our deployment speed to production.

    Target: 30

    URL(s)

    Health: Attention

    Maturity: Level 2 of 3

    Number of abandoned deployments per month

    Tracks number of failed deployments to production over the course of the month due to regressions and/or outages, which allows us to measure our production deployment readiness.

    Target: 0

    Health: Problem

    Maturity: Level 2 of 3

    Disaster Recovery (DR) Time-to-Recover

    Tracks time to recover full operational status in case of a catastrophic incident in our primary production environment.

    Target: 60m

    Health: Unknown

    Maturity: Level 2 of 3

    Other PI Pages

    Legends

    Maturity

    Level Meaning
    Level 3 of 3 Has a description, target, and periscope data.
    Level 2 of 3 Missing one of: description, target, or periscope data.
    Level 1 of 3 Missing two of: description, target, or periscope data.
    Level 0 of 3 Missing a description, a target, and periscope data.

    Health

    Level Meaning
    Okay The KPI is at an acceptable level compared to the threshold
    Attention This is a blip, or we’re going to watch it, or we just need to enact a proven intervention
    Problem We'll prioritize our efforts here
    Unknown Unknown

    How to work with pages like this

    Data

    The heart of pages like this is a data file called /data/performance_indicators.yml which is in YAML format. Almost everything you need to do will involve edits to this file. Here are some tips:

    Pages

    Pages like /handbook/engineering/performance-indicators/ are rendered by and ERB template.

    These ERB templates call the helper function performance_indicators() that is defined in /helpers/custom_helpers.rb. This helper function calls in several partial templates to do it's work.

    This function takes a required argument named org in string format that limits the scope of the page to a portion of the data file. Possible valid values for this org argument are listed in the org property of each element in the array in /data/performance_indicators.yml.