Product Direction - Scalability-Observability

The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.

Overview
Challenges
1-Year Plus Plan
Categories
What's Next

Overview

The Scalability-Observability team is responsible for the observability of GitLab at scale. This includes architecting, rolling out and operating our gloabal observability platform that powers our SaaS Platforms. In addition, the Scalability-Observability team runs our related business critical process like capacity planning and error budgets reporting.

Challenges

Observability of a distributed monolith is complex. There are many teams that own code inside one project in the GitLab codebase and working out who is responsible for performance issues requires mapping pieces of code to teams across the company. To combat this, we have created feature categories within our error budgets so that we can attribute any performance or availabiltiy issues directly to the teams that can take action on them.

Observability is only going to become more challenging as our fleet grows. Both Cells and our Dedicated instances need observability that is durable and capable of operating independently of a global service, without losing data. At the same time, we need to provide observability for the entire fleet and drive attention to actionable notifications when issues occur. There is a limit to how far humans can scale and so we have to ensure that our observability and processes can scale atomically at the instance level and globally to our fleet.

1-Year Plus Plan

For FY 25, we’re investing in the following capabilities, which are linked to our group 3-5 year strategy:

Re-Architect the Observability Stack and rollout for the GitLab Fleet

We are planning to revamp our current observability stack so that it can be rolled out as "units" alongside instances in our GitLab fleet. This will allow us to have durable and resilient observability that can work standalone as well as rolling up into a global service.

Create the first observability blueprint

As part of preparing for the future, we will create the first version of the observability blueprint. This will ensure that we can rollout our observability stack across our entire fleet, including GitLab.com, Cells and Dedicated, in a repeatable and consistent way, with a consistent set of tools.

Build the first observability units

A 'Unit' is described as 'an individual thing or person regarded as single and complete but which can also form an individual component of a larger or more complex whole.' In the future, after GitLab.com has moved to a cellular architecture, it will be essential that observability services are able to operate independantly at the cell-local level and become eventually consistent at the GitLab fleet level. When the observability blueprint is complete, we will work towards bulding the first observability unit. This will include all the things needed to provision, manage and rollout changes to deployed observability units.

Capacity planning as a native Experience

The Scalability group manages the SOX controlled capacity planning process for GitLab. At the time of writing this is managed within GitLab, using issues and other native features, but these are built on top of GitLab. Over time, we will integrate our capacity planning process into GitLab the product so that self managed customers can benefit from the automated capacity planning and saturation warnings that GitLab's SaaS fleet does today.

Increase the precision of Tamland

In order to deliver value to customers and to grow trust in Tamland as the foundation of any GitLab capacity planning process, we must decrease the false positive rate to a level that generates a minimal amount of noise. Whilst some noise has been acceptable for GitLab team members with deep understanding of the GitLab platform, we know that customers of GitLab will get more value out of precise, actionable alerts and a reduction in overall noise.

Introduce capacity planning for all services

In FY24, we rolled out Capcity planning with saturation forecasting for the entire fleet of GitLab instances. This was a great achievement, however there are still a significant number of services that could still benefit from capacity planning and saturation forecasts. We'll work over FY25 and beyond to rollout capacity planning to all GitLab services that could benefit from it.

Introduce Capacity planning as a product capability

GitLab's teams have benefited greatly from the introduction of an automated capacity planning process with forecasts of saturation. It's allowed us to proactively manage saturation issues before they cause service disruptions or become apparent to users. Many customers that run GitLab self-managed also have to manage a capacity planning process for GitLab itself. We will work towards building capacity planning into the product so that capacity planning for GitLab mostly manages itself, saving customers from toil.

Error Budgets across the GitLab fleet

With the introduction of Dedicated in FY24, we now have more SaaS platforms that can provide us with performance and availability metrics at a number of installation sizes. In FY25 and beyond, we will update our error budgets so that our teams get a more accurate sense of how their features and services are performing across the GitLab fleet and not just on GitLab.com, where system headroom can hide minor issues/bugs.

Include GitLab Dedicated in the Error Budgets Calculation

Adding GitLab Dedicated to our error budget calculations will give us much richer information about how stage team's features perform at a number of different scales. This information will helpm PMs and engineers to better prioritise reliability work and make data driven decisions and should mean that GitLab performs better for all users, regardless of tpye.

Increase Completeness of Error Budgets

At the time of writing, we have extensive error budgets in place for GitLab.com. Using dashboards and observability, teams can check in on the performance of their features and services and make sure they are hitting their SLOs and avaialablity targets. Over the next year, we'll be working on increasing the completeness of error budgets and making them more reflective of the user experience on GitLab. We'll do this by increasing the number of services that are included in our error budgets program, increasing the insight by expanding error budgets across the fleet, stabilising our global metrics stack and finally, revisiting our SLA calculation for GitLab.com

A note on 'keeping the lights on'

As part of the Scalability Group’s responsibility to scale GitLab.com, there is a significant amount of operational load on the team. We regularly swarm around issues and production incidents, helping teams to quickly identify, root cause and solve GitLab.com problems. This work will typically be prioritized ahead of any project work to ensure that gitlab.com customers are not disrupted.

What we're not doing

We’re not working on an Internal Developer portal or single pane of glass for observability right now. However, we do expect to contribute observability capability or configuration to an internal portal if it becomes available.

We are not working on Infrastructure Cost Data as this was deprioritized.

What's Next

Infrastructure Capacity Planning for Dedicated
Enhance Error Budgets/Usage metrics
Keep the lights on
Fin Ops

The list above can change and should not be taken as a hard commitment. For the most up-to-date information about our work, please see our top level epic.