This team focuses on forecasting & projection systems that enable development engineering to understand system growth (planned and unplanned) for their areas of responsibility. Error Budgets and Stage Group Dashboards are examples of successful projects that have provided development teams information about how their code runs on GitLab.com.
As Dedicated becomes more mature, we will expand our remit to include projection activities for this platform.
We use metrics to gather data to inform our decisions. We contribute to the observability of the system by maintaining metrics that concern saturation and improving observability tools that we can use to help us understand how the system responds to load.
The following people are members of the Scalability:Projections team:
We are responsible for Capacity Planning, Error Budgets and Infrastructure Cost Data.
We maintain and improve the Capacity Planning process that is described in the Infrastructure Handbook. This is a controlled activity covered by SOC 2. Please see this issue for further details
The goal of this process is to predict and prevent saturation incidents on GitLab.com.
Issues are kept in the capacity planning issue tracker. Where
an issue is needed to improve metrics to support this process, we raise an issue in the Scalability group tracker with
the label of Saturation Metrics
.
Week starting | Person |
---|---|
2023-03-20 | Hercules |
2023-03-27 | Bob |
2023-04-03 | Jacob |
2023-04-10 | Jacob |
2023-04-17 | Sylvester |
2023-04-24 | Sylvester |
2023-05-01 | Marco |
2023-05-08 | Marco |
2023-05-15 | Matt |
2023-05-22 | Matt |
2023-05-29 | Chance |
2023-06-05 | Chance |
2023-06-12 | Stephanie |
2023-06-19 | Stephanie |
2023-06-26 | Alejandro |
2023-07-03 | Alejandro |
2023-07-10 | Igor |
2023-07-17 | Igor |
2023-07-24 | Hercules |
2023-07-31 | Hercules |
The responsibility for reviewing Tamland reports rotates between all members of the Scalability Group.
The rotation lasts for a minimum of two weeks. There is flexibility in the schedule to allow for OOO and on-call responsibilities.
The length of the rotation cycle is to try provide exposure to the wide variety of capacity warnings that occur and to enable each person to gain context on the components that we monitor.
The triage duties are:
capacity-planning::
workflow label). The saturation labels can help in choosing which issues to review first, if there are many with the same due date.When your rotation is finished, you need to provide handover notes in the #infra_capacity-planning channel for the incoming person.
Some tips to help you to get started on duties:
component: <component_name>
(e.g. component: disk_space
) in runbooks
project, the underlying recording rule can be found in rules/autogenerated-saturation.yml
(example for component: disk_space
)We maintain the Error Budgets process that is described in the Engineering Handbook.
Issues are kept in the Scalability group tracker with
the label of Category::Error Budgets
.
We maintain the metrics used to generate the Error Budgets and we ensure that the reports are published on time.
We advocate for improving the SLOs for Stage Groups and we provide support to help them achieve this. Providing the Stage Groups with data about how their feature categories operate on GitLab.com enables them to make good choices about how to efficiently improve the reliability, availability and performance of their feature categories.
The Scalability group is an owner of several performance indicators that roll up to the Infrastructure department indicators:
These are combined to enable us to better prioritize team projects.
An overly simplified example of how these indicators might be used, in no particular order:
Between these different signals, we have a relatively (im)precise view into the past, present and future to help us prioritise scaling needs for GitLab.com.