GitLab.com's capacity planning is based on a forecasting model which is populated with the same saturation and utilization data that is used for short-term monitoring of GitLab.com.
The forecasting tool generates capacity warnings which are converted to issues and these issues are raised in various status meetings.
We use and develop Tamland, which is our capacity forecasting tool. It relies on Facebook's Prophet library for forecasting time series data and generates forecasts on a daily basis. A report is published and any predicted saturation events result in an issue on the capacity planning issue tracker.
GitLab's Capacity Planning strategy is based on the following technologies:
The forecasting model uses the same saturation and utilization data model that we use to monitor GitLab.com over the short-term. This ensures that anything that we feel is worth monitoring as a potential saturation point will automatically be included in the forecasting model.
Because of this, all services used on GitLab.com are automatically included in the model.
The short-term saturation metric model used on GitLab.com models each resource as a percentage, from 0% to 100%, where 100% is completely saturated. Each resource has an alerting threshold (SLO). If this threshold is breached, alerts will fire and the engineer-on-call will be paged.
The thresholds are decided on a case-by-case basis and vary between resources. Some are near 100% while others are much lower, depending on the nature of the resource, the failure modes on saturation of the resource and the required time-to-mediation. Resources are classed as being either horizontally scalable or not. Horizontally scalable resources are generally considered lower priorities from a capacity planning point-of-view, whereas non-horizontally scalable resources (such as CPU on the primary PostgreSQL instance, for example) require much longer-term strategies for remediation and are therefore considered higher priorities in the capacity planning process.
Tamland relies on Facebook Prophet for generating a forecasting model. Prophet performs analysis of daily, weekly, monthly and yearly trends to forecast a future trend in the data.
Even the most skilled engineer would struggle to predict future saturation, so it's unlikely that a model could do it either. We do not expect it to be totally accurate. Instead, with hundreds of resources on GitLab.com that could potentially become saturated, Tamland's forecasts are a bellweather for changes in trends, particularly upward changes, drawing the attention of engineers who review the data to specific issues.
Tamland will attempt to predict a range of outcomes. For saturation, we focus on the median prediction (50th percentile) and only the upper 80th percentile prediction. The lower 80th percentile is not as important for saturation purposes.
The forecast process, Tamland, runs as a GitLab CI job on ops.gitlab.net
. This job will run on a schedule defined in the scheduled pipeline (set to execute daily). The process starts by reading the historical short-term saturation metric data from Thanos, up to 1-year period, using an hourly resolution.
Capacity planning is a shared activity and dependent on input from many stakeholders:
Tamland analyzes metrics data on a daily basis and creates capacity warning issues if it predicts that a resource will exceed its SLO within the forecast horizon.
On a weekly basis, an engineer from the team reviews all open issues in the Capacity Planning tracker following the process described on the Scalability:Projections team page
~"SaaS Weekly"
label when we do the weekly triage.~capacity-planning::tune model
and not get assigned to the Service Owner directly. Since these model tunings highly benefit from domain insight, the Scalability engineer involves Service Owners to get more information.A Service Owner is the individual identified as the DRI for capacity planning for an individual service. This information is covered in the service catalog.
The Service Owner ideally has the closest insight into the service, its recent changes and events and is responsible for its availability and performance characteristics overall. Capacity Planning aims to help the Service Owner to be informed about trends in resource usage and predicted upcoming resource saturation events, which helps to act early and inform prioritization processes.
For capacity planning, the responsibilities of the Service Owner are:
While many forecasts provide a clear and reliable outlook, not all forecasts will be accurate. For example, a sudden upward trend in the resource saturation metric may be caused by a factor that is known to be temporary - for example, a long running migration. The Service Owner is in the best position to know about these external factors and will evaluate based on all information on-hand to determine whether the forecast is accurate and if the issue requires investigation.
The Service Owner will note down their findings on the issue and get the appropriate actions going to remediate and prevent the saturation event. While the Service Owner is the DRI for the capacity warning, the Infradev Process and the SaaS Availability weekly standup assist with the prioritization of these capacity alerts.
The Service Owner can also decide to change the Service Level Objective, the metric definition or any other forecasting parameters that are used to generate capacity warnings. Please see the related documentation for further information. The Scalability:Projections team is available to assist, but the work should be owned by the DRI and their team.
If the issue does not require investigation, it is important to follow-up and improve the quality of the forecast or the process to improve the signal-to-noise-ratio for capacity planning. This can include feeding external knowledge into the forecasting model or consider changes in automation to prevent getting this capacity warning too early. The Service Owner is expected to get in touch with Scalability:Projections to consider and work on potential improvements.
At any time, the Scalability:Projections team can be consulted and is ready to assist with questions around the forecasting or to help figure out the underlying reasons for a capacity warning.
We use the due date
field to track when the next action is due. For example: the date we expect the issue to drop off the report, or the date we need to take another look at the forecast. We do this because we want to use the capacity planning issue board as the single source of truth. The due date is visible on this board and it is easy to see which issues need attention.
The DRI for an issue is responsible for maintaining the due date and adding status information each time the due date is adjusted.
Capacity Planning issues are created without a state. After the initial assessment, one of the following labels should be applied.
capacity-planning::investigate
- this alert requires further active assessment before deciding on a course of actioncapacity-planning::monitor
- we need to wait for time to pass to gather further data on this issue to make a decision on how to proceedcapacity-planning::tune model
- we determined the issue isn't relevant at this point in time and intend to tune the forecasting model while we continue to monitor the issuecapacity-planning::in-progress
- there is a mitigation in progress for this alertcapacity-planning::verification
- we have completed work on this issue and are verifying the resultEach issue has saturation labels, indicating which thresholds it exceeds and how. An issue can have multiple saturation labels; for instance, any issue with saturation
will, by definition, also have the other three.
saturation
- this issue is predicted (by the median line) to reach 100% saturation in the next 90 days.violation
- this issue is predicted (by the median line) to reach the saturation threshold (which varies by component) in the next 90 days.saturation-80%-confidence
- this issue is predicted (by the upper end of the 80% confidence interval) to reach 100% saturation in the next 90 days.violation-80%-confidence
- this issue is predicted (by the upper end of the 80% confidence interval) to reach the saturation threshold (which varies by component) in the next 90 days.tamland:keep-open
- Used to prevent Tamland from closing the issue automatically. This can be useful to validate the effect of changes we made for a longer period of time until we are confident about the effects.~"GitLab.com Resource Saturation"
label applied.The Scalability:Frameworks team uses capacity planning issues to drive prioritization. By taking saturation data as an input into the planning process, Frameworks team can identity potential projects to balance proactive and reactive work streams.
The prioritization framework uses an Eisenhower Matrix, a 2x2 matrix based on urgency and importance:
Quadrant 1: Do Urgent, Important Reactive: Non-horizontally scalable resources forecasted to saturate 100% in 90 days. |
Quadrant 2: Decide Less Urgent, Important Proactive: Non-horizontally scalable resources forecasted to violate hard SLO in 90 days. |
Quadrant 3: Delegate Urgent, Less Important Reactive: Horizontally scalable resources forecasted to saturate 100% in 90 days. |
Quadrant 4: Deny Less Urgent, Less Important Proactive: Horizontally scalable resources forecasted to violate hard SLO in 90 days. |
Urgent is based on forecast threshold (e.g. 100% saturation
vs. hard SLO violation
) and important is based on scalable resources (e.g. non_horizontal
vs. horizontal
). The following resources are available for prioritization:
In this section, we discuss a few capacity planning issues and describe how we applied the process above when addressing them.
gitlab-com/gl-infra/capacity-planning#364
We split out some operations from redis-cache to a new redis-repository-cache instance to reduce our CPU utilisation on redis-cache. We had planned this for weeks in advance due to a capacity planning warning, and we were able to roll this out the day after we had a production page about CPU saturation for redis-cache.
If we hadn't had the capacity planning step in there, we may have noticed this problem much later, and had to scramble to implement mitigations in a high-pressure environment. Instead, we just accelerated our existing timeline slightly, and resolved it with a clean solution.
pre
and gstg
environments in gitlab-com/gl-infra/scalability#2050 and gitlab-com/gl-infra/scalability#2052.gprd
as a result, from gitlab-com/gl-infra/production#8309. This was intended to happen the following week due to staff availability.gitlab-com/gl-infra/capacity-planning#42
gitlab-com/gl-infra/capacity-planning#45
gitlab-com/gl-infra/capacity-planning#144
gitlab-com/gl-infra/capacity-planning#31 and gitlab-com/gl-infra/capacity-planning#108