GitLab.com's capacity planning is based on a forecasting model which is populated with the same saturation and utilization data that is used for short-term monitoring of GitLab.com.
The forecasting tool generates warnings which are converted to issues and these issues are raised in various status meetings.
At present, we use Facebook's Prophet library for forecasting. The model is used to generate a report, which is published weekly to https://gitlab-com.gitlab.io/gl-infra/tamland.
GitLab's Capacity Planning strategy is based on the following technologies:
The forecasting model uses the same saturation and utilization data model that we use to monitor GitLab.com over the short-term. This ensures that anything that we feel is worth monitoring as a potential saturation point will automatically be included in the forecasting model.
Because of this, all services used on GitLab.com are automatically included in the model.
The short-term saturation metric model used on GitLab.com models each resource as a percentage, from 0% to 100%, where 100% is completely saturated. Each resource has an alerting threshold. If this threshold is breached, alerts will fire and the engineer-on-call will be paged.
The thresholds are decided on a case-by-case basis and vary between resources. Some are near 100% while others are much lower, depending on the nature of the resource, the failure modes on saturation of the resource and the required time-to-mediation. Resources are classed as being either horizontally scalable or not. Horizontally scalable resources are generally considered lower priorities from a capacity planning point-of-view, whereas non-horizontally scalable resources (such as CPU on the primary Postgres instance, for example) require much longer-term strategies for remediation and are therefore considered higher priorities in the capacity planning process.
Tamland relies on Facebook Prophet for generating a forecasting model. Prophet performs analysis of hourly, daily, weekly and monthly trends to forecast a future trend in the data.
Even the most skilled engineer would struggle to predict future saturation, so it's unlikely that a model could do it either. We do not expect it to be totally accurate. Instead, with hundreds of resources on GitLab.com that could potentially become saturated, Tamland's forecasts are a bellweather for changes in trends, particularly upward changes, drawing the attention of engineers who review the data to specific issues.
Tamland will attempt to predict a range of outcomes. For saturation, we focus on the median prediction (50th percentile) and only the upper 80th percentile prediction. The lower 80th percentile is not as important for saturation purposes.
The forecast process, Tamland, runs as a GitLab CI job on
ops.gitlab.net. This job will run on a schedule defined in the scheduled pipeline (set to weekly). The process starts by reading the historical short-term saturation metric data from Thanos, up to 1-year period, using an hourly resolution.
We use the
due date field to track when the next action is due. For example: the date we expect the issue to drop off the report, or the date we expect the DRI to have taken action. We do this because we want to use the
capacity planning issue board as the single source of truth.
The due date is visible on this board and it is easy to see which issues need attention.
The DRI is responsible for maintaining the due date and adding status information each time the due date is adjusted.
Capacity Planning issues are created without a state. After the initial assessment, one of the following labels should be applied.
capacity-planning::investigate- this alert requires further active assessment before deciding on a course of action
capacity-planning::monitor- we need to wait for time to pass to gather further data on this issue to make a decision on how to proceed
capacity-planning::in-progress- there is a mitigation in progress for this alert
capacity-planning::verification- we have completed work on this issue and are verifying the result
Each issue has saturation labels, indicating which thresholds it exceeds and how. An issue can have multiple saturation labels; for instance, any issue with
saturation will, by definition, also have the other three.
saturation- this issue is predicted (by the median line) to reach 100% saturation in the next 90 days.
violation- this issue is predicted (by the median line) to reach the saturation threshold (which varies by component) in the next 90 days.
saturation-80%-confidence- this issue is predicted (by the upper end of the 80% confidence interval) to reach 100% saturation in the next 90 days.
violation-80%-confidence- this issue is predicted (by the upper end of the 80% confidence interval) to reach the saturation threshold (which varies by component) in the next 90 days.
The Scalability:Projections team owns the Capacity Planning process and we aim to enable others to take responsibility for the capacity demands of their features and services.
The Scalability:Frameworks team uses capacity planning issues to drive prioritization. By taking saturation data as an input into the planning process, Frameworks team can identity potential projects to balance proactive and reactive work streams.
The prioritization framework uses an Eisenhower Matrix, a 2x2 matrix based on urgency and importance:
|Quadrant 1: Do
Reactive: Non-horizontally scalable resources forecasted to saturate 100% in 90 days.
|Quadrant 2: Decide
Less Urgent, Important
Proactive: Non-horizontally scalable resources forecasted to violate hard SLO in 90 days.
|Quadrant 3: Delegate
Urgent, Less Important
Reactive: Horizontally scalable resources forecasted to saturate 100% in 90 days.
|Quadrant 4: Deny
Less Urgent, Less Important
Proactive: Horizontally scalable resources forecasted to violate hard SLO in 90 days.
Urgent is based on forecast threshold (e.g.
100% saturation vs.
hard SLO violation) and important is based on scalable resources (e.g.
horizontal). The following resources are available for prioritization:
The forecasts are reviewed in the weekly Engineering Allocation meeting and any required corrective actions are prioritized according to the timeframes for saturation predicted by the forecast, and the criticality of the resources.
Practically, this is done by:
Actions described above can also take place asynchronously at any time - we should not wait for the Engineering Allocation meeting to update issue status or find DRIs for issues.
The Scalability:Projections team will triage the capacity alerts by labeling them with the relevant severity/priority labels and assign them to the appropriate owner. We rely on the Infradev Process to assist with prioritization of these capacity issues. We remain available for guidance and review support.
There are three scenarios that can occur:
The severity and priority assigned to these Infradev issues will be based on the alert information. If the alert continues to fire, the severity and priority will be raised appropriately.
When the DRI has been identified, the issue will be assigned to them and a comment will be added to describe how we identified them as the owner as well as what is expected from them to resolve the issue. The DRI should keep the issue updated by maintaining the due date and adding status information each time the due date is adjusted.
If the issue belongs to a specific team, the team label will also be applied.
In this section, we discuss a few capacity planning issues and describe how we applied the process above when addressing them.
We split out some operations from redis-cache to a new redis-repository-cache instance to reduce our CPU utilisation on redis-cache. We had planned this for weeks in advance due to a capacity planning warning, and we were able to roll this out the day after we had a production page about CPU saturation for redis-cache.
If we hadn't had the capacity planning step in there, we may have noticed this problem much later, and had to scramble to implement mitigations in a high-pressure environment. Instead, we just accelerated our existing timeline slightly, and resolved it with a clean solution.
gstgenvironments in gitlab-com/gl-infra/scalability#2050 and gitlab-com/gl-infra/scalability#2052.
gprdas a result, from gitlab-com/gl-infra/production#8309. This was intended to happen the following week due to staff availability.
gitlab-com/gl-infra/capacity-planning#31 and gitlab-com/gl-infra/capacity-planning#108