|Slack Channels||#g_scalability /
||#infrastructure-lounge (Infrastructure Group Channel), #incident-management (Incident Management), #alerts-general (SLO alerting), #mech_symp_alerts (Mechanical Sympathy Alerts)|
|Sisense Dasboard||Useful Charts|
The Scalability group is currently formed of two teams:
The Projections team puts systems in place to enable other teams to make decisions (inform). The Frameworks team puts systems in place to enable other teams to act on those decisions (action).
The Frameworks team creates standard ways to use and scale the various services and technologies used at GitLab, with a particular focus on enabling other development teams to support their own growth. This is akin to Platform Engineering.
For example, with Redis, we are creating more and more instances now and it’s at the point where teams should have the ability to know when they need a custom instance, and how to put one in place. Similarly, with Object Storage, this is a component that is widely used but the usage is not consistent. We can help the development teams become more efficient by providing a structure over this storage.
More information on Frameworks is available on the team page.
This team focuses on observability, forecasting & projection systems that enable development engineering to predict system growth for their areas of responsibility. Error Budgets and Stage Group Dashboards are examples of successful projects that have provided development teams information about how their code runs on our SaaS platforms.
More information on Projections is available on the team page.
The Scalability group is responsible for GitLab at scale, working on the highest priority scaling items related to our SaaS platforms. The group works in close coordination with Reliability Engineering teams and Platform Engineering teams. We support other Engineering teams by sharing data and techniques so they can become better at scalability as well.
As its name implies, the Scalability group enhances the availability, reliability and, performance of GitLab's SaaS platforms by observing the application's capabilities to operate at scale.
The Scalability group analyzes application performance on GitLab's SaaS platforms, recognizes bottlenecks in service availability, proposes (and develops) short term improvements and develops long term plans that help drive the decisions of other Engineering teams.
Short term goals for the group include:
The Scalability Group aligns with the Platforms Department direction for FY24. We are framing our direction using our Themes, having selected Horizontal Scalability and Advocacy/Facilitation as our main focus for the year.
In FY23, the Scalability Group has focused quite heavily on making sure that none of our Redis instances are at risk of saturation, with several projects in process to support this goal. In the year ahead, we need to achieve a lower-touch model, similar to what we achieved with Sidekiq. When we have defined a scaling strategy per instance and automated as many of the scaling options as possible, we will have a clearer path to defining what is required for long-term ownership of this service.
From a practical perspective, this means:
We have developed processes and systems for delivering information to stage groups and reliability teams about the services used to deliver GitLab.com. To the stage groups, we deliver availability information through Error Budgets and to the Reliability teams we deliver capacity planning information with the help of Tamland. Our ability to deliver this information and expand our offering lies with having good data emitted from mature services. In order of criticality, we will expand the information we offer for services and link this to the Service Maturity Model. By offering additional features to services that are more mature we hope to encourage service owners to continually improve the maturity of their services.
Intial thoughts for this are:
We already share certain elements of resource usage information with the stage groups to raise awareness about the resources required to support their feature categories. We will expand on the data we provide and include resource cost information. This can take the form of cost per service (for example, how is the storage cost of the database attributed to each feature category), or per feature category (for example, what is the sum of all resources used to deliver GitLab Pages). Access to this information can support product leadership in investment decisions.
A future iteration for another year might be to incorporate usage information into a projections tool (like Tamland) to estimate future costs.
We are establishing a performance indicator for Error Budgets. This includes a score for the "completeness" of the Error Budget data. We will include the remaining operation types into Error Budgets and provide the necessary tools for the stage groups to inspect these operations.
Currently our focus is on GitLab.com but we must expand our focus to include other SaaS platforms such as Dedicated. Scalability group has focused on GitLab.com exclusively while it was the sole SaaS platform operated by GitLab Inc. With the introduction of GitLab Dedicated we must expand our expertise and reach to this single tenant offering. The scale there is measured in a number of instances running at any given time, and to a lesser extent, the size of those instances.
Throughout FY24, we need to provide help to GitLab Dedicated teams to ensure that individual tenant SLA's are consistently met, and offer our expertise in creating a set of views that allow the team to quickly identify operational bottlenecks.
In FY23, the Scalability Group took ownership of the capacity planning process. We improved the process to make it simpler to operate, but there is still a large amount of manual intervention required by the Scalability Group. Our time available for these activities should rather be spent on supporting service owners with their capacity needs. This enables us to monitor the outliers and prove support where needed.
First steps may include:
The Infrastructure Department is concerned with the availability and performance of GitLab's SaaS platforms.
GitLab.com's service level availability is visible on the SLA Dashboard, and we use the General GitLab Dashboard in Grafana to observe the service level indicators (SLIs) of apdex, error ratios, requests per second, and saturation of the services.
These dashboards show lagging indicators for how the services have responded to the demand generated by the application.
Each team is responsible for separate indicators. For more information, please view the team pages linked above.
The broad nature of work undertaken by the Scalability group can make prioritization challenging as it’s tricky to compare some issues like-for-like. For example, how do we compare the benefit of an issue to address a performance concern against an issue that reduces developer toil? To help guide the direction of the group and to inform our prioritization process, we can categorize issues in to the following themes, in order of priority:
The above list is not comprehensive, nor does it outline a formal process. We should remain pragmatic when prioritizing work, while using the themes as a guideline.
The following people are members of the Scalability:Frameworks team:
|Liam McAndrew||Engineering Manager, Scalability:Frameworks|
|Alejandro Rodríguez||Site Reliability Engineer, Scalability|
|Chance Feick||Senior Backend Engineer|
|Igor Wiedler||Staff Site Reliability Engineer, Scalability|
|Gregorius Marco||Backend Engineer, Scalability|
|Sylvester Chin||Backend Engineer, Scalability|
The following people are members of the Scalability:Projections team:
|Rachel Nienaber||Senior Engineering Manager, Scalability:Projections|
|Bob Van Landuyt||Staff Backend Engineer, Scalability|
|Hercules Lemke Merscher||Backend Engineer|
|Liam McAndrew||Engineering Manager, Scalability:Frameworks|
|Matt Smiley||Staff Site Reliability Engineer, Scalability|
|Stephanie Jackson||Senior Site Reliability Engineer, Scalability|
The Scalability Group consists of Engineering Manager, Backend Engineers, and Site Reliability Engineers.
The Engineering Roles section of the handbook lists the responsbilies of these roles:
We work with all engineering teams across all departments as a representative of GitLab.com as one of the largest GitLab installations, to ensure that GitLab continues to scale in a safe and sustainable way.
The Memory team is a natural counterpart to the Scalability group, but their missions are complementing each other rather than overlap:
|Scalability Teams||Memory Team|
|Focused on GitLab's SaaS platforms first, self-managed only when necessary.||Focused on resolving application bottlenecks for all types of GitLab installations.|
|Driven by set SLO objectives, regardless of the nature of the issue.||Focused on application performance and resource consumption, in all environments.|
|Primary concern is preventing disruptions of GitLab's SaaS platforms SLO objectives through changes in the application architecture.||Primary concern is managing the application performance for all types of GitLab installations.|
Scalability leadership can be reached via PagerDuty Scalability Escalation.
From https://gitlab.pagerduty.com/incidents, click on the "New Incident" button and complete the new incident form as shown below.
workflowlabels to the issue. The team will triage the issue and apply these.
Alternatively, mention us in the issue where you'd like our input.
When issues are sent out way, we will do our best to help or find a suitable owner to move the issue forward. We may be a development team's first contact into the Infrastructure department and we endeavour to treat these requests with care so that we can help to find an effective resolution for the issue.
If you're working on a feature that has specific scaling requirements, you can create an issue with the review request template. Some examples are:
This template gives the Scalability group the information we need to help you, and the issue will be shown on our build board with a high priority.
When we observe a situation on GitLab.com that needs to be addressed alongside a stage group, we first raise an issue in the Scalability issue tracker that describes what we are seeing. We try to determine if the problem lies with the action the code is performing, or the way in which it is running on GitLab.com. For example, with queues and workers, we will see if the problem is in what the queue does, or how the worker should run.
If we find that the problem is in what the code is doing, then we engage with the EM/PM of that group to find the right path forward. If work is required from that group, we will create a new issue in the gitlab-org project and use the Availability and Performance Refinement process to highlight this issue.
We prefer to work asynchronously as far as possible but still use synchronous communication where it makes sense to do so.
To that end, the only regular calls are the demo calls.
Lastly, in order to keep people connected, team members schedule at least one coffee-chat with another team member each week. These are at times that will best suit them as it may be an unusual hour given the various timezones and working hours for each person.
As a small team covering a wide domain, we need to make sure that everything we do has sufficient impact. If we do something that only the rest of the Scalability group knows about, we haven't 'shipped' anything. Our 'users' in this context are the infrastructure itself, SREs, and Development Engineers.
Impact could take the form of changes like:
In order to make others aware of the work we have done, we should advertise changes in the following locations:
Documentation or tutorial videos should also be added to the README.md in our team repository.
We use Epics, Issues, and Issue Boards to organize our work, as they complement each other.
The single source of truth for all work is Scaling GitLab SaaS Platforms epic. This is considered as the top-level epic from which all other epics are derived.
Epics that are added as children to the top-level epic are used to describe projects that the team undertakes.
Having all projects at this level allows us to use a single list for prioritization and enables us to prioritize work for different services alongside each other. Projects are prioritized in line with the OKRs for the current quarter.
Project status is maintained in the description of the top-level epic so that it is visible at a glance. This is auto-generated using the epic issues summary project. You can watch a short demo of this process to see how to use status labels on the epics to make use of this automation.
Example organization is shown on the diagram below:
Note If you are not seeing the diagram, make sure that you accepted all cookies.
Each project has an owner who is responsible for delivering the project.
The owner needs to:
The epic for the project must have the following items:
exit criterionlabel in the epic and are linked in the description.
The Scalability group issue boards track the progress of ongoing work.
On the planning board, the goal is to get issues into a state where we have enough information to build the issue.
However, not all issues that are
workflow-infra::Ready to be built should be scheduled for development right away. Some
issues may be too big, or might not be as important as others. This means not all issues that are
workflow-infra::Ready on the
planning board will move to the build board immediately.
Please see the triage rotation section for when to move issues between the boards.
|Planning Board||Build Board|
|Issues where we are investigating the work to be done.||Issues that will be built next, or are actively in development.|
The Scalability teams routinely uses the following set of labels:
group::Scalability label is used in order to allow for easier filtering of
issues applicable to the team that have group level labels applied.
The Scalability teams leverage scoped workflow labels to track different stages of work. They show the progression of work for each issue and allow us to remove blockers or change focus more easily.
The standard progression of workflow is from top to bottom in the table below:
|Problem is identified and effort is needed to determine the correct action or work required.|
|Proposal is created and put forward for review.
SRE looks for clarification and writes up a rough high-level execution plan if required. SRE highlights what they will check and along with soak/review time and developers can confirm.
If there are no further questions or blockers, the issue can be moved into "Ready".
|Proposal is complete and the issue is waiting to be picked up for work.|
|Issue is assigned and work has started.
While in progress, the issue should be updated to include steps for verification that will be followed at a later stage.
|Issue has an MR in review.|
|MR was merged and we are waiting to see the impact of the change to confirm that the initial problem is resolved.|
|Issue is updated with the latest graphs and measurements, this label is applied and issue can be closed.|
There are three other workflow labels of importance:
|Work in the issue is being abandoned due to external factors or decision to not resolve the issue. After applying this label, issue will be closed.|
|Work is not abandoned but other work has higher priority. After applying this label, team Engineering Manager is mentioned in the issue to either change the priority or find more help.|
|Work is blocked due external dependencies or other external factors. Where possible, a blocking issue should also be set. After applying this label, issue will be regularly triaged by the team until the label can be removed.|
The Scalability group has only one priority label:
Only issues of the utmost importance are given this label.
When an issue is given this label, a message should be pasted in the team's Slack channel so that an owner can be found as quickly as possible.
These issues should be picked up soon as possible after completing ongoing task unless directly communicated otherwise.
It is a scoped label as we previously had 4 levels of priority. We found that in practise we primarily used P4, and used P1 to indicate the issues of greatest importance.
Stage groups use type labels to label merge requests in projects in the
gitlab-org group. The Scalability group is not a part of the stage groups, and labels of importance for the team are explained above. When submitting work in gitlab-org group, we apply ~"team::Scalability" and ~"type::maintenance" to merge requests by default. The latter label is describing work towards refinement of existing functionality which describes majority of the work the team is contributing.
We have automated triage policies defined in the triage-ops project. These perform tasks such as automatically labelling issues, asking the author to add labels, and creating weekly triage issues.
We rotate the triage ownership each month, with the current triage owner responsible for picking the next one (a reminder is added to their last triage issue).
When issues arrive on our backlog, we should consider how they align with our vision, mission, and current OKRs.
We also determine which of the teams would be the more appropriate owner for that task.
We need to effectively triage these issues so that they can be handled appropriately. This means:
When handing over an issue to the new owner, provide as much information as you can from your assessment of the issue.
The Scalability team members often have specialized knowledge that is helpful in resolving incidents. Some team members are also SREs who are part of the on-call rota. We follow the guidelines below when contributing to incidents.
For an on-call SRE:
For an Incident Manager:
If you are not EOC or an Incident Manager when an incident occurs:
The reason for this position is that our project work prevents future large S1 incidents from occuring. If we try to participate in and resolve many incidents, our project work is delayed and the risk of future S1 incidents increases.
The Infradev process aims to highlight SaaS availability and reliability improvements with the Stage Groups.
Where issues marked as
infradev are found to be scaling problems, the
team::Scalability label should be added.
Our commitment to this process, in line with the team's vision, is to provide guidance and assistance to the stage groups who are responsible for resolving these issues. We proactively assist them to determine how to resolve a problem, and then we contribute to reviewing the changes that they make.
Service::Unknownrefinement - go through issues marked
Service::Unknownand add a defined service, where possible.
workflow-infra::In Progress, either through picking them up directly, or asking on our team channel if any one else is able.
infradevlabels so we can help the stage groups move those forward.
Every quarter, we perform a review of all issues on the backlog that are not part of any project. When reviewing issues:
The EM creates this issue each quarter. It is not the sole responsibility of the person on Triage Rotation and is shared among all team members.
We work from our main epic: Scaling GitLab's SaaS Platforms.
Most of our work happens on the current in-progress sub epic. This is always prominently visible from the main epic's description.
When choosing something new to work on you can either:
The Scalability team became a reality during the fourth organizational iteration in the Infrastructure department on 2019-08-22, although it only became a reality once the first team member joined the team on 2019-11-29.
Even though it might not look like it at first glance, the Scalability team has its origin connected to the Delivery team. Namely, the first two backend engineers with Infrastructure specialisation were a part of the Delivery team, a specialisation that previously did not fit into the organizational structure. They had a focus on reliability improvements for GitLab.com, often working on features that had many scaling considerations. A milestone, that will prove to be a case for the Scalability team, was Continous Delivery on GitLab.com.
Throughout July, August and September 2019, GitLab.com experienced a higher than normal amount of customer facing events. Mirroring delays, slowdowns, vertical node scaling issues (to name a few) all contributed to general need to improve stability. This placed higher expectations on the Infrastructure department and with the organization at the time, this was harder to meet. To accelerate the timelines, "infradev" and "rapid action" processes were created, as a connection point between Infrastructure and Development departments to help Product prioritise higher impact issues. This approach was starting to yield results, but the process was there as a reaction to an (ongoing) event with the focus on resolving that specific need.
The background processing architectural proposal clearly illustrated the need to stay ahead of the growing needs of the platform and approach the growth strategically as well as tactically. With a clear case and approvals in hand, the team mission, vision, and goals were set and the team buildout could commence. While that was in motion, we had another confirmation through a performance retrospective that the need for the team is real.
As the team was taking shape, the background processing architectural changes were the first changes delivered by the team with a large impact on GitLab.com, with many more incremental changes throughout 2020 that followed. Measuring that impact reliably, and predicting the future challenges remains one of the team focuses at the time of the writing of this history summary.
The team impact overview is logged in issues: