|Slack Channels||#g_scalability /
||#infrastructure-lounge (Infrastructure Group Channel), #incident-management (Incident Management), #alerts-general (SLO alerting), #mech_symp_alerts (Mechanical Sympathy Alerts)|
|Sisense Dasboard||Useful Charts|
The Scalability group is currently formed of two teams:
The primary goal of the Frameworks team is to create a standard way to use and scale the various services and technologies used by GitLab and GitLab.com, with a particular focus on enabling other development teams to support their own growth. This is akin to Platform Engineering.
For example, with Redis, we are creating more and more instances now and it’s at the point where teams should have the ability to know when they need a custom instance, and how to put one in place. Similarly, with Object Storage, this is a component that is widely used but the usage is not consistent. We can help the development teams become for efficient by providing a structure over this storage.
This team focuses on observability, forecasting & projection systems that enable development engineering to predict system growth for their areas of responsibility. Error Budgets and Stage Group Dashboards are examples of successful projects that have provided development teams information about how their code runs on GitLab.com. But this comes at the end of the development process. At the other end of this cycle sits a huge open opportunity around forecasting where we could incorporate the Product Scaling Model into our capacity forecasting tools.
The Scalability group is responsible for GitLab and GitLab.com at scale, working on the highest priority scaling items related to the application and GitLab.com. The group works in close coordination with Reliability Engineering teams and provides feedback to other Engineering teams so they can become better at scalability as well.
As its name implies, the Scalability group enhances the availability, reliability and, performance of GitLab by observing the application's capabilities to operate at GitLab.com scale.
The Scalability group analyzes application performance on GitLab.com, recognizes bottlenecks in service availability, proposes (and develops) short term improvements and develops long term plans that help drive the decisions of other Engineering teams.
Short term goals for the group include:
The Scalability Group aligns with the Platforms Department direction for FY23.
At the department level, the goal is to ensure that existing team members are happy while at the same time making sure that new new team members are hired and get the support they need.
Initiatives towards this goal:
At the department level, the goal is to help stage groups manage the lifecycle of their features.
Our contribution to this goal is to further enhance and improve Error Budgets. For example:
We have successfully enabled stage groups to take ownership of their features on GitLab.com. In FY23 we can extend this ownership further and see if we can encourage more involvement in both the Delivery and Reliability aspects of their feature categories.
We also need to support self-servicing for system components with no clear owner. This may mean closer attention on items such as application level rate limiting and object storage.
At the department level, the goal is to improve both the metrics we use to guide ourselves, as well as how to use metrics to guide others.
At the department level, we want to ensure that we continue to focus on scaling.
We support this by continuing to work on the scaling initiatives that we have, and look for the new challenges we will face as we continue to grow.
GitLab.com's service level availability is visible on the SLA Dashboard, and we use the General GitLab Dashboard in Grafana to observe the service level indicators (SLIs) of apdex, error ratios, requests per second, and saturation of the services.
These dashboards show lagging indicators for how the services have responded to the demand generated by the application.
The Scalability group is an owner of several performance indicators that roll up to the Infrastructure department indicators:
These are combined to enable us to better prioritize team projects.
An overly simplified example of how these indicators might be used, in no particular order:
Between these different signals, we have a relatively (im)precise view into the past, present and future to help us prioritise scaling needs for GitLab.com.
The team regularly works on the following tasks, in the order of priority:
The following people are members of the Scalability:Frameworks team:
|Liam McAndrew||Engineering Manager, Scalability|
|Alejandro Rodríguez||Site Reliability Engineer, Scalability|
|Igor Wiedler||Staff Site Reliability Engineer, Scalability|
|Jacob Vosmaer||Staff Backend Engineer, Scalability|
|Quang-Minh Nguyen||Senior Backend Engineer, Scalability|
The following people are members of the Scalability:Projections team:
|Rachel Nienaber||Engineering Manager, Scalability|
|Bob Van Landuyt||Senior Backend Engineer, Scalability|
|Matt Smiley||Senior Site Reliability Engineer|
|Sean McGivern||Staff Backend Engineer, Scalability|
We work with all engineering teams across all departments as a representative of GitLab.com as one of the largest GitLab installations, to ensure that GitLab continues to scale in a safe and sustainable way.
The Memory team is a natural counterpart to the Scalability group, but their missions are complementing each other rather than overlap:
|Scalability Teams||Memory Team|
|Focused on GitLab.com first, self-managed only when necessary.||Focused on resolving application bottlenecks for all types of GitLab installations.|
|Driven by set SLO objectives, regardless of the nature of the issue.||Focused on application performance and resource consumption, in all environments.|
|Primary concern is preventing disruptions of GitLab.com SLO objectives through changes in the application architecture.||Primary concern is managing the application performance for all types of GitLab installations.|
workflowlabels to the issue. The team will triage the issue and apply these.
Alternatively, mention us in the issue where you'd like our input.
When issues are sent out way, we will do our best to help or find a suitable owner to move the issue forward. We may be a development team's first contact into the Infrastructure department and we endeavour to treat these requests with care so that we can help to find an effective resolution for the issue.
If you're working on a feature that has specific scaling requirements, you can create an issue with the review request template. Some examples are:
This template gives the Scalability group the information we need to help you, and the issue will be shown on our build board with a high priority.
A few weeks after a review has been closed, a follow-up comment is added to ask for feedback in a survey. This helps us understand if the process has been helpful and if there are improvements that can be made. We go through the feedback provided in the surveys each quarter.
This process is an example of doing something that doesn't scale; as we do more of these, we'll learn what topics can be covered more efficiently by training, documentation, and tooling.
When we observe a situation on GitLab.com that needs to be addressed alongside a stage group, we first raise an issue in the Scalability issue tracker that describes what we are seeing. We try to determine if the problem lies with the action the code is performing, or the way in which it is running on GitLab.com. For example, with queues and workers, we will see if the problem is in what the queue does, or how the worker should run.
If we find that the problem is in what the code is doing, then we engage with the EM/PM of that group to find the right path forward. If work is required from that group, we will create a new issue in the gitlab-org project and use the Availability and Performance Refinement process to highlight this issue.
We prefer to work asynchronously as far as possible but still use synchronous communication where it makes sense to do so.
To that end, the only regular calls are the demo calls.
Lastly, in order to keep people connected, team members schedule at least one coffee-chat with another team member each week. These are at times that will best suit them as it may be an unusual hour given the various timezones and working hours for each person.
As a small team covering a wide domain, we need to make sure that everything we do has sufficient impact. If we do something that only the rest of the Scalability group knows about, we haven't 'shipped' anything. Our 'users' in this context are the infrastructure itself, SREs, and Development Engineers.
Impact could take the form of changes like:
In order to make others aware of the work we have done, we should advertise changes in the following locations:
Documentation or tutorial videos should also be added to the README.md in our team repository.
We use Epics, Issues, and Issue Boards to organize our work, as they complement each other.
The single source of truth for all work is Scaling GitLab.com epic. This is considered as the top-level epic from which all other epics are derived.
Epics that are added as children to the top-level epic are used to describe projects that the team undertakes.
Having all projects at this level allows us to use a single list for prioritization and enables us to prioritize work for different services alongside each other. Projects are prioritized in line with the OKRs for the current quarter.
Project status is maintained in the description of the top-level epic so that it is visible at a glance. This is auto-generated using the epic issues summary project. You can watch a short demo of this process to see how to use status labels on the epics to make use of this automation.
Example organization is shown on the diagram below:
Note If you are not seeing the diagram, make sure that you accepted all cookies.
Each project has an owner who is responsible for delivering the project.
The owner needs to:
The epic for the project must have the following items:
exit criterionlabel in the epic and are linked in the description.
The Scalability group issue boards track the progress of ongoing work.
On the planning board, the goal is to get issues into a state where we have enough information to build the issue.
However, not all issues that are
workflow-infra::Ready to be built should be scheduled for development right away. Some
issues may be too big, or might not be as important as others. This means not all issues that are
workflow-infra::Ready on the
planning board will move to the build board immediately.
Please see the triage rotation section for when to move issues between the boards.
|Planning Board||Build Board|
|Issues where we are investigating the work to be done.||Issues that will be built next, or are actively in development.|
The Scalability teams routinely uses the following set of labels:
team::Scalability label is used in order to allow for easier filtering of
issues applicable to the team that have group level labels applied.
The Scalability teams leverage scoped workflow labels to track different stages of work. They show the progression of work for each issue and allow us to remove blockers or change focus more easily.
The standard progression of workflow is from top to bottom in the table below:
|Problem is identified and effort is needed to determine the correct action or work required.|
|Proposal is created and put forward for review.
SRE looks for clarification and writes up a rough high-level execution plan if required. SRE highlights what they will check and along with soak/review time and developers can confirm.
If there are no further questions or blockers, the issue can be moved into "Ready".
|Proposal is complete and the issue is waiting to be picked up for work.|
|Issue is assigned and work has started.
While in progress, the issue should be updated to include steps for verification that will be followed at a later stage.
|Issue has an MR in review.|
|MR was merged and we are waiting to see the impact of the change to confirm that the initial problem is resolved.|
|Issue is updated with the latest graphs and measurements, this label is applied and issue can be closed.|
There are three other workflow labels of importance:
|Work in the issue is being abandoned due to external factors or decision to not resolve the issue. After applying this label, issue will be closed.|
|Work is not abandoned but other work has higher priority. After applying this label, team Engineering Manager is mentioned in the issue to either change the priority or find more help.|
|Work is blocked due external dependencies or other external factors. Where possible, a blocking issue should also be set. After applying this label, issue will be regularly triaged by the team until the label can be removed.|
The Scalability group has only one priority label:
Only issues of the utmost importance are given this label.
When an issue is given this label, a message should be pasted in the team's Slack channel so that an owner can be found as quickly as possible.
These issues should be picked up soon as possible after completing ongoing task unless directly communicated otherwise.
It is a scoped label as we previously had 4 levels of priority. We found that in practise we primarily used P4, and used P1 to indicate the issues of greatest importance.
Stage groups use type labels to label merge requests in projects in the
gitlab-org group. The Scalability group is not a part of the stage groups, and labels of importance for the team are explained above. When submitting work in gitlab-org group, we apply ~"team::Scalability" and ~"type::maintenance" to merge requests by default. The latter label is describing work towards refinement of existing functionality which describes majority of the work the team is contributing.
We have automated triage policies defined in the triage-ops project. These perform tasks such as automatically labelling issues, asking the author to add labels, and creating weekly triage issues.
We rotate the triage ownership each month, with the current triage owner responsible for picking the next one (a reminder is added to their last triage issue).
When issues arrive on our backlog, we should consider how they align with our vision, mission, and current OKRs.
We need to effectively triage these issues so that they can be handled appropriately. This means:
When handing over an issue to the new owner, provide as much information as you can from your assessment of the issue.
The Scalability team members often have specialized knowledge that is helpful in resolving incidents. Some team members are also SREs who are part of the on-call rota. We follow the guidelines below when contributing to incidents.
For an on-call SRE:
For an Incident Manager:
If you are not EOC or an Incident Manager when an incident occurs:
The reason for this position is that our project work prevents future large S1 incidents from occuring. If we try to participate in and resolve many incidents, our project work is delayed and the risk of future S1 incidents increases.
Where issues marked as
infradev are found to be scaling problems, the
team::Scalability label should be added.
Our commitment to this process, in line with the team's vision, is to provide guidance and assistance to the stage groups who are responsible for resolving these issues. We proactively assist them to determine how to resolve a problem, and then we contribute to reviewing the changes that they make.
Service::Unknownrefinement - go through issues marked
Service::Unknownand add a defined service, where possible.
workflow-infra::In Progress, either through picking them up directly, or asking on our team channel if any one else is able.
infradevlabels so we can help the stage groups move those forward.
Every quarter, we perform a review of all issues on the backlog that are not part of any project. When reviewing issues:
The EM creates this issue each quarter. It is not the sole responsibility of the person on Triage Rotation and is shared among all team members.
We work from our main epic: Scaling GitLab on GitLab.com.
Most of our work happens on the current in-progress sub epic. This is always prominently visible from the main epic's description.
When choosing something new to work on you can either:
The Scalability team became a reality during the fourth organizational iteration in the Infrastructure department on 2019-08-22, although it only became a reality once the first team member joined the team on 2019-11-29.
Even though it might not look like it at first glance, the Scalability team has its origin connected to the Delivery team. Namely, the first two backend engineers with Infrastructure specialisation were a part of the Delivery team, a specialisation that previously did not fit into the organizational structure. They had a focus on reliability improvements for GitLab.com, often working on features that had many scaling considerations. A milestone, that will prove to be a case for the Scalability team, was Continous Delivery on GitLab.com.
Throughout July, August and September 2019, GitLab.com experienced a higher than normal amount of customer facing events. Mirroring delays, slowdowns, vertical node scaling issues (to name a few) all contributed to general need to improve stability. This placed higher expectations on the Infrastructure department and with the organization at the time, this was harder to meet. To accelerate the timelines, "infradev" and "rapid action" processes were created, as a connection point between Infrastructure and Development departments to help Product prioritise higher impact issues. This approach was starting to yield results, but the process was there as a reaction to an (ongoing) event with the focus on resolving that specific need.
The background processing architectural proposal clearly illustrated the need to stay ahead of the growing needs of the platform and approach the growth strategically as well as tactically. With a clear case and approvals in hand, the team mission, vision, and goals were set and the team buildout could commence. While that was in motion, we had another confirmation through a performance retrospective that the need for the team is real.
As the team was taking shape, the background processing architectural changes were the first changes delivered by the team with a large impact on GitLab.com, with many more incremental changes throughout 2020 that followed. Measuring that impact reliably, and predicting the future challenges remains one of the team focuses at the time of the writing of this history summary.
The team impact overview is logged in issues: