Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Scalability Team

Scalability Team logo: inspired by the album cover of Unknown Pleasures, the debut studio album by English rock band Joy Division, except the waveforms are Tanukis.

Workflow Team workflow  
GitLab.com @gitlab-org/scalability  
Issue Trackers Scalability  
Slack Channels #g_scalability / @scalability-team #infrastructure-lounge (Infrastructure Group Channel), #incident-management (Incident Management), #alerts-general (SLO alerting), #mech_symp_alerts (Mechanical Sympathy Alerts)

Mission

The Scalability team is responsible for GitLab and GitLab.com at scale, working on the highest priority scalability items in the application in close coordination with Reliability Engineering teams and providing feedback to other Engineering teams so they can become better at scalability as well.

Vision

As its name implies, the Scalability team enhances the availability, reliability and, performance of GitLab by observing applications capabilities to operate at GitLab.com scale. The Scalability team analizes application performance on GitLab.com, recognizes bottlenecks in service availability, proposes short term improvements and develops long term plans that help drive the decisions of other Engineering teams.

Short term goals include:

Work prioritization process

All work tracked by the team is compiled in the Scaling GitLab.com epic.

When we need to work in the GitLab.org group, we create a corresponding epic there and link it in the above epic's description (as epics are tied to groups, and we use more than one top-level group).

Diagram below describes how the work gets prioritized in the Scalability team, and added to the above mentioned epic:

workflow

Process contains 6 cyclical stages:

  1. Observe - What is causing to SLA and SLO degradations on GitLab.com?
  2. Analysis - Why is availability being reduced, do we have all information, and are our metrics sufficient?
  3. Proposed Improvements - Issue with a (partial, temporary or full, permanent) fix is created including proposals for estimated SLA improvements for services affected.
  4. Triage - Prioritise changes based on pre-defined set of rules, which include ownership of the change.
  5. Development & Deployment - The work on developing and ensuring that the change has no unexpected effects is executed by the owner defined in the previous stage.
  6. Assessment - Assesment of the implemented change is done through retrospecting on the expected and observed state. The retrospective process is documented in an issue that is marked related with the original issue driving the change.

Team work processes

Labels

The Scalability team routinely uses the following set of labels:

  1. The team label, team::Scalability.
  2. Priority labels.
  3. Scoped workflow labels.
  4. Scoped Service labels.

The team::Scalability label is used in order to allow for easier filtering of issues applicable to the team that have group level labels applied.

The priority labels allow us to track the issues correctly and raise/lower priority of work based on the both external and internal factors. Priorities are set based on the priority definitions with an addition that the target SLO's apply to GitLab.com service SLO's.

This means that if resolving an issue will immediately improve, or is unblocking an issue that will immediately impact GitLab.com SLO's issue should have the highest priority.

Workflow labels

The Scalability team leverages scoped workflow labels to track different stages of work. They show the progression of work for each issue and allow us to remove blockers or change focus more easily.

The standard progression of workflow is described below:

sequenceDiagram workflow|Triage ->> workflow|Proposal: 1 Note right of workflow|Triage: Problem has been
scoped and issue has
a proposal ready for
review. workflow|Proposal ->> workflow|Ready: 2 Note right of workflow|Proposal: Proposal has no
blockers and
work can start. workflow|Ready ->> workflow|In Progress: 3 Note right of workflow|Ready: Issue is assigned and
work has started. workflow|In Progress ->> workflow|Under Review: 4 Note right of workflow|In Progress: Issue has a MR in
review. workflow|Under Review ->> workflow|Verify: 5 Note right of workflow|Under Review: MR was merged
issue is completing
set of verification
steps. workflow|Verify ->> workflow|Done: 6 Note right of workflow|Verify: Issue is updated with
the latest graphs
and measurements,
workflow|Done label
is applied and issue
can be closed.

There are three other workflow labels of importance omitted from the diagram above:

  1. workflow::Cancelled:
    • Work in the issue is being abandoned due to external factors or decision to not resolve the issue. After applying this label, issue will be closed.
  2. workflow::Stalled
    • Work is not abandoned but other work has higher priority. After applying this label, team Engineering Manager is mentioned in the issue to either change the priority or find more help.
  3. workflow::Blocked
    • Work is blocked due external dependencies or other external factors. After applying this label, issue will be regularly triaged by the team until the label can be removed.

Triage rotation

We have automated triage policies defined in the triage-ops project. These perform tasks such as automatically labelling issues, asking the author to add labels, and creating weekly triage issues.

We currently have two weekly triage issues:

  1. Board grooming - walk through the current project board and move issues forward towards workflow::Ready where possible.
  2. Service::Unknown grooming - lists issues with Service::Unknown with the goal of adding a defined service, where possible.

We rotate the triage ownership each month, with the current triage owner responsible for picking the next one (a reminder is added to their last triage issue).

Issues

Issue is being implemented if:

  1. Issue has a team member assigned to it.
  2. Assigned issue has a priority label set.
  3. Issue has "~workflow::In Progress" set.

Issue is resolved when:

  1. The problem defined in the issue has been addressed.
  2. Issue description is updated with a graph comparing before/after state (if applicable).
  3. Issue has "~workflow::Done" set.

Issue boards

The Scalability team issue boards track the progress of ongoing work. Purpose of some of the more important issue boards are described below:

  1. Workflow board
    • Tracks the whole team ongoing workload.
  2. Abandoned work board
    • Tracks the work that is not progressing.
  3. Individual services board, for example Sidekiq board
    • Tracks the workload for the individual service.
  4. Priority board
    • Tracks the workload based on issue priorities.

Choosing something to work on

We work from our main epic: Scaling GitLab on GitLab.com.

Most of our work happens on the current in-progress sub epic. This is always prominently visible from the main epic's description. From there, work takes place on the board associated to the current in-progress epic.

Priority and workflow labels take precedence; we don't use issue ordering in boards or epics for priorities. Workflow labels to the right are higher priority than those to the left.

Team counterparts

The Scalability team will work with all engineering teams across all departments as a representative of GitLab.com as one of the largest GitLab installations, to ensure that GitLab continues to scale in a safe and sustainable way.

The Memory team is a natural counterpart to the Scalability team, but their missions are complementing each other rather than overlap:

Simply put:

Team Members

The following people are members of the Scalability Team:

Person Role
New Vacancy - Marin Jankovski (Interim) Engineering Manager, Scalability
Sean McGivern Staff Backend Engineer, Scalability
Oswaldo Ferreira Backend Engineer, Scalability
Bob Van Landuyt Senior Backend Engineer, Scalability
C.M. Site Reliability Engineer, Scalability

How do I engage with the Scalability Team?

  1. Start with an issue in the Scalability team tracker: Create an issue.
  2. You are welcome to follow this up with a Slack message in #g_scalability.
  3. Please don't add any workflow labels to the issue. The team will triage the issue and apply these.
  4. We use our Workflow board to track the workflow of issues.

Celebrating our wins

We celebrate our wins! Whenever a change driven by the Scalability Team shows a clear positive impact on the scalability of GitLab.com; through key metrics, saturation reduction, reduced Mean time to Detection (MTTD), improved Mean time between Failures, etc, we post a message as a comment on this snippet in our tracker: https://gitlab.com/gitlab-com/gl-infra/scalability/snippets/1900609.