The Tenant Scale group (formerly Pods or Sharding group) is part of the Data Stores stage. We offer support for groups, projects, and user profiles within our product, but our main focus is a long-term horizontal scaling solution for GitLab.
This page covers processes and information specific to the Tenant Scale group. See also the direction page and the features we support per category.
To get in touch with us, it's best to create an issue in the relevant
project (typically GitLab) and add the
~"group::tenant scale"
label, along with any other appropriate labels.
For urgent items, feel free to use the Slack channel (internal): #g_tenant-scale.
There are multiple proposals and ideas to increase horizontal scalability via solutions such as database sharding and tenant isolation. The objective of this group is to explore, iterate on, validate, and lead implementation of proposals to provide a solution to accommodate GitLab.com's daily-active user growth.
As we brainstorm and iterate on horizontal scalability proposals, we will provide implementation details, prototypes, metrics, demos, and documentation to support our hypotheses and outcomes.
Currently, Cells is our proposal of a new architecture for our SaaS that is horizontally scalable, resilient, and provides a more consistent user experience.
The executive summary goals for the Tenant Scale group include:
The following people are permanent members of the Tenant Scale group:
Person | Role |
---|---|
Arturo Herrero | Engineering Manager, Tenant Scale |
Abdul Wadood | Senior Backend Engineer, Tenant Scale |
Alex Pooley | Staff Backend Engineer, Tenant Scale |
Manoj Memana Jayakumar | Senior Backend Engineer, Tenant Scale |
Omar Qunsul | Backend Engineer, Tenant Scale |
Peter Hegman | Senior Frontend Engineer, Tenant Scale |
Rutger Wessels | Senior Backend Engineer, Tenant Scale |
Steve Xuereb | Staff Site Reliability Engineer, Tenant Scale |
Thong Kuah | Principal Engineer, Tenant Scale |
The following members of other functional teams are our stable counterparts:
Person | Role |
---|---|
Christina Lohr | Senior Product Manager, Data Stores:Tenant Scale |
Amelia Bauerly | Senior Product Designer, Tenant Scale |
Mike Nichols | Staff Product Designer, Tenant Scale |
Lorena Ciutacu | Technical Writer - Analytics:Product Analytics, Analytics:Analytics Instrumentation, Data Stores:Tenant Scale and Plan:Optimize |
Dylan Griffith | Principal Engineer, Data Stores |
Kamil Trzciński | Senior Distinguished Engineer, Ops and Core Platform |
Quang-Minh Nguyen | Staff Backend Engineer, Gitaly and Tenant Scale |
Rohit Shambhuni | Senior Security Engineer, Application Security, Manage (Authentication and Authorization), SaaS Platforms (Scalability) and Data Stores (Tenant Scale). |
We are working on different large projects where each project has a Directly Responsible Individual (DRI). The role of the DRI involves helping to define the scope of the work needed for the project, ensuring clarity on objectives with the responsibility of looking forward 3-6 months to identify potential blocks or risks. Their work is not limited to that area and they also work in other areas as needed.
Project | DRI |
---|---|
Cells: Essential workflows | Manoj |
Cells: Routing layer | Thong |
Organization | Alex |
Self-managed decomposition | Rutger |
We are a globally distributed group and we communicate mostly asynchronously, however, we also have synchronous meetings. It's unlikely everyone can attend those meetings, so we record them and share written summaries (agenda). Currently we have the following recurring meetings scheduled:
The Product Manager (PM) compiles the list of issues following the product prioritization process, with input from the team, Engineering Manager (EM), and other stakeholders. The iteration cycle lasts from the 18th of one month until the 17th of the next month, and is identified by the GitLab version set to be released.
Engineers are encouraged to work as closely as needed with their stable counterparts. Quality engineering is included in our workflow via the quad planning process.
Before starting a milestone, the group coordinates using planning issues. We follow this process:
The primary source for things to work on is the milestone prioritization board, which lists all issues scheduled for the current cycle in priority order (from most to least important): p1, p2, and p3. You should first pick up issues that have the highest priority, which are listed at the top of the first board column. When you assign yourself to an issue, you indicate that you are working on it.
If anything is blocking you from getting started with the top issue immediately, like unanswered questions or unclear requirements, you can skip it, as long as you put your findings and questions in the issue. This helps the next engineer who picks up the issue.
Usually issues are not directly assigned to people, except when a person has clearly the most knowledge or context to work on an issue. However, we encourage engineers to have a sense of ownership of specific projects or epics to make significantly more impact in the company.
We follow the GitLab product development workflow guidelines. To get a high-level overview of the status of all issues in the current milestone, check the development workflow board.
The process primarily follows this:
workflow::ready for design
to signal an issue was ready to begin the design.workflow::design
designer would use this to signal it was actively being worked on.workflow::planning breakdown
to signal this was past the design phase and ready to be scheduled.workflow::ready for development
to signal it was ready to be worked on for engineering.We follow the GitLab engineering workflow guidelines. To get a high-level overview of the status of all issues in the current milestone, check the development workflow board.
As owners of the issues assigned to them, engineers are expected to keep the
workflow labels on their issues up to date. When an engineer starts working an
issue, they mark it with the workflow::in dev
label as the starting point
and continue updating the issue throughout development.
Before closing an issue, it's important to add the workflow::complete
label, because this is one
of the requirements for the completed items to appear in the Improvements and Bugs
overview of each month's release post. The process primarily follows this diagram:
We track our work on the following issue boards:
We use a simple issue weighting system for capacity planning, ensuring a manageable amount of work for each milestone. We consider both the team's throughput and each engineer's upcoming availability from Time Off by Deel using a Google Apps Script.
The weights are intended to be used in aggregate, and what takes one person a certain amount of time may be different for another, depending on their level of knowledge of the issue. We should strive to be accurate, but understand that they are estimates. Change the weight if it is not accurate or if the issue becomes more difficult than originally expected. Leave a comment indicating why the weight was changed and tag the EM and PM so we can better understand the scope and continue to improve.
To weigh an issue, consider the following important factors:
When estimating development work, please assign an issue the appropriate weight:
Weight | Description | Examples |
---|---|---|
1: Trivial | The simplest possible change. We are confident there will be no side effects. Negligible complexity. | Documentation updates, simple regressions, and other bugs that have already been investigated and discussed and can be fixed with a few lines of code, or technical debt that we know exactly how to address, but just haven't found time for yet. |
2: Small | A simple change (minimal code changes), where we understand all of the requirements. Some small uncertainties exist but we are confident of a solution. | Simple features, like a new API endpoint to expose existing data, or regular bugs or performance issues where all investigation has already taken place. |
3: Medium | A change with a bigger code footprint (e.g. lots of different files, or tests affected). There are uncertainties that we will need to work through. | Regular features, potentially with a backend and frontend component, or most bugs or performance issues. |
5: Large | A more complex change that will impact multiple areas of the codebase. There may also be some refactoring involved. Requirements are poorly understood and you feel there are multiple important gaps. We will need to break this issue into smaller pieces before we can begin a merge request. | Large features with a backend and frontend component, or bugs or performance issues that have seen some initial investigation but have not yet been reproduced or understood. |
Anything with a weight of 5 or larger should be broken down if possible.
Every week the engineering team completes a backlog refinement process to review upcoming issues. The goal of this effort is for all issues to have a weight so we can more accurately plan each milestone and also improve our knowledge sharing.
In addition to the backlog refinement process, engineers can estimate any issues without following this backlog refinement process.
The team will identify issues that need to be refined using the
workflow::refinement
label. If there are issues that are good
candidates for the backlog refinement process (without weight,
unclear requirements, etc.), please use the label. We will refine
a maximum of 5 issues per week.
The EM will use the refinement script to generate an issue with all the issues identified for refinement.
Over the week, each engineer on the team will look at the list of issues selected for backlog refinement. Current backlog refinement issues.
For each issue, team members will review the issues and provide:
When refining issues, consider the following:
After engineers have had a chance to provide input, the EM or PM will:
workflow::refinement
label.workflow::ready for development
label.For any issues that were not discussed and given a weight, we will work with the engineers to see if we need to get more information from PM or UX.
We hold scheduled "per milestone" retrospectives, and can have ad-hoc "per project" retrospectives.
We have milestone retrospectives issues. These include the EM, PM, engineers, UX, and all stable counterparts. Participation is highly encouraged for every milestone. For more information, see group retrospectives created on the 26th of each month, for the currently running milestone.
If an issue, a feature, or other sort of project turns into a particularly useful learning experience, we may hold a synchronous or asynchronous retrospective to learn from it. If you think something you're working on deserves a retrospective:
Each quarter we have a series of Objectives and Key Results (OKRs) for our group. To find the current OKRs for this quarter, check the OKR project.
GitLab uses error budgets to measure the availability and performance of our features. Each engineering group has its own budget spend. The current 28-day spend for the Tenant Scale group can be found in this Grafana dashboard.
An error budget exception of 99.85% was approved to allow the group to focus on long-term scalability work.
You can find our group metrics in the Data Stores:Tenant Scale Sisense dashboard and Tenant Scale Group Engineering Metrics page.
(Sisense↗) We also track our backlog of issues, including past due security and infradev issues, and total open System Usability Scale (SUS) impacting issues and bugs.
(Sisense↗) MR Type labels help us report what we're working on to industry analysts in a way that's consistent across the engineering department. The dashboard below shows the trend of MR Types over time and a list of merged MRs.
(Sisense↗) Flaky test are problematic for many reasons.
(Sisense↗) Slow tests are impacting the GitLab pipeline duration.