This document summarizes the idea of sharding GitLab by top-level namespace. It is meant to provide an idea of how product features would have to change with namespace sharding and to highlight the difficulties and complexities we anticipate with this approach, as well as touching on implementation details and known unknowns.
Let's start off by defining common nomenclature.
In GitLab, projects are contained inside a namespace.. A namespace can be nested inside another namespace. This forms a tree-like hierarchy: Namespaces are the nodes of the tree, projects are the leaves. The root of this hierarchy is called "top-level namespace".
Let's decompose this example: https://gitlab.com/gitlab-org/database-team/team-tasks
gitlab-org: top-level namespace
database-team: a group inside
team-tasks: a project inside
After signup, each user has their own namespace associated under
gitlab.com/$user. Additionally, a user may interact with any number of (root) namespaces across the instance.
A data access pattern is a commonly found perspective of how data is being looked at. This typically informs how database query filters look like. Let's look at a few examples:
The gist of this idea is to slice all data horizontally by the respective top-level namespace.
/gitlab-com as two example top-level namespaces. Both contain many sub-groups, projects, a lot of issues, merge requests, CI data and so on. With sharding applied at the database level, we separate those two namespaces in the database into separate shards (for the sake of the example):
/gitlab-organd all its data
/gitlab-comand all its data
Glossing over data distribution details for now, we can think of those two shards as physically separate databases.
Now, as desired, each shard is much smaller in data size compared to the non-sharded database. This results in much better performance for database queries that are designed around this sharding idea. More on that in: "Features in support of sharding".
On the other hand and strictly speaking, this limits product features in a sense that accessing any data has to always be scoped to a (root) namespace. Otherwise, we would have to scan all shards - which is undesirable and breaks at scale. So this is about access patterns and how some product features access the data. More on that in: "Features in conflict with sharding".
Complicating things, there is data that is completely unrelated to a namespace. We can see this by analyzing the database schema statically. Without even thinking about access patterns at this point, we can already conclude that this type of data is not shardable by namespace - it doesn't have any reference to a namespace.
Before we dive into details, let's summarize shortly what is ideal for this sharding approach:
The following section details apply these points to GitLab.
We describe database sharding by top-level namespace and this builds on the premise of using the top-level namespace as the common sharding dimension throughout the application. Let's look into how well this works in GitLab.
GitLab's current database schema contains roughly 80 (out of 350) tables that do not have any notion of a namespace. This is a result from statically analyzing the database schema and basically checking for columns that (perhaps transitively) relate to a namespace. A summary for this analysis can be found in "A Note on the Technical Complexities with Sharding by Namespace" (GitLab-internal link).
In terms of data size, tables without a notion of namespace make up about 400 GB worth of data (about 6% of total size) as of today. Expecting a 100x growth, we would have 40TB worth of data that cannot be sharded. Depending on implementation details, this data would have to be replicated across to expose it locally to each shard or we would only maintain this data on a central coordinator instance (which means we'd have to combine local and remote data from shards in queries).
Assuming we're not also adding capabilities for multi-master writes (i.e. the ability to write the same data in different locations), this hinders our ability to scale the read-write capacity - because these writes can only happen on a single cluster.
A notion of the sharding key is required for sharding, but it's not sufficient to actually benefit from it. In order to leverage sharding, a feature needs to access sharded data by the sharding dimension. It is fine to add additional filters, but without the sharding key we are not only not making use of sharding but should expect a potentially severe degradation for query performance: Without the sharding key, we'd have to scan all shards (fan out) - which is very expensive. Depending on implementation details, it may be difficult to implement parallel cross shard scans.
Let's look at two examples here, more can be found in #50.
An issue tracker lives inside a single project. As such, issue data transitively references a top-level namespace.
Sharding by namespace is ideal for these features:
For these examples, we can clearly see the access path is "by namespace" or "by project" (which ultimately "by top-level namespace" covers).
On the other hand, there are features which would require a cross-shard scan:
These features have conflicting access patterns with the idea of sharding by namespace. This results in database queries that span all shards. Conceptually, this comes from the fact a user can interact with any number of namespaces across the instance. Therefore, in order to find assigned issues, we'd have to look into each and every namespace.
In a non-sharded database, this is easy: We maintain a single lookup table
issue_assignees (user_id int, issue_id int) to model the N:M relationship.
In a database sharded by namespace, we would have two shards following the previous example:
/gitlab-orgwith a table
issue_assignees (user_id int, issue_id int)
/gitlab-comdata with a table
issue_assignees (user_id int, issue_id int)
In order to find all assigned issues, we iterate all shards and query each of them. This is because we don't know which top-level namespaces the user interacted with at all and where to look for issue assignments (also see "Statistics on user interaction"). After querying all shards, we reconcile the results to typically display the TOP 20 issues (by some criteria) across all shards.
In the ideal scenario for namespace sharding, these features would be constrained by a namespace. This means, search cannot go across multiple namespaces. There are two theoretically possible ways to implement this:
In both cases, it would be possible to change the currently active namespace. However, none of the features would display or work with data across multiple namespaces.
The audit log contains information about events that happened throughout the instance. An event may pertain to a user, a project or a group.
Based on that, we can already conclude that there are events which are unrelated to a project: User-related events do not have a notion of a namespace at all. On GitLab.com, this makes up about 75% of the audit log data, which cannot be sharded by namespace.
We allow users to work with the audit log from different perspectives:
The global instance level view is the problematic one: Ideally, an instance admin wants to see what happened across all namespaces and users on the instance.
In summary, if we sharded by namespace we would face:
The audit log feature and its data is not suited well for namespace sharding in its current form unless we change the admin features.
For example, we could have an admin to always select a namespace of interest and we would only show data for a single namespace at a time.
However, in order to view audit logs for a user, we can only continue to build a large non-sharded table to provide this view. Depending on how isolated the audit log feature is from the rest of the application, we may want to think about feasibility of a different sharding dimension here (time or user).
The premise of sharding by namespace is the idea that "customers care only about what happens inside their namespace".
For example, it may even be undesirable for a SaaS customer on GitLab.com to accidentally @-mention users outside their organization. A typical SaaS customer is assumed to only care about search results inside their namespace, for example.
In GitLab, we track which projects a user interacted with. Based on that, we conducted an analysis of how many projects, groups and top-level namespaces users interact with on GitLab.com. Here, interaction is defined as performing any kind of action on a project - opening and just looking at a project is not enough in this case (i.e. we're looking at actions that lead to an event). Note this includes users without any activity.
A summary and the takeaways are:
95% of users interact with about 4 top-level namespaces or less (including their user namespace).
60% of users only interact with 1 top-level namespace
45% of users interact with a namespace other than their own
15% of users interact with more than one namespace other than their own
More details including the raw data used can be found in gitlab-org/database-team/user-interaction-data.
While this is good news for the namespace sharding approach, it doesn't solve the fundamental problem of conflicting access patterns for all features.
For "by user" access patterns, we may be able to use this information to first figure out which shards are relevant for the particular user at hand and then only query those. This would help to avoid scans across all shards and limit it closer to what is relevant. However, this is not a static analysis anymore but rather involves maintaining lookup information to track which shards are relevant for a given user.
GitLab is built on PostgreSQL which does not support native sharding (however, it's on the community roadmap). We explored using CitusDB instead, which is not an option for us.
We also explored using of PostgreSQL features such as partitioning in combination with foreign-data wrappers to implement a notion of a sharding database feature ourselves. While we identified a principal direction how this can be implemented, we are well aware of its complexities. This is not perceived as a "boring" solution and we do not know of positive reports from other companies going down this road. Also based on our own experience with using FDW for GitLab Geo, we deem this a very complicated approach.
Identified complexities reside around the following topics:
We expect to only need sharding capabilities for GitLab.com and don't expect this to be necessary for self-managed installations. This further complicates implementation because it's likely we would have to implement some actions specific to GitLab.com.
We explored what is needed for database sharding and we also got an idea of how well this would work with GitLab. In order to successfully implement namespace-based sharding, we identified two challenges:
We aim to introduce PostgreSQL table partitioning to the application. This can be thought of as a database tool we have at our disposal to considerably improve performance and scalability of the application.
It can be applied local to a single feature and optimize for a particular use case. For example, we are working on improving the audit log feature with a partitioning pattern. The benefit from this locality is the ability to optimize for different cases across the application - we don't have to find the single common sharding dimension for everything.
While we've explored partitioning to implement a notion of sharding (with FDW), partitioning is also simply a tool to improve database interaction. As such, it can be used additionally to sharding approaches: In a sharded database or in application sharding, partitioning is local to a shard and further improves database scalability.
In that sense, partitioning is a tool that is readily available for us to make use of now and one that does not introduce operational complexity. It directly improves performance and scalability of the database. On GitLab.com, we are at a scale currently where this seems to right tool to address performance issues. Additionally, it comes with the same fundamental thinking around slicing data horizontally and analyzing existing access patterns (and perhaps rethinking product features to align better).
Related links for partitioning:
We are going to continue to explore ways to improve database scalability, particularly with GitLab.com in mind:
Aside from this, we continue to discuss sharding approaches. This is mostly a product question to start with, because it is rather unlikely this works out without needing product changes.
Author: Andreas Brandl