Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Product Direction - Database

Database

Last updated: 2021-10-14

Introduction and how you can help

This page contains the product direction for the Database group of the Enablement stage.

Please reach out to Yannis Roussos, Product Manager for the Database group (Email) if you'd like to provide feedback or ask any questions related to this product direction.

This strategy is a work in progress, and everyone can contribute. Please comment and contribute in the linked issues and epics on this page. Sharing your feedback directly on GitLab.com is the best way to contribute to our strategy and vision.

Overview

Gitlab stores data in various ways and forms. There is Redis, which is used both as a cache and as a more lightweight semi-persistant storage. There is also object storage for data not well suited to be stored in a database and even the file system in the case of Gitaly or the existing version of the Container Registry. But relational databases are the prominent persistent storage in GitLab, used for storing most of the user generated data in GitLab.

The Database group is the steward of all relational database technologies at GitLab. It is responsible for growing GitLab's database expertise, promoting and supporting its proper use and making sure that there is continuity and no knowledge gaps as the GitLab team and product grow. It owns the main GitLab database from an application technology perspective, while individual feature groups own specific tables and the additional feature specific databases, such as Gitaly Cluster, Geo and the database for the new Container Registry.

In practice, the Database group is focused on improving the scalability, performance, and resilience of GitLab's database layer as well as instituting best practices across the wider development organization.

The database is at the core of all user interactions with a GitLab instance, as well as the layer that most automated GitLab features, like for example the CI pipelines, depend on. Any bottleneck at the database layer, regression or non performant application code that interacts with the database can break or render any lovable GitLab feature to a practically unusable one. At the same time, database related incidents can pose a significant threat to the availability of any GitLab instance.

As GitLab's architecture becomes more complex and the list of features grows, the database layer becomes more complex as well. The Database group is tasked with making sure that this process is not introducing any short or long term risks to Gitlab's availability and that we can keep on iterating on existing and new features, while GitLab remains performant at all scales.

Approach

We apply the Group's combined application and database related expertise to tackle the complexity of the database design as it grows, no matter the scale that this design is applied to. Scaling is not the end goal by itself, it's the constant we have to always take into account. We have to make sure that the best database practices are applied to all problems, so that GitLab can be performant at all sizes, ranging from small self-managed instances to GitLab.com. Those best practices and research on state of the art database approaches consequently allow us to also scale our largest instances to the next order of magnitude with each longer term iteration that we take.

We try to achieve those goals by:

Where we are headed

The Database Group's ongoing focus will always remain the scalability of the database, increasing the responsiveness and the availability of the GitLab platform, while also improving the efficiency and reliability of making database changes.

In the long term we plan to extend our automated database testing capabilities and explore how we can provide the tools that we are building to a wider audience. That audience can be internal, i.e. GitLab's engineering, or external, benefiting all the users of GitLab.

Today, most developers have no easy way to test new queries or database changes at scale. We want to figure out ways for developers to test all their database changes on production data, prior to production. Our approach will be to automate the process of testing queries against production data and integrate it in the DevOps lifecycle, at the stage where developers spend most of their time while developing and reviewing code. This will enable developers to complete database related updates and perform code reviews faster, with less guessing and more confidence backed by quantifiable data.

Our long term plan described above aligns and is explained in more detail in Enablement Section's theme of managing complexity for large software projects.

Finally, we are looking for ways to contribute back to the wider community beyond GitLab. We have extended the way Ruby on Rails projects interact with the database and we have introduced numerous new ideas and frameworks. We are evaluating how we could extract parts of our tools, for example by creating a separate library, open sourcing it and letting other developers use it. At the same time, that will enable more seamless contributions to GitLab's database layer.

What's Next & Why

Update all our tools to work with multiple databases

We are working closely with the Sharding group on decomposing GitLab's database (also known as vertical sharding). This approach relies on moving the tables associated with a feature into a separate logical database.

To support the Sharding group to achieve that goal, the Database group will have to update the database tooling and our framework for executing background migrations to support multiple databases.

This is a top priority for both the Database and the Sharding groups, as the largest GitLab instance, GitLab.com, is approaching a point where scaling vertically (buying bigger servers) is no longer easily possible.

Batched Background Migrations - General Availability

In order to address the Primary Key overflow risk for tables with an integer PK, we had to update more than 8.5 billion records while not affecting the performance of GitLab.com's database and the availability of the platform as a whole.

To do so, we had to rethink our approach on how we perform massive data upgrades (data migrations), which led us to building a new framework for performing background migrations (asynchronous jobs running in the background).

The resulting framework, which we call batched background migrations has multiple advantages compared to how we are currently performing similar operations; it can dynamically adjust the work performed by monitoring in real time the performance of the migrations, it requires minimal monitoring by the instance administrators and can automatically recover from performance related errors.

We plan to work on making it the standard and only way we perform migrations in GitLab, as we believe that it will help us reduce database related incidents and, at the same time, once it is mature, it will provide a seamless upgrade path for self managed instances.

It will also allow us to update many of our existing tools (e.g. partitioning) and make them more reliable.

Reduce table sizes to < 100 GB per physical table

One of the top reasons of performance degradation in relational databases is tables growing too large. There is no globally applicable hard rule on the size threshold that tables should not exceed; it depends on the schema of the table, the number and types of indexes defined over it, the mix of read and write traffic that it gets, the query workload used to to access it and more. As a result of our analysis, we set the limit at 100GB and we explain in detail our rationale and how we plan to approach this problem.

Addressing this for tables in GitLab.com is critical, but it will also allow us to provide a more scalable database design for GitLab instances with smaller databases.

Based on our analysis, we are planning to work with multiple other GitLab teams in an ongoing fashion towards achieving that goal.

Automated database testing: Advanced database testing features

Our framework for automated database migration testing using production clones has been released to all GitLab's engineering and is seeing wide adoption.

We have already increased the maturity of the setup, so our next target is to add more advanced database testing features and expanded support for data migrations.

Database Schema Validation

Ensuring that the deployed database schema matches codebase expectations is important for addressing issues self managed instances may face while upgrading to newer versions. It will allow us to support preemptive checks before an instance is upgraded and warn about potential issues before the process is started.

Metrics (Internal)

  1. Enablement::Database - Performance Indicators Dashboard
  2. Average Query Apdex for GitLab.com
Git is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license