The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.
Last updated: 2021-10-14
This page contains the product direction for the Database group of the Enablement stage.
Please reach out to Yannis Roussos, Product Manager for the Database group (Email) if you'd like to provide feedback or ask any questions related to this product direction.
This strategy is a work in progress, and everyone can contribute. Please comment and contribute in the linked issues and epics on this page. Sharing your feedback directly on GitLab.com is the best way to contribute to our strategy and vision.
Gitlab stores data in various ways and forms. There is Redis, which is used both as a cache and as a more lightweight semi-persistant storage. There is also object storage for data not well suited to be stored in a database and even the file system in the case of Gitaly or the existing version of the Container Registry. But relational databases are the prominent persistent storage in GitLab, used for storing most of the user generated data in GitLab.
The Database group is the steward of all relational database technologies at GitLab. It is responsible for growing GitLab's database expertise, promoting and supporting its proper use and making sure that there is continuity and no knowledge gaps as the GitLab team and product grow. It owns the main GitLab database from an application technology perspective, while individual feature groups own specific tables and the additional feature specific databases, such as Gitaly Cluster, Geo and the database for the new Container Registry.
In practice, the Database group is focused on improving the scalability, performance, and resilience of GitLab's database layer as well as instituting best practices across the wider development organization.
The database is at the core of all user interactions with a GitLab instance, as well as the layer that most automated GitLab features, like for example the CI pipelines, depend on. Any bottleneck at the database layer, regression or non performant application code that interacts with the database can break or render any lovable GitLab feature to a practically unusable one. At the same time, database related incidents can pose a significant threat to the availability of any GitLab instance.
As GitLab's architecture becomes more complex and the list of features grows, the database layer becomes more complex as well. The Database group is tasked with making sure that this process is not introducing any short or long term risks to Gitlab's availability and that we can keep on iterating on existing and new features, while GitLab remains performant at all scales.
We apply the Group's combined application and database related expertise to tackle the complexity of the database design as it grows, no matter the scale that this design is applied to. Scaling is not the end goal by itself, it's the constant we have to always take into account. We have to make sure that the best database practices are applied to all problems, so that GitLab can be performant at all sizes, ranging from small self-managed instances to GitLab.com. Those best practices and research on state of the art database approaches consequently allow us to also scale our largest instances to the next order of magnitude with each longer term iteration that we take.
We try to achieve those goals by:
Building the core tools and frameworks that allow all of GitLab's engineering to efficiently interact with the database layer.
We strive to hide complex implementations behind simple interfaces that are easy to understand and use, without requiring deep understanding of database performance optimizations and approaches.
As an example, we consider our work successful when we can allow a GitLab engineer to create a partitioned table and then move hundreds of millions of records to it by issuing two simple commands, while also guaranteeing that the process will succeed without any additional involvement required.
Adding documentation and guidelines that explain the tools offered and provide the process to be followed for the most common use cases that application code may need to interact with the database.
Dogfooding as a group the tools provided and the processes proposed until we make sure that they are ready for general availability.
We had to partition the first two tables in GitLab before asking other teams to use the final, thoroughly tested tools. Similarly, we battle tested our initial implementation of our new background migration framework by updating more than 8.5 billion records before working on making it widely available.
Enabling every GitLab team member to test against clones of our production database, either manually or in automated ways.
Helping our database reviewing process (reviewers, maintainers) to grow.
The Database Group's ongoing focus will always remain the scalability of the database, increasing the responsiveness and the availability of the GitLab platform, while also improving the efficiency and reliability of making database changes.
In the long term we plan to extend our automated database testing capabilities and explore how we can provide the tools that we are building to a wider audience. That audience can be internal, i.e. GitLab's engineering, or external, benefiting all the users of GitLab.
Today, most developers have no easy way to test new queries or database changes at scale. We want to figure out ways for developers to test all their database changes on production data, prior to production. Our approach will be to automate the process of testing queries against production data and integrate it in the DevOps lifecycle, at the stage where developers spend most of their time while developing and reviewing code. This will enable developers to complete database related updates and perform code reviews faster, with less guessing and more confidence backed by quantifiable data.
Our long term plan described above aligns and is explained in more detail in Enablement Section's theme of managing complexity for large software projects.
Finally, we are looking for ways to contribute back to the wider community beyond GitLab. We have extended the way Ruby on Rails projects interact with the database and we have introduced numerous new ideas and frameworks. We are evaluating how we could extract parts of our tools, for example by creating a separate library, open sourcing it and letting other developers use it. At the same time, that will enable more seamless contributions to GitLab's database layer.
We are working closely with the Sharding group on decomposing GitLab's database (also known as vertical sharding). This approach relies on moving the tables associated with a feature into a separate logical database.
This is a top priority for both the Database and the Sharding groups, as the largest GitLab instance, GitLab.com, is approaching a point where scaling vertically (buying bigger servers) is no longer easily possible.
In order to address the Primary Key overflow risk for tables with an integer PK, we had to update more than 8.5 billion records while not affecting the performance of GitLab.com's database and the availability of the platform as a whole.
To do so, we had to rethink our approach on how we perform massive data upgrades (data migrations), which led us to building a new framework for performing background migrations (asynchronous jobs running in the background).
The resulting framework, which we call batched background migrations has multiple advantages compared to how we are currently performing similar operations; it can dynamically adjust the work performed by monitoring in real time the performance of the migrations, it requires minimal monitoring by the instance administrators and can automatically recover from performance related errors.
We plan to work on making it the standard and only way we perform migrations in GitLab, as we believe that it will help us reduce database related incidents and, at the same time, once it is mature, it will provide a seamless upgrade path for self managed instances.
It will also allow us to update many of our existing tools (e.g. partitioning) and make them more reliable.
One of the top reasons of performance degradation in relational databases is tables growing too large. There is no globally applicable hard rule on the size threshold that tables should not exceed; it depends on the schema of the table, the number and types of indexes defined over it, the mix of read and write traffic that it gets, the query workload used to to access it and more. As a result of our analysis, we set the limit at 100GB and we explain in detail our rationale and how we plan to approach this problem.
Addressing this for tables in GitLab.com is critical, but it will also allow us to provide a more scalable database design for GitLab instances with smaller databases.
Our framework for automated database migration testing using production clones has been released to all GitLab's engineering and is seeing wide adoption.
We have already increased the maturity of the setup, so our next target is to add more advanced database testing features and expanded support for data migrations.
Ensuring that the deployed database schema matches codebase expectations is important for addressing issues self managed instances may face while upgrading to newer versions. It will allow us to support preemptive checks before an instance is upgraded and warn about potential issues before the process is started.