The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.
Last updated: 2022-08-22
This page contains the product direction for the Database group of the Enablement stage.
Please reach out to Gabe Weaver, Acting Product Manager for the Database group (Email) if you'd like to provide feedback or ask any questions related to this product direction.
This strategy is a work in progress, and everyone can contribute. Please comment and contribute in the linked issues and epics on this page. Sharing your feedback directly on GitLab.com is the best way to contribute to our strategy and vision.
Gitlab stores data in various ways and forms. There is Redis, which is used both as a cache and as a more lightweight semi-persistant storage. There is also object storage for data not well suited to be stored in a database and even the file system in the case of Gitaly or the existing version of the Container Registry. But relational databases are the prominent persistent storage in GitLab, used for storing most of the user generated data in GitLab.
The Database group is the steward of all relational database technologies at GitLab. It is responsible for growing GitLab's database expertise, promoting and supporting its proper use and making sure that there is continuity and no knowledge gaps as the GitLab team and product grow. It owns the main GitLab database from an application technology perspective, while individual feature groups own specific tables and the additional feature specific databases, such as Gitaly Cluster, Geo and the database for the new Container Registry.
In practice, the Database group is focused on improving the scalability, performance, and resilience of GitLab's database layer as well as instituting best practices across the wider development organization.
The database is at the core of all user interactions with a GitLab instance, as well as the layer that most automated GitLab features, like for example the CI pipelines, depend on. Any bottleneck at the database layer, regression or non performant application code that interacts with the database can break or render any lovable GitLab feature to a practically unusable one. At the same time, database related incidents can pose a significant threat to the availability of any GitLab instance.
As GitLab's architecture becomes more complex and the list of features grows, the database layer becomes more complex as well. The Database group is tasked with making sure that this process is not introducing any short or long term risks to Gitlab's availability and that we can keep on iterating on existing and new features, while GitLab remains performant at all scales.
We apply the Group's combined application and database related expertise to tackle the complexity of the database design as it grows, no matter the scale that this design is applied to. Scaling is not the end goal by itself, it's the constant we have to always take into account. We have to make sure that the best database practices are applied to all problems, so that GitLab can be performant at all sizes, ranging from small self-managed instances to GitLab.com. Those best practices and research on state of the art database approaches consequently allow us to also scale our largest instances to the next order of magnitude with each longer term iteration that we take.
We try to achieve those goals by:
Building the core tools and frameworks that allow all of GitLab's engineering to efficiently interact with the database layer.
We strive to hide complex implementations behind simple interfaces that are easy to understand and use, without requiring deep understanding of database performance optimizations and approaches.
As an example, we consider our work successful when we can allow a GitLab engineer to create a partitioned table and then move hundreds of millions of records to it by issuing two simple commands, while also guaranteeing that the process will succeed without any additional involvement required.
Adding documentation and guidelines that explain the tools offered and provide the process to be followed for the most common use cases that application code may need to interact with the database.
Dogfooding as a group the tools provided and the processes proposed until we make sure that they are ready for general availability.
We had to partition the first two tables in GitLab before asking other teams to use the final, thoroughly tested tools. Similarly, we battle tested our initial implementation of our new background migration framework by updating more than 8.5 billion records before working on making it widely available.
Enabling every GitLab team member to test against clones of our production database, either manually or in automated ways.
Helping our database reviewing process (reviewers, maintainers) to grow.
The Database Group's ongoing focus will always remain the scalability of the database, increasing the responsiveness and the availability of the GitLab platform, while also improving the efficiency and reliability of making database changes.
We are planning to continue addressing this on two fronts:
Easy to use and understand application level libraries and frameworks that make all database operations as performant and as reliable as possible.
We are moving towards more self monitoring, auto-tuning approaches that can respond to changing conditions on production environments without any manual intervention required. That includes both the operations running in the background in a GitLab server and all the operations required to upgrade a GitLab instance.
We are also trying to close the knowledge gap and make most complex database operations as simple to implement for other GitLab Groups as calling a few helper functions.
Shift left our ability to pre-emptively find database related regressions and performance issues by testing all database operations against a production clone of GitLab.com's database.
In the long term we also plan to extend our automated database testing capabilities and explore how we can provide the tools that we are building to a wider audience. That audience can be internal, i.e. GitLab's engineering, or external, benefiting all the users of GitLab.
Today, most developers have no easy way to test new queries or database changes at scale. We want to figure out ways for developers to test all their database changes on production data, prior to production. Our approach will be to automate the process of testing queries against production data and integrate it in the DevOps lifecycle, at the stage where developers spend most of their time while developing and reviewing code. This will enable developers to complete database related updates and perform code reviews faster, with less guessing and more confidence backed by quantifiable data.
Our long term plan described above aligns and is explained in more detail in Enablement Section's theme of managing complexity for large software projects.
Finally, we are looking for ways to contribute back to the wider community beyond GitLab. We have extended the way Ruby on Rails projects interact with the database and we have introduced numerous new ideas and frameworks. We are evaluating how we could extract parts of our tools, for example by creating a separate library, open sourcing it and letting other developers use it. At the same time, that will enable more seamless contributions to GitLab's database layer.
Background migrations are the vehicle for executing all large data updates (data migrations) in GitLab. Any operation that has to update more than a few thousand records has to be performed through a background migration, with workers running asynchronously in the background and executing the update in batches so that the performance of a GitLab instance is not affected.
The way of performing such operations up until GitLab 15.0, by scheduling background jobs in regular intervals was static and did not take into account the load of the database server when each job was executed. In order to address this, we had to rethink our approach on how we perform massive data upgrades, which led us to building the batched background migrations framework.
We first introduced the batched background migrations framework while addressing the Primary Key overflow risk for tables with an integer PK, in which we had to update more than 8.5 Billion records. It provides mechanisms to adapt in real time to the load of a Database Server, adjust the work performed by monitoring in real time the performance of the migrations, it requires minimal monitoring by the instance administrators and can automatically recover from performance related errors.
In 15.0, we have made batched background migrations available to all GitLab engineers and switched them to the default way for performing background migrations.
We plan to continue addressing any issues discovered and add support for missing or novel features that will make this framework even more reliable. In the long term, we plan to update many of our existing tools that perform data operations to use batched background migrations as well. We also plan to evaluate whether we can extend the framework or introduce a similar framework, which will cover other asynchronous operations that are not background migrations, like scheduled or recurring jobs.
We are implementing a generic throttling mechanism for large data changes that will monitor the health of the Database for various signals (leading indicators) and react to problems by throttling or even pausing the execution of the updates. Our plan is to do so by extending the batched background migrations auto-tuning layer to monitor for said signals and actively react by further adjusting the batch sizes of scheduled jobs.
We are shifting left our ability to pre-emptively find database related regressions and performance issues by testing all database updates against a production clone of GitLab.com's database. With every feature we add, we move one step closer to GitLab being more performant and lower the risk that code may be deployed that could cause incidents and affect the performance and availability of GitLab.com or other self managed instances.
Gitlab 15.0 marked an important milestone for the Automated database testing GitLab internal feature - we are now testing all types of database migrations against a clone of the production database of GitLab.com:
All regular and post migrations - all schema updates and small scale data updates
All data migrations (through sampling) - both Sidekiq and Batched background migrations
That means that 100% of scheduled database updates are covered, making sure that we test our most tricky operations before they are even merged. We expect that the effect of those tests will be evident to both GitLab.com and self managed instances throughout GitLab 15 and beyond.
But that was only the beginning as we are expanding our scope; scheduled database updates do not include regular queries or updates that result from user interactions (users performing an action or causing a background job to run). Our next effort will be to find ways to perform automated query analysis for Merge Requests and test newly introduced queries against our production clones as well. This is a difficult problem to solve as we must figure out the parameters for the queries, which depend on the data stored, so we are going to start with the simplest iteration possible, identifying the queries introduced by each MR to support the database reviewers.
Our next top priorities, which we are not actively working on right now.
One of the top reasons of performance degradation in relational databases is tables growing too large. There is no globally applicable hard rule on the size threshold that tables should not exceed; it depends on the schema of the table, the number and types of indexes defined over it, the mix of read and write traffic that it gets, the query workload used to to access it and more. As a result of our analysis, we set the limit at 100GB and we explain in detail our rationale and how we plan to approach this problem.
Addressing this for tables in GitLab.com is critical, but it will also allow us to provide a more scalable database design for GitLab instances with smaller databases.
Ensuring that the deployed database schema matches codebase expectations is important for addressing issues self managed instances may face while upgrading to newer versions. It will allow us to support preemptive checks before an instance is upgraded and warn about potential issues before the process is started.