In early 2020, it took 3.5 months of solid work to implement replication of a new data type in Geo. One year later, support can be added within a month -- including development and all required reviews. How did we do it? First, let me introduce you to Geo.
What is Geo?
GitLab Geo is the solution for widely distributed development teams and for providing a warm-standby as part of a disaster recovery strategy. Geo replicates your GitLab instance to one or more local, read-only instances.
What are data types?
GitLab Geo was released in June 2016 with GitLab 8.9 with the ability to replicate project repositories to a read-only secondary GitLab site. Developers located near secondary sites could fetch project repositories as quickly as if they were near the primary.
But what about wiki repositories? What about LFS objects or CI job artifacts? In GitLab, each of these things is represented by different Ruby classes, database tables, and storage configurations. In Geo, we call these data types.
Is it really that hard to copy data?
When we say a new data type is supported by Geo, this is what we mean:
- Backfill existing data to Geo secondary sites
- As fast as possible, replicate new or updated data to Geo secondary sites
- As fast as possible, replicate deletions to Geo secondary sites
- Retry replication if it fails, for example due to a transient network failure
- Eventually recover missing or inconsistent data, for example if Sidekiq jobs are lost, or if infrastructure fails
- Exclude data according to selective sync settings on each Geo secondary site
- Exclude remote stored data unless Allow this secondary node to replicate content on Object Storage is enabled on a Geo secondary site
- Verify data integrity against the primary data, after replication
- Re-verify data integrity at regular intervals
- Report metrics to Prometheus
- Report metrics in the Admin UI
- View replication and verification status of any individual record in the Admin UI
- Replication and verification job concurrency is configurable in Admin UI
- Retry replication if data mismatch is detected (coming soon to all data types using the framework)
- Allow manual re-replication and re-verification in the Admin UI (coming soon to all data types using the framework)
- And more
How to iterate yourself into a problem
- Project Git repositories
- Project wiki Git repositories
- Issue/MR/Epic attachments
- LFS objects
- CI job artifacts
- Container/Docker registry
And we had added a slew of features around these data types. But suddenly it was clear we had a problem. We were falling behind in the race to replicate and verify all of GitLab's data.
- A new data type was being added by other teams, every few months. It was painful to prioritize 3 months of development time only to add replication to one data type. And even if we caught up, the latest features would always be unsupported by Geo for 3 months.
- Automatic verification of Project and Wiki repositories was implemented, but adding it to a single data type was going to take 3 months.
- Maintenance and other new features were increasing in effort due to the amount of code duplication.
- Our event architecture needed too much boilerplate and overhead to add new events
How to iterate yourself out of a problem
Just because it's possible to iterate yourself into a problem doesn't mean iteration failed you. Yes, ideally we would have seen this coming earlier. But consider that fast and small iteration has likely saved many hours of upfront work on features that have been quickly validated, and have since been changed or removed. It's also possible to DRY up code too soon into bad abstractions, which can be painful to tear apart.
But we reached a point where everyone agreed that the most efficient way forward required consolidating existing code.
Do the design work
to build a new geo replication and verification framework with the explicit goal of enabling teams across GitLab to add new data types in a way that supports geo replication out of the box
Most of the logic listed above in Is it really that hard to copy data? is exactly the same for all data types. An internal framework could be used to significantly reduce duplication, which could deliver huge benefits:
- Bugs in the framework only have to be fixed once, increasing reliability and maintainability.
- New features could be added to the framework for all data types at once, increasing velocity and consistency.
- Implementation details would be better hidden. Changes outside the framework become safer and easier.
The proposal went further than making it easy for ourselves to add Geo support to new data types. The goal was to make it easy for non-Geo engineers to do so. To achieve this goal, the framework must be easy to use, easy to understand, and well-documented. Besides the usual benefits of reducing duplication, this higher standard would help:
- Minimize the effort to implement Geo support of new features, whether it's done by a Geo engineer or not.
- Minimize lag time to add Geo support. If it's easy to do, and anyone can do it, then it's easy to prioritize.
- Increase awareness in other teams that new features may require Geo support.
- Influence the planning of new features. There are ways to make it more difficult to add Geo support. This is much easier to avoid during initial planning.
As a first step, Fabian proposed creating a proof of concept of a framework leveraging lessons learned and incorporating improvements we already wanted to make to the existing architecture. The issue stimulated lots of design discussion in the team, as well as multiple POCs riffing off one another.
The biggest change was the introduction of a
Replicator class which could be subclassed for every data type. The subclasses would contain the vast majority of the specifics to each data type.
In order to further reduce duplication, we also introduced the concept of a
Replicator strategy. Most data types in GitLab could be categorized as blobs (simple files) or Git repositories. Within these categories, there was relatively little logic that needed to be specific to each data type. So we could encapsulate the logic specific to these categories in strategies.
Another significant decision was to make the event system more flexible and lightweight. We wanted to be able to quickly implement new kinds of events for a
Replicator. We decided to do this without rewriting the entire event processing layer, by packaging and transmitting
Replicator events within a single, generic event leveraging the existing heavyweight event system. We could then leave the old system behind, and after migrating all data types to the framework, we could easily replace it.
Once a vision is chosen, it can be difficult to see how to get there with small iterations. But there are often many ways to go about it.
At a high-level, we could have achieved our goal by taking two data types that were already supported, DRYing up their code, and refactoring toward the desired architecture. This is a proven, safe, and effective method.
But to me it felt more palatable overall to deliver customer value along the way, by adding support for a brand-new data type while developing the reusable framework. We already had practice implementing many data types, so there was little risk that we would, for example, take too long or use suboptimal abstractions. So we decided to do this with Package registry.
Lay the foundation
Our POCs already answered the biggest open questions about the shape of the architecture. The next step was to get enough of a skeleton merged, as quickly as possible, so that we could unlock further parallel work. To ensure correctness, we aimed to get something working end-to-end. We decided to implement "replication of newly created Package files". Much was left out, for example:
- Replication of changes. (Most Blob types, including Package files, are immutable anyway)
- Replication of deletes
- Backfill of existing files
- Verification was left out entirely from the scope of the first epic, since we already knew replication alone provides most of the value to users.
Since the work still required many specific design decisions, we decided to pair program. Gabriel Mazetto and I used Zoom and Visual Studio Live Share, which worked well for us, though there are many options available. See a recording of our first call.
The spike was merged and we thought ourselves safe under the feature flag. Looking back on this particular merge request, we did make a couple mistakes:
- An autoloading bug was discovered. The merge request was reverted, fixed, and remerged. Thanks to CI and end-to-end QA tests using actual builds, the impact was limited.
- The size of the spike was unnecessarily large and difficult to review for a single merge request. As it grew, we should have used it as a "reference" merge request from which we could break out smaller merge requests. Since then, GitLab policies have further emphasized smaller iterations.
Build on the foundation
With the skeleton of the framework in the main branch, we could implement multiple features without excessive conflicts or coordination. The feature flag was enabled on GitLab's staging environment, and each additional slice of functionality was tested as it was merged. And new issues for bugs and missing features were opened.
We built up the developer documentation as we went along. In particular, we documented specific instructions to implement a new data type, aimed at developers with no prior knowledge of Geo. These instructions have since been moved to issue templates. For example, this is the template for adding support to a new Git repository type. This caught a lot of would-be pain points for users of the framework.
Finally, we released Geo supports replicating GitLab Package Registries in GitLab 13.2!
Reaping the benefits
Following the release of Geo support for Package Registries, we added support for many new data types in quick succession. Automatic verification was added to the framework. This recently culminated in a non-Geo engineer implementing replication and verification for a new data type, within one month!
- In GitLab 13.5, Geo replicates external merge request diffs and Terraform state files. These were added by Geo engineers who had been less involved in building the framework. Many refinements to the framework, and especially to the documentation, came out of this.
- In GitLab 13.7, Geo supports replicating Versioned Snippets. This was also added by a Geo engineer, and it was the first Git repository type in the framework, so it required more work than adding new Blob types.
- In GitLab 13.10:
- GitLab 13.11:
- GitLab 13.12:
- An already supported data type, LFS objects, is migrated to the framework under feature flag. Following this will be the migration of "Uploads" and "CI Job artifacts", and then deleting thousands of lines of code. This should improve both reliability and velocity, for example, verification will be added to these data types.
- In GitLab 12.9, we replicated ~56% of all data types (13 out of 23 in total) and verified ~22%.
- In GitLab 13.11, we replicate ~86% of all data types (25 out of 29 in total) and verify ~45%.
- In the last year, GitLab released six new features that needed Geo support. We replicate 100% of those new features and verify ~57%.
What did it cost?
For comparison, it took around 3.5 months to implement replication of Design repositories. It took around 6 months to implement the framework for replication of Package files. So the cost to produce the framework for replication was roughly 2.5 months of work.
We don't really have a comparable for implementation of verification, but it looked like it would take about 3 months to implement for a single data type, while it took about 4 months total to implement for Package files and simultaneously add to the framework, for a cost of about 1 month.
Given that new data types now take about 1 month to implement replication and verification, the work to produce the framework paid for itself with the implementation of a single data type. All the rest of the benefits and time saved are more icing on the cake.
My only regret is that we should have done it sooner. I intend to be more cognizant of this kind of opportunity in the future.
What to expect in the future
- Already supported data types will be migrated into the framework
- New features will be added more quickly, for example, verification will be rolled out for all Blob and Git repository data types
- Duplication will be further reduced, for example, by leveraging Rails generators
Huge thanks to everyone who contributed to closing the gap on replicating everything in Geo!