How we are closing the gap on replicating *everything* in GitLab Geo

Apr 29, 2021 · 12 min read
Michael Kozono GitLab profile

In early 2020, it took 3.5 months of solid work to implement replication of a new data type in Geo. One year later, support can be added within a month – including development and all required reviews. How did we do it? First, let me introduce you to Geo.

What is Geo?

GitLab Geo is the solution for widely distributed development teams and for providing a warm-standby as part of a disaster recovery strategy. Geo replicates your GitLab instance to one or more local, read-only instances.

What are data types?

GitLab Geo was released in June 2016 with GitLab 8.9 with the ability to replicate project repositories to a read-only secondary GitLab site. Developers located near secondary sites could fetch project repositories as quickly as if they were near the primary.

But what about wiki repositories? What about LFS objects or CI job artifacts? In GitLab, each of these things is represented by different Ruby classes, database tables, and storage configurations. In Geo, we call these data types.

Is it really that hard to copy data?

When we say a new data type is supported by Geo, this is what we mean:

How to iterate yourself into a problem

Iteration is a core value at GitLab. In the case of Geo, by GitLab 12.3 we had added replication support for the most important data types, for example:

And we had added a slew of features around these data types. But suddenly it was clear we had a problem. We were falling behind in the race to replicate and verify all of GitLab's data.

How to iterate yourself out of a problem

Just because it's possible to iterate yourself into a problem doesn't mean iteration failed you. Yes, ideally we would have seen this coming earlier. But consider that fast and small iteration has likely saved many hours of upfront work on features that have been quickly validated, and have since been changed or removed. It's also possible to DRY up code too soon into bad abstractions, which can be painful to tear apart.

But we reached a point where everyone agreed that the most efficient way forward required consolidating existing code.

Do the design work

Fabian, our esteemed product manager, proposed an epic:

to build a new geo replication and verification framework with the explicit goal of enabling teams across GitLab to add new data types in a way that supports geo replication out of the box

Most of the logic listed above in Is it really that hard to copy data? is exactly the same for all data types. An internal framework could be used to significantly reduce duplication, which could deliver huge benefits:

The proposal went further than making it easy for ourselves to add Geo support to new data types. The goal was to make it easy for non-Geo engineers to do so. To achieve this goal, the framework must be easy to use, easy to understand, and well-documented. Besides the usual benefits of reducing duplication, this higher standard would help:

As a first step, Fabian proposed creating a proof of concept of a framework leveraging lessons learned and incorporating improvements we already wanted to make to the existing architecture. The issue stimulated lots of design discussion in the team, as well as multiple POCs riffing off one another.

The biggest change was the introduction of a Replicator class which could be subclassed for every data type. The subclasses would contain the vast majority of the specifics to each data type.

In order to further reduce duplication, we also introduced the concept of a Replicator strategy. Most data types in GitLab could be categorized as blobs (simple files) or Git repositories. Within these categories, there was relatively little logic that needed to be specific to each data type. So we could encapsulate the logic specific to these categories in strategies.

Another significant decision was to make the event system more flexible and lightweight. We wanted to be able to quickly implement new kinds of events for a Replicator. We decided to do this without rewriting the entire event processing layer, by packaging and transmitting Replicator events within a single, generic event leveraging the existing heavyweight event system. We could then leave the old system behind, and after migrating all data types to the framework, we could easily replace it.

Once a vision is chosen, it can be difficult to see how to get there with small iterations. But there are often many ways to go about it.

Code

High-level approach

At a high-level, we could have achieved our goal by taking two data types that were already supported, DRYing up their code, and refactoring toward the desired architecture. This is a proven, safe, and effective method.

But to me it felt more palatable overall to deliver customer value along the way, by adding support for a brand-new data type while developing the reusable framework. We already had practice implementing many data types, so there was little risk that we would, for example, take too long or use suboptimal abstractions. So we decided to do this with Package registry.

Lay the foundation

Our POCs already answered the biggest open questions about the shape of the architecture. The next step was to get enough of a skeleton merged, as quickly as possible, so that we could unlock further parallel work. To ensure correctness, we aimed to get something working end-to-end. We decided to implement "replication of newly created Package files". Much was left out, for example:

Since the work still required many specific design decisions, we decided to pair program. Gabriel Mazetto and I used Zoom and Visual Studio Live Share, which worked well for us, though there are many options available. See a recording of our first call.

The spike was merged and we thought ourselves safe under the feature flag. Looking back on this particular merge request, we did make a couple mistakes:

  1. An autoloading bug was discovered. The merge request was reverted, fixed, and remerged. Thanks to CI and end-to-end QA tests using actual builds, the impact was limited.
  2. The size of the spike was unnecessarily large and difficult to review for a single merge request. As it grew, we should have used it as a "reference" merge request from which we could break out smaller merge requests. Since then, GitLab policies have further emphasized smaller iterations.

Build on the foundation

With the skeleton of the framework in the main branch, we could implement multiple features without excessive conflicts or coordination. The feature flag was enabled on GitLab's staging environment, and each additional slice of functionality was tested as it was merged. And new issues for bugs and missing features were opened.

We built up the developer documentation as we went along. In particular, we documented specific instructions to implement a new data type, aimed at developers with no prior knowledge of Geo. These instructions have since been moved to issue templates. For example, this is the template for adding support to a new Git repository type. This caught a lot of would-be pain points for users of the framework.

Finally, we released Geo supports replicating GitLab Package Registries in GitLab 13.2!

Reaping the benefits

Following the release of Geo support for Package Registries, we added support for many new data types in quick succession. Automatic verification was added to the framework. This recently culminated in a non-Geo engineer implementing replication and verification for a new data type, within one month!

In aggregate:

What did it cost?

For comparison, it took around 3.5 months to implement replication of Design repositories. It took around 6 months to implement the framework for replication of Package files. So the cost to produce the framework for replication was roughly 2.5 months of work.

We don't really have a comparable for implementation of verification, but it looked like it would take about 3 months to implement for a single data type, while it took about 4 months total to implement for Package files and simultaneously add to the framework, for a cost of about 1 month.

Given that new data types now take about 1 month to implement replication and verification, the work to produce the framework paid for itself with the implementation of a single data type. All the rest of the benefits and time saved are more icing on the cake.

My only regret is that we should have done it sooner. I intend to be more cognizant of this kind of opportunity in the future.

What to expect in the future

Huge thanks to everyone who contributed to closing the gap on replicating everything in Geo!

Free eBook: The GitLab Remote Playbook

Learn to stabilize your work-from-home team and dive deep on topics including asynchronous workflows, meetings, informal communication, and management.

Download now
Edit this page View source