For some time now, GitLab has been working on enabling the Elasticsearch integration on GitLab.com to allow as many GitLab.com users as possible access to the Advanced Global Search features. Last year, after enabling Advanced Search for all licensed customers on GitLab.com we were thinking how to simplify the rollout of some Advanced Search features that require changing the data in Elasticsearch.
(If you're interested in the lessons we learned on our road to Enabling Elasticsearch for GitLab.com, you can read all about it.
The data migration process problem
Sometimes we need to change mappings of an index or backfill a field, and reindexing everything from scratch or using Zero downtime reindexing might seem like an obvious solution. However, this is not a scalable option for big GitLab instances. GitLab.com is the largest known installation of GitLab and as such has a lot of projects, code, issues, merge requests and other things that need to be indexed. For example, at the moment our Elasticsearch cluster has almost 1 billion documents in it. It would take many weeks or even months to reindex everything and for all that time indexing would need to remain paused, therefore search results would quickly become outdated.
Original plan for multi-version support
Originally, we were planning to introduce multi-version support using an approach that is fully reliant on GitLab to manage both indices, reading from the old one and writing to both until the migration is finished. You can read more information at !18254 and &1769. As of writing this, most of the code for this approach still exists in GitLab in a half-implemented form.
There were 2 primary concerns with this approach:
- Reindexing would require the GitLab application to read every single document from the storage and send it to Elasticsearch again. Doing so would put a big strain on different parts of the application, such as database, Gitaly, and Sidekiq.
- Reindexing everything from GitLab to the cluster again may be very wasteful on occasions where you only need to change a small part of the index. For example, if we want to add epics to the index, it is very wasteful to reindex every document in the index when we could very quickly just index all the epics. There are many situations where we will be trying to perform some migration that can be done more efficiently using a targeted approach (e.g. adding a new field to a document type only requires reindexing all the documents that actually have that field).
For these reasons we've decided to create a different data migration process.
Our revised data migration process
We took inspiration from the Rails DB migrations. We wanted to apply the best practices from it without having to re-architect what the Rails team has already implemented.
For example, we've decided that we would have a special directory with time-stamped migration files. We wanted to achieve a strict execution order so that many migrations might be shipped simultaneously. A special background processing worker will be checking this folder on schedule. This is slightly different to rails background migrations where the operator is required to manually run the migration. We decided to make it fully automated and run it in the background to avoid the need for self-managed customers to add extra steps to the migration process. This would have likely made it much more difficult for everyone involved as there are many ways to run GitLab. This extra constraint also forces us to always think of migrations as possibly incomplete at any point in the code which is essential for zero-downtime.
At first, we wanted to store the migration state in the Postgresql database, but decided against it since this may not be perfect for the situation where a user wants to connect a new Elasticsearch cluster to GitLab. It's better to store the migrations themselves in the Elasticsearch cluster itself so they're more likely to be in sync with the data.
You can see your new migration index in your Elasticsearch cluster. It's called
gitlab-production-migrations
. GitLab stores a few fields there. We use the
version number as the document id. This is an example document:
{
"_id": "20210510143200",
"_source": {
"completed": true,
"state": {
},
"started_at": "2021-05-12T07:19:08.884Z",
"completed_at": "2021-05-12T07:19:08.884Z"
}
}
The state field is used to store data that's required to run batched migrations. For example, for batched migrations we store a slice number and a task id for current Elasticsearch reindex operation and we update the state after every run.
This is how an example migration looks:
class MigrationName < Elastic::Migration
def migrate
# Migrate the data here
end
def completed?
# Return true if completed, otherwise return false
end
end
This looks a lot like Rails DB migrations, which was our goal from the beginning. The main difference is that it has an additional method to check if a migration is completed. We've added that method because we need to execute asynchronous tasks quite often and we want to check if it's completed later in a different worker process.
Migration framework logic
This is a simple flow chart to demonstrate the high level logic of the new migration framework.
As you can see above, there are multiple different states of a migration. For example, the framework allows it to be halted when it has too many failed attempts. In that case, the warning will be shown in the admin UI with a button for restarting the migration.
Configuration options
We've introduced many useful configuration options into the framework, such as:
-
batched!
- Allows the migration to run in batches. If set, the worker will re-enqueue itself with a delay which is set using thethrottle_delay
option described below. We use this option to reduce the load and ensure that the migration won't time out. -
throttle_delay
- Sets the wait time in between batch runs. This time should be set high enough to allow each migration batch enough time to finish. -
pause_indexing!
- Pauses indexing while the migration runs. This setting will record the indexing setting before the migration runs and set it back to that value when the migration is completed. GitLab only uses this option when absolutely necessary since we attempt to minimize the downtime as much as possible. -
space_requirements!
- Verifies that enough free space is available in the cluster when the migration is running. This setting will halt the migration if the storage required is not available. This option is used to prevent situations when your cluster runs out of space when attempting to execute a migration.
You can see the up-to-date list of options in this development documentation section.
Data migration process results
We implemented the Advanced Search migration framework in the 13.6 release and have been improving it since. You can see some details in the original issue #234046. The only requirement for this new feature is that you should create your index using at least version 13.0. We have that requirement since we're heavily utilizing aliases, which were introduced in 13.0. As you might know, over the last few releases we've been working on separating different document types into their own indices. This migration framework has been a tremendous help for our initiative. We've already completed the migration of issues (in 13.8), comments (in 13.11), and merge requests (in 13.12) with a noticeable performance improvement.
Since we've accumulated so many different migrations over the last few releases and they require us to support multiple code paths for a long period of time, we've decided to remove older migrations that were added prior to the 13.12 release. You can see some details in this issue. We plan to continue the same strategy in the future, which is one of the reasons why you should always upgrade to the latest minor version before migrating to a major release.
If you're interested in contributing to features that require Advanced Search migrations, we have a dedicated documentation section that explains how to create one and lists all available options for it.