For some time now, GitLab has been working on enabling the Elasticsearch integration on GitLab.com to allow as many GitLab.com users as possible access to the Advanced Global Search features.
This article follows up with yet more lessons learned on our road to Enabling Elasticsearch for GitLab.com. You can read the first article from Mario and the second article from Nick to see where we've come from.
GitLab.com is the largest known installation of GitLab and as such has a lot of projects, code, issues, merge requests and other things that need to be indexed in Elasticsearch in order for us to support Advanced Global Search. Given the volume of data and given that we didn't have much experience running Elasticsearch at any scale in the company, it made sense for us to think of a way to gradually index our data and learn new lessons at each scaling milestone. In order to do this we built in the ability to enable indexing and searching of individual groups and projects so we could stagger out the release.
Back in June 24, 2019, we enabled the integration for the
gitlab-org group on
GitLab.com, and after doing this, we learnt a lot before we were able
to expand to other groups.
Today, we have enabled Advanced Global Search for over 900 groups already, and we are still increasing this rollout. In order to get to this large number, we needed to automate the gradual increase in groups by allowing operators to roll out to percentages rather than just one group at a time. This has been used now to get us to 15% of bronze tier customers and will hopefully be useful for getting us all the way to 100% of all paying customers.
This post will detail a few key lessons learnt since enabling our first group. At the very least it will serve as a reminder of the things we wished we knew before we ran into them.
Defense in depth
One of the first and least expected problems we ran into after enabling this feature was a continuous stream of security vulnerabilities caused by the complexity of replicating our authorization model in Elasticsearch. We really thank our HackerOne community for how quickly they were able to notice the mistakes we made in our authorization logic and appreciate the impact they've made on ensuring people's data is secure.
Basically, our Elasticsearch integration needs to cover all permutations of user permissions, project visibilities, group inheritance, confidential resources, and other features that GitLab supports for determining whether or not a user can view a specific document. All documents live in a single index, so the only way to ensure that searches return the correct set of results a user can see is to include the permissions checks in the search query itself. However, this breaks one of the first programming rules we've learnt, which is D.R.Y. (Do not Repeat Yourself). And that rule in this case manifested in the expected outcome of bugs caused by our permission logic being wrong in our Elasticsearch queries.
After fixing bug after bug which fell into this same formula of incorrect permission logic, we soon realized we just needed to apply a 2nd layer of permission filtering of search results using the same code we use to check permissions in GitLab. Once we implemented this, the vulnerabilities mostly disappeared despite a few lower severity and harder to accomplish attacks. This new logic we now call "redacting" search results also allowed for an additional threat/bug detection mechanism in GitLab by setting up alerts when the redacting logic was triggered.
GitLab initially followed the common path for indexing our database models by simply queueing a background job on every created/updated/deleted record in the database that also needed to be in Elasticsearch. This is very easy to implement in Rails using Sidekiq, but it comes with the downside that you end up sending many small updates to Elasticsearch. These very frequent writes to ElasticSearch cause performance issues at scale.
After reading Keeping Elasticsearch in Sync we decided to implement a buffered queue approach based on Redis sorted sets. This approach allowed us to easily batch updates using the Elasticsearch bulk API every minute, and it also had the advantage of automatically de-duplicating updates due to the sorted set data structure.
This has now been running for a few months without any major issues. The de-duplication turns out to be very powerful given our specific workloads, as over a 24hr period of time, 95% of updates are duplicates. These details will obviously vary based on the specifics of your application, but in GitLab these duplicates happen very close together as people edit issue descriptions, MR descriptions and other objects in quick succession, and so de-duplication will allow us to reach a considerably larger scale by reducing load on Elasticsearch writes.
Some index settings can cause certain queries to break
A very important lesson we
learnt is that it's very
easy to break your Elasticsearch queries if you change the
and don't test all the types of queries you are sending. In particular, the
documentation does mention that "Positions can be used for proximity or phrase
queries", but it does not clearly mention that if you try to do those queries
(for example, in the form of simply query
without the positions index options, then your queries will fail.
Of course the other important lesson included in our fix was to provide an adequate integration test suite of the different types of queries we allow our users to perform to avoid these regressions.
Just because Elasticsearch allows you to update index mappings doesn't mean it will work
Elasticsearch does make it clear that updating existing mappings in an index
is usually not an
but there seem to be some cases where you are able to change the mapping without
triggering any error at the time of changing the mapping, but later no documents
will be able to write to the index. We learnt
when trying to update
The lesson was to properly test that you can change the setting and keep writing
to the index after.
Remote reindex has some limitations that make it difficult to do quickly
A key part of running Elasticsearch is in production is coming up with a strategy for schema changes or other data migrations. We have experience doing these kinds of changes with Postgres, and most web application developers will have similar experience with migrating relational databases. But Elasticsearch doesn't allow for many in-place migrations of schema or data, and as such, it generally requires users to reindex all of their data in order to make changes to the schema or data.
In the case of GitLab, we've been making considerable improvements to the storage used by our Elasticsearch index which is a key step in rolling out the feature on GitLab.com. But changes to the index options which reduce storage have meant we needed to reindex the entire dataset to see the benefits.
Initially, it seemed like the best way to do this would be to use the reindex from remote feature in Elasticsearch. This seemed appealing as it would mean that the high volume of writes which happen during the reindex would not put any load on our primary cluster during the reindex process. This benefit was valid, but it turned out that we couldn't get the reindexing process to go quick enough due to a limitation in this API where it won't allow you to configure the buffer size.
Aliases are really a good idea to do from the start
When reading about reindexing in Elasticsearch, you will likely learn that using an index alias is almost always at the heart of these solutions. Without this, it's very difficult to perform zero downtime reindexing, because Elasticsearch provides no way to rename an index.
As such, we learnt that we probably should have been using aliases from the beginning and plan to implement this functionality soon.
There are several settings to speed up reindexing
After realizing that reindex to remote was not going to work well for us, we did learn about a bunch of ways to make the in-cluster reindexing happen a lot faster. Most of these come from Elastic's excellent guide to tune for indexing speed.
The changes we used when reindexing to speed things up were:
0on the destination index. The default is
1which means 1 replica which means all the data written to the index needs to be written to 2 nodes. Disabling means basically half the amount of writing needed.
-1on the destination index. The default behaviour is to refresh the indices every second. Because we aren't reading from this new indexing during the reindexing, we can then disable it as we don't mind that searches won't return results.
asyncon the destination index. This will allow the cluster to sync the updates to disk asynchronously. It does come with some risk of data loss if a hard disk fails during reindexing, but does come with the advantage of speeding up the process, and when we finish, we can switch this back to the default.
You can make the replica recovery go faster by changing settings too
As mentioned above, we decided to disable replication during the reindexing, but
after it finishes, we need to re-enable replication and allow replicas to
catch up before switching to the new index. This process may take a while but
you can speed it up by increasing
Use routing to speed up queries but note the limit on number of routing ids in the request
GitLab routes all documents based on the project. This allowed us the opportunity to route searches to the correct nodes when searching within projects. This did speed up our searches by around 5x.
We did however learn when making this optimization to be careful to not send too many routing ids in the request, or you may exceed the 4096 bytes limit on Elasticsearch HTTP lines.