Back in March, Mario shared some of the lessons we'd learned from our last attempt to enable Elasticsearch on GitLab.com, an integration that would unlock both Advanced Global Search and Advanced Syntax Search. Since then, we've been working hard to address problems with the integration and prepare for another attempt.
At the heart of our dilemma was a classic "chicken and egg" problem. We needed to gather more information about Elasticsearch to make improvements to the total index size, but without an active deployment, that information was very hard to gather. Customer feedback and small-scale testing in development environments all help, but dogfooding the integration is the best way to get the information we require.
To resolve this, we prioritized changes to enable Elasticsearch integration on GitLab.com. Since the index size was a hard problem, this meant some kind of selective indexing was necessary, so we've added per-project and per-group controls.
On Jun. 24, 2019, we enabled the integration for the
gitlab-org group on
GitLab.com. Now, any searches at the group or project level will make use of the
Elasticsearch index, and the advanced features the integration unlocks will be available.
We figured, why not give it a try?
The total index size for this group – which includes about 500 projects – is around 2.2 million documents and 15GB of data, which is really easy to manage from the point of view of Elasticsearch administration. The indexing operation itself didn't go as smoothly as we hoped, however!
Another advantage to having selective Elasticsearch indexing enabled on GitLab.com is that our engineers need confidence that the feature is performant, that it won't threaten the overall stability of GitLab.com, and that it is substantially bug-free. So we went through a Production Readiness Review before enabling it. The review uncovered a number of pre-existing bugs and new regressions, which have all been fixed in the 12.0 release. Some of the bugs included:
- Elasticsearch was sometimes used for searches, even when disabled
- Performance regression indexing database content
- Regression searching for some projects at group level
- Regression visiting page 2 of search results
- Wiki indexing still relied on a shared filesystem
- Searching snippets with Elasticsearch enabled still queries the database, not Elasticsearch
We still can't claim to be bug-free, of course, but the picture is a lot rosier than if we'd attempted to roll out this feature without first using it ourselves.
We'd tested the new indexing code on our staging environment, but this was last refreshed more than a year ago, and was significantly smaller than the group on GitLab.com, containing around 150 projects. As a result, some bugs and scalability issues were uncovered for the first time in production. We're addressing them with high priority in the 12.1 and 12.2 releases. The scaling issues include:
- Project imports unconditionally enqueue an ElasticCommitIndexerWorker
- Allow maximum bulk request size to be configured
- Intelligently retry bulk-insert failures when indexing
- Note bulk indexing often fails due to statement timeout
- Cannot index large snippets
- Removing documents from the index can fail with a conflict error
Once these issues are addressed, indexing at scale should be quick, easy, and reliable. Indexing at scale is invaluable from the point of view of an engineer trying out changes to reduce total index size.
Another area for improvement is administering the indexing process itself. Although GitLab automatically creates, updates, and removes documents from the index when changes are made, backfilling existing data required manual intervention, running a set of complicated (and slow) rake tasks to get the pre-existing data into the Elasticsearch index. Unless these instructions were followed correctly, search results would be incomplete. There was also no way to configure a number of important parameters for the indexes created by GitLab.
When using the selective indexing feature, GitLab now automatically enqueues "backfill" tasks for groups and projects as they are added, and removes the relevant records from the index when they are supposed to be removed. We've also made it possible to configure the number of shards and replicas for the Elasticsearch index directly in the admin panel, so when GitLab creates the index for you, there's no need to manually change the parameters afterwards.
Personal snippets are the one type of document that won't be respected in the
selective-indexing case. To ensure they show up in search results, you'll still
need to run the
gitlab:elastic:index_snippets rake task for now.
There are also improvements if you're not using selective indexing – the admin area now has a "Start indexing" button. Right now, this only makes sense if starting from an empty index, and doesn't index personal snippets either, but we're hopeful we can remove the rake tasks entirely in the future.
We're really happy to have Elasticsearch enabled for the
gitlab-org group, but
the eventual goal is to have it enabled on all of GitLab.com.
We'll be rolling it out to more groups in the future.
To get there, we'll need to continue to improve the administration experience using Elasticsearch. For instance, it's still difficult to see the indexing status of a group or project at a glance, a function that would be really useful for our support team to answer queries like "Why isn't this search term returning the expected results?"
Managing the Elasticsearch schema is also a challenge
Currently, we take the easy route of reindexing everything if we need to change some aspect of the schema, which doesn't scale well as the index gets larger. Some work on this is ongoing, and the eventual goal is for GitLab to automatically manage changes to the Elasticsearch index in the same way it does for the database.
Reducing the index size is still a huge priority, and we hope to make progress on this now that we have an Elasticsearch deployment to iterate against.
We'd also like to improve the quality of search results
For example, we have reports of code search failing to find certain identifiers and we'd like to use the Elasticsearch index in more contexts, such as for filtered search.
The Elasticsearch integration is progressing. Finally, responsibility for the Elasticsearch integration has been passed from the Plan stage to the Editor group of the Create stage. I hope you'll join Mario and me in wishing Kai, Darva, and the rest of the team the best of luck in tackling the remaining challenges for Elasticsearch. An up-to-date overview of their plans can always be found on the search strategy page.
Photo by Benjamin Elliott on Unsplash
“The challenge of enabling Elasticsearch on GitLab.com” – Nick Thomas
Click to tweet