The year 2020 marked the first full year for the Memory Group. The origins of the Memory Group began in early 2019, with the most recent team members joining in October of 2019. This summary is intended to highlight the impact and the work of the Memory Group in 2020. The sections below will give a brief overview of the project, describe the impact and link to issues and epics, mostly within the GitLab.org project. The efforts listed are in approximate chronological order. This is also not to be considered a comprehensive list. There are some isolated issues as well as confidential initiatives that are not listed below.
As part of a 10x initiative to Iteratively re-architect our queue implementation to create 10x headroom the Memory Group focused on Sidekiq improvements to improve observability and implement configurability. We wrapped up our efforts on this work at the beginning of 2020.
We implemented tooling to improve observability with Sidekiq. This improvement, in coordination with the Infrastructure's team to simplify sidekiq worker pools, led to a saturation point improvement from approximately 600 jobs per second to 7000 jobs per second. We also fixed longstanding issues with the Sidekiq memory killer. Additionally, an epic was created to identify future improvements to Improve reliability, observability, performance of background jobs. The scalability team was able to build upon our research and prototyping to implement a solution to move sidekiq-cluster to Core making it available to everyone.
An extensive account of our process of moving from unicorn to puma on our web servers can be read in our blog post How we migrated application servers from Unicorn to Puma
Once we deployed Puma to our entire web fleet we observed a drop in memory usage from 1.28T to approximately 800GB (approximately a 37% drop) while our request queuing, request duration and CPU usage all remained roughly the same.
This initially started as a rapid action due to failing imports and exports on large projects. For the rapid action we were able to assist in troubleshooting the large projects and identified some areas of improvement going forward. Subsequently, as part of the 10x Initiative we identified some short term project import/export improvements.
Not only were we able to fix broken imports for large projects, we were able to improve (numbers approximate) by
Introducing NDJSON also allowed for GitLab to process imports and exports with approximately constant memory use regardless of the project size. Prior to this, imports and exports would increase in memory usage as the project size increased. More details about the impact of NDJSON processing can be found in the Introduce
.ndjson as a way to process import epic.
In this MVC we implemented the ability to track accumulated CI minutes on shared runners. This helped to build some of the framework for a proposed feature in a future release.
The MVC we built allows for configuring limits and cost factors of shared and public runners. This enabled additional shared runner types such as Windows, as well as customers to purchase additional shared runner minutes if desired.
We identified that the
PipelineProcessService is executed multiple times resulting in duplicate jobs and SQL queries being called, among other exessive resource usage. In this issue we identifed and implemented steps to improve the
The improvements to our PipelineProcessService not only reduced the duration of the pipeline execution, thereby increasing the feedback loop, it also decreased the CPU load. A comprehensive overview of the positive impact of these changes can be found in the rollout issue. Included in this overview:
Testing identified API endpoints that were failing due to high memory usage or were too CPU intensive. Due to the high priority and severity of these issues the Memory Group focused on the initial improvements to the BlameController and BlobControllers to reduce the severity of these endpoints to acceptable levels and reassign them to feature teams.
This effort was dedicated to focus on the memory usage of our self-managed instances. Prior to this implementation we only had anecdotal information on how and where memory was allocated for our customers in the field. The issues and merge requests within this epic allowed us to collect and anlyze memory usage of our self-managed installations.
This allowed the Memory Team to validate goals for their North Star Metric. We are now also able to measure multiple metrics, for self-managed instances reporting telemetry, on our Enablement::Memory Sisense Dashboard.
Prior to this improvement all images displayed by the GitLab application were delivered in the same size from which they were uploaded, leaving it to the browser rendering engine to scale them down as necessary. This meant serving megabytes of image data in a single page load, just so the frontend would throw most of it away. We determined that the first iteration would focus on avatars since they made up the vast majority of the use cases. The Memory Group implemented a solution to dynamically resize avatars
We shrunk image transfers by 93%. The details can be read here in our blog post Scaling down: How we shrank image transfers by 93%.
In %13.1 the Memory Group added instrumentation to log various SQL requests from sidekiq. Based on the data we collected we set about detecting potential N+1 CACHED SQL calls, documented why these queries are considered bad from a memory perspective, reduced several cached calls and provided guidelines to enable developers to continue these efforts going forward.
We saw a 10% reduction in transactions per second (TPS) across the fleet, as measured by Postgres. More details and links to the supporting Thanos query (internally accessible) can be found in this issue.
Members of the Memory Group have been involved in working groups to provide guidance and cosultation on performance and memory consumption. The working groups we have participated in are listed below: