Published on October 14, 2019
9 min read
Explore all the different engineering approaches to store and update the namespace statistics in a performant manner.
Managing storage space on large GitLab instances, such as GitLab.com, can be a challenge. At the moment, we only have a restriction on repository limits, but no restriction on most of the other items that can consume storage space: wiki, lfs objects, artifacts, and packages, to mention a few.
We want to facilitate a method for easily viewing the amount of storage consumed by a group and allow easy management on GitLab.com by setting storage and limits management for groups. But to do that we need a way to track the statistics of a namespace, whether it is a Group or a User namespace.
The "refresh" part is the tricky one. Currently we don't have a pattern to update/refresh the namespace statistics every time a project belonging to the namespace is updated.
We refreshed projects statistics in the following way:
ProjectStatistics
,ProjectStatistics
are updated through a callback every time the project is saved.Namespaces#with_statistics
scope.Analyzing this query we noticed that:
1.2
seconds for namespaces with over 15 000
projects.EXPLAIN ANALYZE
results in query timeouts (15 seconds) when using our internal tooling.Additionally, the callback to update the project statistics doesn't scale. It is currently one of the most frequently run and expensive database queries on GitLab.com. We can't add one more query to it as it will increase the transaction's length.
Because of these reasons, we can't apply the same pattern to store
and update the namespaces' statistics, as the namespaces
table is one
of the largest tables on GitLab.com. Therefore, we have to find a performant and
alternative method.
Update the ActiveRecord model with a refresh strategy based on project routes and a materialized view:
SELECT split_part("rs".path, '/', 1) as root_path,
COALESCE(SUM(ps.storage_size), 0) AS storage_size,
COALESCE(SUM(ps.repository_size), 0) AS repository_size,
COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
COALESCE(SUM(ps.packages_size), 0) AS packages_size
FROM "projects"
INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
INNER JOIN project_statistics ps ON ps.project_id = projects.id
GROUP BY root_path
We could then execute the query with:
REFRESH MATERIALIZED VIEW root_namespace_storage_statistics;
While this implied a single query update, it has some downsides:
Update the ActiveRecord model with a refresh strategy through a Common Table Expression.
WITH refresh AS (
SELECT split_part("rs".path, '/', 1) as root_path,
COALESCE(SUM(ps.storage_size), 0) AS storage_size,
COALESCE(SUM(ps.repository_size), 0) AS repository_size,
COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
COALESCE(SUM(ps.packages_size), 0) AS packages_size
FROM "projects"
INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
INNER JOIN project_statistics ps ON ps.project_id = projects.id
GROUP BY root_path)
UPDATE namespace_storage_statistics
SET storage_size = refresh.storage_size,
repository_size = refresh.repository_size,
wiki_size = refresh.wiki_size,
lfs_objects_size = refresh.lfs_objects_size,
build_artifacts_size = refresh.build_artifacts_size,
packages_size = refresh.packages_size
FROM refresh
INNER JOIN routes rs ON rs.path = refresh.root_path AND rs.source_type = 'Namespace'
WHERE namespace_storage_statistics.namespace_id = rs.source_id
Unlike Attempt A, a CTE will be limited to the namespace we care about instead of operating on all namespaces. The downside of it, is that earlier versions of MySQL do not support Common Table Expressions.
We could get rid of the model that stores the statistics in aggregated form and instead use a Redis Set. This would be the boring solution and the fastest one to implement, as GitLab already includes Redis as part of its Architecture.
The downside of this approach is that Redis does not provide the same persistence/consistency guarantees as PostgreSQL, and the namespace statistics are information we can't afford to lose in a case of a Redis failure. Also, searching for information like the largest namespaces per repository size will be easier to do in PostgreSQL than in Redis.
Directly relate the root namespace to its child namespaces, so whenever a child namespace is created, it's also tagged with the root namespace ID:
id | root_id | parent_id |
---|---|---|
1 | 1 | NULL |
2 | 1 | 1 |
3 | 1 | 2 |
To aggregate the statistics inside a namespace we'd execute something like:
SELECT COUNT(...)
FROM projects
WHERE namespace_id IN (
SELECT id
FROM namespaces
WHERE root_id = X
)
Even though this approach would make aggregating much easier, it has some major downsides:
For this approach we continue using the incremental statistics updates we already have, but we refresh them through Sidekiq jobs and in different SQL transactions:
namespace_aggregation_schedules
) with two columns id
and namespace_id
.namespace_aggregation_schedules
project_statistics
, the insertion should be done in a different transaction and through a Sidekiq Job.1.5h
hours.1.5h
lease on Redis on a key based on the root namespace ID.1.5h
.namespace_aggregation_schedules
after the update.namespace_aggregation_schedules
table and schedule jobs for every pending row.
This implementation has the following benefits:
project_statistics
.The downsides of this approaches are:
1.5
hours after the change is done, which means there's a brief window in time where the statistics are inaccurate. This is not a major problem because we're not currently enforcing storage limits.namespace_aggregation_schedules
table will see a high rate of inserts and deletes, which may require that we tune auto vacuuming for this table.We went with Attempt E because updating the storage statistics asynchronously was the less problematic and performant approach of aggregating the root namespaces.
Given this is a performance improvement, we have to be very careful introducing this change to GitLab.com: Which is why we decided to release it under feature flag and roll it out gradually by:
gitlab-org
group and measure the performance.Finally if no problem arises, we can be confident this change performs properly on GitLab.com and we can remove the feature flag.
To assess the execution of this approach, we monitored the Sidekiq dashboards on Kibana to ensure jobs were being executed flawlessly and without using too much memory or CPU. Particularly, we observed the "Sidekiq queue size," "Rate of running jobs," and "Running jobs" dashboards.
The feature was enabled globally on staging and the execution of the jobs was satisfactory. But there was barely any traffic to measure the impact of our changes:
Our results were different on GitLab.com. We first enabled it for the gitlab-org
group and we quickly started to observe more traffic:
Once we enabled the feature flag globally, the rate of running jobs increased considerably:
We currently have nearly 400 000
statistics stored for root namespaces on GitLab.com, which are updated at a high pace.
Being able to efficiently fetch those statistics allows one to easily track the top biggest repositories and/or namespaces on an instance
and to start paving the way to enforce storage limits for groups on GitLab.com.
Learn more about this use case by reading:
Cover photo by Bill Oxford on Unsplash.
Find out which plan works best for your team
Learn about pricingLearn about what GitLab can do for your team
Talk to an expert