1-NRepository structure with separate blob management
This page is meant to track the discussion of different database design approaches for the Container Registry.
A docker manifest describes how a docker image comprises of multiple layers. A manifest can be identified by the layers it references and as such it can be thought of as unique throughout the registry. Multiple repositories reference the same manifest.
Below we discuss different approaches to modeling that. Model 1 aims to deduplicate manifests, so that there is only ever one record for the same manifest and repositories merely reference manifests. Model 2 does not deduplicate manifests this way. Here, a manifest always belongs to a repository (which is part of its key). The same manifest can be present in multiple repositories which in turn leads to "duplicate" manifest entries (aside from the fact they live in different repositories).
A similar concept holds true for layers of a manifest.
We have imported the container registry from dev.gitlab.org into a database using model 1. In this section, we gather statistics from that import to shed some light onto what to expect from this in-database deduplication.
The "dedup factor" is basically how much more entries we expect if we didn't deduplicate them.
|Entity||Referenced from||Dedup factor|
Example for Manifest:
select (select count(*) from repository_manifests) / (select count(*)::numeric from manifests);
In the case of the registry on dev.gitlab.org, we can see that almost all manifests are unique (dedup factor 1.003). On average, blobs are being referenced by 1.53 layers and 1.17 repositories.
In summary, if we were not deduplicating those records in the database (model 2 below), we'd have less than double the number of records - assuming dev.gitlab.org is representative enough.
This approach has been validated to support all relevant container registry features. The database design discussion is well covered and includes query examples along with their plans.
The idea of this approach is to treat all entities as first-class citizens and use many-to-many reference tables to connect them. For example, for a given manifest digest, there'll ever be one entry in
manifests. This entry might be referenced from multiple repositories through the many-to-many reference table
This naturally leads to deduplicating records, as they are not physically tied to the existence of the
repository. See above "Deduplication Ratios".
There is no natural partitioning key in this model. There are a couple of patterns how the tables are being access. Given all models are first-class citizens, there's no common dimension all tables (or a large enough subset of tables) are being accessed by.
This renders it problematic if not impossible to find a meaningful partitioning scheme. This has potential to lead to performance problems down the road, when tables become large.
The nature of this model is to treat all entities as first class citizens. This helps to deduplicate records, but also allows for a state where an entity doesn't have any references anymore but still exists in the database.
This means that we'll have to have a garbage collection algorithm to clean those entries up.
Since the model allows records to become "dangling", we'll have to implement a GC algorithm to find those and eventually delete them. Let's look at the example of dangling manifests: This is a manifest that is not being referenced by any repository. We can find a batch of up to
N=100 dangling manifests like so:
SELECT * FROM manifests m WHERE NOT EXISTS ( SELECT 1 FROM repository_manifests rm WHERE rm.manifest_id = m.id ) LIMIT 100
This is an anti-join across the full relation, which can be executed as a scan on two indexes. In the best case, we find
N=100 records quickly. This is rather unlikely though since we generally strive to keep the registry clean of those dangling records. The worst case is the registry is clean (there are no dangling records). In this case, we scan the entire relation. This is expensive and non-linear due to the join-nature.
We have a good example for anti-join runtime characteristics. Here, runtime varies from a few milli-seconds (best case - many "dangling records" or "dangling records" at the beginning of the scan) to about 30s (no dangling records).
The many-to-many reference tables are the ones that would become large, because they are effectively connecting
N repositories with
M manifests, for example (one record per connection).
Ultimately, this can become a performance problem when they become very large. This is particularly because the reference tables are being used in joins across one or more other tables.
1-NRepository structure with separate blob management
This approach treats a repository as the first-class model. A repository contains many manifests (
1-N), those contain many layers (
1-N). Separately from the repository structure, we keep track of blobs residing in object storage. We automatically maintain a reference table to keep track of which blob represents a given layer (lookup by blob digest).
Minor differences to model 1 include using a single
digest column to store both algorithm and actual digest and normalizing the media type information into a lookup table.
Reference: SQL schema
What is not shown in the diagram is the possibility to have
blobs_layers being tracked automatically. This can be implemented in the database, but we may also do this from the application. The in-database implementation would rely on a trigger like so:
CREATE FUNCTION public.track_blob_usage() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN IF (TG_OP = 'DELETE') THEN -- TODO: We can do more stuff here, this is just for illustrative purposes. -- Note: This doesn't have to be a trigger, it can also be application logic IF (SELECT COUNT(*) FROM blobs_layers WHERE id <> OLD.id AND digest = OLD.digest AND layer_id IS NOT NULL) = 0 THEN INSERT INTO blob_review_queue (digest) VALUES (OLD.digest) ON CONFLICT (digest) DO NOTHING; END IF; ELSIF (TG_OP = 'INSERT') THEN INSERT INTO blobs_layers (digest, repository_id, layer_id) VALUES (NEW.digest, NEW.repository_id, NEW.id) ON CONFLICT (digest, layer_id) DO NOTHING; INSERT INTO repository_blobs (repository_id, blob_digest) VALUES (NEW.repository_id, NEW.digest) ON CONFLICT (repository_id, blob_digest) DO NOTHING; END IF; RETURN NULL; END $$; CREATE TRIGGER track_blob_usage_trigger AFTER INSERT OR UPDATE OR DELETE ON public.layers FOR EACH ROW EXECUTE PROCEDURE public.track_blob_usage();
In case of inserts to
layers, we also keep track of the reverse lookup automatically in
repository_blobs . When we delete a layer, we can check if there are any remaining usages left for this blob (notice the efficient lookup by
digest) and if not, we might insert the
digest into a queuing table. A background process takes those records and eventually cleans object storage and the
Note this is just illustrative and we can get to the details later.
In contrast to model 1, this approach makes garbage collection straight forward. We don't ever need to scan entire tables to find "dangling" records. This is because the database contents are always consistent.
For example, when we delete a layer - we can determine the affected blobs easily and schedule those for further checking and eventually deletion. This can be implemented even inside the database (triggers), if we wanted to. If we rely on cascading foreign keys, all this can be triggered by a
DELETE - even deleting a full repository may be possible (though we might want to consider batch deletes) this way, fully triggering a cleanup of all the
layers and eventually scheduling relevant blobs for deletion.
Blob management has a need for (some) GC algorithm, because we effectively deduplicate data in object storage. However, other entities like manifests and layers don't have a need to perform GC in this model.
This is in contrast to model 1 where we effectively allow a record to become "dangling" because we deduplicate all entities in the database, too.
There are two distinct areas of the database, each with their own partitioning key:
The choice of partitioning key dictates how the respective tables should be accessed. We must always use the respective partitioning key to make for most efficient queries.
For blob information, this leads to "Drawback 2" below.
Note that this partitioning scheme can ultimately also be used to create an application sharding design. We would apply the same idea for the repository structure (split by repository) and rely on a single blob management (or even divide that into more parts at the expense of decreasing space usage efficiency).
With repository being the top-level entity, manifests and their layers are not being deduplicated (in contrast to model 1, where we don't store a given manifest twice). This is intentional to support the ability to separate data by repository internally (see partitioning).
manifests contains the actual payload of a manifest, too, this may have a significant effect on overall database size. It remains to be seen if this is mitigated by partitioning. This depends on the efficiency of deduplicating manifests, i.e. what the dedup factor is.
A blob has meta-data attached like its
size or the detected
media_type_id. This is stored in
blobs but also in
layers. The reason for this duplication is that we have queries that scan a range of manifests (e.g. all layers for a manifest). If we only had
manifests.digest, we would have to lookup this information in
blobs. This table in turn can only be queried efficiently by
digest, rendering the lookups single-record queries and effectively a N+1 pattern. This is being mitigated by duplicating the information into
Note: In most cases, we will have to resolve a repository's
<name> to its corresponding
<id> first. This needed so we can use the id as a partitioning key in later queries.
In this case, we delete the corresponding record in
DELETE FROM manifests WHERE repository_id=:id AND digest=:digest
The installed trigger inserts all referenced blobs (configuration and layers in the manifest) into the
blob_review_queue. A later GC process inspects those and deletes blobs in case they don't have any references anymore.
Deleting a full repository doesn't seem implemented in the api specs. It works like deleting all manifests in the repository.
In this case, we delete the tag:
DELETE FROM tags WHERE repository_id=:id AND name=:reference
This can lead to untagged manifests which in turn are eligible for deletion. In order to defer this check to background GC, we insert all referenced blobs for the manifest that the tag points to into the
blob_review_queue, but only if no other tag points to the same manifest in that repository.
We might be able to skip this step early in case there is another tag for the same manifest. Note for simplicity, we can also do these checks in GC instead.
Once a blob upload is finished, we create the record in
blobs and associate the blobs with the repository through
repository_blobs. In regular situations, the client might upload more blobs until it pushes the manifest. However, this process might fail and we would be left with dangling blobs (a configuration or a layer, which is related to the repository, but no manifest references it.)
In this case, we would insert the blob digest into the
blob_review_queue after the upload has finished (and before the corresponding records in
blobs, repository_blobs have been created). By setting
blob_review_queue.review_after to a time in the future, we can delay the checking of this blob (which effectively defines a timeout until when we expect the client to have pushed the manifest).
Now the client pushes the manifest.
We create the corresponding records in
layers. After completing the verification steps, we can de-queue (delete) the digest from the
blob_review_queue. This is because we are sure now that there is a reference to this blob and we can skip GC. Note this is optional as the GC process would also be able to figure this out on its own - deleting from the queue is cheap (and likely cheaper than running GC), so this is an optimization.
When uploading a manifest, if
reference is a tag (can be a digest as well), we have to consider the scenario where an existing tag is switched from manifest
A to manifest
B. For example, if we:
Build a docker image for a sample application, tag it with
myapp:latest and push it to the registry. When pushing the image to the registry, its manifest, lets call it
A, will be uploaded and tagged with
latest inside the repository
Change the sample application source code, rebuild the image with the same
myapp:latest tag and push it to the registry. Because we changed the source code, this image will have a different manifest, lets call it
B, which will be uploaded with tag name
When the registry receives the manifest
B, it finds out that another manifest,
A, is already tagged with
latest in the repository. In this situation, the registry will remove the
latest tag from
A and point it to
B instead. Because of this, manifest
A may now be eligible for deletion if no other tag points to it.
To account for this we need to insert all blobs referenced by manifest
A into the
We implement a background job system that consumes entries from
blob_review_queue and performs GC checks and actions. This can be done with go-routines and synchronization on
blob_review_queue can be implemented with
SELECT ... FOR UPDATE SKIP LOCKED mechanics.
Note: A particular choice here is that GC is performed on a single blob only. It can be expanded though to support batch GC. However, the database model being used here is optimized for making by-digest lookups.
In order to retrieve the next blob digest for review from the queue (the queue "head"), we find the next non-locked and qualifying record:
BEGIN; SELECT * FROM blob_review_queue WHERE review_after < NOW() FOR UPDATE SKIP LOCKED LIMIT 1; -- now we perform checks and cleanup action -- and once done, we remove the digest from the queue: DELETE FROM blob_review_queue WHERE digest = :digest; COMMIT;
There is a caveat to that: In order to delete the blob from storage, we would perform external IO within an open database transaction (which is an anti-pattern). In order to mitigate this, we can do the following:
BEGIN; SELECT * FROM blob_review_queue WHERE review_after < NOW() FOR UPDATE SKIP LOCKED LIMIT 1; UPDATE blob_review_queue SET review_after = review_after + INTERVAL '5 minutes' WHERE digest = :digest; COMMIT; -- now we perform checks and cleanup action (outside scope of a database transaction) -- and once done, we remove the digest from the queue: DELETE FROM blob_review_queue WHERE digest = :digest;
The trade-off here is that we have short database transactions but at the expense from changing the queue into having at least once semantics. That is, we can see multiple jobs trying to clean up the same digest. For example, when the external IO takes longer than 5 minutes (the offset for
review_after) or a job fails completely. The assumption is that this is never a problem because the jobs are idempotent.
Given a single blob digest, we implement a series of checks to determine whether or not the blob should be deleted from object storage:
SELECT 1 FROM blobs_layers WHERE digest=:digest LIMIT 1
SELECT 1 FROM repository_blobs WHERE blob_digest=:digest(TODO: The partitioning scheme is not suitable here, we might want to have
blobs_repositoriesadditionally - tbd)
SELECT 1 FROM blobs_configurations WHERE digest=:digest LIMIT 1
The database model supports all those checks and we would typically benefit from the partitioning model plus using an efficient index (examples above: single record lookup).
Now that we've determined a blob should be deleted, we perform the following steps:
We discussed a synchronization need with incoming requests. That is, an incoming request (like putting a manifest referencing this blob or checking) might haver to be serialized with the cleanup action to get consistent results.