Blog Engineering Why we spent the last month eliminating PostgreSQL subtransactions
September 29, 2021
3 min read

Why we spent the last month eliminating PostgreSQL subtransactions

How a mysterious stall in database queries uncovered a performance limitation with PostgreSQL.

nessie.jpg

Since last June, we noticed the database on GitLab.com would
mysteriously stall for minutes, which would lead to users seeing 500
errors during this time. Through a painstaking investigation over
several weeks, we finally uncovered the cause of this: initiating a
subtransaction via the SAVEPOINT SQL query while
a long transaction is in progress can wreak havoc on database
replicas. Thus launched a race, which we recently completed, to
eliminate all SAVEPOINT queries from our code. Here's what happened,
how we discovered the problem, and what we did to fix it.

The symptoms begin

On June 24th, we noticed that our CI/CD runners service reported a high
error rate:

runners errors

A quick investigation revealed that database queries used to retrieve
CI/CD builds data were timing out and that the unprocessed builds
backlog grew at a high rate:

builds queue

Our monitoring also showed that some of the SQL queries were waiting for
PostgreSQL lightweight locks (LWLocks):

aggregated lwlocks

In the following weeks we had experienced a few incidents like this. We were
surprised to see how sudden these performance degradations were, and how
quickly things could go back to normal:

ci queries latency

Introducing Nessie: Stalled database queries

In order to learn more, we extended our observability tooling to sample
more data from pg_stat_activity
. In PostgreSQL, the pg_stat_activity
virtual table contains the list of all database connections in the system as
well as what they are waiting for, such as a SQL query from the
client. We observed a consistent pattern: the queries were waiting on
SubtransControlLock. Below shows a graph of the URLs or jobs that were
stalled:

endpoints locked

The purple line shows the sampled number of transactions locked by
SubtransControlLock for the POST /api/v4/jobs/request endpoint that
we use for internal communication between GitLab and GitLab Runners
processing CI/CD jobs.

Although this endpoint was impacted the most, the whole database cluster
appeared to be affected as many other, unrelated queries timed out.

This same pattern would rear its head on random days. A week would pass
by without incident, and then it would show up for 15 minutes and
disappear for days. Were we chasing the Loch Ness Monster?

Let's call these stalled queries Nessie for fun and profit.

What is a SAVEPOINT?

To understand SubtransControlLock (PostgreSQL
13

renamed this to SubtransSLRU), we first must understand how
subtransactions work in PostgreSQL. In PostgreSQL, a transaction can
start via a BEGIN statement, and a subtransaction can be started with
a subsequent SAVEPOINT query. PostgreSQL assigns each of these a
transaction ID (XID for short) when a transaction or a subtransaction
needs one, usually before a client modifies data
.

Why would you use a SAVEPOINT?

For example, let's say you were running an online store and a customer
placed an order. Before the order is fullfilled, the system needs to
ensure a credit card account exists for that user. In Rails, a common
pattern is to start a transaction for the order and call
find_or_create_by. For
example:

Order.transaction do
  begin
    CreditAccount.transaction(requires_new: true) do
      CreditAccount.find_or_create_by(customer_id: customer.id)
  rescue ActiveRecord::RecordNotUnique
    retry
  end
  # Fulfill the order
  # ...
end

If two orders were placed around the same time, you wouldn't want the
creation of a duplicate account to fail one of the orders. Instead, you
would want the system to say, "Oh, an account was just created; let me
use that."

That's where subtransactions come in handy: the requires_new: true
tells Rails to start a new subtransaction if the application already is
in a transaction. The code above translates into several SQL calls that
look something like:

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

New to GitLab and not sure where to start?

Get started guide

Learn about what GitLab can do for your team

Talk to an expert