Since last June, we noticed the database on GitLab.com would
mysteriously stall for minutes, which would lead to users seeing 500
errors during this time. Through a painstaking investigation over
several weeks, we finally uncovered the cause of this: initiating a
subtransaction via the SAVEPOINT
SQL query while
a long transaction is in progress can wreak havoc on database
replicas. Thus launched a race, which we recently completed, to
eliminate all SAVEPOINT
queries from our code. Here's what happened,
how we discovered the problem, and what we did to fix it.
The symptoms begin
On June 24th, we noticed that our CI/CD runners service reported a high
error rate:
A quick investigation revealed that database queries used to retrieve
CI/CD builds data were timing out and that the unprocessed builds
backlog grew at a high rate:
Our monitoring also showed that some of the SQL queries were waiting for
PostgreSQL lightweight locks (LWLocks
):
In the following weeks we had experienced a few incidents like this. We were
surprised to see how sudden these performance degradations were, and how
quickly things could go back to normal:
Introducing Nessie: Stalled database queries
In order to learn more, we extended our observability tooling to sample
more data from pg_stat_activity
. In PostgreSQL, the pg_stat_activity
virtual table contains the list of all database connections in the system as
well as what they are waiting for, such as a SQL query from the
client. We observed a consistent pattern: the queries were waiting on
SubtransControlLock
. Below shows a graph of the URLs or jobs that were
stalled:
The purple line shows the sampled number of transactions locked by
SubtransControlLock
for the POST /api/v4/jobs/request
endpoint that
we use for internal communication between GitLab and GitLab Runners
processing CI/CD jobs.
Although this endpoint was impacted the most, the whole database cluster
appeared to be affected as many other, unrelated queries timed out.
This same pattern would rear its head on random days. A week would pass
by without incident, and then it would show up for 15 minutes and
disappear for days. Were we chasing the Loch Ness Monster?
Let's call these stalled queries Nessie for fun and profit.
What is a SAVEPOINT
?
To understand SubtransControlLock
(PostgreSQL
13
renamed this to SubtransSLRU
), we first must understand how
subtransactions work in PostgreSQL. In PostgreSQL, a transaction can
start via a BEGIN
statement, and a subtransaction can be started with
a subsequent SAVEPOINT
query. PostgreSQL assigns each of these a
transaction ID (XID for short) when a transaction or a subtransaction
needs one, usually before a client modifies data.
Why would you use a SAVEPOINT
?
For example, let's say you were running an online store and a customer
placed an order. Before the order is fullfilled, the system needs to
ensure a credit card account exists for that user. In Rails, a common
pattern is to start a transaction for the order and call
find_or_create_by
. For
example:
Order.transaction do
begin
CreditAccount.transaction(requires_new: true) do
CreditAccount.find_or_create_by(customer_id: customer.id)
rescue ActiveRecord::RecordNotUnique
retry
end
# Fulfill the order
# ...
end
If two orders were placed around the same time, you wouldn't want the
creation of a duplicate account to fail one of the orders. Instead, you
would want the system to say, "Oh, an account was just created; let me
use that."
That's where subtransactions come in handy: the requires_new: true
tells Rails to start a new subtransaction if the application already is
in a transaction. The code above translates into several SQL calls that
look something like: