Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Blueprint: Postgres query optimization bot to boost backend development process

On this page

Overview

This blueprint addresses the following problems:

SQL query micro-optimization is a process of optimization of an SQL (or the corresponding Ruby on Rails code), involving only one or a few SQL statements, usually executed one by one. The goal of such a process is achieving better performance and scalability of particular SQL queries without significant degrading of these characteristics for all other SQL queries that are executed in the database. (In contrast, SQL query macro-optimization addresses the top-down performance and scalability analysis of a particular database and its workload, usually done using some statistics collecting tools and/or logs analysis. This kind of optimization is beyond the scope of this article). SQL micro-optimization involves changes in the index set, query re-writing, DB schema re-design, and, less often, local changes in database configuration parameters.

Further, we discuss the drawbacks of the current (as of June 2019) SQL micro-optimization workflow used in GitLab development, such as:

We propose a new approach which is a successor of the existing one. In particular, we propose switching to a new Slack bot called "Joe bot", built on top of a special framework, "Database LaB". Working in pair, Joe bot and Database Lab are capable of performing a series of EXPLAIN (ANALYZE, BUFFERS) requests on a full-sized fresh production DB clone provisioned during several seconds, and allowing work with any types of queries including modifying ones. Working with an independent cloned database, Joe also allows to change database schema anyhow, create indexes, and reset the database to its initial state within a few seconds, being ready for a new micro-optimization session.

Both Joe bot and Database Lab are open source projects developed and maintained by Postgres.ai, and adapted for use in GitLab.com infrastructure by Nikolay Samokhvalov.

Analysis

When code changes involve a multi-terabyte database, it is not uncommon to have the following problem. A developer has an SQL query (or the corresponding ActiveRecord code), that is known to be problematic, or that is new, and she/he wants to check its performance and Postgres query optimizer behavior at production scale, but it is impossible because one or more of the following reasons (and sometimes most or even all of them):

  1. The query is time/resource-consuming and might negatively affect the production system's performance.
  2. The query is a modifying one, changing the database state, in extreme cases changing heavily (and resulting in a lot of I/O operations) or even ones that are not transaction-safe (such as TRUNCATE).
  3. Some DDL (example: CREATE INDEX) needs to be executed, and doing so on production just for development and testing purposes seems exceptionally unreasonable.
  4. Most developers do not have access to the production database servers or have only read-only access.

The Existing Solution

GitLab employees can use a special tool, available in Slack: /chatops run explain command. It works great and partially solves the last problem in the list above (providing developers indirect read-only access to production databases), but does not address any of the others. It works with one of the actual production replicas. Therefore, heavy, prolonged, and/or modifying operations are not allowed at all.

All this brings us to the position when even for a simple check, developers need to ask DBREs for help, which becomes a significant bottleneck in the development process.

The Staging Environment: Single, Differs from Production

The separate problem is that MRs that involve database-related changes are often not verified on production-scale data volumes at all due to the complexity of such checks implying enormous amount of DBRE's manual efforts.

Currently (June 2019), the staging database size is only ~0.5 TiB (postgres-checkup report for GitLab.com staging), while the production DB exceeds ~3 TiB (postgres-checkup report for GitLab.com production).

This means that table and index sizes on the staging DB nodes are significantly smaller than those on the production DB nodes, so SQL query behavior differs significantly (planner might choose very different plans, timings are smaller, and so on).

Additionally, as of June 2019, there is only one single staging environment, so it is not trivial to test the same modifying change multiple times, and it is extremely difficult if not impossible to parallelize various kinds of checks (for example, verifying various MRs in parallel).

Problem Statement

  1. As for June 2019, most MRs are never verified on a full-size DB copy. Database code reviews are done based only on the code check, without analyzing SQL statements on production-scale data sets.

  2. Most of the time, engineers develop changes using only tiny-size databases.

  3. Database-related review processes are often slowed down. Among the main reasons:
    • too few active database reviewers, and they are too busy (June 2019),
    • lack of data leading to multiple iterations of question/answer.
  4. Checking of modifying SQL statements and migrations is difficult for a DBRE, and impossible for a developer:
    • usually, there is no a full-size DB instance ready for that,
    • during development, the statement/migration under question needs to be analyzed multiple times, in iterations. Without special means (ability to quickly start from scratch), it becomes extremely difficult, often leading mistakes and development process slowdown.

The Proposed Solution

I propose using the new-generation of Slack bot, providing any engineer with the ability to do SQL query micro-optimization on a full-size production clone, running any kinds of queries without short timeout limits, with the ability to quickly restore the database state after any changes and without interfering with other developers.

My team (Postgres.ai) implemented these ideas in two open-source projects written in Go:

  1. Joe bot,
  2. Database Lab.

Joe is integrated with Slack and supports the following commands (preferably in a public channel):

Joe Demo

Currently (June 2019), the command reset is based on ZFS snapshots. Joe now supports AWS, but GCP support would be implemented similarly. Using GCP snapshots instead of ZFS may be an option too.

This approach can help to:

Estimated time needed for the PoC implementation: 2 weeks. Resources needed:

Comparison of SQL micro-optimization methods

Delays in MRs

Further reading

Continue reading: the Design document, describing th implementation and results of using the proposed solution in September 2019 – January 2020.


Author: Nikolay Samokhvalov {: .note}

GIT is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license