Design: CICD Pipeline for GitLab.com

On this page

Idea/Problem Statement

Release managers initiate deploys manually from the commandline using takeoff. The current process is that we build a release candidate or official build and it is deployed to three different stages in sequence:

For each stage, developers run manual tests and GitLab-QA. If there are no errors reported and manual tests pass, the release continues to the next stage.

Below is a timeline of the 10.3 release to production. From this, it is clear that we deploy at different intervals and not all release candidates make it to production. In some cases, problems are not observed until we see a large amount of production traffic which requires patching production or rolling back the release.

Release timeline of 10.3

Staging and canary deployments ommitted

Current shortcomings

There are a number of shortcomings to the current release process:

Design

The goal of this design proposal is a replacement for deploying RCs manually to each deployment stage at predefined intervals. A CICD pipeline is constructed that continuously deploys nightly builds to canary.

Once the deployment of the nightly builds to canary is complete, the canary fleet receives a small percentage of production traffic, and it is then promoted to GitLab.com.

Goals

The goals of this design are incremental and align with the CICD blueprint:

Tasks

Below are a draft set of issues that would be in the epic for implementing this design

Issues that are defined or in progress

Issues that need scoping

Criticism: Design anti-goals, what this doesn't cover

GitLab.com should be moving in a direction that utilizes a container deployment strategy and dogfoods the cloud native product we are creating for customers. This design is meant to be compatible with the omnibus methods of installation and does not include a container migration strategy, although it should be considered as a next step in that direction.

Overall, this design proposal focuses on work that is an incremental change on our current infrastructure and process. Any work done in line with this proposal will be weighed against the goal of container based deployments and such work will be prioritised.

Specifically this design does not require any of the following:

It does not preclude these items, but allows for a transition from using non-container deploys.

This design does however make some improvements that will be helpful with the longer term goals of creating pipeline(s) for continuous container deployments, these are:

Rollbacks and Patching

In order to safely deploy continuously to canary there also needs to be a way to safely rollback and deliver fast patch updates. This design proposes three different approaches:

Testing

Testing will be an integral part of the deploy pipeline. For this reason, included in the scope of this design is testing at every deploy step. The choice for this testing will be to use a combination of GitLab-QA acting as a gate for pipeline stages, and continuous traffic on non-production stages and the production canary stage. This allows us to use our existing alerting infrastructure on the staging and canary stages so that regressions can be spotted early, before the changes reach production. Each CICD step will have the ability to check for outstanding alerts before continuing to the next stage.

Generating artificial load on the non-production stages

In order to ensure that we can detect performance regressions it will be useful to generate artificial load. This design does not go into the details of how this is implemented, some proposals so far have been:

Architecture

Current deployments

A deployment orchestration tool is necessary that can drain servers from HAProxy, run apt installs of the omnibus package, and restart/hup services after install. Currently this is done with takeoff.

The current sequence of deployment to an environment is:

Post-deployment patches

In addition to the normal release process of omnibus builds the production team employs post-deployment patches, a way to quickly patch production for high severity bugs or security fixes.

Post-deployment patches bypass validation and exist outside of the normal release process. The reason for this is to quickly deploy a change for a critical security fix, a high severity bug, or to mitigate a performance issue. The assumption is that once a post-deployment patch is deployed, changes deployed to canary will be halted until the patch(es) are incorporated into an omnibus-build.

CICD Design

The CICD approach for omnibus is divided into two pipelines. One that continuously deploys to canary using a nightly build and another for deploying from canary to the rest of the fleet.

At any time during the deployment, if GitLab-QA fails or if there are any alerts the pipeline is halted.

The second pipeline may be initiated automatically, or on-demand, when there is confidence in the nightly build on canary.

Production traffic to canary is controlled with GitLab ChatOps by setting the server weights. This allows us to at any time increase the amount of production traffic on canary to have more confidence in application changes before it reaches the wider community.

Pipeline Diagram

Azure Canary

Deployment stages to Canary


Deployment stages from Canary to Production