Release managers initiate deploys manually from the commandline using takeoff.
The current process is that we build a release candidate or official build and
it is deployed to three different stages in sequence:
Staging
Production Canary Stage
Production Main Stage
For each stage, developers run manual tests and GitLab-QA.
If there are no errors reported and manual tests pass, the release
continues to the next stage.
Below is a timeline of the 10.3 release to production.
From this, it is clear that we deploy at different intervals and not all release candidates make it to
production. In some cases, problems are not observed until we see a large
amount of production traffic which requires patching production or rolling back
the release.
Release timeline of 10.3
Staging and canary deployments omitted
RC1: Sept 3rd
RC2: Sept 5th
RC3: Sept 6th
RC4: Sept 7th
RC6: Sept 11th
RC8: Sept 17th
RC9: Sept 18th
RC10: Sept 19th
RC11: Sept 20th
10.3: Sept 22nd
Current shortcomings
There are a number of shortcomings to the current release process:
Release managing is time intensive because deploying to staging, canary and
production is initiated manually.
Large sets of changes see production traffic at once sometimes making it
difficult to pinpoint what changes are causing issues.
The staging environment is useful for GitLab-QA and manual testing, but does
not receive any continuous traffic which can make it difficult to spot performance
regressions before release candidates land on production.
Design
The goal of this design proposal is a replacement for
deploying RCs manually to each deployment stage at predefined intervals.
A CICD pipeline is constructed that continuously deploys nightly builds to canary.
Once the deployment of the nightly builds to canary is complete, the canary fleet receives a small
percentage of production traffic, and it is then promoted to GitLab.com.
Goals
The goals of this design are incremental and align with the CICD blueprint:
Use a GitLab CICD for deployments from https://ops.gitlab.net that can be
driven with GitLab ChatOps while also ensuring that there are no CICD or tooling dependencies
on GitLab.com
Deploy nightly builds to the production canary stage in a CICD pipeline.
Create CICD stages that validate each deploy step, these steps include:
running GitLab-QA tests
check for alerts on the stage
With a set of runners, run traffic on the staging and production canary stage.
Report pipeline metrics to prometheus with a push gateway.
Initiate database migrations on production for every deployment to the
production canary stage.
Promote nightly builds from canary to production, or push the official
build through the pipeline for the self-managed omnibus release on the 22nd.
Tasks
Below are a draft set of issues that would be in the epic for implementing this
design
Create deployments for internal consumption: This is necessary to quickly
release undisclosed security updates to GitLab.com.
Continuous traffic against canary and staging
Fast patches/releases to production to address high severity
issues like security vulnerabilities or site degradation.
Initiate the promotion of canary to production from GitLab, possibly with ChatOps
Add alert checking to the CICD pipeline
Add GitLab-QA to pipeline stages
Report metrics from the CICD pipeline to the prometheus pushgateway
GitLab ChatOps command to control weights on the canary stage, this controls how much
traffic is directed to it.
Criticism: Design anti-goals, what this doesn't cover
GitLab.com should be moving in a direction that utilizes a container deployment
strategy and dogfoods the cloud native product we are creating for customers.
This design is meant to be compatible with the omnibus methods of installation
and does not include a container migration strategy, although it should be considered as
a next step in that direction.
Overall, this design proposal focuses on work that is an
incremental change on our current infrastructure and process.
Any work done in line with this proposal will be weighed against
the goal of container based deployments and such work will be prioritised.
Specifically this design does not require any of the following:
Removing the omnibus package as a deploy dependency
Migrating services to kubernetes or using kubernetes for deployment orchestration
Using pre-built images and auto-scaling
Blue/Green deployments outside of what is currently capable with canary
It does not preclude these items, but allows for a transition from using
non-container deploys.
This design does however make some improvements that will be helpful with the
longer term goals of creating pipeline(s) for continuous container deployments,
these are:
Instrumenting CICD for checks against active alerts
Instrumenting CICD that incorporates GitLab-QA
Adding generated traffic to the non-production stages and production canary
stage.
Start integrating with ChatOps for deployments
Smaller changes that automatically deployed up to canary, for internal use.
Automatic migrations daily, resulting in more frequent and smaller database
updates.
Rollbacks and Patching
In order to safely deploy continuously to canary there also needs to be a way to
safely rollback and deliver fast patch updates. This design proposes three different approaches:
Rollbacks to environments that deploy in reverse order on a deployment
stage: Rollbacks that are repeatable, safe and have the same impact as
upgrades.
Fast updates to environments: In some circumstances expedience trumps
availability. Updates may need to be
applied quickly for when there are critical security vulnerabilities or
serious performance degradation. The update should be applied fast,
and with minimal impact, but may result in some errors or dropped connections.
Fast rollbacks to production: In the case of a serious release regression
an environment may also need to be rolled back quickly. The rollback should
be applied fast and with minimal impact, but may result in some
errors or dropped connections.
Testing
Testing will be an integral part of the deploy pipeline. For this reason,
included in the scope of this design is testing at every deploy step. The choice for this
testing will be to use a combination of GitLab-QA acting as a gate for pipeline
stages, and continuous traffic on non-production stages and the production
canary stage. This allows us to use our existing alerting infrastructure on the
staging and canary stages so that regressions can be spotted early, before the
changes reach production. Each CICD step will have the ability to check for outstanding alerts before
continuing to the next stage.
Generating artificial load on the non-production stages
In order to ensure that we can detect performance regressions it will be useful
to generate artificial load. This design does not go into the details of
how this is implemented, some proposals so far have been:
Using siege for scraping predefined set endpoints
Using a subset of GitLab-QA tests in a fleet of runners, running continuously
A deployment orchestration tool is necessary that can
drain servers from HAProxy, run apt installs of the omnibus package, and
restart/hup services after install. Currently this is done with
takeoff.
The current sequence of deployment to an environment is:
Stop chef
Update the version role in Chef for the environment we are deploying to
Deploy the omnibus to the deploy node
Run migrations on the deploy node
Deploy to gitaly (apt-get install gitlab-ee and restarts the restarts gitaly)
Deploy to the rest of the fleet
parallel by role, done currently, apt-get install gitlab-ee and restarts the corresponding service
Start chef
Post-deployment patches
In addition to the normal release process of omnibus builds the production team
employs post-deployment patches, a way to quickly patch production for high
severity bugs or security fixes.
Post-deployment patches bypass validation and exist outside of the normal
release process. The reason for this is to quickly
deploy a change for a critical security fix, a high severity bug, or to mitigate
a performance issue. The assumption is that once a post-deployment patch is
deployed, changes deployed to canary will be halted until the
patch(es) are incorporated into an omnibus-build.
CICD Design
The CICD approach for omnibus is divided into two pipelines. One that
continuously deploys to canary using a nightly build and another
for deploying from canary to the rest of the fleet.
At any time during the deployment, if GitLab-QA fails or if there are any alerts the pipeline
is halted.
The second pipeline may be initiated automatically, or on-demand, when there is
confidence in the nightly build on canary.
Production traffic to canary is controlled with GitLab ChatOps by setting the
server weights. This allows us to at any time increase the amount of production
traffic on canary to have more confidence in application changes before it
reaches the wider community.
Pipeline Diagram
Deployment stages to Canary
Stage 1: Migrations on staging
Check for outstanding alerts on staging, do not start if there are any
critical alerts active.
Run migrations from a deploy host
Check for outstanding alerts
If there are no alerts after an interval of time, continue to the next
stage.
Stage 2: Deploy to staging Gitaly
Deploy to the Gitaly fleet.
Run GitLab-QA against staging.GitLab.com
Check for outstanding alerts
If there are no alerts after an interval of time, continue to the next
stage.
Stage 3: Deploy to the staging fleet
Deploy to the remaining fleet, nodes are drained and removed from the load
balancer as they are deployed.
Run GitLab-QA against staging.GitLab.com.
Check for outstanding alerts
If there are no alerts after an interval of time, continue to the next
stage.
Stage 4: run post-deployment migrations on staging
Run post-deployment migrations
Check for outstanding alerts
If there are no alerts after an interval of time, continue to the next
stage.
Stage 5: Migrations on production
Check for outstanding alerts on production, do not start if there are any
critical alerts active.
Run migrations from a deploy host
Run GitLab-QA against GitLab.com.
Check for outstanding alerts
If there are no alerts after an interval of time, continue to the next
stage.
Stage 6: Deploy to the production canary fleet
Ensures that there is no production traffic diverted to the canary fleet
by setting the canary weights to zero.
Deploy to the canaries. While we operate on each node it is drained and removed from the
load balancer.
Run GitLab-QA against canary.GitLab.com.
Check for outstanding alerts
If there are no alerts after an interval of time, pass the pipeline.
Deployment stages from Canary to Production
Stage 7: Deploy to the production Gitaly fleet
Check for outstanding alerts on production, do not start if there are any
critical alerts active.
Using backend server weights, divert some production traffic to canary.
Check for outstanding alerts on production, do not start if there are any
critical alerts active.
Deploy the version on canary to the production Gitaly server
Run GitLab-QA against GitLab.com.
Check for outstanding alerts
If there are no alerts after an interval of time, continue to the next
stage.
Stage 8: Deploy the remaining production fleet
Check for outstanding alerts
Deploy to the remaining production fleet, nodes are drained and removed from the load
balancer as they are deployed.
Run GitLab-QA against GitLab.com.
Check for outstanding alerts
If there are no alerts after an interval of time, continue to the next