Release managers initiate deploys manually from the commandline using takeoff. The current process is that we build a release candidate or official build and it is deployed to three different stages in sequence:
For each stage, developers run manual tests and GitLab-QA. If there are no errors reported and manual tests pass, the release continues to the next stage.
Below is a timeline of the 10.3 release to production. From this, it is clear that we deploy at different intervals and not all release candidates make it to production. In some cases, problems are not observed until we see a large amount of production traffic which requires patching production or rolling back the release.
Staging and canary deployments omitted
There are a number of shortcomings to the current release process:
The goal of this design proposal is a replacement for deploying RCs manually to each deployment stage at predefined intervals. A CICD pipeline is constructed that continuously deploys nightly builds to canary.
Once the deployment of the nightly builds to canary is complete, the canary fleet receives a small percentage of production traffic, and it is then promoted to GitLab.com.
The goals of this design are incremental and align with the CICD blueprint:
Below are a draft set of issues that would be in the epic for implementing this design
GitLab.com should be moving in a direction that utilizes a container deployment strategy and dogfoods the cloud native product we are creating for customers. This design is meant to be compatible with the omnibus methods of installation and does not include a container migration strategy, although it should be considered as a next step in that direction.
Overall, this design proposal focuses on work that is an incremental change on our current infrastructure and process. Any work done in line with this proposal will be weighed against the goal of container based deployments and such work will be prioritised.
Specifically this design does not require any of the following:
It does not preclude these items, but allows for a transition from using non-container deploys.
This design does however make some improvements that will be helpful with the longer term goals of creating pipeline(s) for continuous container deployments, these are:
In order to safely deploy continuously to canary there also needs to be a way to safely rollback and deliver fast patch updates. This design proposes three different approaches:
Testing will be an integral part of the deploy pipeline. For this reason, included in the scope of this design is testing at every deploy step. The choice for this testing will be to use a combination of GitLab-QA acting as a gate for pipeline stages, and continuous traffic on non-production stages and the production canary stage. This allows us to use our existing alerting infrastructure on the staging and canary stages so that regressions can be spotted early, before the changes reach production. Each CICD step will have the ability to check for outstanding alerts before continuing to the next stage.
In order to ensure that we can detect performance regressions it will be useful to generate artificial load. This design does not go into the details of how this is implemented, some proposals so far have been:
A deployment orchestration tool is necessary that can drain servers from HAProxy, run apt installs of the omnibus package, and restart/hup services after install. Currently this is done with takeoff.
The current sequence of deployment to an environment is:
In addition to the normal release process of omnibus builds the production team employs post-deployment patches, a way to quickly patch production for high severity bugs or security fixes.
Post-deployment patches bypass validation and exist outside of the normal release process. The reason for this is to quickly deploy a change for a critical security fix, a high severity bug, or to mitigate a performance issue. The assumption is that once a post-deployment patch is deployed, changes deployed to canary will be halted until the patch(es) are incorporated into an omnibus-build.
The CICD approach for omnibus is divided into two pipelines. One that continuously deploys to canary using a nightly build and another for deploying from canary to the rest of the fleet.
At any time during the deployment, if GitLab-QA fails or if there are any alerts the pipeline is halted.
The second pipeline may be initiated automatically, or on-demand, when there is confidence in the nightly build on canary.
Production traffic to canary is controlled with GitLab ChatOps by setting the server weights. This allows us to at any time increase the amount of production traffic on canary to have more confidence in application changes before it reaches the wider community.