Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Scheduled daily deployments on GitLab.com environments

Resources

Blueprint: Engineering release workflow

Epic:

Overview

Design for merging CE and EE codebases explains how working in two codebases affects our speed for getting a feature or a fix ready for deployment.

This document will describe how we plan to increase the speed of getting the feature or a fix that is ready for deployment, and make it available for public consumption on GitLab.com.

Description of the current situation

Design: CI/CD Pipeline describes the situation as seen in October CY2018. Since then, we've addressed a number of items raised in that design:

At the moment of writing of this document (February CY2019), version rollbacks is in development.

Items that still require work are:

Detaching deployments from self-managed tagged releases

Working with self-managed tagged releases

Currently, GitLab.com deployments are tied to release process for self-managed GitLab installations. This process has two critical dates, 7th and the 22nd.

7th of the month is the so-called, feature freeze. At that point in time, a new slower running branch is created from master branch. This branch is named MAJOR-MINOR-stable(where MAJOR and MINOR are the version of GitLab to be released).

Current branching

In the image above you can see that the stable branch moves slower than the master branch, and that any necessary critical fixes are cherry-picked from master into the stable branch.

This setup was created in the early days of GitLab to ensure that brand new features are not introducing regressions for users on GitLab.com. This system was working fine for as long as master branch was not faster than stable by orders of magnitude. Another item worth highlighting is that the slower branch in theory introduces smaller amount of items to deploy between 7th and 22nd to GitLab.com, seemingly allowing for a more "stable" environment.

Neither of the two highlighted items above is true any longer: master branch receives a large amount of commits daily, and the first deploy after the 7th introduces a significant, large change into GitLab.com environments.

Design

As noted in the CI-CD blueprint, GitLab.com lives in the now. Any released feature gets consumed right away. Any bug, regression or security vulnerability is exposed quicker on GitLab.com. The model of having a stable branch gives the false sense of security because any improvement or a fix needs a full release cycle time to reach GitLab.com and that is no longer viable. In the same blueprint, we state that we want to get to continuous delivery model. Jump between the current situation described above and our goal, is too large to execute in one step. The step requires changes in tooling, infrastructure, and even development culture all the while keeping the deployment and the release process of GitLab running.

The changes we need to think about are:

With this in mind, we need to think about a Transition process on our way to a New process.

Transition from the current release git branching model

The intermediate step is described in the image below:

Intermediate branching

While the intermediate step does not look that much different from the current process, it does create a few notable changes:

To explain how one release month might look like, lets run through image above assuming that we selected a proper transition date and lets assume that the date is one week before the feature freeze date of the 7th for release 11.9.

Week 1:

At the time when we used to create stable branches, we would create a 11-9-w1. This branch would still receive features and fixes scheduled for 11.9 release. When that branch is created, the first release candidate would be created and deployed to GitLab.com environments. The automated and manual QA testing would be executed. Only the smoke tests in our QA suite would be able to "stop the world", the rest would be addressed based on the severity and priority. In case of an highly impactful issue that requires addressing, developers would create a MR with a fix and apply Pick into 11-9 label. During that first week, several new release candidates would be created and deployed through GitLab.com environments.

Week 2:

By the end of Week 1 we had several deployments through GitLab.com environments. Week two would start with a new branch 11-9-w2, created from master repeating the same process as seen during Week 1.

Week 3:

In week 3, the 11-9-w3 would be created and due to not having any impactful fixes to be addressed, no additional work was required. At the end of Week 3, a new 11-9-backport branch is created. This branch would become the feature freeze branch. This branch only receives fixes and a final 11.9.0 release would be tagged from this branch. It is worth noting here that a self-managed tagged release would be created based on the latest commit from the 11-9-backport branch which is running on GitLab.com.

Week 4:

Week 4 is actually Week 1 as a start of the new release cycle, and during this week we would have a new branch 11.1-w1. We will also have a branch 11-9-backport for the release cycle that just completed.

New git branching model

The git branching model for our deployments and releases is described in the image below:

New branching

The new process has several notable changes:

In this model, release artifacts are created out of specific commits from the master branch and every commit on master is considered stable. This assumption is the primary requirement for complete removal of feature freeze. Commits are deployed from master branch when possible each day of the month, and a release for self-managed users is created on the 22nd based on the state in the master branch at that time.

In case of regressions and bugs found while the release artifact is progressing through GitLab.com environments, a bugfix branch is created out of specific commit. The developers would then resolve the issue in the bugfix branch, and a new release artifact would be created out of the feature branch. This artifact would then be deployed to at least one environment, and only when the fix is confirmed to be working would the release artifact propagate through the rest of GitLab.com environments. Once the fix propagated through the environments, the MR can be merged into the master branch.

Automated scheduled deployments

With the change in our git branching model, in parallel we require a change in our deployment cycles.

The changes we need to think about are:

Design

Similar to the description of a change in the git branching model, the number of changes to tooling and processes is too large to execute in one step. Here too, we need to think about a Transition process on our way to a New process.

Transition from the current deployment process

To complement the Transition from the current release git branching model, the image below describes an intermediate step in deployment process changes:

Intermediate deploy

Note: The colors describe different deployments. For example, light green box is describing the promotion through different environments of the same deployment artifact. Darker green shows that the deployment artifact is based on the light green artifact, with some changes included to address an issue.

The intermediate process has a few notable changes:

The image also tries to describe a "regular" workflow where no issues are found during deployments and a workflow when the issues are found in any of the steps.

The "regular" workflow has a clear cadence: Day 1 of the week is reserved for deploying the master branch commits to staging environment found on Day 1 of the current week. Deployment to production environment is also done on Day 1 of the current week but using previous weeks commits, after deploying to a new non-production environment. Deployment to production canaries is done at the middle of the week, allowing for additional (automated/manual) QA through the rest of the week before promoting to production at the start of the following week.

When issues are found during one of the deployment steps, the process complicates a bit. For each of the cases when a problem is discovered in a different environment, we can describe the process:

Week 1: An issue is discovered after a deployment to staging

After deployment to the gstg environment, an impactful regression is found. In this case, the developers would checkout the commit that was deployed to gstg environment and create a fix. When the fix is ready, a new deployment artifact is created and deployed to the same environment. Once the fix is verified on the environment, the rest of the deploy promotion process continues as in the "regular" workflow.

Week 2: An issue is discovered after a deployment to canary

Deployment on the gstg environment didn't uncover any issues so the deployment promotion continued to the production canary. Once that deployment was completed, an impactful regression is found and that stops the progression of further deploys. Developers would need to create a branch based of the deployed specific commit and a new deployment artifact is created. A new deployment is done on the new non-production environment and the fix is verified. Once confirmed, deploy continues to production canary and if all goes well again, to production.

Week 3/4: An issue is discovered after a deployment to production

Deployment process went well during week 3, and no issues were discovered when deploying through each of the environments. Once the deployment artifact was deployed to production environment at the beginning of week 4, an impactful regression is found.

In this scenario, the production environment is rolled back to the previous deployed version. In parallel, developers create a branch based of the deployed specific commit that was being promoted, and once the fix is ready, a new deployment artifact is created. The deployment is then executed on the new environment, and then production canary, verifying the fix each time. Once we are sufficiently confident, the deployment is then executed on the production environment

New deployment process

The new deployment process would not be different from the transitional proposal created above. The only real change is the frequency of deployments, where the weekly cadence would be changed for daily cadence. This would mean that deployment on gstg and production canary would be done on day 1, and new and production environments would be deployed to on day 2. For this to happen, we need to:

Final thoughts

This design document covers a number of topics that are logically independent but very much depend on each other:

Most notable missing peace is how addressing high priority and severity security vulnerabilities affects these deadlines and processes. That process has been omitted from this document only due to the complexity that the security patches bring to each of the topics. That process will be addressed in a separate design document.