Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Change Management

On this page

Changes

Changes are any modification to the operational environment and are classified into two types:

Deployments are a special change metatype depending on their scope and the effect they may have on the environment, as defined above. As we make progress towards CI/CD, we aim to turn all deployments into simple service changes.

Trust

GitLab.com is the premier GitLab instance on the planet, and a production instance in every sense of the word. Change Management's primary goal is to safeguard the integrity of the GitLab.com environment through increased predictability by providing a framework to drive all changes towards becoming service changes and to help us achieve an optimal change speed.

Change Management is underpinned by trust: we trust ourselves to act responsibly in the operational environment to maintain its integrity and, by extension, its availability and performance.

To that end, we are not instituting a blanket policy for changes. Rather, we are developing the foundation of what a service change is (risk evalation, automatic auditing and communication, pre-flight checks, defensive coding, post-change validation) and will help teams with adoption.

Change Management helps us prioritize our resources towards changes that need to be made more resilient through defensive automation. Priorities are driven by two factors:

In these situations, we will focus on developing the necessary automation and safeguards to help teams and services move towards safe service changes in a timely fashion. Until then, all changes that fall under the two abovementioned categories are treated as maintenance changes.

Change Severities

Change severities encapsulate the risk associated with a change in the environment. Said risk entails the potential effects if the change fails and becomes an incident. Change management uses our standarized severity definition, which can be found under out which can be found under issue workflow documentation.

Change Type

The change types can be one of the labels :

Change Plans

All changes should have change plans. While this is optional for service changes, it is mandatory for maintenance changes. Change Plans provide detailed descriptios of proposed changes and include the following information depending on the criticallity of the service :

Criticality 1:

Examples of Criticality 1:

  1. Any changes to Postgres hosts that affects DB functionality - quantity of nodes, changes to backup or replication strategy
  2. Architectural changes to Infra as code (IaC)
  3. IaC changes to pets - Postgres, Redis, and other Single Points of Failure
  4. Changes of major vendor - CDN, mail, DNS
  5. Major version upgrades of tooling (HAProxy, Chef)
Change Objective Describe the objective of the change  
Change Type Type described above  
Services Impacted List services  
Change Team Members Name of the involved in the change  
Change Severity How critical is the change  
Change Reviewer A colleague will review the change  
Tested in staging The change was tested on staging environment  
Dry-run output If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result  
Due Date Date and time in UTC timezone for the execution of the change, if possible add the local timezone of who is executing the change  
Time tracking To estimate and record times associated with changes ( including a possible rollback )  
Downtime Component if yes how many minutes  
Detailed steps for the change. Each step must include: * pre-conditions for execution of the step, * execution commands for the step, * post-execution validation for the step , * rollback of the step, * review of changed graphs in grafana, * review of alerts to disable and enable  

Criticality 2:

Examples of Criticality 2:

  1. Load Balancer Configuration - major changes to backends or front ends, fundamental to traffic flow
  2. IaC changes to cattle / quantity when there is a decrease
  3. Minor version upgrades of tools or components (HAProxy)
  4. Removing old hosts from IaC (like removals of legacy infrastructure)
Change Objective Describe the objective of the change
Change Type Type described above
Services Impacted List services
Change Team Members Name of the involved in the change
Change Severity How critical is the change
Change Reviewer A colleague will review the change
Tested in staging The change was tested on staging environment
Dry-run output If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result
Due Date Date and time in UTC timezone for the execution of the change, if possible add the local timezone of who is executing the change
Time tracking To estimate and record times associated with changes ( including a possible rollback )
Detailed steps for the change. Each step must include: - pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step, - review of changed graphs in grafana, - review of alerts to disable and enable

Criticality 3:

Examples of Criticality 3:

  1. IaC changes to cattle / quantity when there is an increase (not requiring reboot or destroy/recreate)
  2. Changes in configuration for current systems serving customers related to DNS or CDN
Change Objective Describe the objective of the change
Change Type Type described above
Services Impacted List services
Change Team Members Name of the involved in the change
Change Severity How critical is the change
Change Reviewer or tested in staging A colleague will review the change or The change was tested on staging environment
Dry-run output If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result
Due Date Date and time in UTC timezone for the execution of the change, if possible add the local timezone of who is executing the change
Time tracking To estimate and record times associated with changes ( including a possible rollback )
Detailed steps for the change. Each step must include: - pre-conditions for execution of the step, - execution commands for the step, - post-execution validation for the step , - rollback of the step, , - review of changed graphs in grafana, - review of alerts to disable and enable

With change plans, we develop a solid library of change procedures. Even more importantly, they provide detailed blueprints for implentation of defensive automation. Adding on to the defensive automation, every change request that uses some sort of a script must have a dry-run capability, the script should be ran in the dry-run mode and its output should be provided to the CR for review. Ideally, the planner and the executor should be different individuals.

Change Schedule

Please consider the timezone UTC as the standard for all the changes.

The following table has the original schedule for changes based on the criticality level of the component :

  10 PM - 6 AM 6 AM - 2 PM 2 PM - 10 PM
Criticality 1 ALLOWED NOT ALLOWED NOT ALLOWED
Criticality 2 ALLOWED NOT ALLOWED NOT ALLOWED
Criticality 3 ALLOWED ALLOWED ALLOWED
Criticality 4 ALLOWED ALLOWED ALLOWED

Please consider the time slots on the calendar Production, to add change requests to Criticality 1 and 2. The other criticalities please add direct to the calendar.

Change Request Workflow

To drive a change request to be executed on production :

Change Execution

If the change is executed by a script, it should be run from the bastion host of the target environment in a terminal multiplexer (e.g. screen or tmux) session. Using a bastion host has the benefit of preventing any unintended actions (e.g. caused by a script bug) from spreading to other environments. A terminal multiplexer guards against the possibility of losing connection to the bastion mid-change and the unpredictable consequences of it.

sudo is disabled on the bastion hosts, so you can copy your Chef PEM file to one of them, if your script requires it, without fearing it being snooped on.

A sequence of actions to run a script could look like this:

your-workstation $ ssh -A bastion-01-inf-gstg
bastion-01-gstg  $ tmux
bastion-01-gstg  $ git clone git@gitlab.com:my-migration/script.git
bastion-01-gstg  $ ./script/migrate

Change Reviews

Maintenance changes require change reviews. The reviews are intended to bring to bear the collective experience of the team while providing a forum for pointing out potential risks for any given change. A minimun quorun of three reviewers is required to approve a ~S1 or ~S2 maintenance change.

Roles

Role Definition and Examples
EMOC Event Manager
  The Event Manager is the tactical leader of the change team. For service changes, the EMOC is the person executing the change. For maintenance changes, the EMOC is the person in the IMOC rotation. ~S1 and ~S2 changes require an EMOC.
CMOC Communications Manager
  The Communications Manager is the communications leader of the change team. The focus of the Change Team is executing the change as safely and quickly as possible. For ~S1 and ~S2 maintenance changes, a CMOC communicates with the appropriate stakeholders. Othersiwe, EMOC can handle communication.
CT Change Team
  The Change Team is primarily composed of technical staff perfoming the change.

Communication Channels

Information is a key asset during any change. Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that a change is happening is critical in helping stakeholders plan for said changes.

This flow is determined by:

For instance, a large end-user may choose to avoid doing a software release during a maintenance window to avoid any chance that issues may affect their release.

Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.

To that end, we will have: 

Production Change Lock (PCL)

While changes we make are rigorously tested and carefully deployed, it is a good practice to temporarily halt production changes during certain events such as GitLab LiveStream, GitLab Summit and days where LOA (leave of absence), due to holidays, is high in engineering teams. We categorize these special periods of times into two buckets:

  1. GitLab Events
  2. High LOA

Risks of making a production deployment during the said periods includes immediate customer impact and/or less engineering team coverage in case an incident occurs and has to be resolved immediately. Therefore, we have introduced a mechanism called Production Change Lock (PCL). We see the future of PCL as an automated process which, provided a time range, locks production deployments and releases the lock once the time expires. However, as the first iteration towards this future state we are starting with creating events on our Production Calendar so that teams are aware of the PCL periods.

Start Date/Time End Date/Time Timezone Region Description
05/08/19 12:00AM 05/15/19 12:00 NA GitLab Contribute 2019

Questions