Change Management has traditionally referred to the processes, procedures, tools and techniques applied in IT environments to carefully manage changes in an operational environment: change tickets and plans, approvals, change review meetings, scheduling, and other red tape.
In our context, Change Management refers to the guidelines we apply to manage changes in the operational environment with the aim of doing so (in order of highest to lowest priority) safely, effectively and efficiently. In some cases, this will require the use of elements from traditional change management; in most cases, we aim to build automation that removes those traditional aspects of change management to increase our speed in a safe manner.
Our overriding objective is maximize changes that avoid traditional aspects of change management, which is an iterative process that will evolve over time. Success is measured by our ability to safely execute changes at the speed required by our business needs.
Changes are defined as modifications to the operational environment, including configuration changes, adding or removing components or services to the environment and cloud infrastructure changes. Our Staging environment is crucial to our GitLab.com release process. Therefore, Staging should be considered within scope for Change Management, as part of GitLab's operational environment. Application deployments, while technically being changes, are excluded from the change management process, as are most, but not all, feature flag toggles.
Changes that need to be performed during the resolution of an Incident fall under Incident Management.
|Infrastructure||Responsible for implementing and executing this procedures|
|Infrastructure Management (Code Owners)||Responsible for approving significant changes and exceptions to this procedure|
Plan issues are opened in the production project tracker via the change management issue template. Each issue should be opened using an issue template for the corresponding level of criticality:
C4. It must provide a detailed description of the proposed change and include all the relevant information in the template. Every plan issue is initially labeled
~"change::unscheduled" until it can be reviewed and scheduled with a Due Date. After the plan is approved and scheduled it should be labeled
~"change::scheduled" for visibility.
To open the change incident management issue from Slack issue the following slash command:
Creating the Change issue from Slack will automatically fill in some fields in the description.
C2 changes that are labeled
change::in-progress will block deploys, feature flag changes, and potentially other operations. Take particular care to have good time estimates for such operations, and ideally have points/controls where they can be safely stopped if they unexpectedly run unacceptably long.
In particular for long running Rails console tasks, it may be acceptable to initiate them as a
C2 for approvals/awareness and then downgrade to a C3 while running. However consider carefully the implications of long running code over multiple deployments and the risks of mismatched code/data storage over time; such a label downgrade should ideally have at least 2 sets of eyes (SREs/devs) assess the code being exercised for safety, and management approval is recommended for visibility.
These are changes with high impact or high risk. If a change is going to cause downtime to the environment, it is always categorized a
Examples of Criticality 1:
#productionSlack channel and obtain a written approval from the EOC in both the issue and in Slack using the
The EOC must be engaged for the entire execution of the change.
These are changes that are not expected to cause downtime, but which still carry some risk of impact if something unexpected happens. For example, reducing the size of a fleet of cattle is usually ok because we've identified over-provisioning, but we need to take care and monitor carefully before and after.
Examples of Criticality 2:
gitlab-rakeshould be considered as a Criticality 2 change.
#productionSlack channel and obtain a written approval from the EOC in both the issue and in Slack.
These are changes with either no or very-low risk of negative impact, but where there is still some inherent complexity, or it is not fully automated and hands-off.
Examples of Criticality 3:
These are changes that are exceedingly low risk and commonly executed, or which are fully automated. Often these will be changes that are mainly being recorded for visibility rather than as a substantial control measure.
Examples of Criticality 4:
No approval required.
Change plans often involve manual tasks
gcloudcommand line utility instead of the GCP console.
UTC is the standard time zone used in talking about the scheduled time for all the changes.
When scheduling your change, keep the impact of the change in mind and consider the following questions:
If the change is executed by a script, it should be run from the bastion host of the target environment in a terminal multiplexer (e.g. screen or tmux) session. Using a bastion host has the benefit of preventing any unintended actions (e.g. caused by a script bug) from spreading to other environments. A terminal multiplexer guards against the possibility of losing connection to the bastion mid-change and the unpredictable consequences of it.
sudo is disabled on the bastion hosts, so you can copy your Chef PEM file to one of them, if your script requires it, without fearing it being snooped on.
A sequence of actions to run a script could look like this:
your-workstation $ ssh -A bastion-01-inf-gstg bastion-01-gstg $ tmux bastion-01-gstg $ git clone email@example.com:my-migration/script.git bastion-01-gstg $ ./script/migrate
Maintenance changes require change reviews. The reviews are intended to bring to bear the collective experience of the team while providing a forum for pointing out potential risks for any given change. Consider using multiple reviewers for ~C1 or ~C2 Change requests.
Information is a key asset during any change. Properly managing the flow of information to its intended destination is critical in keeping interested stakeholders apprised of developments in a timely fashion. The awareness that a change is happening is critical in helping stakeholders plan for said changes.
This flow is determined by:
For instance, a large end-user may choose to avoid doing a software release during a maintenance window to avoid any chance that issues may affect their release.
Furthermore, avoiding information overload is necessary to keep every stakeholder’s focus.
To improve communication the following are recommendations for high criticality Changes:
From time to time we will need to run a production change that requires downtime, affecting customers and our SLO. This section covers how to successfully manage communications in these type of situations.
As a reference, we should communicate 5-6 weeks before the change, for a
C1 that does not carry a significant architecture change. Longer preparation time is advised if the change involves a large migration or a significant architecture change.
~Scheduled Maintenancelabel to the Change issue or create a new issue using the template
external_communicationif a confidential issue is necessary.
While changes we make are rigorously tested and carefully deployed, it is a good practice to temporarily halt production changes during certain events such as GitLab Summit, major global holidays, and other times where GitLab Team Member availability is substantially reduced.
Risks of making a production environment change during these periods includes immediate customer impact and/or reduced engineering team availability in case an incident occurs. Therefore, we have introduced a mechanism called Production Change Lock (PCL). We see the future of PCL as an automated process which, provided a time range, locks production deployments and releases the lock once the time expires. However, as a boring solution until then we are listing the events here so that teams are aware of the PCL periods.
The following dates are currently scheduled PCLs. Times for the dates below begin at 09:00 UTC and end the next day at 09:00 UTC.
|24 - 28 November 2021||Soft||US Thanksgiving Holiday (Low team member availability)|
|24 - 26 December 2021||Soft||Christmas Holiday (Low team member availability)|
|31 December 2021 - 2 January 2022||New Years (Low team member availability)|
|Recurring: 22nd of every month||Soft||Release day|
|Recurring: Scheduled Family and Friends Days||Soft||Family and Friends Days|
There are 2 types of PCLs: soft and hard.
Soft PCLs aim to mitigate risk without halting all changes to production. Soft PCLs prohibit infrastructure changes with a criticality level of 2 or higher. In case of an emergency, the EOC should interact with the Incident Manager On Call for C1 and C2 changes.
During the soft PCL, code deployments to canary are allowed since we have tools to control canary impact. Production deployments are allowed for lower criticality items (C3/C4) in coordination with the EOC. These items include high priority code deployments (impactful bugs, security fixes).
Hard PCLs include code deploys and infrastructure changes for every criticality level (see change severities). In case of an emergency, the EOC should interact with the Incident Manager On Call prior to making any decision. It is at EOC and Incident Manager On Call discretion to make a decision on whether a change should be approved and executed. If the change is approved, Incident Manager On Call should inform the VP of Infrastructure of this decision (who will inform the executive team as necessary).
Feature flags reduce risk by allowing application changes to be easily tested in production. Since they can be gradually or selectively enabled, and quickly turned off, their use is encouraged whenever appropriate.
However, as the company and the number of developers working with feature flags continues to grow, it becomes important to manage risk associated with these changes too. Developers follow the process defined in the developers documentation for feature flags.
On any given day, dozens of feature flag changes may occur. Many of these are trivial, allowing low risk changes – sometimes just changes to UI appearance – to be tested. However, some feature flag changes can have a major impact on the operation of GitLab.com, negatively affecting our service level agreements. This in turn can have a financial and reputational risk for the company. Without clear communication between the application developers toggling features and the engineer-on-call (EOC), it can be difficult for the EOC to assess which feature flag toggles are high risk and which are not.
Additionally, during an incident investigation, knowing which high-risk features have recently been enabled, and documentation on how to assess their operation, is important.
For this reason, feature flag toggles which meet any of the below criteria, should be accompanied by a change management issue.:
Does production above include canary?
Does this apply only to production environment?
Yes. Only production environment. This means you can still make changes and deployments to environments other than production.
What is the exact scope of the changes that are enforced under PCL? (infrastructure, software, handbook…etc)
Any production change to and/or supporting gitlab.com SaaS Product. For example, configuration changes, setup of new libraries, introducing new code, toggling feature flags.
What if I still want to make a change during the PCL period?
Product Group Development code changes will require Development VP approval All other changes, including all underlying cloud and infrastructure changes will require Infrastructure VP approval.
Does this apply to our monthly release which happens on the 22nd?
No. If 22nd falls under PCL period, additional coordination is necessary to ensure uninterrupted monthly release.
We have a question that is not answered here?
Please raise an issue to Infrastructure team's queue and we will be happy to get back to you as soon as we can.
Exceptions to this process must be tracked and approved by Infrastructure.