To accomplish these goals, we've defined four teams within Infrastructure to tackle:
Production: keeping GitLab.com available and scalable. This includes (among other things) ownership of staging, improving deployment processes, and building an infrastructure that enables development to go fast.
Database: keeping GitLab.com's database available, fast, and scalable.
Gitaly: making Git access available, scalable, and fast.
Collaboration Across the Company for Site Reliability
Individuals from the Infrastructure team frequently collaborate very closely with different product teams (e.g. Platform, Discussion, CI, Packaging, etc) and Reliability Experts from product teams collaborate closely with the Infrastructure team. Together, they work on the topics listed above, using the principles and methods of Site Reliability Engineering a little bit more each day.
Product / Infrastructure Feedback Loop
Since GitLab both makes a developer platform for users to create software, and runs it at scale on GitLab.com, there is a natural feedback loop between the features and scalability of the application and the operations of GitLab.com. The following scenarios govern how we balance the creation and improvement of our product with need to keep GitLab.com suitable for mission-critical customer workloads.
When the Feature Set is Ahead of GitLab.com Infrastructure
The Infrastructure team will prioritize implementing the applicable product features e.g. deployments
When a new Feature Set Duplicates Custom Tooling Already Used to Run GitLab.com
The Infrastructure team will adopt the new feature set if it results in less than a ~30% drop in nice-to-have functionality, because paying the switching cost and losing some functionality ultimately improves the product for customers and eventually ourselves
When GitLab.com Infrastructure is Ahead of the GitLab Product Feature Set
Infrastructure team should document the requirements with Product Management in an issue or epic
Both the product work, and custom work should be sized to understand the varying effort involved
Infrastructure should make clear to product management the time-frame in which the feature is needed
If product management feels the product work can happen in a realistic time-frame, infrastructure will wait for the feature and participate in it's creation
When PM documents that they cannot meet the time-frame or is in scope, our infrastructure team will create (with permission) custom tools or processes in the interest of keeping GitLab.com suitable for mission critical custom workloads
If outside of our scope or time-frame we should minimize our time spend. We will favor managed SaaS services over self-hosted, self-managed solutions.
The infrastructure team will favor using open source and open core software both for SaaS and self-hosted so that when something comes in scope or time-frame we're familiar with an open source option we can ship as part of GitLab.
Our infrastructure team will implement functionality in such a way that it can be replaced later by product features without excessive effort or risk.
Our product management & infrastructure teams will continue to work together to capture functionality in form of requirements for later prioritization
Embedded Production Engineers
Production engineers can be "embedded" with one or multiple different teams for anything from a few weeks to months.
If you are an "embedded" Production Engineer, then you are expected to:
Participate in team calls for the team that you are embedded with.
Keep up to date with the general Production Engineering team and duties:
Take care of issues, fires, and on-call. In fact, those will still be your priority unless something else has been explicitly agreed between the Production Engineering Manager and the team where you are embedded.
Continue to report to the Production Engineering Manager and participate in the Production team meetings.
Share the context of the team you are embedded with, with the rest of the Production Team.
Help the team that you are embedded with to make their feature set or service "production ready". Use and improve upon the Production Readiness Guide.
Since at GitLab most "feature sets or services" are mostly already in production, being embedded means that you work on making sure that the feature set or service meets the requirements for Production Readiness post factum. This will typically involve improving the runbooks and documentation, alerting, monitoring, coding for resiliency, etc. An embedment ends when the feature set or service has passed the criteria in the Production Readiness Guide and is already in production.
Developers focused on the reliability and production readiness of their feature set or service are named Reliability Expert and work closely with Production Engineers.
Production and Staging Access
Roles Production access (specifically: ssh access to production) is limited to people in the roles of production, support, and release-manager. If you are in one of these roles and don’t have access, please open an issue in the infrastructure tracker, make the request and add the label “access”. All roles are listed in the GitLab chef cookbook
Staging access is treated at the same level as production access because it currently contains production data. This will be changed in the future to allow broader access.
In order to fully remove all production access in the future, we need to accomplish the following:
Get a list of the reasons why people need access to production https://gitlab.com/gitlab-com/infrastructure/issues/2702.
Fix the logging infrastructure https://gitlab.com/gitlab-com/infrastructure/issues/2225#note_35367845.
Fix staging to resemble production, and allow logins https://gitlab.com/gitlab-com/infrastructure/issues/2674.
Look to automate or create tools for everything we discovered in step 1 (not fixed by steps 2-4).
Iterate steps 1 & 5 over until there is no need for access to production.
Any other team member should not have access to production. If you are such a team member and you need information from the production environment, please request it via an issue in the infrastructure issue tracker. Please make sure to apply the appropriate labels. For more information, read about the labels that are most useful and the Production team's commitment to always try to help.
Release Managers do Have Access
Release managers require production ssh access to perform deploys. Therefore, release managers will have production access until production engineering can offer a deployment automation that does not require chef nor ssh access. This is an ongoing effort.
Developer Manager Access
In order to help facilitate quick incident resolution, engineering managers will have production access.
The CI/CD team has production access to the runners and runner managers. The up to date list of the runners and runner managers roles are in GitLab chef cookbooks.
Feature Flags: Access and Use
Feature Flags provide an excellent way to introduce new features into the code base and into production that might affect the stability of GitLab,com, as described in the "direction" issue.
Having the power to toggle a feature flag does not constitute Production Access as defined above (ssh access). In general, the more Developers have access to use feature flags the better since it should make it much easier for Developers to quickly assess the impact of the feature on performance, stability, etc. However, the act of toggling can have an impact on the health and operation of the production environment.
To balance the potential benefits with the possible risks, we've developed the following rules:
Any Developer can be granted access to toggle feature flags; simply request this via an issue in the infrastructure issue tracker.
The feature flag can only be toggled by calling the chatbot from the #production chat channel. This is done on purpose to make sure that the Production Engineers can be aware of the activity.
The feature flag should be set to expose no more than 1% of the traffic to the new feature - at least initially.
When you enable a feature flag, you need to be available (responsive on Slack) for a period of time that makes sense in the context of the feature; typically count on several hours post-toggle. That way if there are any unintended consequences, you (as the expert on the feature) can quickly help the team identify possible reasons and remedies for the unexpected behavior.
It's absolutely possible to deviate from these rules for any reason: please open an issue to discuss the needs of your particular case before executing it. This gives the Production team a chance to consider what unforeseen consequences this may have.
How to Get Access to Feature Flags
Access to feature flags is on an ad-hoc basis. Please open an issue with infrastructure and include the label “access request”. Pease note that you need an existing account with Marvin. To get an account you just need to talk to Marvin, it will create the account automatically for you.
Access to Systems with Pseudonymized Data
Some systems use pseudonymized versions of production data, for example by working with a subset of repos and a database that has been pseudonymized using the "howto psuedonymize" runbook. This allows developers to test features and functionality on staging and on test installations with relatively realistic database sizes. However, the process of pseudonymization of the data does not remove or replace all of the user data and should therefore still be treated with the same level of security and protection that we have in place for access to regular production data.
Cloud Service Access
Unlike other cloud providers, AWS is used for a wide variety of services and access is available to various team members for various levels of access that match their need (for example, access to Billing vs. access to S3 buckets with backups). Various AWS accounts exist for different purposes. As an example, there is an AWS account that is used to manage S3 backups, and another that is available to the Support Team to test instances installed on AWS services.
For an overview of the various accounts and their basic purpose, find the Google Sheet titled "GitLab AWS accounts" (only accessible by GitLab team members).
To gain access to an AWS account, go to the "GitLab AWS accounts" sheet and leave a comment on the appropriate tab (corresponds to AWS account) specifying your request (e.g. "please give me X role in Y account") and mention one of the Administrators for the relevant account (listed on that same sheet) to get their attention.
Note that this is identical to the process we use to gain access to 1Password shared vaults, and also how we gain access to email aliases in the GitLab "Email, Slack, and GitLab Groups and Aliases" google doc. It is not an ideal process, and we expect to replace it with a simpler process eventually.
For "deep" levels of access (e.g. access to production data of any kind), you may still need to open an issue on the infrastructure issue tracker to have your manager authorize the request.
Do not share access credentials for AWS accounts. AWS has an excellent audit log, but it is only of value if "users" on the system are not shared by multiple actual people. Bear in mind that for each AWS account that you have access to, you may have a different username, which can be confusing.
For the AWS account that is in use for GitLab.com related services, details on the configuration of security groups within Amazon AWS and a full breakdown of policies and services is detailed in the Google Doc entitled "AWS - GitLabdotCom - Security Groups and Policies" (only accessible by GitLab team members).
Some actions require access to the AWS account by the root user (for example change account names, accept another group into the account, change billing settings, etc.). Each AWS account can only have a single root user. To reduce the bus factor multiple people within GitLab have access to the root user credentials (this is the one exception to the earlier rule of not sharing AWS credentials). If you require AWS root access for any reason, contact one of the people listed on the "AWS root access" secure note in the shared vault on 1Password.
These runbooks aim to provide simple solutions for common problems. The solutions are linked to from our alerting system. The runbooks should be kept up to date with whatever we learn as we scale GitLab.com so that our customers can also adopt them.
Runbooks are divided into 2 main sections:
What to do when: points to specific runbooks to run in stressful situations (on-call)
How do I: points to general administration texts that explain how to perform different administration tasks.
When writing a new runbook, be mindful what the goal of it is:
If it is for on-call situations, make it crisp and brief. Try to keep the following structure: pre-check, resolution, post-check.
If it is for general management, it can be freely formatted.
Some basic rules:
Use maintained cookbooks from https://supermarket.chef.io.
Create a wrapper cookbook whenever a feature is missing.
Make sure our custom cookbooks are public available from https://gitlab.com/gitlab-cookbooks.
Make sure there is a copy in our DEV environment https://dev.gitlab.org/cookbooks and setup push mirror to keep it in sync.
Berkshelf should only point to our cookbooks in DEV so we are able to fix our cookbooks whenever GitLab.com comes unavailable.
Cookbooks should be developed using the team. We use merge requests and code review to share knowledge and build the best product we can.
Cookbooks should be covered with ChefSpec and TestKitchen testing in order to ensure they do what they are supposed to and don't have conflicts.
There may be cases of cookbooks that could become a security concern, in which case it is ok to keep them in our GitLab private instance. This should be assessed in a case by case and documented properly.
Available in the Chef Repo. There is some documentation that is specific to GitLab.com. Things that are specific to our infrastructure providers or that would create a security threat for our installation.
Still, this documentation is in the Chef Repo, and we aim to start pulling things out of there into the runbooks, until this documentation is thin and GitLab.com only.
In this issue we will gather the following information:
The timeline of events: what happened first, what later, what reasoning triggered what action.
Sample graphs or logs captured from our monitoring explaining how they drove our reasoning.
The 5 whys that lead to the root cause that triggered the incident.
The things that worked well
The things that can be improved
Further actions with links to the issues that cover them
These issues should also be tagged with any other label that makes sense, for example, if the issue is related to storage, label it as such.
The responsibility of creating this post mortem is initially on the person that handled the incident, unless it gets assigned explicitly to someone else.
Public by default policy
These blameless post mortems have to be public by default with just a few exceptions:
The post mortem would affect a customer or employee privacy: revealing the real user name, email, private project names, any data that can identify the person, etc.
The post mortem would reveal billing information.
The post mortem would reveal GitLab's confidential information.
That's it, there are no other reasons.
If what's blocking us from revealing this information is shame because we made a mistake, that is not a good enough reason.
The post mortem is blameless because our mistakes are not a person mistake but a company mistake, if we made a bad decision because our monitoring failed we have to fix our monitoring, not blame someone for making a decision based on insufficient data.
On top of this, blameless post-mortems help in the following ways:
We can help people understand the complexity of running a service in production, and how things can go wrong.
We help ourselves to learn by reflecting and analyzing on why this issue has happened.
We force ourselves to think about what we need to do to not make the same mistake again, or to improve our infrastructure in a way that we don't have to deal with the same incident.
We open our reasoning and information to the public so they can chime in and help us out.
We leave a great trace of information for onboarding new engineers. They can see how production fails.
We can use these post-mortems for recruiting and marketing.
Once this Post Mortem is created, we will tweet from the GitLabStatus account with a link to the issue and a brief explanation of what is it about.
Making Changes to GitLab.com
The production environment of GitLab.com should be treated with care since we strive to make GitLab.com so reliable that it can be mission-critical for our customers. Allowable downtime is a scarce resource.
Therefore, to be able to deploy useful and cool new things into production, we need to
use checklists to prepare and carry out the changes, and
schedule the (non-automated, non-self-serve) changes we make to our production environment (to be sure the necessary people are there, and to prevent having multiple changes happening at the same time).
When you introduce something new such as installing pgbouncer, enabling ElasticSearch, etc.
Anything that may result in downtime. Not sure? Then assume that it will result in downtime.
Anything that constitutes a change to GitLab.com and requires the assistance of a Production Engineer.
Changes that May Need a Checklist, but Not Explicit Scheduling
A self-service process, such as adding alerting (which only requires a merge request to the relevant repo), or fixing a chef-client.
"Quick" fixes that do not require a Production Engineer.
Changes in Staging
Testing things in staging typically only needs scheduling to avoid conflicting with others, but is otherwise straightforward since it is mostly self-service.
Consider using the same format of the checklist - it is good form to know how to roll back for example. Using the checklist will also set the baseline for when you want to introduce the change into production; thus saving time. But it is not required to follow the full checklist in order to make changes on staging.
Do schedule changes to staging, on the "GitLab Production" calendar (link at top of page). This is to prevent conflicts when different people or teams are trying to work on staging for different purposes.
Given that there are quite a few people who can use self-service in staging, there should be no need for resourcing there from the Production team, and you do not need an explicit OK from the Production Engineering Manager.
Deployments of release(s) (candidates) are a special case. We don't want to block deployments if we can avoid it. And since they are currently performed by release managers there is generally no need for someone from production engineering to be heavily involved. Still, follow the next steps to schedule a deploy:
Schedule the deploy both in staging and production on the "GitLab Production" calendar (link at top of page) as soon as possible. Do this to avoid conflicting with any other production operation that will happen (don't oversubscribe).
Use a 4 hr block too, sometimes deploys delay to start for odd reasons, you don't want to run out of time.
Notify the production engineering team by inviting ops-contact. This will raise awareness.
If you need help from production engineering for whatever reason, be explicit in the invite, or get in touch in the production chat channel to ask.
Make GitLab.com Settings the Default
One of the goals of production engineers is "Making GitLab easier to maintain to administrators all over the world". One of the ways we do it is making GitLab.com settings the default for all our customers. It is very important that GitLab.com is running GitLab Enterprise Edition with all its default settings. We don't want users running GitLab at scale to run into any problems.
If it is not possible to use the default settings the difference should be documented in GitLab.com settingsbefore applying them to GitLab.com.
GitLab has access to "Azure Rapid Response". The level of support is described on their website, but TL;DR it means that the SLAs are:
Initial response for Sev A < 15 mins, Sev B < 2 Hours, Sev C < 4 hours.
The Azure team prefers if we ask questions via the support portal, but there is an Azure representative whom we can ping in our issue tracker. This is mostly helpful in cases of non-urgent questions, and questions of an advisory nature.
Light hand-holding for things such as architecture review is available. To reach out for this assistance, submit an "advisory case" ticket through the portal.
Azure subscription and service limits, quotas, and constraints: https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits