Availability here is defined as the uptime of the site linked above (which is in fact the first issue in the gitlab-ce issue tracker; for more on pingdom see the monitoring page), measured per calendar month, and as recorded on pingdom.
Keeping it easy to maintain GitLab instances, for administrators all over the world.
To do this, we've defined four teams within Infrastructure to tackle:
Production: keeping GitLab.com available and scalable. This includes (among other things) ownership of staging, improving deployment processes, and building an infrastructure that enables development to go fast.
Security: keeping GitLab.com safe, from the perspective of the application, the infrastructure and the organization.
Database: keeping GitLab.com's database available, fast, and scalable.
Gitaly: making Git access available, scalable, and fast.
Collaboration across the company for Site Reliability
Individuals from the Infrastructure team frequently collaborate very closely with different product teams (e.g. Platform, Discussion, CI, Packaging, etc) and Reliability Experts from product teams collaborate closely with the Infrastructure team. Together, they work on the topics listed above, using the principles and methods of Site Reliability Engineering a little bit more each day.
Embedded Production Engineers
Additionally, specific production engineers can be "embedded" with one or multiple different teams for anything from a few weeks to months.
If you are an "embedded" Production Engineers, then you are expected to:
Participate in team calls for the team that you are embedded with.
Keep up to date with the general Production Engineering team and duties:
Take care of issues, fires, and on-call. In fact, those will still be your priority unless the Lead explicitly says otherwise.
Continue to report to the Production Lead and participate in the Production team meetings.
Share the context of the team you are embedded with, with the rest of the Production Team.
Help the team that you are embedded with to make their feature set or service "production ready".
Since at GitLab most "feature sets or services" are mostly already in production, then it means that you work on making sure that the feature set or service post factum meets the requirements for Production Readiness [TODO: add link to production readiness review questionnaire]. This will typically involve improving the runbooks and documentation, alerting, monitoring, coding for resiliency, etc. By the time you are done, any other member of the Production Team should be able to tend to the feature set or service in production as well as you can, and the "embedment" stops. At that point you should be listed as an expert in the respective service.
Production and Staging Access
Production access (specifically: ssh access to production) is granted to production engineers, security engineers, and (production) on-call heroes.
Staging access is treated at the same level as production access because it contains production data.
Any other engineer, or lead, or manager at any level will not have access to production, and, in case some information is needed from production it must be obtained by a production engineer through an issue in the infrastructure issue tracker.
There is one temporary exception: release managers require production ssh access to perform deploys, they will have production access until production engineering can offer a deployment automation that does not require chef nor ssh access. This is an ongoing effort.
These runbooks aim to provide simple solutions for common problems. The solutions are linked to from our alerting system. The runbooks should be kept up to date with whatever we learn as we scale GitLab.com so that our customers can also adopt them.
Runbooks are divided into 2 main sections:
What to do when: points to specific runbooks to run in stressful situations (on-call)
How do I: points to general administration texts that explain how to perform different administration tasks.
When writing a new runbook, be mindful what the goal of it is:
If it is for on-call situations, make it crisp and brief. Try to keep the following structure: pre-check, resolution, post-check .
If it is for general management, it can be freely formatted.
Some basic rules:
Use maintained cookbooks from https://supermarket.chef.io.
Create a wrapper cookbook whenever a feature is missing.
Make sure our custom cookbooks are public available from https://gitlab.com/gitlab-cookbooks.
Make sure there is a copy in our DEV environment https://dev.gitlab.org/cookbooks and setup push mirror to keep it in sync.
Berkshelf should only point to our cookbooks in DEV so we are able to fix our cookbooks whenever GitLab.com comes unavailable.
Cookbooks should be developed using the team. We use merge requests and code review to share knowledge and build the best product we can.
Cookbooks should be covered with ChefSpec and TestKitchen testing in order to ensure they do what they are supposed to and don't have conflicts.
There may be cases of cookbooks that could become a security concern, in which case it is ok to keep them in our GitLab private instance. This should be assessed in a case by case and documented properly.
Available in the Chef Repo. There is some documentation that is specific to GitLab.com. Things that are specific to our infrastructure providers or that would create a security threat for our installation.
Still, this documentation is in the Chef Repo, and we aim to start pulling things out of there into the runbooks, until this documentation is thin and GitLab.com only.
In this issue we will gather the following information:
The timeline of events: what happened first, what later, what reasoning triggered what action.
Sample graphs or logs captured from our monitoring explaining how they drove our reasoning.
The 5 whys that lead to the root cause that triggered the incident.
The things that worked well
The things that can be improved
Further actions with links to the issues that cover them
These issues should also be tagged with any other label that makes sense, for example, if the issue is related to storage, label it as such.
The responsibility of creating this post mortem is initially on the person that handled the incident, unless it gets assigned explicitly to someone else.
Public by default policy
These blameless post mortems have to be public by default with just a few exceptions:
The post mortem would affect a customer or employee privacy: revealing the real user name, email, private project names, any data that can identify the person, etc.
The post mortem would reveal billing information.
The post mortem would reveal GitLab's confidential information.
That's it, there are no other reasons.
If what's blocking us from revealing this information is shame because we made a mistake, that is not a good enough reason.
The post mortem is blameless because our mistakes are not a person mistake but a company mistake, if we made a bad decision because our monitoring failed we have to fix our monitoring, not blame someone for making a decision based on insufficient data.
On top of this, blameless post-mortems help in the following aspects:
We can help people understand the complexity of running a service in production, and how things can go wrong.
We help ourselves to learn by reflecting and analyzing on why this issue has happened.
We force ourselves to think about what we need to do to not make the same mistake again, or to improve our infrastructure in a way that we don't have to deal with the same incident.
We open our reasoning and information to the public so they can chime in and help us out.
We leave a great trace of information for onboarding new engineers. They can see how production fails.
We can use these post-mortems for recruiting and marketing.
Once this Post Mortem is created, we will tweet from the GitLabStatus account with a link to the issue and a brief explanation of what is it about.
Making Changes to GitLab.com
The production environment of GitLab.com should be treated with care since we strive to make GitLab.com so reliable that it can be mission-critical for our customers. Allowable downtime is a scarce resource.
Therefore, to be able to deploy useful and cool new things into production, we need to
use checklists to prepare and carry out the changes, and
schedule the (non-automated, non-self-serve) changes we make to our production environment (to be sure the necessary people are there, and to prevent having multiple changes happening at the same time).
Changes that need a checklist and a schedule
When you introduce something new such as installing pgbouncer, enabling ElasticSearch, etc.
Anything that may result in downtime. Not sure? Then assume that it will result in downtime.
Anything that constitutes a change to GitLab.com and requires the assistance of a Production Engineer.
Changes that may need a checklist, but not explicit scheduling
A self-service process, such as adding alerting (which only requires a merge request to the relevant repo), or fixing a chef-client.
"Quick" fixes that do not require a Production Engineer.
Changes in staging
Testing things in staging typically only needs scheduling to avoid conflicting with others, but is otherwise straightforward since it is mostly self-service.
Consider using the same format of the checklist - it is good form to know how to roll back for example. Using the checklist will also set the baseline for when you want to introduce the change into production; thus saving time. But it is not required to follow the full checklist in order to make changes on staging.
Do schedule changes to staging, on the "GitLab Production" calendar (link at top of page). This is to prevent conflicts when different people or teams are trying to work on staging for different purposes.
Given that there are quite a few people who can use self-service in staging, there should be no need for resourcing there from the Production team, and you do not need an explicit OK from the Production Lead.
Deployments of release(s) (candidates) are a special case. We don't want to block deployments if we can avoid it. And since they are currently performed by release managers there is generally no need for someone from production engineering to be heavily involved. Still, follow the next steps to schedule a deploy:
Schedule the deploy both in staging and production on the "GitLab Production" calendar (link at top of page) as soon as possible. Do this to avoid conflicting with any other production operation that will happen (don't oversubscribe).
Use a 4 hr block too, sometimes deploys delay to start for odd reasons, you don't want to run out of time.
Notify the production engineering team by inviting ops-contact. This will raise awareness.
Notify via twitter the range of time when the deploy will happen.
If you need help from production engineering for whatever reason, be explicit in the invite, or get in touch in the production chat channel to ask.
Production Change Checklist
Any team or individual can initiate a change to GitLab.com by following this checklist. Create an issue in the infrastructure issue tracker and select the change_checklist template
Make GitLab.com settings the default
As said in the production engineer job description one of the goals is "Making GitLab easier to maintain to administrators all over the world". One of the ways we do it is making GitLab.com settings the default for all our customers. It is very important that GitLab.com is running GitLab Enterprise Edition with all its default settings. We don't want users running GitLab at scale to run into any problems.
If it is not possible to use the default settings the difference should be documented in GitLab.com settingsbefore applying them to GitLab.com.
GitLab has access to "Azure Rapid Response". The level of support is described on their website, but TL;DR it means that the SLAs are:
Initial response for Sev A < 15 mins, Sev B < 2 Hours, Sev C < 4 hours.
The Azure team prefers if we ask questions via the support portal, but there is an Azure representative whom we can ping in our issue tracker. This is mostly helpful in cases of non-urgent questions, and questions of an advisory nature.
Light hand-holding for things such as architecture review is available. To reach out for this assistance, submit an "advisory case" ticket through the portal.
Azure subscription and service limits, quotas, and constraints: https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits