Infrastructure

On this page

Other Pages

Infrastructure goals and team(s)

High level goals

Per our current set of OKRs, the infrastructure team works on making "GitLab.com ready for mission-critical tasks". Specifically, this means the infrastructure team works on

  1. GitLab.com's availability and scalability.
    • Current goal: Availability at 99.9%.
    • Availability here is defined as the uptime of the site linked above (which is in fact the first issue in the gitlab-ce issue tracker; for more on pingdom see the monitoring page), measured per calendar month, and as recorded on pingdom.
  2. GitLab.com's performance.
  3. Keeping it easy to maintain GitLab instances, for administrators all over the world.

Teams in Infrastructure

To accomplish these goals, we've defined four teams within Infrastructure to tackle:

Collaboration across the company for Site Reliability

Individuals from the Infrastructure team frequently collaborate very closely with different product teams (e.g. Platform, Discussion, CI, Packaging, etc) and Reliability Experts from product teams collaborate closely with the Infrastructure team. Together, they work on the topics listed above, using the principles and methods of Site Reliability Engineering a little bit more each day.

Embedded Production Engineers

Production engineers can be "embedded" with one or multiple different teams for anything from a few weeks to months.

If you are an "embedded" Production Engineer, then you are expected to:

Since at GitLab most "feature sets or services" are mostly already in production, being embedded means that you work on making sure that the feature set or service meets the requirements for Production Readiness post factum. This will typically involve improving the runbooks and documentation, alerting, monitoring, coding for resiliency, etc. An embedment ends when the feature set or service has passed the criteria in the Production Readiness Guide and is already in production.

Reliability Experts

Developers focused on the reliability and production readiness of their feature set or service are named Reliability Expert and work closely with Production Engineers.

Production and Staging Access

Roles Production access (specifically: ssh access to production) is limited to people in the roles of production, support, and release-manager. If you are in one of these roles and don’t have access, please open an issue in the infrastructure tracker, make the request and add the label “access”. All roles are listed in the GitLab chef cookbook

Staging access is treated at the same level as production access because it currently contains production data. This will be changed in the future to allow broader access.

In order to fully remove all production access in the future, we need to accomplish the following:

  1. Get a list of the reasons why people need access to production https://gitlab.com/gitlab-com/infrastructure/issues/2702
  2. Fix the logging infrastructure https://gitlab.com/gitlab-com/infrastructure/issues/2225#note_35367845
  3. Fix staging to resemble production , and allow logins https://gitlab.com/gitlab-com/infrastructure/issues/2674
  4. Fix canary deployments https://gitlab.com/gitlab-com/infrastructure/issues/1504
  5. Look to automate or create tools for everything we discovered in step 1 (not fixed by steps 2-4)
  6. iterate steps 1 & 5 over until there is no need for access to production.

Any other team member should not have access to production. If you are such a team member and you need information from the production environment, please request it via an issue in the infrastructure issue tracker. Please make sure to apply the appropriate labels. For more information, read about the labels that are most useful and the Production team's commitment to always try to help.

Release Managers do have access

Release managers require production ssh access to perform deploys. Therefore, release managers will have production access until production engineering can offer a deployment automation that does not require chef nor ssh access. This is an ongoing effort.

CI/CD team

The ci/cd team has production access to the runners and runner managers. The up to date list of the runners and runner managers roles are in [GitLab chef cookbooks] (https://gitlab.com/gitlab-cookbooks/gitlab_users#currently-defined-roles)


Feature Flags: Access and Use

Feature Flags provide an excellent way to introduce new features into the code base and into production that might affect the stability of GitLab,com, as described in the "direction" issue.

Having the power to toggle a feature flag does not constitute Production Access as defined above (ssh access). In general, the more Developers have access to use feature flags the better since it should make it much easier for Developers to quickly assess the impact of the feature on performance, stability, etc. However, the act of toggling can have an impact on the health and operation of the production environment.

To balance the potential benefits with the possible risks, we've developed the following rules:

  1. Any Developer can be granted access to toggle feature flags; simply request this via an issue in the infrastructure issue tracker.
  2. The feature flag can only be toggled by calling the chatbot from the #production chat channel. This is done on purpose to make sure that the Production Engineers can be aware of the activity.
  3. The feature flag should be set to expose no more than 1% of the traffic to the new feature - at least initially.
  4. When you enable a feature flag, you need to be available (responsive on Slack) for a period of time that makes sense in the context of the feature; typically count on several hours post-toggle. That way if there are any unintended consequences, you (as the expert on the feature) can quickly help the team identify possible reasons and remedies for the unexpected behavior.
  5. It's absolutely possible to deviate from these rules for any reason: please open an issue to discuss the needs of your particular case before executing it. This gives the Production team a chance to consider what unforeseen consequences this may have.

How to get access to Feature Flags

Access to feature flags is on an ad-hoc basis. Please open an issue with infrastructure and include the label “access request”. Pease note that you need an existing account with Marvin. To get an account you just need to talk to Marvin, it will create the account automatically for you.

Cloud Service Access

Amazon AWS

Unlike other cloud providers, AWS is used for a wide variety of services and access is available to various team members for various levels of access that match their need (for example, access to Billing vs. access to S3 buckets with backups). Various AWS accounts exist for different purposes. As an example, there is an AWS account that is used to manage S3 backups, and another that is available to the Support Team to test instances installed on AWS services.

Azure

To be written.

Documentation

Runbooks

Runbooks are public. They are automatically mirrored to our private development environment, this is so because if GitLab.com is down, we can still access the runbooks there.

These runbooks aim to provide simple solutions for common problems. The solutions are linked to from our alerting system. The runbooks should be kept up to date with whatever we learn as we scale GitLab.com so that our customers can also adopt them.

Runbooks are divided into 2 main sections:

When writing a new runbook, be mindful what the goal of it is:

Chef cookbooks

Some basic rules:

Generally our chef cookbooks live in the open, and they get mirrored back to our internal cookbooks group for availability reasons.

There may be cases of cookbooks that could become a security concern, in which case it is ok to keep them in our GitLab private instance. This should be assessed in a case by case and documented properly.

Internal documentation

Available in the Chef Repo. There is some documentation that is specific to GitLab.com. Things that are specific to our infrastructure providers or that would create a security threat for our installation.

Still, this documentation is in the Chef Repo, and we aim to start pulling things out of there into the runbooks, until this documentation is thin and GitLab.com only.

Outages and Blameless Post Mortems

Every time there is a production incident we will create an issue in the infrastructure issue tracker with the outage label.

In this issue we will gather the following information:

These issues should also be tagged with any other label that makes sense, for example, if the issue is related to storage, label it as such.

The responsibility of creating this post mortem is initially on the person that handled the incident, unless it gets assigned explicitly to someone else.

Public by default policy

These blameless post mortems have to be public by default with just a few exceptions:

That's it, there are no other reasons.

If what's blocking us from revealing this information is shame because we made a mistake, that is not a good enough reason.

The post mortem is blameless because our mistakes are not a person mistake but a company mistake, if we made a bad decision because our monitoring failed we have to fix our monitoring, not blame someone for making a decision based on insufficient data.

On top of this, blameless post-mortems help in the following ways:

Once this Post Mortem is created, we will tweet from the GitLabStatus account with a link to the issue and a brief explanation of what is it about.

Making Changes to GitLab.com

The production environment of GitLab.com should be treated with care since we strive to make GitLab.com so reliable that it can be mission-critical for our customers. Allowable downtime is a scarce resource.

Therefore, to be able to deploy useful and cool new things into production, we need to

Production Change Checklist

Any team or individual can initiate a change to GitLab.com by following this checklist. Create an issue in the infrastructure issue tracker and select the change_checklist template

Changes that need a checklist and a schedule

For example:

Changes that may need a checklist, but not explicit scheduling

For example:

Changes in staging

Testing things in staging typically only needs scheduling to avoid conflicting with others, but is otherwise straightforward since it is mostly self-service.

Deployments

Deployments of release(s) (candidates) are a special case. We don't want to block deployments if we can avoid it. And since they are currently performed by release managers there is generally no need for someone from production engineering to be heavily involved. Still, follow the next steps to schedule a deploy:

Make GitLab.com settings the default

As said in the production engineer job description one of the goals is "Making GitLab easier to maintain to administrators all over the world". One of the ways we do it is making GitLab.com settings the default for all our customers. It is very important that GitLab.com is running GitLab Enterprise Edition with all its default settings. We don't want users running GitLab at scale to run into any problems.

If it is not possible to use the default settings the difference should be documented in GitLab.com settings before applying them to GitLab.com.

Involving Azure