Chat channel; please use the #infrastructure chat channel for questions that don't seem appropriate to use the issue tracker or the internal email address for.
The infrastructure team is split between production engineers and performance specialists.
Both roles are closely related as they touch on some of the same spots, for example, both care about the availability and performance of GitLab.com, from different perspectives.
Both roles also care about building an infrastructure and monitoring that can be shipped to our customers.
Production engineers work on keeping the infrastructure that runs our services running fast and reliably. This infrastructure includes GitLab.com, dev.GitLab.org and GitHost.io.
Production engineers also have a strong focus on enabling development to ship features as fast and bug free as possible. Providing the monitoring tools that prevent shipping regressions that would affect our customers. And building automation tools that lower the barrier of access to production and allow us to scale with automation.
Documentation: refer to runbooks and internal documentation in this very page.
Chat channels in Slack:
Alerts: monitoring tools post into this channel, production engineers should monitor this channel to act on alerts. Remember to let the people know when you are dealing with an alert, or if you have triggered it.
Infrastructure: general conversation about infrastructure goes on in this channel. Remember to let the people know when you are about to do some change in the infrastructure.
Releases: deployments and general releases conversation goes on here, lurk it to support deployments and help out when things go wrong.
Weekly automatic OS updates are performed on Monday at 10:10 UTC.
Performance specialists are developers that have a focus on improving GitLab.com performance. They work on issues from the GitLab-CE project.
For practical reasons we track the work that is on flight in the performance issue tracker by cross linking, but we keep the discussion in the source issue.
This is so we can have really quick 1 week sprints that allow us to iterate faster.
Performance specialists can also focus on critical infrastructure tasks that will enable GitLab.com to go faster, to increase availability, or to just generally make it scale to handle more users with less resources.
These runbooks aim to provide simple solutions for common problems, they should be pointed from our alerting system and should also be kept up to date with whatever new finding we get as we learn how to scale GitLab.com so these runbooks can also be adopted by our customers.
Runbooks are divided into 2 main sections:
What to do when: points to specific runbooks to run on stressful situations (on-call)
How do I: points to general administration texts that explain how to perform different administration tasks.
When writing a new runbook, be mindful what the goal of it is:
If it is for on-call situations, make it crisp and brief. Try to keep the following structure: pre-check, resolution, post-check .
If it is for general management, it can be freely formatted.
Some basic rules:
Use maintained cookbooks from https://supermarket.chef.io.
Create a wrapper cookbook whenever a feature is missing.
Make sure our custom cookbooks are public available from https://gitlab.com/gitlab-cookbooks.
Make sure there is a copy in our DEV environment https://dev.gitlab.org/cookbooks and setup push mirror to keep it in sync.
Berkshell should only point to our cookbooks in DEV so we are able to fix our cookbooks whenever GitLab.com comes unavailable.
Cookbooks should be developed using the team. We use merge requests and code review to share knowledge and build the best product we can.
Cookbooks should be covered with testing in order to prevent them from becoming legacy.
There may be cases of cookbooks that could become a security concern, in which case it is ok to keep them in our GitLab private instance. This should be assessed in a case by case and documented properly.
Available in the Chef Repo. There is some documentation that is specific to GitLab.com. Things that are specific to our infrastructure providers or that would create a security treat for our installation.
Still, this documentation is in the Chef Repo, and we aim to start pulling things out of there into the runbooks, until this documentation is thin and GitLab.com only.
GitLab Cloud Images
A detailed process on creating and maintaining GitLab cloud images can be found here.
Production events logging
There are 2 kind of production events that we track:
As said in the production engineer job description one of the goals is "Making GitLab easier to maintain to administrators all over the world". One of the ways we do it is making GitLab.com settings the default for all our customers. It is very important that GitLab.com is running GitLab Enterprise Edition with all its default settings. We don't want users running GitLab at scale to run into any problems.
If it is not possible to use the default settings the difference should be documented in GitLab.com settingsbefore applying them to GitLab.com.