Production Team

On this page

Production Team

Production engineers work on keeping the infrastructure that runs our services running fast and reliably. This infrastructure includes staging, GitLab.com and dev.GitLab.org; see the list of nodes.

Production engineers also have a strong focus on building the right toolsets and automations to enable development to ship features as fast and bug free as possible, leveraging the tools provided by GitLab.com itself - we must dogfood.

Another part of the job is building monitoring tools that allow quick troubleshooting as a first step, then turning this into alerts to notify based on symptoms, to then fixing the problem or automating the remediation. We can only scale GitLab.com by being smart and using resources effectively, starting with our own time as the main scarce resource.

Tenets

  1. Security: reduce risk to its minimum, and make the minimum explicit.
  2. Transparency, clarity and directness: public and explicit by default, we work in the open, we strive to get signal over noise.
  3. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way.

Prioritizing Issues

Given the variety of responsibilities and number of "interfaces" between the Production team and all the other teams at GitLab, here is a guideline on how to prioritize the issues we work on. Basing this on the goals of the Infrastructure team as well as our values and workflows as a company as whole, the priority should be:

  1. keeping GitLab.com available - and secure
  2. unblocking others
  3. automating tasks to reduce toil and increase team availability (but be explicit about the costs and benefits
  4. improving performance of GitLab.com while being conscious of cost
  5. reducing costs of running GitLab.com

Labeling Issues

We use issue labels to assist in organizing issues within the Infrastructure issue tracker. Prioritized labels are

Workout of the Week (WoW) Milestone

Issues in this tracker are organized into milestones to define the "workout of the week" (WoW) from one week to the next. The "week" runs from Wednesday to end of Tuesday. The other milestone in use is "Next WoW" to track items scheduled for the next week. Every week, the Production Lead renames the WoW to "WoW ending yyyy-mm-dd", and closes it; then renames "Next WoW" to "WoW". By doing this, the closed milestones provide a history of what the team has worked on, while the team only needs to be concerned with two open milestones. If issues are added to the "WoW" after the week has already started, add the ~unscheduled label (not needed if the issue is ~outage since those are by definition unscheduled).

Issue or outage hand off

Ongoing outages, as well as issues that have the ~(perceived) data loss label and are (therefore) actively being worked on need a hand off to happen as team members cycle in and out of their timezones and availability. The on call log can be used to assist with this. (See link at top to on-call log).

Production events logging

There are 2 kind of production events that we track: