Production Team

On this page

Production Team

The Production SRE team is responsible for all user-facing services. Production SREs ensure that these services are secure, reliable, and fast. This infrastructure includes staging, and; see the list of environments.

Production SREs also have a strong focus on building the right toolsets and automations to enable development to ship features as fast and bug free as possible, leveraging the tools provided by itself - we must dogfood.

Another part of the job is building monitoring tools that allow quick troubleshooting as a first step, then turning this into alerts to notify based on symptoms, to then fixing the problem or automating the remediation. We can only scale by being smart and using resources effectively, starting with our own time as the main scarce resource.

We want to make ready for mission critical workloads. That readiness means:

  1. Speedy (speed index below 2 seconds)
  2. Available (uptime above 99.95%)
  3. Durable (automated backups and restores, monthly manual tests)
  4. Secure (prioritize requests of our security team)
  5. Deployable (quickly deploy and provide metrics for new versions in all environments)


  1. Security: reduce risk to its minimum, and make the minimum explicit.
  2. Transparency, clarity and directness: public and explicit by default, we work in the open, we strive to get signal over noise.
  3. Efficiency: smart resource usage, we should not fix scalability problems by throwing more resources at it but by understanding where the waste is happening and then working to make it disappear. We should work hard to reduce toil to a minimum by automating all the boring work out of our way.


Workout of the Week (WoW) Milestone

Issues in the tracker are organized into milestones to define the "workout of the week" (WoW) from one week to the next. The "week" runs from Wednesday to end of Tuesday. The other milestone in use is "Next WoW" to track items scheduled for the next week. Every week, the Production Engineering Manager renames the WoW to "WoW ending yyyy-mm-dd", and closes it; then renames "Next WoW" to "WoW". By doing this, the closed milestones provide a history of what the team has worked on, while the team only needs to be concerned with two open milestones. If issues are added to the "WoW" after the week has already started, add the ~unscheduled label (not needed if the issue is ~outage since those are by definition unscheduled).

The Workout of the Week is currently on hold due to the ongoing migration project, for this we are tracking our effort using the migration milestones.

All direct or indirect changes to authentication and authorization mechanisms used by GitLab Inc. by customers or employees require additional review and approval by a member of at least one of following teams:

This process is enforced for the following repositories where the approval is mandatory using MR approvals:

Additional repositories may also require this approval and can be evaluated on a case-by-case basis.

Labeling Issues

We use issue labels within the Infrastructure issue tracker to assist in prioritizing and organizing work. Prioritized labels are:

We also use the ~AP1, ~AP2, ~AP3 labels as described in availability & performance priority labels. Those are mainly used to communicate priority of issues to Product Managers, for scheduling purposes.

Issues labeled ~oncall are prioritized to be worked on by the current oncall team memers.

Goals and Meta Goal

~goals are issues that are in a WoW and we agreed as a team that we will do everything in our power to deliver them. Goal issues should fit in one WoW, that is, they are deliverable in a single week time, if they do not fit in one WoW we are probably talking about a ~meta ~goal.

We use this kind of issues to indicate a general direction (generally speaking something that will take from 1 to 3 months of work) This means that a ~meta ~goal should be achievable in one quarter.

~meta issues that are not also ~goal are the tasks that are larger than what fits in a quarter, therefore they need to be sliced into actually deliverable pieces that can also become a goal.

Other Labels

We use some other labels to indicate specific conditions and then measure the impact of these conditions within production or the production engineering team. This is specially important from the time investment in specific parts of the production engineering team, to reduce toil or to reduce the chance of a failure by accessing to production more than enough.

Labels that are particularly important for gathering data are:

Always Help Others

We should never stop helping and unblocking team members. To this end, data should always be gathered to assist in highlighting areas for automation and the creation of self-service processes. Creating an issue from the request with the proper labels is the first step. The default should be that the person requesting help makes the issue; but we can help with that step too if needed.

If this issue is urgent for whatever reason, we should label them following the instructions above and add them to the ongoing WoW.

Issue or Outage Hand-off

Ongoing outages, as well as issues that have the ~(perceived) data loss label and are (therefore) actively being worked on need a hand off to happen as team members cycle in and out of their timezones and availability. The on call log can be used to assist with this. (See link at top to on-call log).

Production Events Logging

There are 2 kind of production events that we track:


Summary of Backup Strategy

For details see the runbooks, in particular regarding details on Azure snapshots and Database backups using WAL-E (encrypted)

R.A.D. - Restore Appreciation Days

Every second day of the month, we have a R.A.D. "party". Two production SREs use this day to test our backup processes by fully restoring not-yet-automated backups to test instances and to verify data integrity. The issues for every individual R.A.D can be found in the infrastructure tracker.

The ongoing effort to automate all the things backups is tracked in the infrastructure META issue.