Team

On this page

Workflow How may we be of service? GitLab.com Status STATUS
Issue Trackers Infrastructure: Milestones, OnCall Production: Incidents, Changes, Deltas Delivery
Slack Channels #sre-lounge, #database #alerts, #production #g_delivery
Operations Runbooks (please contribute!) On-call: Handover Document, Reports  

Teams

An operational environment is a complex and interconnected mesh of components working in unison to deliver a set of services. Rather than organize the team along siloed functional groups, our team is aligned with the environment's lifecycle, taking into account the two variables that drive change into the environment: time and space. Events and actions take place in the environment in a time scale (between essentially now and soon) and their effect on people resources is higher the closer said resources are to the environment.

Structure

Our long-term objective is to become a world-class SRE organization. In order to reach that goal, we are adopting a focal arrangement where the organizational formula is derived from the focus and purpose of the groups arranged along the time and space variables, and each group contains the appropriate functional resources necessary to manage the environment, which include systems and database specialties.

The first iteration in this model comprised two groups, Site Availability and Site Reliability.

The second iteration adds a third group, one specializing on the biggest source of change in the environment, releases, whose purpose is to make CI/CD at GitLab a reality: Delivery.

Thus, the three groups:

Rotation

Team members in Site Availability and Site Reliability rotate between both groups on a 6-9 month schedule to ensure that all team members level on the skills necessary to be successful in our long-term vision. The rotation allows each and every one of us to get a sense of the priority axes across both groups, which will eventually merge under a single SRE umbrella.

Long-term Structure

As our processes and automation mature, the quality of our work will stabilize and be more predictable. We will become adept at maintaining high levels of uptime across the board. Site Availability will then merge into Site Reliability, at which point we will have several vertical Site Reliability teams that follow the sun. GitLab.com is a global service, and as such, so must be Infrastructure.