GitLab.com is the Largest Production GitLab Installation on the Planet.
The Infrastructure team is the primary responsible party for the availability, reliability, performance, and scalability of GitLab.com. Other department and teams contribute greatly to these attributes of our service as well. In these cases it is the responsibility of the infrastructure department to close the feedback loop with monitoring and metrics to drive accountability.
We are a blend of operations gearheads and software crafters that apply sound enginering principles, operational discipline and mature automation to make GitLab.com ready for mission-critical customer workloads. We strive for excellence every day by living and breathing GitLab's values as our guiding operating principles in every decision we make and every action we take.
An operational environment is a complex and interconnected mesh of components working in unison to deliver a set of services. Rather than organize the team along siloed functional groups, our team is aligned with the environment's lifecycle, taking into account the two variables that drive change into the environment: time and space. Events and actions take place in the environment in a time scale (between essentially now and soon) and their effect on people resources is higher the closer said resources are to the environment.
Our long-term objective is to become a world-class SRE organization. In order to reach that goal, we are adopting a focal arrangement where the organizational formula is derived from the focus and purpose of the groups arranged along the time and space variables, and each group contains the appropriate functional resources necessary to manage the environment, which include systems and database specialties.
The first iteration in this model comprises two groups:
Team members in Infrastructure rotate between both groups on a 6-9 month schedule to ensure that all team members level on the skills necessary to be successful in our long-term vision. The rotation allows each and every one of us to get a sense of the priority axes across both groups, which will eventually merge under a single SRE umbrella.
As our processes and automation mature, the quality of our work will stabilize and be more predictable. We will become adept at maintaining high levels of uptime across the board. Site Availability will then merge into Site Reliability, at which point we will have several vertical Site Reliability teams that follow the sun. GitLab.com is a global service, and as such, so must be Infrastructure.
Site Availability is the gatekeeper and primary caretaker of the operational environment, focusing on its uptime and state as it exists in the present.
Over the next 12 to 18 months, we will focus relentlessly on the availability of GitLab.com so that it becomes engrained in everything we do. Thus, the team's priorities are driven, almost exclusively, by availability considerations, effecting the cultural shift necessary to achieve our uptime goals. This group has the greatest latitude in making changes to the environment that ensure uptime in the here and now, and is the final authority as it relates to changes in GitLab.com.
Site Availability is the primary owner (but not the only consumer) of the following operational processes and procedures:
Key metrics related to this group include:
Site Reliability is the complementary primary caretaker of the operational environment, focusing on its uptime through reliability considerations. Whereas Site Availability is focused on the here and now, Site Reliability has a slightly longer time horizon, soon. Its guiding principles are efficiency, effectiveness and frugality. In a sense, this is the team that will outdate both change and delta management. In very colloquial terms, Site Reliability produces well-designed machine parts to replace duct-tape placed in the environment.