Infrastructure Blueprint

On this page

2018Q4

We successfully completed the GCP migration, the Summit is behind us, and 2018Q3 is almost over. It is now time to turn our attention to Q4, and while we are still working out OKR specifics, our primary and overriding focus for Q4 will continue to be GitLab.com’s availability, and more specifically, observable availability. There are two aspects to availability:

Making it observable entails measuring and evaluating availability against clearly specified objectives (SLXs) that take into account error budgets.

When determining priorities, we must answer the question of whether the task at hand contributes to availability along those two axes: if it does, we move forward; if it does not, it is likely safe to skip; if in doubt, we simply ask. Efficiency, speed and cost optimizations are, at this stage, secondary goals. Maintaining high and stable levels of measurable availability is, and will always be, one of our most critical OKRs. As our automation matures, we will be able to increase our speed and optimize the environment.

OKRs

Infrastructure's objective is to make GitLab.com ready for mission-critical workloads. In order to achieve this goal, each team in Infrastructure has a charter to pursue this objective from different angles. SAE's goal is to return GitLab.com to its nominal operating state when an incident takes place, whereas SRE's goal is to ensure GitLab.com remains in its nominal operating state. In other words, SAE minimizes GitLab.com's MTTR and SRE maximizes GitLab.com's MTBF.

Infrastructure's OKRs are crafted along these lines of thought.

Roadmap

Cultural Focus

Achieving ours goals requires a high-performance team that adheres to GitLab's values. We are not simply runnig GitLab.com: we are also building a team we love to work with. As Reed Hastings eloquently put it, we want to create a team where "Oh, I’d want to come to work every day and solve these problems with these people".

There are three cultural aspects to focus on as we continue to to develop and polish the team:

Workload Focus

Our workload should be managed in a fairly predictable fashion. A minimum of 60% of our work should be known, scheduled work. This work is defined as work that:

When the planned work entails design and discussion, no work changing production should be performed on that issue: create follow up issues as the end result and definition of the work to be done.

Functional Focus

Database

Backup, Restore, and Verification
Replication

Storage

Observability

CI/CD