Reliability Engineering teams are the gatekeepers and primary caretakers of the operational environment hosting all of GitLab's user-facing services (most notably GitLab.com), focusing on their availability, performance and scalability through reliability considerations.
The Site Reliability teams are responsible for all of GitLab's user-facing services, most notably, GitLab.com. Site Reliability Engineers ensure that these services are available, reliable, scalable, performant and, with the help of GitLab's Security Department, secure. This infrastructure includes a multitude of environments, including staging, GitLab.com (production) and dev.GitLab.org, among others (see the list of environments).
SREs are primarily focused on the GitLab.com's availability, and have a strong focus on building the right toolsets and automations to enable development to ship features as fast and bug-free as possible, leveraging the tools provided by GitLab (we must dogfood).
Another part of the job is building monitoring tools that allow quick troubleshooting as a first step, then turning this into alerts to notify based on symptoms, to then fixing the problem or automating the remediation. We can only scale GitLab.com by being smart and using resources effectively, starting with our own time as the main scarce resource.
Reliability Engineering teams own the following operational processes:
The teams' overarching goal with respect to these processes is to outdate them through automation.
Key metrics related to this group include:
Each member of the Site Reliability Team is part of this vision:
The DBRE team has his own roadmap, dashboard, milestones and on-call rotation.
The Reliability Engineering team primarily organizes on the ~"team::Reliability" label in the GitLab Infrastructure Team group.
There are now 3 infrastructure teams reporting to
The three teams share the on-call rotations for GitLab.com. The two SREs in the weekly rotation (EMEA and Americas) share responsibility for triaging issues and managing tasks on the SRE On-call board. The board uses the group
SRE:On-call label to identify issues across subgroups in
gitlab-com and is not aligned with any single milestone.
Engineers not on-call should focus on their team's board(s) (e.g. AS TEAM) so they remained focused on the current milestone.
Incoming requests of the infrastructure team can start in the Current milestone, but can be triaged out to the correct teams.
Add issues at any time to the infrastructure issue tracker. Let one of the managers for the production team know of the request. It would be helpful for our prioritization to know the timeline for the issue if your team has commitments related to it. We do reserve part of our time for interrupt requests, but that does not always mean we can fit in everything that comes to us.
Each team's manager will triage incoming requests for the services their team owns. In some cases, we may decide to pull that work immediately, in other cases, we may defer the work to a later milestone if we have higher priority currently in progress. The 3 managers will be meeting twice a week and we can share efforts and rebalance work if needed. Work that is ready to pull will be added to the team milestone(s) and appear on their boards.
The infrastructure issue tracker is the backlog for the infrastructure team and tracks all work that SRE teams are doing that is not related to an ongoing change or incident.
We have a production issue tracker. Issues in this tracker are meant to track incidents and changes to production that need approval. We can host discussion of proposed changes in linked infrastructure issues. These issues should have ~incident or ~change and notes describing what happened or what is changing with relevant infrastructure team issues linked for supporting information.
Standups: We do standups with a bot that will ask for updates from each team member at 11AM in their timezone. Updates will go into our slack channel.
Retros: We are testing async retros with another bot that happens the second Wednesday of our milestone. Updates from that retro will again go to our slack channel. A summary will also be made so that we can vote on important issues to talk about in more depth. These can then help us update our themes for milestones.
We use boards extensively to manage our work (see https://gitlab.com/groups/gitlab-com/gl-infra/-/boards).
board. The board is groomed daily by the Reliability Managers.
The managers' priorities are to:
workflow::Blockedlist is empty (i.e., unblocking issues is critical)
keeps track of the state of Production, showing, at a glance, incidents, hotspots, changes and deltas related to production, and it also includes on-call reports.
There are four types of issues related to production, denoted by labels:
||Incidents are anomalous conditions where GitLab.com is operating below established SLOs.|
||Hotspots identify threats that are likely to become incidents if not addressed but that we are unable to address right away.|
||Changes are scheduled changes through mainatenance windows.|
||Deltas reflect devitations from standard configuration that will eventually merge into the standard.|
The Production Board is groomed by the IMOC/CMOC on a daily basis, and we strive to keep it both clean and lean.
Database (group label) will automatically add issues to the board.
collects Observability-related issues.
Board::Observability (group label) will automatically add issues to the board.
||Director of Infrastructure||Infrastructure|
||Reliability Engineering: CI/CD & Enablement||Reliability|
||Reliability Engineering: Dev & ops||Availability|
||Reliability Engineering: Secure and Defend||Observability|
||Distinguished Engineer, Infrastructure||Infrastructure|
||Operations Analyst, Infrastructure||Cost|
Other labels are relevant to issues in the board:
||Denotes OKR-related issue. It is used to communicate status and progress for quarerly KRs as assigned to erach team.|
||Denotes KPI-related issue. It is used to track progress on definition, implementation and tracking of Infrastructure KPIs.|
||Denotes the state of an issue according to our workflow conventions:
Board lists are driven by the following labels:
||Issues assigned to OnGres|
||OnGres support issues||Support|
||OnGres project issues||Project|
||Issues ready to start||Issues must always be prioritized|
||Issues in progress||Issues must always have a Due Date|
||Issues blocked||Issues must describe what can unblock them|
Issues are labeled by priority and criticality. Priority incidetes what should be worked on first. Criticality describes the risk of not doing the work.
All issues must have priority and critically assigned to them. This is a requirement for issues in the
||Highest priority items that require immediate action, with expected ETA in hours/days|
||Items that require prompt attention, with expected ETA within the week|
||Items with expected ETA within the current milestone|
||Items with expected ETA within the following milestone or beyond|
||Immediate threat to availability, performance or data durability|
||Expected threat to availability and/or performance within 30 days|
||Expected threat to availability and/or performance within 60 days|
||Expected threat to availability and/or performance beyond 60 days|
Note: data loss is always a
All the issues with the Priority 1
P1 label should be updated daily.
The time duration of our milestones is 2 weeks.