The Practices team is a subgroup of the Reliability Team.
Our mission is to ensure the reliability, performance, and availability of GitLab.com by partnering with Stage Groups to ensure that features and services are designed and implemented with reliability in mind. The team collaborates with Stage groups to build, maintain, and improve services and ensure the services' SLO is met as per GitLab.com's availability and performance goals.
The following people are members of the Reliability:Practices team
Currently, Practices team supports Infrastructure Reliability related to the below services;
Stage Group | Reliability:Practices Team Members |
---|---|
Verify:Runner | Rehab Hassanein |
Verify:Runner SaaS | Rehab Hassanein |
Monitor:Product Analytics | |
Fulfillment:Fulfillment Platform | |
Systems:Gitaly::Cluster | Furhan Shabir |
Systems:Gitaly::Git | Furhan Shabir |
Workflow | Issue Labels Weekly Issue Triage |
Backlog | Current Milestone Service Backlog Priority Board |
Reaching us | #g_infra_practices #reliability-lounge @gitlab-org/reliability/practices |
Weekly Agenda | Weekly alternative APAC/EMEA and EMEA/AMER |
Achievements | FY24 - Q1 |
Requests not in the scope below should refer to Reliability team's defined General Workflow for prioritization and assignment.
The Practices team is not there to reduce the stage team's backlog but to collaborate and support reliability efforts and infrastructure best practices.
The above scope of work might imply some overlap with other teams' functions such as;
We use quarterly Objectives and Key Results to plan and measure our Key Performance Indicators (KPIs).
We measure the value we contribute by using performance indicator metrics.
In addition to the Infrastructure Department's KPIs for availability and performance of GitLab.com, the Practices team tracks the following;
The Practices team does not necessarily own the above KPIs and metrics but rather facilitate and support them to ensure reliable services and components.
We strive for a 50% split between project work and operations work. Having more than 50% operations work is an indicator that the service/team is not in a healthy state and is something that needs to be addressed. More detail on this can be found in https://sre.google/workbook/part-II-practices/
Date:
Participants:
1. New Items
1. ...
2. Standing Items
1. Incidents and Corrective Actions
2. Observability Adjustments
3. Long Term Work
4. Ideas for Improvements
1. Automation
2. Self-Service
It is encouraged to perform pairings between the Practices team members and their assigned Stage Group team. This provides multiple immediate benefits:
Both the Practices team member and their assigned Stage Group team members are encouraged to reach out to the respective counterpart to ask whether they have room to work on a specific issue together. Any identified improvements should be documented by creating issues or creating merge requests.
Members of the Practices team are encouraged to identify and attend one or more of their Stage Group's Weekly meetings:
Issues for Practices team can be found in the following projects:
Meeting | Purpose |
---|---|
Biweekly sync (rotate EMEA/APAC/AMER) | Share news and information and provide an opportunity for people on the team to escalate concerns. |
Retrospective (weekly) | Discuss what went well, not so well and opportunities for improvements. |
Roadmap (monthly) | Discuss the vision and roadmap check-in |
The purpose of the daily standup is to allow team members to have visibility into what everyone else is doing, provide an avenue for asking and offering help. We use geekbot integrated with Slack.
The purpose of daily updates is to inspect progress and adapt upcoming planned work as necessary. In an all-remote culture, we keep the updates asynchronous and put them directly in the issues.
The async daily update communicates the progress and confidence using an issue comment. A daily update may be skipped if there was no progress.
A weekly async update should be added to epics related to quarter goals and to epics actively being worked on as per the format in Project Epic template The update should provide an overview of the progress across the work in progress. Consider adding an update if epic is blocked, if there are competing priorities, and even when not in progress.