If you are a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you are a GitLab team member looking for assistance from Reliability Engineering, please see the Getting Assistance section.
Reliability Engineering is responsible for all of GitLab's user-facing services, with their primary responsibility being GitLab.com. Site Reliability Engineers (SREs) ensure the availability of these services, building the tools and automation to monitor and enable this availability. These user-facing services include a multitude of environments, including staging, GitLab.com, and dev.GitLab.org, among others (see the list of environments).
Reliability Engineering ensures that GitLab's customers can rely on GitLab.com for their mission-critical workloads. We approach availability as an engineering challenge and empower our counterparts in Development to make the best possible infrastructure decisions. We own and iterate often on how we manage incidents and continually derive and share our learnings by conducting thorough reviews of those incidents.
The Reliability Team maintains service ownership as defined in the GitLab Service Ownership Policy
If you're a GitLab team member and are looking to alert Reliability Engineering about an availability issue with GitLab.com, please find quick instructions to report an incident here: Reporting an Incident.
If you'd like our assistance, please use one of the issue generation templates below and the work will be routed appropriately:
We can also be reached in Slack in the #production channel for questions related to GitLab.com and in the #infrastructure-lounge channel for all other questions.
Assistance from the Infrastructure Team is occasionally required to help solve or troubleshoot external customer issues.
All larger features, configuration changes, and new services must go through a Production Readiness review.
The General Team supports the Reliability Team's overall vision by supporting services for GitLab.com that do not fit the mission of the other Reliability Teams.
The Observability team maintains metrics and logs platforms for GitLab SaaS and is responsible for Prometheus, Thanos, Grafana, and Logging.
The Foundations team builds, runs and owns the core infrastructure for GitLab.com.
The Database Reliability Team is responsible for the PostgreSQL engine for GitLab.com services, as well as a range of related systems which help to ensure the availability and reliability of the database engine.
The Practices Team has SRE's that work full time directly with Stage Groups Teams to focus on their Reliability and Infrastructure concerns for GitLab.com when their need is greater than the General Team can support.
We are mainly working from OKRs with the ~sub-department::Reliability
label. Our remaining work is adhoc issues on the Reliability issue backlog.
Corrective Actions
for incidentsOKRs for the Reliability team that require status tracking should be updated each Wednesday.
All Objectives and Key Results should have the following labels:
~OKR
~division::Engineering
~"Department::Infrastructure & Quality"
~Sub-Department::Reliability
~Reliability::<name>
. (e.g. ~Reliability::Practices
)The description of an Objective or Key Result should include the "why" for making this a focus area for the quarter and how the key result will be scored.
OKRs should be updated every Wednesday with an update of the % completed and a dated update in the comments for current status.
If progress is on track and the % completed is getting updated as expected, it is not necessary to provide extensive status updates.
One sentence with a link to a larger update in a corresponding epic is sufficient in most cases.
It is also acceptable to do a check-in (ie update the % complete) without providing a note on a single occasion, but not over several check-ins.
When status changes to "needs attention" or "at risk", we need to provide context as to why the status has changed, what action is being taken to address the issues and what assistance is required to enable team members not close to the work to understand the situation without requesting additional status updates.
The OKR Handbook has a starting point for when to add Needs Attention
and At Risk
to Health Status for OKRs
6 weeks from the start of the quarter is the midpoint check in which requires a more detailed check in. In addition to weekly status update information all objectives and key results should have:
At the end of the quarter the last check in requires a restrospective. The DRI for the objective or key result should add the following:
In addition to status updates every Wednesday, all Objectives and Key Results assigned to the current quarter must have the following set: