#productionchat channel for questions that don't seem appropriate to use the issue tracker or the internal email address for.
+GoogleCalendarbutton in the lower right of the screen when viewing the Calendar with the link here.
The Production SRE team is responsible for all user-facing services. Production SREs ensure that these services are secure, reliable, and fast. This infrastructure includes staging, GitLab.com and dev.GitLab.org; see the list of environments.
Production SREs also have a strong focus on building the right toolsets and automations to enable development to ship features as fast and bug free as possible, leveraging the tools provided by GitLab.com itself - we must dogfood.
Another part of the job is building monitoring tools that allow quick troubleshooting as a first step, then turning this into alerts to notify based on symptoms, to then fixing the problem or automating the remediation. We can only scale GitLab.com by being smart and using resources effectively, starting with our own time as the main scarce resource.
We want to make GitLab.com ready for mission critical workloads. That readiness means:
Add issues at any time to the infrastructure issue tracker. Let one of the managers for the production team know of the request. It would be helpful for our prioritization to know the timeline for the issue if your team has commitments related to it. We do reserve part of our time for interrupt requests, but that does not always mean we can fit in everything that comes to us.
We have a production issue tracker. Issues in this tracker are meant to track incidents and changes to production that need approval. We can host discussion of proposed changes in linked infrastructure issues. These issues should have ~incident or ~change and notes describing what happened or what is changing with relevant infrastructure team issues linked for supporting information.
The columns on our board are:
Guiding philosophies: To get from Planning to Ready, an issue should be:
We'll organize our work into 2 week milestones. These milestones are meant to:
There will always be a "Current Milestone" and a "Next Milestone". This way when making or updating issues, we can use quick actions like
\milestone %"Current Milestone and
\milestone %"Next Milestone" to quickly get issues adde.
When a milestone is complete, we'll rename the finished milestone to the YYYY-MM-DD Milestone and update the Current and Next milestones.
We also want to value handing off issues to take advantage of the many timezones our team covers. An issue may be started by any team member in any timezone, but we can mark an issue with ~Ready_to_Handoff for issues that can go to someone in another timezone. If we mark an issue with the ~Ready_to_Handoff label, it should have clear notes about where it is being left off and next steps. Handing off is not required, anyone can own an issue to completion too, but we want to be able communicate and work across the many people on our team were it makes sense.
The Milestones will run from a Monday to a Friday that is 12 days away. The Milestone will be named with the Month and End date of the timebox. For example August Milestone 2 - ending 2018-08-24.
Long term, additional teams will perform work on the production environment:
We cannot keep track of events in production across a growing number of functional queues.
Furthermore, said teams will start to have on-call rotations for both their function (e.g., security) and their services. For people on-call, having a centralized tracking point to keep track of said events is more effective than perusing various queues. Timely information (in terms of when an event is happening and how long it takes for an on-call person to understand what's happening) about the production environment is critical. The
production queue centralizes production event information.
Functional queues track team workloads (
security, etc) and are the source of the work that has to get done. Some of this work clearly impacts production (build and deploy new storage nodes); some of it will not (develop a tool to do x, y, z) until is deployed to production.
production queue tracks events in production, namely:
Over time, we will implement hooks into our automation to automagically inject change audit data into the
This also leads to a single source of data. Today, for instance, incident reports for the week get transcribed to both the On-call Handoff and Infra Call documents (we also show exceptions in the latter). These meetings serve different purposes but have overlapping data. The input for this data should be queries against the
production queue versus the manual build in documents.
Additionally, we need to keep track of error budgets, which should also be derived from the
We will also be collapsing the
database queue into the
infrastructure queue. The database is a special piece of the infrastructure for sure, but so are the storage nodes, for example.
All direct or indirect changes to authentication and authorization mechanisms used by GitLab Inc. by customers or employees require additional review and approval by a member of at least one of following teams:
This process is enforced for the following repositories where the approval is mandatory using MR approvals:
Additional repositories may also require this approval and can be evaluated on a case-by-case basis.
We use issue labels within the Infrastructure issue tracker to assist in prioritizing and organizing work. Prioritized labels are:
~(perceived) data loss
We also use the
~AP3 labels as described in availability & performance priority labels. Those are mainly used to communicate priority of issues to Product Managers, for scheduling purposes.
~oncall are prioritized to be worked on by the current oncall team memers.
~goals are issues that are in a WoW and we agreed as a team that we will do everything in our power to deliver them. Goal issues should fit in one WoW, that is, they are deliverable in a single week time, if they do not fit in one WoW we are probably talking about a
We use this kind of issues to indicate a general direction (generally speaking something that will take from 1 to 3 months of work) This means that a
~meta ~goal should be achievable in one quarter.
~meta issues that are not also
~goal are the tasks that are larger than what fits in a quarter, therefore they need to be sliced into actually deliverable pieces that can also become a goal.
We use some other labels to indicate specific conditions and then measure the impact of these conditions within production or the production engineering team. This is specially important from the time investment in specific parts of the production engineering team, to reduce toil or to reduce the chance of a failure by accessing to production more than enough.
Labels that are particularly important for gathering data are:
~toilRepetitive, boring work that should be automated away.
~unscheduledAn issue that became an interruption to the team and had to be handled in a WoW. It's unplanned work.
~unblocks othersAn issue that is allowing some other part of the company to deliver something.
~access requestWhen someone is requesting to get access to some part of the infrastructure.
~requires production accessEvery time someone with production access has to jump into a console to perform some manual operation like running a script in a rails console, or connecting to Redis or the database directly
We should never stop helping and unblocking team members. To this end, data should always be gathered to assist in highlighting areas for automation and the creation of self-service processes. Creating an issue from the request with the proper labels is the first step. The default should be that the person requesting help makes the issue; but we can help with that step too if needed.
If this issue is urgent for whatever reason, we should label them following the instructions above and add them to the ongoing WoW.
Ongoing outages, as well as issues that have the
~(perceived) data loss label and are (therefore) actively being worked on need a hand off to happen as team members cycle in and out of their timezones and availability. The on call log can be used to assist with this. (See link at top to on-call log).
To ensure 24x7 coverage of emergency issues, we currently have split on-call rotations between EMEA and AMER regions; team members in EMEA regions are on-call from 0400-1600 UTC, and team members in AMER regions are on-call from 1600-0400 UTC. We plan to extend this to include team members from the APAC region in the future, as well. This forms the basis of a follow-the-sun support model, and has the benefit for our team members of reducing (or eliminating) the stress of responding to emergent issues outside of their normal work hours, as well as increasing communication and collaboration within our global team.
For further details about managing schedules, workflows, and documentation, see the on-call runbook.
There are 2 kind of production events that we track:
Every second day of the month, we have a R.A.D. "party". Two production SREs use this day to test our backup processes by fully restoring not-yet-automated backups to test instances and to verify data integrity. The issues for every individual R.A.D can be found in the infrastructure tracker.
The ongoing effort to automate all the things backups is tracked in the infrastructure META issue.