The SRE org today consists of 3 teams with 3-6 engineers and will grow in 2019 to a full 3 teams with ~6 SREs and ~2 DBREs. As we grow to a more mature team in 2019, we will continue to think about how we organize our teams and how work flows to them.
How does work flow to the SRE teams?
Two major ways we could have teams cover their responsibilities.
In aligning geographically, we can ensure a good model for a follow the sun strategy. However, a lot of the strategies discussed cover co-located, but geographically diverse teams - think 3 co-located offices in 3 different timezones.
Some unique challenges we have:
In aligning on expertise, teams will own and develop expertise in alignment with certain development/product areas. See the "Structuring a Multiple SRE Team Environment" in Chapter 18. Further below "Running Cohesive Distributed SRE Teams", there are further tips that would be applicable to the distribution of SRE team members with GitLab. This model will likely give a better experience to the other teams we need to interact with. There will be a stable SRE team which can develop a relationship with their counterpart teams in GitLab. We could structure this along the lines of the product categories and sections. For areas of internal ownership, we could also then have more stable "shepherds" or owners for core parts of our infrastructure like Prometheus, Grafana, ELK, PagerDuty, Chef, and Terraform. Other SRE teams that are not owners of that infrastructure tooling can still make changes, but there can be a known subset for who/where MRs and design should go.
Another consideration is how SRE and the rest of engineering envision alerting for their services on GitLab.com in the future. This is largely item 1 in the how does work flow to our team above. Currently all alerts go to the central SRE on call rotation. The shifts are a pair of 12 hours starting at 4AM and 4PM UTC. As the infrastructure teams and GitLab continue to grow, we have the opportunity to iterate on how we handle on call for GitLab.com.
A future desired state is to get the rotations to a 3 shifts of 8 hours model to better follow the sun. This is largely dependent on having enough people in each rotation. Six people is the rough minimum for a reasonable rotation that will allow people to focus on project work when they are not on call along with preventing burn out. Many SRE models also have the engineering teams for services participate in on call along with SRE teams for monitoring and alerting for their services. We could take steps to iterate towards that model. However, there are some considerations on how we would approach this:
Further reading for thoughts and considerations about on call are in Chapter 8 of the SRE Workbook