Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments and the GitLab codebase.
SREs specialize in systems (operating systems, storage subsystems, networking), while implementing best practices for availability, reliability and scalability, with varied interests in algorithms and distributed systems.
GitLab.com is a unique site and brings with it unique challenges: it is the largest GitLab instance in existence (and in fact, one of the largest single-tenancy open-source SaaS sites on the Internet). The team's experience feeds back into other Engineering groups within the company, as well as to GitLab customers running self-managed installations.
The Site Reliability Engineer is a grade 6.
The Senior Site Reliability Engineer is a grade 7.
Are Site Reliability Engineers who meet the following criteria:
The Staff Site Reliability Engineer is a grade 8.
Are Senior Production SREs who meet the following criteria:
SRE's with Delivery specialization focus primarily on improving the software delivery for GitLab.com, as well as self-managed users by improving the release management tooling and processes. They have a wide understanding of the system and application architecture, and have a strong observability background. They are expected to contribute to various GitLab projects with a software delivery focus and point of view.
Delivery SRE responsibilities are the same as for their Backend Engineer team colleagues, defined in the backend engineer role. While the backend engineers approach their responsibilities from a software developer point of view, the SRE's approach the same problems from the operational perspective and collaborate closely on finding an optimal solution that will safely and quickly deliver code to various supported environments.
Additional responsibility that Delivery SRE's are tasked with is ensuring shortening the software delivery times by introducing new technologies and migrating from existing established infrastructure, such as migrating from Virtual Machines to the Kubernetes platform and similar.
SRE's with Scalability specialization focus primarily on the application side of GitLab running on GitLab.com, through improving the architecture as GitLab.com continues growing. They work to provide data to development teams to enable them to prioritize reliability and performance improvements through application changes and improved use of the infrastructure resources available on GitLab.com.
They have a strong development background (expected to continuously contribute to GitLab codebases), and have a good grasp of observability and systems operations.
SREs in Scalability operate on a long-term horizon. We aim to prevent future large S1 incidents by making sure that we have a system that can scale to meet demand.
In each of these responsibilities, we focus on the long-term mindset required to harden our systems for growth.
SREs in Scalability are expected to be part of the on-call rotation.
Further details about Scalability SRE's involvement with incidents is available on the team handbook page.
SRE's with Environment Automation specialization primarily focus on provisioning of various GitLab environments, and automating every operational aspect of the application lifecycle. They have a strong operational background, but their strength is in converting regular manual actions into repeatable automated tasks.
SRE's with Cloud Efficiency Engineering specialization primarily focus on improving our overall utilization of cloud provider resources. This includes improving tooling and adoption of tagging/labeling, analyzing and leveraging use of discounting tools such as RIs and CUDs, and collaborating across various teams to increase efficienciency of GitLab itself.
Individual Contributors in SRE roles can also move to roles in the Engineering Management - Infrastructure job family.
Site Reliability Engineers have the following job-family performance indicators:
All interviews are conducted using Zoom video conferencing software. To learn more about someone conducting your interview, find their job title on our team page.
Please keep in mind that you can be declined at any stage of the process. You should consider each of the following bullets as though the words, "If selected" precedes them.
It's possible you may have additional 60 minute interviews with either the Director of Infrastructure Engineering, the VP of Engineering, or both.
If approved, you will subsequently be made an offer.
Additional details about our process can be found on our hiring page.
We are an equal opportunity employer and value diversity, inclusion and belonging at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
GitLab Inc. is a company based on the GitLab open-source project. GitLab is a community project to which over 2,200 people worldwide have contributed. We are an active participant in this community, trying to serve its needs and lead by example. We have one vision: everyone can contribute to all digital content, and our mission is to change all creative work from read-only to read-write so that everyone can contribute.
We value results, transparency, sharing, freedom, efficiency, self-learning, frugality, collaboration, directness, kindness, diversity, inclusion and belonging, boring solutions, and quirkiness. If these values match your personality, work ethic, and personal goals, we encourage you to visit our primer to learn more. Open source is our culture, our way of life, our story, and what makes us truly unique.
Top 10 Reasons to Work for GitLab:
See our culture page for more!
Work remotely from anywhere in the world. Curious to see what that looks like? Check out our remote manifesto and guides.