- You are here:
- Engineering Roles
- Site Reliability Engineer
Site Reliability Engineers are responsible for the keeping all user-facing services (most notably GitLab.com) and many other GitLab production systems running smoothly 24/7/365. SREs are a blend of operations gearheads and software crafters that apply sound enginering principles, operational discipline and mature automation, specializing in systems, whether it be networking, the Linux kernel, or even a specific interest in scaling, algorithms, or distributed systems.
GitLab.com is a unique site and it brings unique challenges: it’s the biggest GitLab instance in existence; in fact, it’s one of the largest single-tenancy open-source SaaS sites on the internet. The experience of our team feeds back into other engineering groups within the company, as well as to GitLab customers running self-managed installations.
As a SRE you will:
- Be on a PagerDuty rotation to respond to GitLab.com availability incidents and provide support for service engineers with customer incidents.
- Use your on-call shift to prevent incidents from ever happening.
- Manage our infrastructure with Chef, Terraform and Kubernetes.
- Make monitoring and alerting alert on symptoms and not on outages.
- Document every action so your learnings turn into repeatable actions and then into automation.
- Use the GitLab product to run GitLab.com as a first resort and improve the product as much as possible
- Improve the deployment process to make it as boring as possible.
- Design, build and maintain core infrastructure pieces that allow GitLab scaling to support hundred of thousands of concurrent users.
- Debug production issues across services and levels of the stack.
- Plan the growth of GitLab's infrastructure.
You may be a fit to this role if you:
- Think about systems - edge cases, failure modes, behaviors, specific implementations.
- Know your way around Linux and the Unix Shell.
- Know what is the use of config management systems like Chef (the one we use)
- Have strong programming skills - Ruby and/or Go
- Have an urge to collaborate and communicate asynchronously.
- Have an urge to document all the things so you don't need to learn the same thing twice.
- Have a proactive, go-for-it attitude. When you see something broken, you can't help but fix it.
- Have an urge for delivering quickly and iterating fast.
- Share our values, and work in accordance with those values.
- Have experience with Docker, Nginx, Go, Kubernetes
Projects you could work on:
- Coding infrastructure automation with Chef and Terraform
- Improving our Prometheus Monitoring or building new Metrics
- Helping release managers deploy and troubleshoot new versions of GitLab-EE.
- Migrate GitLab.com from it’s current home on Azure Cloud to Google Cloud Platform.
- Migrate GitLab.com to Kubernetes.
Areas of expertise for Leveling
- CDN and load balancing the application
- Kubernetes and containerizing our system
- Product knowledge
- Monitoring and Metrics in Prometheus and integrations with Slack/PagerDuty
- Logging infrastructure
- Team organization and planning
- Backend storage management and scaling
- Disaster Recovery and High Availability strategy
Junior Site Reliability Engineer
- Provides emergency response either by being on-call or by reacting to symptoms according to monitoring.
- Delivers production solutions that scale, identifying automation points, and proposing ideas on how to improve efficiency.
- Improves documentation all around, either in application documentation, or in runbooks, explaining the why, not stopping with the what.
- Improves the performance of the system by either making better use of resources, distributing load or reducing the latency.
- Shares the learnings publicly, either by creating issues that provide context for anyone to understand it or by writing blog posts.
- Updates GitLab default values so there is no need for configuration by customers.
- Improves monitoring and alerting fighting alert spam.
- General knowledge of the 2 of the areas of expertise
Site Reliability Engineer
- Proposes ideas and solutions within the infrastructure team to reduce the workload by automation.
- Plan, design and execute solutions within infrastructure team to reach specific goals agreed within the team.
- Plan and execute configuration change operations both at the application and the infrastructure level.
- Actively looks for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation
- 1 area of deep knowledge in the areas of expertise and general knowledge of on how to work with 4 more areas of expertise.
Senior Site Reliability Engineer
Are experienced production engineers who meet the following criteria
- Lead Production SREs and Junior Production SREs by setting the example.
- Identifies changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
- Know a domain really well and radiate that knowledge
- Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make GitLab cheaper to run for all our customers.
- Perform and run blameless root cause analyses on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.
- Show ownership of a major part of the infrastructure.
- Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.
- Identify the SLI (Service Level Indicators) that will align the team to meet the availability and latency objectives.
- Deep knowledge in 2 areas of expertise and general knowledge of all areas of expertise. Capable of mentoring Junior in all areas and other SRE in their area of deep knowledge.
Staff Site Reliability Engineer
Senior Production SREs who meet the following criteria:
- Technical Skills
- Identifies significant projects that result in substantial cost savings or revenue
- Able to create innovative solutions that push GitLab's technical abilities ahead of the curve
- Proposes and drives architectural changes that affects the whole company solve scaling and performance problems
- Set the necessary goals and SLO (Service Level Objectives) that will guide the infrastructure team to build a better product
- Writes in-depth documentation that shares knowledge and radiates GitLab technical strengths
- Production, Scalability & Automation
- Strives for automation either by coding it or by leading and influencing developers to build systems that are easy to run in production.
- Measure the risk of introduced features to plan ahead and improve the infrastructure.
- Deep knowledge of GitLab and 2 other areas of expertise. Knowledge of each area of expertise enough to mentor and guide other team members in those areas.
Candidates for this position can expect the hiring process to follow the order below. Please keep in mind that candidates can be declined from the position at any stage of the process. To learn more about someone who may be conducting the interview, find her/his job title on our team page.
Additional details about our process can be found on our hiring page.
Unfortunately GitLab is not hiring at your selected country at this time. Please read our hiring handbook
for more details.
Please note that if we are actively hiring for a position, you will see it listed on our jobs page, where all of our current openings are advertised. To apply, please click on the name of the role you are interested in, which will take you to our applicant tracking system (ATS), Greenhouse.
Avoid the confidence gap; you do not have to match all the listed requirements exactly to apply. Our hiring process is described in more detail in our hiring handbook.
GitLab Inc. is a company based on the GitLab open-source project. GitLab is a community project to which over 1,000 people worldwide have contributed. We are an active participant in this community, trying to serve its needs and lead by example. We have one vision: everyone can contribute to all digital content, and our mission is to change all creative work from read-only to read-write so that everyone can contribute.
We value results, transparency, sharing, freedom, efficiency, frugality, collaboration, directness, kindness, diversity, boring solutions, and quirkiness. If these values match your personality, work ethic, and personal goals, we encourage you to visit our primer to learn more. Open source is our culture, our way of life, our story, and what makes us truly unique.
Top 10 reasons to work for GitLab:
- Work with helpful, kind, motivated, and talented people.
- Work remote so you have no commute and are free to travel and move.
- Have flexible work hours so you are there for other people and free to plan the day how you like.
- Everyone works remote, but you don't feel remote. We don't have a head office, so you're not in a satellite office.
- Work on open source software so you can interact with a large community and can show your work.
- Work on a product you use every day: we drink our own wine.
- Work on a product used by lots of people that care about what you do.
- As a company we contribute more than we take, most of our work is released as the open source GitLab CE.
- Focused on results, not on long hours, so that you can have a life and don't burn out.
- Open internal processes: know what you're getting in to and be assured we're thoughtful and effective.
See our culture page for more!
Work remotely from anywhere in the world. Curious to see what that looks like? Check out our remote manifesto.