Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Site Reliability Engineer

Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other GitLab production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our our environments and the GitLab codebase. We specialize in systems, whether it be networking, the Linux kernel, or some more specific interest in scaling, algorithms, or distributed systems.

GitLab.com is a unique site and it brings unique challenges–it’s the biggest GitLab instance in existence. In fact, it’s one of the largest single-tenancy open-source SaaS sites on the internet. The experience of our team feeds back into other engineering groups within the company, as well as to GitLab customers running self-managed installations.

As an SRE you will:

You may be a fit to this role if you:

Projects you could work on:

Leveling of Site Reliability Engineering at GitLab

Areas of expertise/contribution for Leveling

Technical:

Execution:

Collaboration and Communication:

Influence and Maturity

Levels for Site Reliability Engineer

Junior Site Reliability Engineer

Technical:

  1. Updates GitLab default values so there is no need for configuration by customers.
  2. General knowledge of the 2 of the areas of technical expertise

Execution:

  1. Provides emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed
  2. Delivers production solutions that scale, identifies automation points, and proposes ideas on how to improve efficiency.
  3. Improves monitoring and alerting fighting alert spam.

Collaboration and Communication:

  1. Improves documentation all around, either in application documentation, or in runbooks, explaining the why, not stopping with the what.

Influence and Maturity

  1. Shares the learnings publicly, either by creating issues that provide context for anyone to understand it or by writing blog posts.

Site Reliability Engineer

Technical:

  1. General knowledge of the 4 of the areas of technical expertise with deep knowledge in 1 area

Execution:

  1. Provides emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed
  2. Proposes ideas and solutions within the infrastructure team to reduce the workload by automation.
  3. Plan, design and execute solutions within infrastructure team to reach specific goals agreed within the team.
  4. Plan and execute configuration change operations both at the application and the infrastructure level.
  5. Actively looks for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation

Collaboration and Communication:

  1. Improves documentation all around, either in application documentation, or in runbooks, explaining the why, not stopping with the what.

Influence and Maturity

  1. Shares the learnings publicly, either by creating issues that provide context for anyone to understand it or by writing blog posts.
  2. Contributes to the hiring process in review questionnaires or being part of the interview team to qualify SRE candidates

Senior Site Reliability Engineer

Are experienced Site Reliability Engineers who meet the following criteria

Technical:

  1. Deep knowledge in 2 areas of expertise and general knowledge of all areas of expertise. Capable of mentoring Junior in all areas and other SRE in their area of deep knowledge.
  2. Contributes small improvements to the GitLab codebase to resolve issues

Execution:

  1. Identifies significant projects that result in substantial cost savings or revenue
  2. Identifies changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
  3. Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make GitLab cheaper to run for all our customers.
  4. Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.
  5. Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.

Collaboration and Communication:

  1. Know a domain really well and radiate that knowledge
  2. Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.

Influence and Maturity:

  1. Lead Production SREs and Junior Production SREs by setting the example.
  2. Show ownership of a major part of the infrastructure.
  3. Trusted to de-escalate conflicts inside the team

Staff Site Reliability Engineer

Are Senior Production SREs who meet the following criteria:

Technical:

  1. Able to create innovative solutions that push GitLab's technical abilities ahead of the curve
  2. Deep knowledge of GitLab and 4 areas of expertise. Knowledge of each area of expertise enough to mentor and guide other team members in those areas.
  3. Contributes to GitLab codebase to resolve issues and add new functionality

Execution:

  1. Strives for automation either by coding it or by leading and influencing developers to build systems that are easy to run in production.
  2. Measure the risk of introduced features to plan ahead and improve the infrastructure.
  3. Proposes and drives architectural changes that affect the whole company to solve scaling and performance problems
  4. Leads significant project work for OKR level goals for the team

Communication and Collaboration:

  1. Works with engineers across the whole company influencing design to create features that will work well with SaaS and self hosted platforms
  2. Runs RCAs and epic level planning meetings to get meaningful work scheduled into the plan

Influence and Maturity:

  1. Writes in-depth documentation that shares knowledge and radiates GitLab technical strengths
  2. Has a high level of self awareness
  3. Trusted to de-escalate conflicts inside and outside the team
  4. Routinely has an impact on the broader Engineering organization
  5. Helps to develop other team members in to senior levels and leaders in the team

Distinguished Site Reliability Engineer

TBD

Engineering Fellow, Infrastructure

The Infrastructure Fellow embodies all the requirements of less senior roles on this page. In addition, the role is closely associated with Engineering Fellow role in our Development Department.

  1. Drive the technical strategy of of our GitLab.com Infrastructure
  2. Heavily influence the technical strategy of our open-source application maintained by our Development Department
  3. Make skill-gap recommendations for future hiring in Infrastructure and other departments
  4. Author technical vision artifacts with >1 year time horizon
  5. Assist teams throughout Engineering to interpret this vision into actionable backlogs
  6. Help Engineering avoid the architecture "ivory tower"
  7. Spend time with customers to learn their needs

Performance Indicators

Site Reliability Engineers have the following job-family performance indicators:

Hiring Process

All interviews are conducted using Zoom video conferencing software. To learn more about someone conducting your interview, find their job title on our team page.

Please keep in mind that you can be declined at any stage of the process. You should consider each of the following bullets as though the words, "If selected" precedes them.

It's possible you may have additional 60 minute interviews with either the Director of Infrastructure Engineering, the VP of Engineering, or both.

If approved, you will subsequently be made an offer.

Additional details about our process can be found on our hiring page.

We are an equal opportunity employer and value diversity and inclusion at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Compensation

Components

Compensation at GitLab consists of:

  1. Base salary
  2. Options
  3. Bonus in some functions
  4. Discretionary bonus
  5. Retirement contributions in eligible regions
  6. Pay for equipment
  7. Remote work
  8. Unlimited time off
  9. Benefits

Apply

Please note that if we are actively hiring for a position, you will see it listed on our jobs page, where all of our current openings are advertised. To apply, please click on the name of the role you are interested in, which will take you to our applicant tracking system (ATS), Greenhouse.

Avoid the confidence gap; you do not have to match all the listed requirements exactly to apply. Our hiring process is described in more detail in our hiring handbook.

About GitLab

GitLab Inc. is a company based on the GitLab open-source project. GitLab is a community project to which over 1,000 people worldwide have contributed. We are an active participant in this community, trying to serve its needs and lead by example. We have one vision: everyone can contribute to all digital content, and our mission is to change all creative work from read-only to read-write so that everyone can contribute.

We value results, transparency, sharing, freedom, efficiency, frugality, collaboration, directness, kindness, diversity and inclusion, boring solutions, and quirkiness. If these values match your personality, work ethic, and personal goals, we encourage you to visit our primer to learn more. Open source is our culture, our way of life, our story, and what makes us truly unique.

Top 10 reasons to work for GitLab:

  1. Work with helpful, kind, motivated, and talented people.
  2. Work remote so you have no commute and are free to travel and move.
  3. Have flexible work hours so you are there for other people and free to plan the day how you like.
  4. Everyone works remote, but you don't feel remote. We don't have a head office, so you're not in a satellite office.
  5. Work on open source software so you can interact with a large community and can show your work.
  6. Work on a product you use every day: we drink our own wine.
  7. Work on a product used by lots of people that care about what you do.
  8. As a company we contribute more than we take, most of our work is released as the open source GitLab CE.
  9. Focused on results, not on long hours, so that you can have a life and don't burn out.
  10. Open internal processes: know what you're getting in to and be assured we're thoughtful and effective.

See our culture page for more!

Work remotely from anywhere in the world. Curious to see what that looks like? Check out our remote manifesto.