- You are here:
- Site Reliability Engineer
Site Reliability Engineers (SREs) take on problems that require both development and operations expertise. For example, an SRE may solve distributed computing and/or concurrency problems that affect both our application and our infrastructure. An SRE works closely within a team of developers to make sure that the service or feature set that is being developed will reach its' target metrics on availability and latency, and that the solutions are scalable and reliable once deployed to production on GitLab.com.
Applicants need to qualify to join both as a developer and as a production engineer to join the team as an SRE.
- Work with developers to make their service or set of features ("service" for brevity) reliable.
- Contribute modular, well-tested, and maintainable code
- Write production-ready code with little assistance
- Write complex code that can scale with a significant number of users
- Fix performance issues on GitLab.com using our existing tools, and improve those tools where needed; providing guidance to others.
- Develop monitoring and alerting to measure and act on improving the availability, and scalability of the service on GitLab.com.
- Responsible for managing the infrastructure related to the service.
- Radiate knowledge to the infrastructure team about the service, and radiate knowledge of the service's infrastructure and reliability to the rest of the development team.
- Together with other SREs and Production Engineers, design, build and maintain core infrastructure pieces that allow GitLab scaling to support hundred of thousands of concurrent users.
- Identify parts of the system that do not scale, provide immediate palliative measures and drive long term resolution of these incidents.
- Participate in on-call rotation to respond to GitLab.com availability incidents, and use your on-call rotation to prevent pages from ever happening.
- Document every action so your learnings turn into repeatable actions and then into automation.
- Debug application and production issues across services and levels of the stack.
- Ship every solution into the GitLab-CE and EE package as a default.
- You can reason about software, algorithms, and performance from a high level.
- You have experience thinking about systems - edge cases, failure modes, behaviors, and specific implementations.
- You have worked with distributed systems and have a solid understanding of how modern web stacks are built, and why.
- You are passionate about open source.
- You have worked on a production-level Ruby application, preferably using Rails.
- You know how to write your own Ruby gem using TDD techniques
- You know your way around Linux and the Unix Shell.
- Strong written communication skills
- Experience with Docker, Nginx, Go, Kubernetes, a plus
- Experience with online community development a plus
- Self-motivated with strong organizational skills
- You share our values, and work in accordance with those values.
- A technical interview is part of the hiring process for this position.