Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Category Direction - Incident Management

Introduction and how you can help

Thanks for visiting this category page on Incident Management in GitLab. This page belongs to the Health group of the Monitor stage, and is maintained by Kevin Chu who can be contacted directly via email. This vision is a work in progress and everyone can contribute. Sharing your feedback directly on issues and epics at GitLab.com is the best way to contribute to our vision. If you’re a GitLab user and have direct knowledge of your need for incident management, we’d especially love to hear from you.

Overview

Downtime is expensive and that cost is growing as the reliability of a software service becomes as important as the features & functionality within the product. It doesn't matter what your product can do, if your customers can't reliably access it. Downtime has been known to cost companies hundreds of thousands of dollars in a single hour. This number, though an estimate based on a wide range of companies, communicates that avoiding downtime is critical for organizations. It's important that companies invest time to culminate process and culture around managing outages and resolving them quickly. The larger an organization becomes, the more distributed their systems and teams tend to be. This distribution leads to longer response times and more money lost for the business. Investing in the right tools and fostering a culture of autonomy, feedback, quality, and automation leads to more time spent innovating and building software and less time spent reacting to outages and racing to restore services. The tools your DevOps teams use to respond during incidents critically affect MTTR (Mean Time To Resolve, also known Mean Time To Repair) as well as the happiness and morale of team members responsible for the IT services your business depends on.

A complete incident management solution consumes inputs from alerting sources, transforms those inputs into actionable incidents, routes them to the responsible party, and then empowers response teams to quickly understand and remediate the problem at hand. These pathways are going to be different depending on the company. Response teams will be most successful if their incident management tool allows them to define and optimize their incident management workflows such that critical incidents are expedited and urgently resolved, while lower severity incidents are handled accordingly. In short, an Incident Management tool should closely align to the processes by which DevOps manage other tickets and provide options for customizations depending of the criticality of the incident.

Mission

Our mission is to help DevOps teams reduce MTTR by streamlining the triage and resolve workflows via tools that provide access to observability resources (metrics, logs, errors, runbooks, and traces), that foster easy collaboration across response teams, and that support continuous improvement via Post Incident Reviews and system recommendations.

Challenges

As we invest R&D in building out Incident Management at GitLab, we are faced with the following challenges:

Opportunities

We are uniquely positioned to take advantage of the following opportunities:

Target Audience and Experience

Our current Incident Management tools have been built for users who align with our Allison (Application Ops) and Devon (DevOps Engineer) personas. The experience targets DevOps teams at smaller companies where it is common for the engineers to be on-call and responding to alerts for the software that they also write code for. As we mature this category, we will evolve the experience to appeal to and serve the enterprise customer.

Strategy

Scope

Incident Management is a broad category. The following diagram explains all functionality that is currently within scope for our vision of the category.

image.png

Maturity Plan

We are currently working to mature the Incident Management category from viable to complete. Definitions of these maturity levels can be found on GitLab's Maturity page. The following epics group the functionality we have planned to mature Incident Management.

What is Next & Why?

Processing alerts during a fire-fight requires responders to coordinate across multiple tools to evaluate different data sources. This is time consuming because every time a responder switches to a new tool, they are confronted with a new interface and different interactions which is disorienting and slows down investigation, collaboration, and the sharing of findings with teammate. Actionable alerts and incidents accelerate the fire-fight by enabling efficient knowledge sharing, providing guidelines for resolution, and minimizing the number of tools you need to check before finding the problem. In support of this, we are pursuing the following functionality in the next 2-3 releases:

…and much more! Please follow along in this epic to contribute to our plan.

Dogfooding Plan

We are actively dogfooding Incident Management features with the Infrastructure team. Today, the Infrastructure team relies partially on PagerDuty to maintain GitLab.com and the other services they are responsible for. Ultimately the joint goal of the Infrastructure team and the Monitor:Health group is for the Infrastructure team to be able to rely on GitLab Incident Management solely. Our plan to achieve this goal is as follows:

  1. Prioritize new functionality based on the gap analysis
  2. Incrementally dogfood new features via simulation days (example) to gather immediate feedback on for improvements
  3. Begin a full migration once we have completed filling in the gap analysis - view migration plan here

On-call Schedule Management will also be dogfooded by the Support and Security Ops teams once we've built functionality required for them to manage them global on-call teams. Dogfooding plans for these teams are TBD.

Marketing & Sales Enablement

Marketing and Sales Enablement material can be found here.

Pricing

Features in the Incident Management category have been placed in tiers based on GitLab's Buyer Based Tiering strategy. The following pricing plan represents existing and future features.

Functionality Free Premium Ultimate
ALERT INTEGRATIONS      
Generic HTTP Endpoint
Multiple HTTP endpoints  
Email integration  
Multiple email endpoints  
External Prometheus integration
Add custom mapping for alert formats to endpoints  
Special bi-directional out of the box integrations with popular monitoring tools    
INCIDENTS      
Manual Incident Creation
Incident creation based on limited criteria (e.g. integration or severity)  
Incident creation based on extensive criteria    
Incident payload transformations    
ON-CALL SCHEDULE MANAGEMENT      
Create multiple schedules  
Escalation policies   ✅ (single) ✅ (multiple)
Routing rules for alerts    
RUNBOOKS      
Link runbooks to alerts via simple URL input - link appears in alert
Automatically render linked runbooks in alerts/incidents  
Create new runbook when creating alert  

Competitive Landscape

Atlassian Opsgenie Splunk On-call (Previously know as VictorOps - acquired by Splunk in 2018)
PagerDuty
ServiceNOW
XMatters

Analyst Landscape

Analyst firms such Gartner and 451 have recently published articles on the rising prevalence of automation in incident response workflows.

Gartner

Gartner's recent research study titled Automate Incident Response to Enhance Incident management, focuses on the importance of leveling up manual incident response processes with automation: "Organizations targeting best-in-class incident management must address the manual processes and collaboration challenges between teams.”. They go on to outline some of their key findings which highlight that “I&O organizations are looking to enhance incident response by focusing on automation, third-party integration, stakeholder management and improved detection response feedback loops.”

Their recommendations include the following:

Gartner, Automate Incident Response to Enhance Incident management, By: Venkat Rayapudi & Steve White, Published 18 September 2020

Competitors (listed above in competitive landscape) enable the automation of these processes to different extents. Automattion functionality is typically offered with higher pricing tiers across the board. In order to take advantage of these automation features, companies must invest significant time in the configuration and fine-tuning of systems and processes.

In the near term, GitLab is positioned to enable Gartner’s recommendations for a best in-class incident management platform via the centralization of on-call schedule management to enable the automatic routing of alerts to the right responders at the right time. When we begin working on maturing Incident Management to Lovable (plan) we will be adding rule sets that enable users to automate the creation of actionable incidents.

451 Research

451 Research published an article on the acquisition of Rundeck by PagerDuty in September 2020. Read more about this on Rundecks website. This was a strategic move to meet the demands of the enterprise for more automation in incident response.

GitLab has plans to investigate using Rundeck for Runbooks via gitlab#36655, this will be interesting opportunity to connect the PagerDuty lifecycle into GitLab Runbooks and Monitoring capabilities.

There is an existing landscape of comparable tools and even "ServiceNow and xMatters have orchestration engines that can be deployed to build workflows across tools, but they aren't typically extensively used to execute remediations." VictorOps (owned by Splunk) and OpsGenie (owned by Atlassian) are other similar tools with visions like PagerDuty.

Top Customer Success/Sales Issue(s)

Not yet, but accepting merge requests to this document.

Top Customer Issue(s)

Not yet, but accepting merge requests to this document.

Top Internal Customer Issue(s)

Not yet, but accepting merge requests to this document.

Top Vision Item(s)

Not yet, but accepting merge requests to this document.

Git is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license