The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.

Overview
- Demo
- Vision
- Challenges
- Opportunities
Target Audience and Experience
Roadmap

Overview

Downtime is expensive - it costs companies hundreds of thousands of dollars, or more, in a single hour.

While downtime avoidance is preferable, downtime is not avoidable. As such, it is imperative that organizations are geared towards being able to respond to production problems efficiently and effectively. Put another way, organizations need to be resilient.

The majority of investment and focus of the DevOps industry (including GitLab) to date has been on downtime avoidance. There are some entrenched competitors approaching incident management from the perspective of workflows (ServiceNow), or incident notification (Pager Duty, OpsGenie). Despite this, holistic incident management products are lacking resulting in many organizations stitching together point solutions mixed with a healthy dose of DIY. We believe that many organizations are looking for ways to avoid reinventing the wheel on incident management

Demo

Check out this short video from our engineer Paulina that demos incident tags.

To see past demos take a look at the Respond Group playlist or click the link(s) below.

August 2022 - What we've built so far; on-call schedules, escalation policies, alerts, and incidents.
September 2022 - GitLab Incidents as the single source of truth (SSOT).
October 2022 - Important incident metrics and how you will be able to track them in GitLab Incidents.
November 2022 - Slack App for Incident Management design and epic overview
December 2022 - A short demo video that highlights our designs for our Post-incident review MVC.

Vision

GitLab Incident Management helps teams build resiliency in their software and processes against downtime, outages, and other unexpected situations.

We plan to achieve this by:

Capturing the single source of truth (SSOT) with incident timelines.
Continuously improving software and process resiliency through incident reviews and incident dashboards.
Facilitating planning and coordination through the use on-call schedule management.
Boosting signals and notifying on-call responders through the use of alert integration and escalations.
Communicating with customers via status pages.

We've also created a series of vision items for the category, which are visible in this issue. A video walk-through of our vision items is also available on YouTube.

Challenges

As we invest R&D in building out Incident Management at GitLab, we are faced with the following challenges:

The market is dominated by incident management companies that have been around for longer. Specific examples are included in the Competitive Landscape section.
We lack brand identification with Enterprise Ops buyers (also mentioned on the Ops Vision page)
Incident management tools typically work in conjunction with other monitoring tools. Having a robust ecosystem of integration with monitoring tools is an expensive investment.

Opportunities

We are uniquely positioned to take advantage of the following opportunities:

Colocation of code and incidents significantly reduces context switching and accelerates MTTR. We can correlate development events such as merge requests or deployments with incidents, shortening the time it takes to find the root cause and automates some of the work required to prepare a timeline of events necessary for Post Incident Reviews
We are well-practiced in building boring solutions and iteration. This will enable us to quickly produce a simple version of Incident Management "just-good-enough" to displace DIY solutions
We can uniquely serve the needs of Operations Managers who struggle to answer the question - "Are my teams spending all their time firefighting, or are they proactively managing the health of their applications?"

Target Audience and Experience

Our current Incident Management tools have been built for users who align with our Allison (Application Ops) and Ingrid (Infrastructure Operator) personas. The experience targets DevOps teams at smaller companies where it is common for the engineers to be on-call and responding to alerts for the software that they also write code for. As we mature this category, we will evolve the experience to appeal to and serve the enterprise customer. Here is a list of all jobs to be done (JTBD) for incident management.

Roadmap

Now - (next 1-4 milestones)

GitLab Incidents are the single source of truth (SSOT)

Prioritized Development Work

Incident metrics are a set of standard, quantifiable measurements used to track the incident response process. We are also working on allowing incident response teams can track timestamps in incidents!
The majority of do it yourself (DIY) users use Slack or similar apps as the main collaboration tool during an incident. We want to make it easy for incident response teams to use Slack and GitLab incident management, ensuring Slack-to-GitLab workflows are seamless to minimize repetitive, manual work for incident response teams. For our internal team, we plan to replace woodhouse with a productized Slack integration.
Every incident is an investment and users want to improve their incident response posture and learn from past incidents. We're introducing a post-incident review process so teams can learn from past incidents and better prepare for future, similar events.

How are we tracking success?

Ship Slack App for Incident Management; measure the number of integrations setup and work on a go to market (GTM) effort directly with Slack.
Ship Incident Review. We will track usage, particularly percentage of incident reviews conducted relative to the number of ~"severity::1" incidents created.
Ship Incident Tags MVC so users can start measuring important incident metrics.

Next (next 5-8 milestones)

On-call incident responders never miss an alert. Prioritized Development Work

We are working on ensuring on-call schedule experience is accurate and flexible enough to fit incident response teams' needs. Sometimes responders need to get their shifts covered; we'll be introducing scheduled overrides. Additionally, being on-call during normal working/ business hours isn't the same as being on-call during a Friday evening or weekend; we're introducing the ability for on-call schedules to allow for specific days.
Responders need to be automatically notified when an alert is triggered and they want to define how they are contacted/paged. This will increase their responsiveness to an alert that needs to be triaged. We will be introducing paging options beyond email like a phone call or an SMS text message.

How are we tracking success?

Users subscribe to usage based pricing for on-call schedules (more than 5 pages a month)
The number of on-call schedules monthly active users increases.
GitLab On-Call reaches viable maturity!

Future (next 9-12 milestones)

Alerts can be easily triaged

Prioritized Development Work

We will continue to build more incident management features, making GitLab alerts more robust so teams can quickly and effectively triage them.
We will work across stages, ensuring GitLab events that warrant and alert, like a failed pipeline, are surfaced in the alert list.

Exploratory work to inform future priorities

How are we tracking success?

Through cross stage adoption we see an increase in our Group Monthly Active Users (GMAU).
GitLab Incident Management reaches complete maturity!

Long Term (2+ years, FY26 and beyond)

Users come to GitLab for their incident management solution

Strong bi-directional product tie-in to other GitLab stages and Monitor categories including, Error Tracking, Tracing, Logging, Metrics, Continuous Verification, Service Catalog and Runbooks. What could this look like in practice?:
- Users can see auto-generated incidents directly in their alerting tool.
- Users can quickly identify what service, code change, or customer is experiencing degraded performance.
- Users leverage automation to link past incidents to current incidents, creating merge requests with a proposed solution or pointing engineers on-call to a knowledge base.
- Users can execute suggested runbooks with the click of a button.
Users can quantify and show a decrease in mean time to resolve (MTTR). GitLab takes decreasing MTTR one step further; users are able to demonstrate that they catch more incidents before they happen through continuous verification, chaos engineering, and capacity forecasting.

Workflow

Incident Management is a broad category. The following diagram explains all functionality that is currently within scope for our vision of the category.

Maturity Plan

We are currently working to mature the Incident Management category from viable to complete. Definitions of these maturity levels can be found on GitLab's Maturity page. The following epics group the functionality we have planned to mature Incident Management.

What is Next & Why?

Introduce incident management workflow to GitLab Slack App to ensure Slack-to-GitLab workflows are seamless to minimize repetitive, manual work for incident response teams.
Incident Tags MVC to start capturing relevant incident timestamps to begin measuring important inincidentcient metrics like mean time to resolve (MTTR).

Dogfooding Plan

We are actively dogfooding Incident Management features with the Infrastructure team. Today, the Infrastructure team relies partially on PagerDuty to maintain GitLab.com and the other services they are responsible for. Ultimately the joint goal of the Infrastructure team and the Respond group is for the Infrastructure team to rely soley on GitLab Incident Management. Our plan to achieve this goal is as follows:

Prioritize new functionality based on the gap analysis
Meet with the Infrastructure team on a monthly cadence to gather feedback and incorporate changes into upcoming milestones.
Incrementally dogfood new features via simulation days (example) to gather immediate feedback on for improvements
Begin a full migration once we have completed filling in the gap analysis - view migration plan here

Incident Management features the Infrastructure team is currently Dogfooding

Feature List

General Feature	Specific Feature	Dogfooding?	Example	Feature needs 'x' to dogfood
Incidents	Incident issue type	✅
	Creating incidents manually	✅
	Creating incidents automatically	✅	Sample incident created via ops.gitlab.net
	Incident timelines	✅
	Linked resources	🔴	TBD, just released, Dogfood issue
	Creating incidents via the PagerDuty webhook	🔴
	Incident list	🔴		Labels need to be included on the incident list.
	Metrics tab	🔴		There isn't a working integration with our observability vendor. Metrics are added as screen shots (example) to the incident.
	Alert details tab	🔴		Not currently dogfooding GitLab alerts
	Service Level Agreement countdown timer	🔴		SLAs aren't based on a per incident basis
Alerts	GitLab Alerts	🔴		Alert improvments are noted here
	Alert list	🔴		Dependent on dogfooding alerts.
	Alert details tab	🔴		Dependent on dogfooding alerts.
	Metrics tab	🔴		Dependent on dogfooding alerts.
	HTTP endpoints	🔴		Mapping a complex payload to the custom mapping was cumbersome. Alerts showed a new alert when the payload changed.
	Prometheus integration	🔴		Right now Pager Duty is the single source of truth for alerts. There is not any value, beyond associating GitLab Alerts to GitLab Incidents.
	Grouping of identical alerts	🔴		Dependent on dogfooding alerts. Looking for the ability to manually add similar alerts to the same incident.

Marketing & Sales Enablement

Marketing and Sales Enablement material can be found here.

Pricing

Features in the Incident Management category have been placed in tiers based on GitLab's Buyer Based Tiering strategy. The following pricing plan represents existing and future features.

Incidents

Functionality	Free	Premium	Ultimate
Manual incident creation	✅	✅	✅
Incident creation based on limited criteria (e.g. integration or severity)		✅	✅
Incident timelines	✅	✅	✅
Incident tags	✅	✅	✅
Linked resources		✅	✅
Incident Reviews		✅	✅
Slack app for incident management			✅
Incident dashboards			✅

Alerts

Functionality	Free	Premium	Ultimate
One Generic HTTP Endpoint	✅	✅	✅
Internal GitLab Alerts	✅	✅	✅
One Generic HTTP Endpoint	✅	✅	✅
Email integration		✅	✅
Multiple email endpoints		✅	✅
External Prometheus integration	✅	✅	✅
Custom mapping for alert formats to endpoints		✅	✅
Special bi-directional out of the box integrations with popular monitoring tools			✅

Competitive Landscape

Name of Competitor	Year Founded	Relative Links
Atlassian Opsgenie	2012	Website Link
AWS Incident Manager	2021	Website Link
Grafana OnCall (Previously Amixr)	2018, acquired by Grafana in 2021	Website Link Competitive Analysis
Jeli	2019	Website Link
PagerDuty	2009	Website Link
Rootly	2020	Website Link Competitive Analysis
ServiceNOW	2003	Website Link
Splunk On-Call (Previously VictorOps)	2021, acquired by Splunk in 2018	Website Link
XMatters	2000	Website Link
FireHydrant	2018	Website Link Competitive Analysis
Blameless	2017	Website Link Competitive Analysis

Analyst Landscape

Analyst firms such Gartner and 451 have recently published articles on the rising prevalence of automation in incident response workflows.

Gartner

Gartner's recent research study titled Automate Incident Response to Enhance Incident management, focuses on the importance of leveling up manual incident response processes with automation: "Organizations targeting best-in-class incident management must address the manual processes and collaboration challenges between teams.” They go on to outline some of their key findings which highlight that “I&O organizations are looking to enhance incident response by focusing on automation, third-party integration, stakeholder management and improved detection response feedback loops.”

Their recommendations include the following:

"Invest in a centralized on-call management system and automate incident response workflows with wide integrations that create a holistic incident response management solution."
"Integrate monitoring solutions and Service Desk systems with bidirectional synchronization to incident response systems, which keeps the incident status synchronized across systems."
"Leverage automation to extend incident response capabilities that can integrate with DevOps toolchain monitoring."
"Improve incident communication and collaboration by integrating incident workflow processes with ChatOps tools like Slack or Microsoft Teams."

Gartner, Automate Incident Response to Enhance Incident management, By: Venkat Rayapudi & Steve White, Published 18 September 2020

Competitors (listed above in competitive landscape) enable the automation of these processes to different extents. Automattion functionality is typically offered with higher pricing tiers across the board. In order to take advantage of these automation features, companies must invest significant time in the configuration and fine-tuning of systems and processes.

In the near term, GitLab is positioned to enable Gartner’s recommendations for a best in-class incident management platform via the centralization of on-call schedule management to enable the automatic routing of alerts to the right responders at the right time.

451 Research

451 Research published an article on the acquisition of Rundeck by PagerDuty in September 2020. Read more about this on Rundecks website. This was a strategic move to meet the demands of the enterprise for more automation in incident response.

GitLab has plans to investigate using Rundeck for Runbooks via Runbooks - Rundeck Validation, this will be interesting opportunity to connect the PagerDuty lifecycle into GitLab Runbooks and Monitoring capabilities.

There is an existing landscape of comparable tools and even "ServiceNow and xMatters have orchestration engines that can be deployed to build workflows across tools, but they aren't typically extensively used to execute remediations." VictorOps (owned by Splunk) and OpsGenie (owned by Atlassian) are other similar tools with visions like PagerDuty.

Category Direction - Incident Management

Overview

Demo

Vision

Challenges

Opportunities

Target Audience and Experience

Roadmap

Now - (next 1-4 milestones)

Next (next 5-8 milestones)

Future (next 9-12 milestones)

Long Term (2+ years, FY26 and beyond)

Workflow

Maturity Plan

What is Next & Why?

Dogfooding Plan

Incident Management features the Infrastructure team is currently Dogfooding

Marketing & Sales Enablement

Pricing

Incidents

Alerts

Competitive Landscape

Analyst Landscape

Gartner

451 Research