The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.
Downtime is expensive - it costs companies hundreds of thousands of dollars, or more, in a single hour.
While downtime avoidance is preferable, downtime is not avoidable. As such, it is imperative that organizations are geared towards being able to respond to production problems efficiently and effectively. Put another way, organizations need to be resilient.
The majority of investment and focus of the DevOps industry (including GitLab) to date has been on downtime avoidance. There are some entrenched competitors approaching incident management from the perspective of workflows (ServiceNow), or incident notification (Pager Duty, OpsGenie). Despite this, holistic incident management products are lacking resulting in many organizations stitching together point solutions mixed with a healthy dose of DIY. We believe that many organizations are looking for ways to avoid reinventing the wheel on incident management
Check out this short video from our engineer Paulina that demos incident tags.
To see past demos take a look at the Respond Group playlist or click the link(s) below.
GitLab Incident Management helps teams build resiliency in their software and processes against downtime, outages, and other unexpected situations.
We plan to achieve this by:
We've also created a series of vision items for the category, which are visible in this issue. A video walk-through of our vision items is also available on YouTube.
As we invest R&D in building out Incident Management at GitLab, we are faced with the following challenges:
We are uniquely positioned to take advantage of the following opportunities:
Our current Incident Management tools have been built for users who align with our Allison (Application Ops) and Ingrid (Infrastructure Operator) personas. The experience targets DevOps teams at smaller companies where it is common for the engineers to be on-call and responding to alerts for the software that they also write code for. As we mature this category, we will evolve the experience to appeal to and serve the enterprise customer. Here is a list of all jobs to be done (JTBD) for incident management.
GitLab Incidents are the single source of truth (SSOT)
Prioritized Development Work
How are we tracking success?
On-call incident responders never miss an alert.
Prioritized Development Work
How are we tracking success?
Alerts can be easily triaged
Prioritized Development Work
Exploratory work to inform future priorities
How are we tracking success?
Users come to GitLab for their incident management solution
Incident Management is a broad category. The following diagram explains all functionality that is currently within scope for our vision of the category.
We are currently working to mature the Incident Management category from viable
to complete
. Definitions of these maturity levels can be found on [GitLab's Maturity page](https://about.gitlab.com/direction/#maturity. The following epics group the functionality we have planned to mature Incident Management.
We are actively dogfooding Incident Management features with the Infrastructure team. Today, the Infrastructure team relies partially on PagerDuty to maintain GitLab.com and the other services they are responsible for. Ultimately the joint goal of the Infrastructure team and the Respond group is for the Infrastructure team to rely soley on GitLab Incident Management. Our plan to achieve this goal is as follows:
General Feature | Specific Feature | Dogfooding? | Example | Feature needs 'x' to dogfood |
Incidents | β | |||
β | ||||
β | created via ops.gitlab.net | |||
β | ||||
π΄ | TBD, just released, | |||
π΄ | ||||
π΄ | Labels need to be included on the incident list. | |||
π΄ | There isn't a working integration with our observability vendor. Metrics are added as screen shots to the incident. | |||
π΄ | Not currently dogfooding GitLab alerts | |||
π΄ | SLAs aren't based on a per incident basis | |||
Alerts | π΄ | Alert improvments are noted | ||
π΄ | Dependent on dogfooding alerts. | |||
π΄ | Dependent on dogfooding alerts. | |||
π΄ | Dependent on dogfooding alerts. | |||
π΄ | Mapping a complex payload to the custom mapping was cumbersome. Alerts showed a new alert when the payload changed. | |||
π΄ | Right now Pager Duty is the single source of truth for alerts. There is not any value, beyond associating GitLab Alerts to GitLab Incidents. | |||
π΄ | Dependent on dogfooding alerts. Looking for the ability to manually add similar alerts to the same incident. |
Marketing and Sales Enablement material can be found here.
Features in the Incident Management category have been placed in tiers based on GitLab's Buyer Based Tiering strategy. The following pricing plan represents existing and future features.
Functionality | Free | Premium | Ultimate |
---|---|---|---|
Manual incident creation | β | β | β |
Incident creation based on limited criteria (e.g. integration or severity) | Β | β | β |
Incident timelines | β | β | β |
Incident tags | β | β | β |
Linked resources | Β | β | β |
Incident Reviews | Β | β | β |
Slack app for incident management | Β | Β | β |
Incident dashboards | Β | Β | β |
Functionality | Free | Premium | Ultimate |
---|---|---|---|
One Generic HTTP Endpoint | β | β | β |
Internal GitLab Alerts | β | β | β |
One Generic HTTP Endpoint | β | β | β |
Email integration | Β | β | β |
Multiple email endpoints | Β | β | β |
External Prometheus integration | β | β | β |
Custom mapping for alert formats to endpoints | Β | β | β |
Special bi-directional out of the box integrations with popular monitoring tools | Β | Β | β |
Name of Competitor | Year Founded | Relative Links |
---|---|---|
Atlassian Opsgenie | 2012 | Website Link |
AWS Incident Manager | 2021 | Website Link |
Grafana OnCall (Previously Amixr) | 2018, acquired by Grafana in 2021 | Website Link Competitive Analysis |
Jeli | 2019 | Website Link |
PagerDuty | 2009 | Website Link |
Rootly | 2020 | Website Link Competitive Analysis |
ServiceNOW | 2003 | Website Link |
Splunk On-Call (Previously VictorOps) | 2021, acquired by Splunk in 2018 | Website Link |
XMatters | 2000 | Website Link |
FireHydrant | 2018 | Website Link Competitive Analysis |
Blameless | 2017 | Website Link Competitive Analysis |
Analyst firms such Gartner and 451 have recently published articles on the rising prevalence of automation in incident response workflows.
Gartner's recent research study titled Automate Incident Response to Enhance Incident management, focuses on the importance of leveling up manual incident response processes with automation: "Organizations targeting best-in-class incident management must address the manual processes and collaboration challenges between teams.β They go on to outline some of their key findings which highlight that βI&O organizations are looking to enhance incident response by focusing on automation, third-party integration, stakeholder management and improved detection response feedback loops.β
Their recommendations include the following:
Gartner, Automate Incident Response to Enhance Incident management, By: Venkat Rayapudi & Steve White, Published 18 September 2020
Competitors (listed above in competitive landscape) enable the automation of these processes to different extents. Automattion functionality is typically offered with higher pricing tiers across the board. In order to take advantage of these automation features, companies must invest significant time in the configuration and fine-tuning of systems and processes.
In the near term, GitLab is positioned to enable Gartnerβs recommendations for a best in-class incident management platform via the centralization of on-call schedule management to enable the automatic routing of alerts to the right responders at the right time.
451 Research published an article on the acquisition of Rundeck by PagerDuty in September 2020. Read more about this on Rundecks website. This was a strategic move to meet the demands of the enterprise for more automation in incident response.
GitLab has plans to investigate using Rundeck for Runbooks via Runbooks - Rundeck Validation, this will be interesting opportunity to connect the PagerDuty lifecycle into GitLab Runbooks and Monitoring capabilities.
There is an existing landscape of comparable tools and even "ServiceNow and xMatters have orchestration engines that can be deployed to build workflows across tools, but they aren't typically extensively used to execute remediations." VictorOps (owned by Splunk) and OpsGenie (owned by Atlassian) are other similar tools with visions like PagerDuty.