Downtime is expensive. The cost of downtime also compounds because customers lose trust every time your product is not available or is unreliable. It behooves organizations to not just build resilient applications that are available 24x7, but also enable an incident response capability that is efficient and reliable.
Because downtime is expensive. There is high pressure for the team responsible for incident management. When the team lacks the right tools to respond to incidents, their jobs become unnecessarily stressful and burn-out occurs. This snowballs into more outages leading to poor business outcomes.
Incident Management is a critical tool in your DevOps toolbox. The right incident management tool helps incident responders to be notified, streamlines communication and coordination, assists in diagnosing problems, facilitates remediation, and helps your entire organization to learn and to improve.
Buyers are going to be Senior Managers or Directors of engineering, infrastructure, operations, or reliability teams. They typically lead 10-100 people depending on the set up of the company.
Incident Management is frequently referred to as Incident Response from the analyst perspective. Sometimes it is categorized under ITSM (IT Service Management)
Ability to integrate with any alerting source and consume alerts for IT service disruptions or outages. Alerts can be received in HTTP or email formats. Often tools will provide proprietary
All incoming alerts are aggregated in a list to streamline triage. Each alert has a detail page that shows the payload and an audit trail of actions taken on the alert. The status of an alert can be changed to show state and progress and alerts can be assigned to demonstrate ownship.
All incidents are aggregated in a list to streamline triage. They also show up in Issue lits and on issue boards to fit the unique workflows of different teams. Incidents can be created manually or automatically from an alert. When an incident is created from an alert, it contains all of the alerts details. Incidents coordinate response and collboration.
Set-up schedules for responding on-call. Admins have flexible options for creating schedules and adding responders to the schedules. Schedules are used in Escalation policies to identify who is on-call when an alert is triggered. Responders get to set up paging policies with preferred paging methods.
Review what happened during a fire-fight in a blameless setting. Walk through the incident timeline and notate it with learnings, places to improve, and things to investigate. Create after action items to continuously improve.
|Market Requirements||How GitLab Delivers Today||Demos|
|Integrate alerting sources||We offer the ability to create HTTP endpoints for customers to send alert to. We offer a proprietary Prometheus integrations that receives alerts from AlertManager||TBD|
|Alert Triage||Users can triage alerts in the alerts list with column sorting capabilities to filter to the most important ones. Clicking on an alert takes the user to the alert details where the user can view the payload, change the status, and assign it to someone. All actions taken on the alert appear as a system note in the audit trail for the alert.||TBD|
|Incident Response||Users can create incident manually or set up a project to have them created automatically for all alerts that are created in GitLab. Incidents contain alert details, an editable description, and comments so users can collaborate during a fire fight. A user can promote the incident to an externally facing status page to communicate with external stakeholders.||TBD|
|On-call schedule management||On-call schedule management MVC is planned to release in 13.10. The MVC will enable users to create a single schedule in a project that contains multiple rotations. All alerts received to that project will email the on-call responder in the scehdule.||TBD|
|Post Incident Review||Users can create an issue and link it to a closed Incident to execute the Post Incident Review. After action items can be created as additional issues.||TBD|
|On-call schedule management integrated with your dev platform||Easily transition your development to being on-call for the code that they write||TBD|
|Triage alerts and incidents in the same platform where you will deploy the fix||Shorten time to resolution. Easily associate incidents and patches + after actions items||TBD|
|PagerDuty||Stand-alone incident management tool. Entrenched in the enterprise. We are starting to see them lose their foothold as the market shifts towards workflow tools including incident management versus companies relying on one specific tool.||Came into market on the right wave. Has a large customer base that is entrenched in the tool and it is hard to get a company to switch incident management platforms. Robust feature set. Cloud based HA solution.||Considered very expensive. Stand alone tool that does nothing else. Must be integrated with everything. Has a rap for poor customer service/success.|
|Opsgenie||Part of the Atlassian product suite - acquired in 2018. Less market share than PagerDuty - we rarely hear of GitLab customers using it. Considered "more progressive". One example of an Incident Management company becoming an offering of a workflow tool.||Very flexible tool with an intutive interface. Strong brand.||TBD|
|ServiceNow||Huge workflow tool that is entrenched in the enterprise. Has been slowly developing out-of-the-box incident management workflows for the last several years. Leans towards enabling traditional ITSM/ITL workflows.||Very flexible tool that offers users options to customize everything. Can easily expand to on-call teams because it is likely being used in other parts of the business.||Requires a lot of set up. People are not going to purchase it just for Incident Management. Lacks Incident Management specialized experience and features.|
|Splunk On-call||Formed by the VictorOps acquisition in 2018. Rebranded to Splunk On-call at the end of 2020. Is one part of three that makes up the Splunk Observability suite. Has a small marketshare.||VictorOps & Splunk have a very strong brand and a following.||Splunk has divested in Splunk on-call and reduced team responsible for it. Users are not going to purchase Splunk just fo Splunk on-call. Splunk itself is very expensive. Tool is inflexible and is difficult for large teams to use.|
|DataDog Incident Management||A relatively new addition to DataDog's observability platform, introduced in 2020.||It has the advantage of having monitoring and incident response all housed within one integrated application. DataDog is a market leader in the Monitoring space and has a large pre-existing customer base to expand into.||Relatively new. Does not have on-call schedule management meaning that someone would also need an incident management tool to use it. Is proprietary to DataDog.|
This table shows the recommended use cases to adopt, links to product documentation, the respective subscription tier for the use case, and product analytics metrics.
|Use Case||Description||Links to documentation||Applicable Subscription Tier||Metrics|
|SRE team||This is a team of 10-30 people at a progressive company. In addition to responding on-call to incidents, they are involved in the architecture and maintenance of the infrastructure of the cloud-native services their company provides. They are agile and effective. They are committed to continuous improvement and have post-incident reviews built into their regular practices. They evangelize DevOps throughout their organization. They try to automate as much of their workfflows as possible. This type of team will be the ones submitting the most feaure requests and really pushing us (GitLab) to be more innovative.||Ultimate||TBD|
|Development team||This team can range from 10-500 people. This is a team of engineers who are responsible for developing the software. They have recently been asked to be on-call as their organization moves through the DevOps transformation. They will rarely solve an incident themselves, they are much more likely to be paged to join a fire-fight with operations team members. More progressive organizations will have figured out a way to page engineers based on the area of the code-base they contribute to. In these teams it is common to find a lot of single points of failure (i.e. Senior team members that has a lot of domain knowledge)||Premium, Ultimate||TBD|
|Support team||This team is typically 5-20 people and usually belongs to a much larger Support department. Support teams are reactive to customer reported outages and are the liasion between the company and the customer during a fire-fight. They handle stakeholder communication. We find these teams as traditional and progressive companies alike. Support teams are going to be satisfied with a "just-good-enough" solution.||Premium, Ultimate||TBD|
Question - What tools are you using for Incident Management today?
Check out the Incident Management category direction page for future vision and plans.
Incident Management documentation