Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Category Direction - Alert Management

Introduction and how you can help

Thanks for visiting this category page on Alert Management in GitLab. This page belongs to the Health group of the Monitor stage, and is maintained by Sarah Waldner who can be contacted directly via email. This vision is a work in progress and everyone can contribute. Sharing your feedback directly on issues and epics at GitLab.com is the best way to contribute to our vision. If you’re a GitLab user and have direct knowledge of your need for Alert Management, we’d especially love to hear from you.

Overview

The cost of IT service disruptions continues to increase exponentially as every company becomes a tech company. Services that were previously offered during "business hours only" are now 24/7 and are expected to adhere to 6 nine's of uptime. Moreover, operating these services becomes increasingly complex in this age of digital change. New technologies are emerging in the market on a daily basis, software development teams are moving to CI/CD frameworks, and legacy platforms are evolving into globally distributed networks of micro-services. It is critical for modern operations teams to implement an accurate and flexible IT alerting system that enables them to detect problems and solve them proactively.

Teams responsible for maintaining available and reliable services require a stack of tools to monitor the different layers of technology that comprise software services. These tools capture events (changes in the state of an IT environment) and generate alerts for critical events that indicate a degradation in application or system behavior. The complexity of IT applications, systems, and architectures and the many tools required to monitor them causes multiple problems for operators with regards to alerting. First, it is very challenging to figure out the correct metrics to monitor and track and the right thresholds to alert on. Most teams end up defining alerts too broadly in fear of missing critical issues. This results in a constant barrage of alert notifications where the problem is further exacerbated when multiple tools are concurrently alerting. When this happens, teams are forced to react to problems versus proactively mitigating them because they can't keep up with the stream of alerts and are always switching in between tools and interfaces. This causes 'alert fatigue' and leads to high stress and low morale. What these teams need is a single central interface that aggregates alerts from any source or multiple sources. The alert system should provide automatic deduplication and event correlation which will enable operators to efficiently triage and prioritize problems for resolution.

Mission

Our mission is to close the gap between outage detection and service restoration for DevOps teams by consolidating alerts in the same application where they investigate metrics, logs, traces, and errors and resolve incidents.

Challenges

As we invest R&D in adding Alert Management to GitLab, we are faced with the following challenges:

Opportunities

We are uniquely positioned to take advantage of the following opportunities:

Target Audience and Experience

While Alert Management matures through minimal and viable, we are creating an intuitive and streamlined experience for Allison (Application Ops) and Devon (DevOps Engineer). Initially, this experience will be oriented towards DevOps teams at smaller companies where it is common for the engineers to be on-call and responding to alerts for the software that they also write code for.

Strategy

Maturity Plan

We are currently focused on moving Alert Management from the planned to the minimal maturity level and that work is captured in this epic. Definitions of these maturity levels can be found on GitLab's Maturity page.

What is Next & Why?

Processing alerts during a fire-fight requires responders to coordinate across multiple tools to evaluate different data sources. This is time consuming because every time a responder switches to a new tool, they are confronted with a new interface and different interactions which is disorienting and slows down triage workflows.

The minimal version of Alert Management will be an interface in GitLab that aggregates alerts from any tool. Similar to how your application runs on a stack of technology, there is a stack of monitoring tools that ensures each layer of your tech is reliable and available. There are hundreds of tools in market. We want to consume alerts from all of them. You can follow our progress and contribute to the MVC via this epic.

Once we've created an interface where you can view alerts from different tools side by side, we will enrich that experience by enabling you to interact with them and take action on them.

What is not planned right now

This is a new category and we are still refining our vision. We will add items to this section as we move through research and prioritization.

Competitive Landscape

Analyst Landscape

Not yet, but accepting merge requests to this document.

Top Customer Success/Sales Issue(s)

Not yet, but accepting merge requests to this document.

Top Customer Issue(s)

Not yet, but accepting merge requests to this document.

Top Internal Customer Issue(s)

Not yet, but accepting merge requests to this document.

Top Vision Item(s)

Not yet, but accepting merge requests to this document.

GIT is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license