It’s 2 a.m. Monday morning.
Your phone screen lights up and buzzes. Lo and behold, the alert is serious and there is likely a severe incident ongoing with your service.
You check Slack to see if anyone else is involved. Next, you log into your monitoring tool to review the alert and do a quick triage hoping that the cause and solution are straightforward. The next 30 minutes are spent frantically bouncing around between five to six different tools, digging for clues in metrics, events, traces, logs, and release tools, hoping you can correlate a recent deployment to the incident. After another team member finally joins the firefight, you spend precious time getting them up to speed. After that, your boss calls. At this time, an hour has passed, you are no closer to the root cause.
Does this situation sound familiar?
There are so many jobs to be done during an incident: Communicating using multiple channels, facilitating collaboration, documenting findings and the timeline, and assessing metrics, logs, traces, and errors to diagnose problems. This process can be manual, time-consuming, and stressful for incident responders.
Wouldn’t it be great if most of this is automated and centralized in one place?
Enter, GitLab alert and incident management
Our vision is to free up more time for incident responders to actually respond to incidents by automating resource management, communications, correlating observability data and metadata, and executing runbooks. Since GitLab is a single app for your entire DevOps lifecycle, the bonus of using GitLab to triage IT alerts and manage incidents is that you are doing so in the same tool you are already using - everything is colocated to help you remediate problems faster.
What can I do today?
We are in the midst of building an Operations Command Center where you can investigate, respond to, and remediate IT incidents all in one interface.
Available today, GitLab includes the following highlighted functionality:
- Aggregate IT alerts in a single interface (GitLab) via our generic webhook receiver
- Triage multiple alerts in a list view
- Indicate ownership of critical alerts by changing the status
- Delegate responsibility by assigning alerts
- Promote alerts to incidents by creating GitLab issues
- Investigate the metrics directly in the alert
What is coming soon?
Alert and incident management tools are the main focus of the Health group within the Monitor stage. In the next few milestones, we anticipate releasing:
- Embedded logs for GitLab Alerts
- Linked runbooks in alerts
- A custom integration builder to integrate any alerting source with GitLab
- An incident dashboard to manage active outages
We want to hear from you!
As per usual, we, at GitLab, listen closely to our community and we like to give you direct access to the ideas we are considering for our product. If you want to contribute to building Incident Management tools, please check out the linked epic to see what we have in the near-term. We love your feedback and we would love to receive your merge requests even more.