Gitlab hero border pattern left svg Gitlab hero border pattern right svg

GitLab Security Operations On-Call Guide

On this page

GitLab Security Operations On-Call Guide

The Security Operations Team (SecOps) is collectively on-call 24/7/365, split into 12-hour shifts Monday to Friday and 48-hour coverage Saturday and Sunday.

Responsibilities

During those on-call shifts the SecOps Engineer has three core responsibilities:

  1. Acknowledge and respond to pages
  2. Review and engage on issues marked with the oncall label
  3. Improve Security Operations on-call and incident handling processes

1. Engagement: Paging Duties

The on-call Engineer's primary concern is to provide timely engagement to pages sent to Security Operations. When receiving a page:

  1. Acknowledge the alarm in #security-team Slack channel or through PagerDuty directly
  2. Review the contents of the page, the associated GitLab issue any other provided details
  3. Resolve the PagerDuty alarm (via Slack or PagerDuty)
  4. Investigate and mitigate/resolve the issue as priority

Note: If the alarm is not acknowledged within 15 minutes the on-call Security Manager will be alerted.

2. Review, Engage & The oncall Label

Occasionally issues in Security Engineering and Security Operations will be marked with the oncall label, and during the on-call shift we should watch for and engage on new as well as existing open/active issues to assist towards mitigation/resolution.

These issues are typically generated through automated alerting and may occasionally require human intervention based on the scenario. In the least these should be reviewed twice per shift.

3. Improve On-Call, Incident Handling, and GitLab's Security

As we continue to grow and mature in the operational security space we will have many new experiences, succeed and fail at handling security events, and subsequntly learn from them. These learning should be documented through runbooks, processes, and handbook updates. During on-call shifts it's the on-call Engineer's responsibility to take notes, look for improvement opportunities in how we handle scenarios, find steps that can be automated, and ask questions about our tools, services, infrastructure, and try to find questionable security areas or risky decisions so we can improve GitLab's overall security posture.

Major Incident Response Workflow

Before diving straight into handling a major incident it's best to setup crucial tools, communication channels, and rules of engagement to work cross-team, like:

Communication Channels

GitLab Issue Tracking

When acting on a page, regardless of whether the incident is new or ongoing, the issue created by the paging mechanism should be used to record security-relevant data like:

However if there is an existing issue tracking the incident outside of the page-created one, correspondences and engagements across internal/external individuals and teams should be recorded in the existing issue. If the page-created issue is the only one, correspondences and engagements should be recorded in that issue.

Incident Ownership

There's a simple rule to incident ownership: Whoever ACK's the page owns it. Other Engineers and members of Security Operations may engage to maintain 24/7 coverage but ownership remains with whomever ACK'd the page.

Ownership of an incident implies being accountable to:

Being accountable does not imply being the sole person to act on these tasks. Hand-off at the end of an on-call shift, or coordinated breaks during extended incidents, would temporarily place another person responsible for these tasks. To coordinate these hand-offs it's essential to equip the next person with all necessary details…

Incident Hand-off

To best prepare the next Engineer and ensure continued progress, details should be recorded in the page-created issue as well as the Security Operations slack channel. Details like:

Stakeholder Communication

In Progress

Incident Closure

Prior to closing any GitLab issue resulting from a page be sure to record any points or comments on how we can improve our processes, tools, and knowledgebase which may have assisted with the incident.

Then, once the incident has been resolved, the GitLab issue should be closed.

Additional notes may be placed in a closed issue. When an RCA is performed the closed issue will be referenced.

RCA

A thorough Root Cause Analysis guide has been published to describe the "what, why, and how" on performing an RCA.