The Security Operations Team (SecOps) is collectively on-call 24/7/365, split into 12-hour shifts Monday to Friday and 48-hour coverage Saturday and Sunday.
During those on-call shifts the SecOps Engineer has three core responsibilities:
The on-call Engineer's primary concern is to provide timely engagement to pages sent to Security Operations. When receiving a page:
Note: If the alarm is not acknowledged within 15 minutes the on-call Security Manager will be alerted.
Occasionally issues in Security Engineering and Security Operations will be marked with the
oncall label, and during the on-call shift we should watch for and engage on new as well as existing open/active issues to assist towards mitigation/resolution.
These issues are typically generated through automated alerting and may occasionally require human intervention based on the scenario. In the least these should be reviewed twice per shift.
As we continue to grow and mature in the operational security space we will have many new experiences, succeed and fail at handling security events, and subsequntly learn from them. These learning should be documented through runbooks, processes, and handbook updates. During on-call shifts it's the on-call Engineer's responsibility to take notes, look for improvement opportunities in how we handle scenarios, find steps that can be automated, and ask questions about our tools, services, infrastructure, and try to find questionable security areas or risky decisions so we can improve GitLab's overall security posture.
Before diving straight into handling a major incident it's best to setup crucial tools, communication channels, and rules of engagement to work cross-team, like:
#secops_####where #### is the GitLab issue number in the Security Operations project.
When acting on a page, regardless of whether the incident is new or ongoing, the issue created by the paging mechanism should be used to record security-relevant data like:
However if there is an existing issue tracking the incident outside of the page-created one, correspondences and engagements across internal/external individuals and teams should be recorded in the existing issue. If the page-created issue is the only one, correspondences and engagements should be recorded in that issue.
There's a simple rule to incident ownership: Whoever ACK's the page owns it. Other Engineers and members of Security Operations may engage to maintain 24/7 coverage but ownership remains with whomever ACK'd the page.
Ownership of an incident implies being accountable to:
Being accountable does not imply being the sole person to act on these tasks. Hand-off at the end of an on-call shift, or coordinated breaks during extended incidents, would temporarily place another person responsible for these tasks. To coordinate these hand-offs it's essential to equip the next person with all necessary details…
To best prepare the next Engineer and ensure continued progress, details should be recorded in the page-created issue as well as the Security Operations slack channel. Details like:
Prior to closing any GitLab issue resulting from a page be sure to record any points or comments on how we can improve our processes, tools, and knowledgebase which may have assisted with the incident.
Then, once the incident has been resolved, the GitLab issue should be closed.
Additional notes may be placed in a closed issue. When an RCA is performed the closed issue will be referenced.
A thorough Root Cause Analysis guide has been published to describe the "what, why, and how" on performing an RCA.