This is a Controlled Document
Inline with GitLab's regulatory obligations, changes to controlled documents must be approved or merged by a code owner. All contributions are welcome and encouraged.
The Security Operations sub-department is collectively responsible for responding to reports of actual or potential security incidents on a 24/7/365 basis.
Prior to scheduling planned time off, Security Operations team members should consult with the team to ensure proper coverage will be available.
Security Operations Managers also share in On-Call responsibilities and need to ensure proper coverage for escalations and emergencies. The Security department maintains a series of On-Call escalations to ensure that every reported incident is responded to in a reasonable timeframe.
On-Call scheduling for SIRT is organized in Pager Duty within the Security Responder
policy.
On-Call scheduling for Trust & Safety is organized in Pager Duty within the Trust & Safety Responder
policy.
Standard handoff times are as scheduled. However, team members are empowered to agree on a temporary modified handoff schedule as long as those involved agree and the team is notified in the team’s Slack channel.
SIRT
Trust and Safety
The Weekend On-Call Security team member will be responsible for covering On-Call responsibilities from AMER Friday evening until APAC Monday morning according to the established On-Call Security Handoff times.
"Weekend On-Call Security Responsibilities"
When scheduled for the Weekend On-Call Security shift team members should:
Security Operations provides weekday On-Call coverage using a follow-the-sun model. Weekday On-Call Security Engineers are the team members that cover the On-Call responsibilities during their region's sunny time.
The Weekday On-Call paging workflow is currently designed to follow this escalation path:
In addition to the Security Engineers being On-Call, the Security Managers across the GitLab Security Department act as backups in the event the Security Engineers are unable to acknowledge security pages. PagerDuty will automatically engage the Security Manager On-Call if the Security Engineers don’t acknowledge the first two page attempts, with each attempt being 15 minutes apart
.
It's the responsibility of the Security Manager On-Call to:
During weekday On-Call shifts the Security Engineer On-Call has these core responsibilities:
The On-Call Engineer's primary concern is to provide timely and adequate responses to incoming pages. When receiving a page:
If the alarm is not acknowledged within two 15-minute opportunities
, the Security Incident Manager On-Call will be alerted.
Engineers should acknowledge pages within the first 15 minutes, and perform initial triage of potential incidents within the first hour of the alarm.
Be sure to communicate with the reporter(s):
Occasionally, issues that do not trigger a page are still created in the GitLab-SIRT namespace and will be marked with the incident
label during the on-call shift, we should watch for and engage on new, as well as existing, open issues to assist towards mitigation/resolution. Those lower-priority incidents are also directly mentioned in the appropriate Slack channel.
These issues are generated through the automated /security workflow and require human intervention for triage.
As we continue to grow and mature in the operational security space we will have many new experiences, succeed and fail at handling security events, and subsequently learn from them. These learnings should be documented through runbooks, processes, and handbook updates. During On-Call shifts it is the On-Call Security Engineer's responsibility to take notes, look for improvement opportunities in how we handle scenarios, find steps that can be automated, and ask questions about our tools, services, infrastructure, and try to find questionable security areas or risky decisions so that we can improve GitLab's overall security posture.
As On-Call periods are typically interrupt driven it can be difficult to work on large projects, this is a good opportunity to reduce the queue of Backlog
issues. During weekdays, it is the responsibility of the On-Call Engineer to review the backlog board for items that can be accomplished during the On-Call period. When working an issue from the backlog the On-Call Engineer should assign themselves the issue and should see it through to completion. This may include completing it in the week following their On-Call schedule.
There's a simple rule to incident ownership: Whoever is assigned to the incident after the initial triage, is the person responsible for incident resolution. Use the assignee field in the GitLab incident to identify the responsible person. In some cases, the incident may be high severity and high priority, and may have an assignee per region. In these cases, the work should continue around the globe until the incident is contained and eventually resolved.
Ownership of an incident means being the person responsible for:
Being the responsible person does not imply being the sole person to act on these tasks. Hand-off at the end of an On-Call shift, or coordinated breaks during extended incidents, can temporarily assign another person responsible for these tasks. To coordinate these hand-offs it's essential to equip the next person with all necessary details.
To best prepare the next Security team member and ensure continued progress, details should be recorded in the page-created issue as well as the team’s slack channel. Details like:
Exceptions to this procedure will be tracked as per the Information Security Policy Exception Management Process.