The Respond group at GitLab is responsible for building tools that enable DevOps teams to respond to, triage and remediate errors and IT alerts for the systems and applications they maintain. We aim to provide a streamlined Operations experience within GitLab that enables the individuals who write the code, to maintain it at the same time.
This team maps to the Respond Group category and focuses on:
This section details the happenings within the Respond group. At any given time this section will have the top exciting things and/or accomplishments of the team.
Person | Role |
---|---|
Daniele Rossetti | Senior Frontend Engineer, Monitor:Visualization |
Mat Appelman | Principal Engineer, Monitor |
Ottilia Westerlund | Security Engineer, Fulfillment (Fulfillment Platform, Billing and Subscription Management), Govern (Security Policies, Threat Insights), Monitor (Observability), Plan (Product Planning) |
The Respond Group’s mission is to decrease the frequency and severity of incidents. By helping our users respond to alerts and incidents with a streamlined workflow, and capturing useful artifacts for feedback and improvement, we can accomplish our mission. To get started, we need to establish an initial user base that we can learn from to improve and further grow the Monitor stage. We have prioritized the Incident Management category to obtain this usage.
With that in mind, our primary performance indicator, the Monitor:Respond GMAU, is the count of unique users that interact with alerts and incidents. This PI will inform us if we are on the right path to provide meaningful incident response tools.
We expect to track the journey of users through the following funnel:
Work with Nicole to clean up this section
Please view this sisense chart for a list of events we have instrumented for Monitor:Respond Events.
Event Category | Event Action |
Incident Management | view_alerts_list |
Incident Management | view_alert_details |
Incident Management | update_alert_status |
Incident Management | view_incidents_list |
Incident Management | view_incident_details |
Incident Management | toggle_incident_comments_into_timeline_view |
Incident Management | create_incident_button_clicks |
Service Ping is used for self-managed customers. The table below was generated from https://metrics.gitlab.com/.
Event Category | Event Action |
Incident Management | counts.issues_created_gitlab_alerts |
Incident Management | counts.incident_issues |
Incident Management | counts.issues_created_from_alerts |
Incident Management | counts.alert_bot_incident_issues |
Incident Management | counts.issues_created_manually_from_alerts |
Incident Management | counts.issues_with_embedded_grafana_charts_approx |
Incident Management | counts.issues_with_associated_zoom_link |
Incident Management | counts.issues_using_zoom_quick_actions |
Incident Management | redis_hll_counters.quickactions.i_quickactions_promote_to_incident_monthly |
Incident Management | redis_hll_counters.quickactions.i_quickactions_publish_monthly |
Incident Management | redis_hll_counters.quickactions.i_quickactions_severity_monthly |
Incident Management | counts.projects_with_alerts_created |
Incident Management | counts.projects_with_enabled_alert_integrations |
Incident Management | usage_activity_by_stage.monitor.projects_with_incidents |
Incident Management | usage_activity_by_stage_monthly.monitor.projects_with_incidents |
Incident Management | usage_activity_by_stage.monitor.projects_with_alert_incidents |
Incident Management | usage_activity_by_stage_monthly.monitor.projects_with_alert_incidents |
Incident Management | redis_hll_counters.incident_management.incident_management_alert_todo_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_alert_todo_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_comment_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_comment_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_todo_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_todo_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_relate_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_relate_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_assigned_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_alert_assigned_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_alert_assigned_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_alert_status_changed_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_alert_status_changed_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_change_confidential_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_created_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_created_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_zoom_meeting_weekly |
Incident Management | redis_hll_counters.incident_management_alerts.incident_management_alert_create_incident_monthly |
Incident Management | counts_monthly.aggregated_metrics.incident_management_alerts_total_unique_counts |
Incident Management | counts_monthly.aggregated_metrics.incident_management_incidents_total_unique_counts |
Incident Management | redis_hll_counters.incident_management_alerts.incident_management_alert_create_incident_weekly |
Incident Management | counts_weekly.aggregated_metrics.incident_management_alerts_total_unique_counts |
Incident Management | counts_weekly.aggregated_metrics.incident_management_incidents_total_unique_counts |
Incident Management | redis_hll_counters.incident_management.incident_management_total_unique_counts_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_total_unique_counts_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_unrelate_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_reopened_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_reopened_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_published_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_published_weekly |
Incident Management | redis_hll_counters.incident_management_oncall.i_incident_management_oncall_notification_sent_weekly |
Incident Management | redis_hll_counters.incident_management_oncall.i_incident_management_oncall_notification_sent_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_assigned_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_change_confidential_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_closed_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_closed_weekly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_zoom_meeting_monthly |
Incident Management | redis_hll_counters.incident_management.incident_management_incident_unrelate_monthly |
Incident Management | redis_hll_counters.quickactions.i_quickactions_publish_weekly |
Incident Management | redis_hll_counters.quickactions.i_quickactions_severity_weekly |
Incident Management | redis_hll_counters.quickactions.i_quickactions_promote_to_incident_weekly |
Incident Management | counts.projects_creating_incidents |
Incident Management | redis_hll_counters.incident_management.issuable_resource_links_total_unique_counts_monthly |
Incident Management | redis_hll_counters.incident_management.issuable_resource_links_total_unique_counts_weekly |
Incident Management | redis_hll_counters.incident_management.timeline_event_total_unique_counts_weekly |
Incident Management | redis_hll_counters.incident_management.timeline_event_total_unique_counts_monthly |
Incident Management | counts.status_page_incident_publishes |
Incident Management | counts.status_page_incident_unpublishes |
Incident Management | usage_activity_by_stage.monitor.projects_with_enabled_alert_integrations_histogram |
Incident Management | counts.status_page_issues |
Incident Management | counts_monthly.projects_with_alerts_created |
Incident Management | usage_activity_by_stage.monitor.projects_incident_sla_enabled |
Incident Management | counts.status_page_projects |
On-Call Schedule Management | redis_hll_counters.quickactions.i_quickactions_page_monthly |
On-Call Schedule Management | redis_hll_counters.quickactions.i_quickactions_page_weekly |
(Sisense↗) We also track our backlog of issues, including past due security and infradev issues, and total open System Usability Scale (SUS) impacting issues and bugs.
(Sisense↗) MR Type labels help us report what we're working on to industry analysts in a way that's consistent across the engineering department. The dashboard below shows the trend of MR Types over time and a list of merged MRs.
(Sisense↗) Flaky test are problematic for many reasons.
To surface blockers, mention your Engineering Manager in the issues, and then contact them via slack and or 1:1's. Also make sure to raise any blockers in your daily async standup using Geekbot.
The engineering managers want to make unblocking their teams their highest priority. Please don't hesitate to raise blockers
The Product Manager is responsible for scheduling issues in a given milestone. The engineering team will make sure that issues are scoped and well-defined enough to implement and whether they need UX involvement and/or technical investigation.
See also Measuring Say Do ratio for more on milestone commitments.
We use the following values for estimating the effort of issues to help determine our capacity during the planning process.
When new bugs are reported, the engineering managers ensure that they have proper Priority and Severity labels. Bugs are discussed in the weekly triage issue and are scheduled according to severity, priority, and the capacity of the teams. Ideally, we should work on a few bugs each release regardless of priority or severity.
As new technical debt issues are created, the engineering manager and product manager will triage, prioritize and schedule these issues. When new issues are created by Monitor team members, add any relevant context to the description about the priority or timing of the issue, as this will help streamline the triage work.
Priorities for scheduling technical debt will apply as follows:
As part of the Ops sub-department Async Updates, the EM is responsible for sharing a weekly team update.
Weekly update flow:
Links
Community contributions are encouraged and prioritized at GitLab. Please check out the Contribute page on our website for guidelines on contributing to GitLab overall.
Within the Monitor stage, Product Management will assist a community member with questions regarding priority and scope. If a community member has technical questions on implementation, Engineering Managers will connect them with MR coaches within the team to collaborate with.
Engineers use spikes to conduct research, prototyping, and investigation to gain knowledge necessary to reduce the risk of a technical approach, better understand a requirement, or increase the reliability of a story estimate (paraphrased from this overview). When we identify the need for a spike for a given issue, we will create a new issue, conduct the spike, and document the findings in the spike issue. We then link to the spike and summarize the key decisions in the original issue.
Engineers should typically ignore the suggestion from Dangerbot's Reviewer Roulette and assign their MRs to be reviewed by a frontend engineer or backend engineer from the Respond Group. If the MR has domain specific knowledge to another team or a person outside of the Respond Group, the author should assign their MR to be reviewed by an appropriate domain expert. The MR author should use the Reviewer Roulette suggestion when assigning the MR to a maintainer.
Advantages of keeping most MR reviews inside the Respond Group include:
<!---
1. Navigate to Settings > Monitor
1. Expand the Alert section, and click the button to "Enable a new integration"
1. Select "HTTP endpoint" in the integration type dropdown. Add an integration name, turn the toggle to "active", and click to "Save the integration"
1. Once the integration is added, click on the settings icon button in the integration table
1. Click on the "Send test alert" tab
1. Enter the sample payload shown below, and click send.
1. Navigate to Monitor > Alerts, where you will see the new alert appear.
-->
{ "title": "Gitaly latency is too high", "description": "https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/gitaly/gitaly-latency.md", "service": "service not affected", "monitoring_tool": "GitLab scripts", "severity": "high", "host": "fe-2" }
Product designers generally try to work one milestone ahead of the engineers, to ensure scope is defined and agreed upon before engineering starts work. So, for example, if engineering is planning on getting started on an issue in 12.2, designers will assign themselves the appropriate issues during 12.1, making sure everything is ready to go before 12.2 starts.
To make sure this happens, early planning is necessary. In the example above, for instance, we'd need to know by the end of 12.0 what will be needed for 12.2 so that we can work on it during 12.1. This takes a lot of coordination between UX and the PMs. We can (and often do) try to pick up smaller things as they come up and in cases where priorities change. But, generally, we have a set of assigned tasks for each milestone in place by the time the milestone starts so anything we take on will be in addition to those existing tasks and dependent on additional capacity.
The current workflow:
Though Product Designers make an effort to keep an eye on all issues being worked on, PMs add the UX label to specific issues needing UX input for upcoming milestones.
The week before the milestone starts, the Product Designers divide up issues depending on interest, expertise and capacity.
Product Designers start work on assigned issues when the milestone starts. We make an effort to start conversations early and to have them often. We collaborate closely with PMs and engineers to make sure that the proposed designs are feasible.
In terms of what we deliver: we will provide what's needed to move forward, which may or may not include a high-fidelity design spec. Depending on requirements, a text summary of the expected scope, a balsamiq sketch, a screengrab or a higher fidelity measure spec may be provided.
When we feel like we've achieved a 70% level of confidence that we're aligned on the way forward, we change the label to ~'workflow::ready for development' as a sign that the issue is appropriately scoped and ready for engineering.
We usually stay assigned to issues after they are ~'workflow::ready for development' to continue to answer questions while the development process is taking place.
Finally, we review MRs following the guidelines as closely as possible to reduce the impact on velocity whilst maintaining quality.
How we measure Say Do ratio:
How this differs from past approaches:
~deliverable
label to issues committed to being completed in the current milestone.~filler
label to issues which are not committed in the current milestone.Why we choose this approach:
Downsides to this approach:
In order to develop and test Zoom features for the integration with GitLab we now have our own Zoom sandbox account.
To request access to this Zoom sandbox account please open an issue providing your non-GitLab email address (which can already be associated an existing non-GitLab Zoom account).
The following people are owners of this account and can grant access to other GitLab Team Members:
Add User
User Type
- most likely Pro
Add
- the users receive invitations via emailFor more information on how to use Zoom see theirs guides and API reference.
While we try to keep our process pretty light on meetings, we do hold a Monitor:Respond Team meeting weekly to triage and prioritize new issues, discuss our upcoming issues, and uncover any unknowns.
The Respond team uses labels for issue tracking and to organize issue boards. Many of the labels we use also drive reporting for Product Management and Engineering Leadership to track delivery metrics. It's important that labels be applied correctly to each issue so that information is easily discoverable.
~devops::monitor
~group::respond
~frontend
~backend
~Category:Runbooks
~Category:Incident Management
~Category:On-call Schedule Management
~Category:GitLab Self Monitoring
~Category:Error Tracking
~Category:Synthetic Monitoring
~Category:Product Analytics
~"type::feature"
: Feature Issues~"type::bug"
: Bug Issues~technical debt
: Technical Debtworkflow::refinement
: Issues that need further input from team members in order for it to be workflow::ready for development
.workflow::blocked
: Waiting on external factors or another issue to be completed before work can resume.workflow::ready for development
: The issue is refined and ready to be scheduled in a current or future milestone.workflow::in dev
: Issues that are actively being worked on by a developer.workflow::in review
: Issues that are undergoing code review by the development team.workflow::verification
: Everything has been merged, waiting for verification after a deploy.Just like the rest of the company, we use Time Off by Deel to track when team members are traveling, attending conferences, and taking time off. The easiest way to see who has upcoming PTO is to run the /time-off-deel whosout
command in the #g_respond_standup
slack channel. This will show you the upcoming PTO for everyone in that channel.
A list of interesting content related to the areas of the Respond group: