Respond Group

The Respond Group a part of the Monitor Stage of the DevOps lifecycle.

Respond

The Respond group at GitLab is responsible for building tools that enable DevOps teams to respond to, triage and remediate errors and IT alerts for the systems and applications they maintain. We aim to provide a streamlined Operations experience within GitLab that enables the individuals who write the code, to maintain it at the same time.

This team maps to the Respond Group category and focuses on:

Exciting things and accomplishments

You can follow along with the team’s accomplishments by reading the latest weekly async updates.

Team members

Name	Role

Stable counterparts

Name	Role

Communication

Slack channel: #g_respond
Slack alias: @monitor-respond-group
Our Google groups are organized like this:
- Monitor Respond Group (whole team)
  - monitor-respond_be (backend team)
  - monitor-respond-fe (frontend team)
The Respond group team meetings are scheduled on our Monitor Stage team calendar
- The team holds a weekly sync meeting, alternating between 2 timezone groups (Thu @ 06:00 UTC and Thu @ 14:00 UTC) and meetings are recorded. We generally try to keep our process pretty light on meetings.

Dashboards

Working Agreements

🤝 Collaboration

Anything in an issue description is allowed to change. And YOU’RE allowed to do it. Since we have the description history, we can always go back to an old version with negligible effort. If making a substantial change to the description, provide some explanation to explain why it’s changing. Issue descriptions are documentation, not a single person’s opinion.
- Related: try to keep “I”s out of the issues you write, or differentiate your personal opinions/context by adding a disclaimer/bold/italics. Personal opinions could also be added to the comments section so that others can respond and discuss them. If a description is less personal, it’s easier for anyone else to feel like they can improve or refine it.
It’s ok to ask people to do things directly. It may feel uncomfortable, but you have to trust that they can manage their own time and priorities.
- “Someone should probably do X.” is a trap. Prefer “Would anyone be willing to do X?” or “Who can take on X?”. Identifying a specific owner for each new task or subtask (write tests, update docs, add a follow-up, etc) will prevent it from getting lost or forgotten.
Conversely, it’s ok to say no or offer hard limits.
- When you do say “no”, propose alternatives or a potential path forward for that person to get what they want/need.

📈 Results

Just declare a plan. If people don’t agree, they’ll tell you.
- If requirements are unclear, ask for help and explain exactly what you’re looking for. Prefer questions like “What should happen in scenario X?” to “I’m not sure how Y should work.” The responder should be able to tell whether they’ve unblocked you simply by making sure each of your questions has an answer.
- If the discussion has gotten off-track, you can’t tell what the action item should be, or you don’t have an opinion yourself, make the plan up! Then communicate it.
It’s also ok to express a direct opinion about what you think is best when presenting a set of options.
- It’s way easier to engage with your work if you believe in what you’re building. Advocate for it.
Sometimes it’s necessary to accept risk to make progress.

⏱️ Efficiency

It’s ok to say “I’m so confused, can you explain it differently?”
When answering questions or posing questions, always think “who is my audience, and what info do they need right now?”
- Make it easy for the reader to just pick an option or take action.
- If posing a question/problem to multiple groups, categorize information & label it. Let the reader choose what they want to read.
- A question posed to a designer should be different from a question asked of an engineer. Our designer needs to know the impact of the decision on the user, how big of a pain a given solution is, or whether an option has implications for the design down the line. Conversely, an engineer needs to know which code is being discussed, any implicit assumptions that have been made, which requirements are already known, or why certain options have been ruled out. But in either circumstance, you want to provide the responder with exactly what they need to make an informed choice by the time they reach the end of the question/comment.

🌐 Diversity, Inclusion & Belonging

Communication is hard. Our attention spans are short. If possible, supplement with pictures.
- If you are verbose, that’s ok. Include summaries, tldrs, tables, headers, and style your text to make it easier to consume your writing.

👣 Iteration

If you already took an action down one path, but now you need to go a different direction, that’s ok. That’s iteration. You did not waste time or do anything wrong. You just moved forward.

👁️ Transparency

If there’s 80% of a decision but still some unknowns, it can be ok to use “I’m just going to improvise” as the plan for the remaining pieces. Just state it explicitly in advance & communicate the outcome afterward.
- The best path forward is sometimes the path of least resistance. It often doesn’t matter what you do, as long as it’s well communicated.

Issue boards

Monitor - Workflow - Issue board organized by workflow labels
Monitor Bugs - Issue board organized by Priority labels so that we make sure we meet our bug fix SLA

Development Processes

Surfacing blockers

To surface blockers, mention your Engineering Manager or Product Manager in the issues. Also make sure to raise any blockers in your daily async standup using Geekbot.

The Engineering Manager and Product Manager want to make unblocking the team their highest priority. Please don’t hesitate to raise blockers.

Scheduling

Scheduling issues in milestones

The Product Manager is responsible for scheduling issues in a given milestone. The engineering team will make sure that issues are scoped and well-defined enough to implement and whether they need UX involvement and/or technical investigation.

See also Measuring Say Do ratio for more on milestone commitments.

Scheduling bugs

When new bugs are reported, the Engineering Manager ensures that they have proper Priority and Severity labels. Bugs are discussed in the weekly triage issue and are scheduled according to severity, priority, and the capacity of the teams. Ideally, we should work on a few bugs each release regardless of priority or severity.

Scheduling technical debt

As new technical debt issues are created, the Engineering Manager and Product Manager will triage, prioritize and schedule these issues. When new issues are created by Monitor team members, add any relevant context to the description about the priority or timing of the issue, as this will help streamline the triage work.

Technical debt is planned following the standard prioritization scheduling.

Weekly async updates

As part of the Ops sub-department Async Updates, the EM is responsible for sharing a weekly team update.

Weekly update flow:

Monday
1. (bot) Weekly update is created automatically as an issue in the gitlab-org/monitor/respond project.
Throughout the week
1. (human) Noteworthy highlights, blockers, key metrics, etc, are posted manually in a comment thread on the issue.
Friday
1. (bot) Issue stats are pulled and inserted in the issue.
2. (bot) Highlight thread is copied in the issue.
3. (human) Curate the issue description as needed.
4. (human) If the milestone just ended, report on Say Do ratio with some context and trend.
5. (human) Close the issue, share a link to it in the #g_respond Slack channel.

Links

Interacting with community contributors

Community contributions are encouraged and prioritized at GitLab. Please check out the Contribute page on our website for guidelines on contributing to GitLab overall.

Within the Monitor stage, Product Management will assist a community member with questions regarding priority and scope. If a community member has technical questions on implementation, Engineering Managers will connect them with MR coaches within the team to collaborate with.

Using spikes to inform design decisions

Engineers use spikes to conduct research, prototyping, and investigation to gain knowledge necessary to reduce the risk of a technical approach, better understand a requirement, or increase the reliability of a story estimate (paraphrased from this overview). When we identify the need for a spike for a given issue, we will create a new issue, conduct the spike, and document the findings in the spike issue. We then link to the spike and summarize the key decisions in the original issue.

Assigning MRs for code review

Engineers should typically ignore the suggestion from Dangerbot’s Reviewer Roulette and assign their MRs to be reviewed by a frontend engineer or backend engineer from the Respond Group. If the MR has domain specific knowledge to another team or a person outside of the Respond Group, the author should assign their MR to be reviewed by an appropriate domain expert. The MR author should use the Reviewer Roulette suggestion when assigning the MR to a maintainer.

Advantages of keeping most MR reviews inside the Respond Group include:

Quicker reviews because the reviewers hopefully already have the context and don’t need additional research to figure out how the MR is supposed to work.
Knowledge sharing among the engineers in the Respond Group.
Design reviews currently follow a different process. For design reviews, follow the “Reviewer roulette” recommendation (will only be shown if the MR is non-draft and has a ~UX label applied), and ensure you provide context for how to set up the feature they will be testing. For example, for testing alerts:

<!---
1. Navigate to Settings > Monitor
1. Expand the Alert section, and click the button to "Enable a new integration"
1. Select "HTTP endpoint" in the integration type dropdown. Add an integration name, turn the toggle to "active", and click to "Save the integration"
1. Once the integration is added, click on the settings icon button in the integration table
1. Click on the "Send test alert" tab
1. Enter the sample payload shown below, and click send.
1. Navigate to Monitor > Alerts, where you will see the new alert appear.
-->
{  "title": "Gitaly latency is too high",  "description": "https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/gitaly/gitaly-latency.md",  "service": "service not affected",  "monitoring_tool": "GitLab scripts",  "severity": "high", "host": "fe-2" }

Preparing UX designs for engineering

Product designers generally try to work one milestone ahead of the engineers, to ensure scope is defined and agreed upon before engineering starts work. So, for example, if engineering is planning on getting started on an issue in 12.2, designers will assign themselves the appropriate issues during 12.1, making sure everything is ready to go before 12.2 starts.

To make sure this happens, early planning is necessary. In the example above, for instance, we’d need to know by the end of 12.0 what will be needed for 12.2 so that we can work on it during 12.1. This takes a lot of coordination between UX and the PMs. We can (and often do) try to pick up smaller things as they come up and in cases where priorities change. But, generally, we have a set of assigned tasks for each milestone in place by the time the milestone starts so anything we take on will be in addition to those existing tasks and dependent on additional capacity.

The current workflow:

Though Product Designers make an effort to keep an eye on all issues being worked on, PMs add the UX label to specific issues needing UX input for upcoming milestones.
The week before the milestone starts, the Product Designers divide up issues depending on interest, expertise and capacity.
Product Designers start work on assigned issues when the milestone starts. We make an effort to start conversations early and to have them often. We collaborate closely with PMs and engineers to make sure that the proposed designs are feasible.
In terms of what we deliver: we will provide what’s needed to move forward, which may or may not include a high-fidelity design spec. Depending on requirements, a text summary of the expected scope, a balsamiq sketch, a screengrab or a higher fidelity measure spec may be provided.
When we feel like we’ve achieved a 70% level of confidence that we’re aligned on the way forward, we change the label to ~‘workflow::ready for development’ as a sign that the issue is appropriately scoped and ready for engineering.
We usually stay assigned to issues after they are ~‘workflow::ready for development’ to continue to answer questions while the development process is taking place.
Finally, we review MRs following the guidelines as closely as possible to reduce the impact on velocity whilst maintaining quality.

Measuring Say Do ratio

How we measure Say Do ratio:

We set a list of goals in the milestone planning issue. Usually 3-5 of them.
1. Stretch goals may exist but don’t contribute to Say/Do.
Ideally 1 goal = 1 epic. There can be exceptions, some goals don’t map 1:1 with an epic. That’s ok.
Say = number of goals planned (i.e. committed) at the start of the milestone.
Do = number of goals achieved at the end of the milestone (i.e. number of epics closed, usually).
The Engineering Manager reports the ratio in the weekly async update with some context and the recent trend.

How this differs from past approaches:

We do not apply the ~deliverable label to issues committed to being completed in the current milestone.
We do not apply the ~filler label to issues which are not committed in the current milestone.

Why we choose this approach:

Clear goals. Puts the focus on committing to milestone goals, and whether we’re achieving them. Milestone plan is the go-to source of truth for our priorities.
1. Bonus: we get better at writing goals (Say), because if we don’t they’ll be hard to assess (Do).
Accuracy. By zooming out to the epic-level, we gain accuracy. We don’t need to label individual issues, they’re either part of a deliverable goal / epic, or not.
Usability. By being less granular, we don’t need to spend time labeling individual issues. Engineering Managers save time, engineers retain ownership and case-by-case decisions within an established goal/epic.
Easier to reason about. The Say Do ratio is “rounder” (20%, 25%, 33%, 40%, etc), since the number of goals is 3-5. It’s less noisy, and less prone to over-analysis than an issue-based ratio. E.g. what would it mean to go from 77.8% to 73.6% delivered issues? Is it worrisome? Is it ok?
See this thread for the original context that lead to this approach.

Downsides to this approach:

Accuracy may still be a problem, depending on how well we break down our milestone goals.
1. Mitigation: accept that perfect accuracy is not a goal, and rely on the feedback cycle to improve our goal setting skills.
Query-based dashboards will not pick up our Say Do ratio.
1. Mitigation: reporting of Say Do ratio in the Ops sub-department is moving to the monthly PI review instead of relying on a dashboard.

Repos we own or use

Prometheus Ruby Mmap Client - The ruby Prometheus instrumentation lib we built, which we used to instrument GitLab
GitLab - Where much of the user facing code lives
Omnibus and Charts, where a lot of the packaging related work goes on. (We ship GitLab fully instrumented along with a Prometheus instance)

Service accounts we own or use

Zoom sandbox account

In order to develop and test Zoom features for the integration with GitLab we now have our own Zoom sandbox account.

Requesting access

To request access to this Zoom sandbox account please open an issue providing your non-GitLab email address (which can already be associated an existing non-GitLab Zoom account).

The following people are owners of this account and can grant access to other GitLab Team Members:

Granting access

Log in to Zoom with your non-GitLab email
Go to User Management > Users
Click on Add User
Specify email addresses
Choose User Type - most likely Pro
Click Add - the users receive invitations via email
Add the linked name to the list in “Requesting access”

Documentation

For more information on how to use Zoom see theirs guides and API reference.

Labels

The Respond team uses labels for issue tracking and to organize issue boards. Many of the labels we use also drive reporting for Product Management and Engineering Leadership to track delivery metrics. It’s important that labels be applied correctly to each issue so that information is easily discoverable.

Issue Labels

Stage: required. Identifies which stage of GitLab an issue is assigned to.
- ~devops::monitor
Group: required. Identifies which team this issue belongs to. This triggers new issues to appear in the weekly triage report for the team’s Product and Engineering managers.
- ~group::respond
Team: required. Identifies which team (or both) will develop a solution.
- ~frontend
- ~backend
Category: optional. Identifies the correct Monitor category the issue falls under.
- ~Category:Runbooks
- ~Category:Incident Management
- ~Category:On-call Schedule Management
- ~Category:GitLab Self Monitoring
- ~Category:Error Tracking
- ~Category:Synthetic Monitoring
- ~Category:Product Analytics
Milestone: optional. While technically not a label, if the issue is being worked on immediately, add the current milestone. If you know when the issue needs to be scheduled (such as follow-up work), add the future milestone that it should be scheduled in. Otherwise, leave it empty.
Issue Type: required.
- ~"type::feature": Feature Issues
- ~"type::bug": Bug Issues
- ~technical debt : Technical Debt
Workflow: required.
- workflow::refinement: Issues that need further input from team members in order for it to be workflow::ready for development.
- workflow::blocked: Waiting on external factors or another issue to be completed before work can resume.
- workflow::ready for development: The issue is refined and ready to be scheduled in a current or future milestone.
- workflow::in dev: Issues that are actively being worked on by a developer.
- workflow::in review: Issues that are undergoing code review by the development team.
- workflow::verification: Everything has been merged, waiting for verification after a deploy.

Respond PTO

Just like the rest of the company, we use Time Off by Deel to track when team members are traveling, attending conferences, and taking time off. The easiest way to see who has upcoming PTO is to run the /time-off-deel whosout command in the #g_respond_standup slack channel. This will show you the upcoming PTO for everyone in that channel.

Reading list

A list of interesting content related to the areas of the Respond group:

On-Call
- Google’s SRE Workbook, Chapter 8 - On-Call
Incident Response
- Google’s SRE Workbook, Chapter 9 - Incident Response
Postmortem Culture: Learning from Failure
- Google’s SRE Workbook, Chapter 9 - Incident Response

Group Respond - GitLab End-to-End (E2E) Testing for group Respond

Overview: The goal of this page is to summarize how Respond group can use our existing GitLab QA framework to run and/or implement E2E tests. Why do we have them? E2E testing is a strategy used to check whether our application works as expected across the entire software stack and architecture. This includes the integration of all micro-services, features and components that are supposed to work together to satisfy any meaningful and complete user workflow.

Respond Group - JTBD

The jobs-to-be-done that the Respond group is solving for.

Last modified March 9, 2024: Remove erroneous stable counterparts from respond team page (63dd4534)

View page source - Edit this page - please contribute.