Using GitLab, you automatically get broad and deep insight into the health of your deployment.
We provide a robust monitoring solution to give GitLab users insight into the performance and availability of their deployments and alert them to problems as soon as they arise. We provide data that is easy to digest and to relate to other features in GitLab. With every piece of the devops lifecycle integrated into GitLab, we have a unique opportunity to closely tie our monitoring features to all of the other pieces of the devops flow.
We work collaboratively and transparently and we will contribute as much of our work as possible back to the open-source community.
The monitoring team is responsible for:
This stage consists of the following groups:
These groups map to the Monitor Stage product category.
Team members who are successful in this stage typically demonstrate stakeholder mentality. There are many ways to demonstrate this but examples include:
This stage is only successful when each team member collaborates to make one another successful.
Since GitLab releases on a monthly basis, we have supporting activities that also take place on monthly rhythms. In addition, since our releases take place on the 22nd of each month, each monthly cadence does not map to actual months of the Gregorian calendar. These are listed in an ordered list for ease of reference.
workflow::verificationwill be moved to the next milestone
Meetings are not required but attendance/reviewing the recordings to the important ones will generally make team members successful. These are ordered in order of importance and are all stored in the Monitor Stage Calendar(Viewable to all GitLab team members)
Groups in this stage also participate in async daily standups. The purpose is to give every team member insight into what others are working on so that we can identify ways to collaborate and unblock one another as well as foster relationships within the team. We use the geekbot slack plugin to automate our async standup, following the guidelines outlined in the Geekbot commands guide.
Our questions change depending on the day of the week. Participation is optional but encouraged.
|Question||Why we ask it|
|Do you need help from anyone to unblock you this week?||One of our main goals with our standups is to help ensure that we are unblocking one another as a top priority. We ask this first because we think it's the question that other team members can take action on.|
|What do you plan on working on this week?||We want to understand how our daily actions drive us toward our weekly goals. This question provides a broader context for our daily work but also helps us hold ourselves accountable to maintaining proper scopes for our tasks, issues, merge requests, etc. This answer may stay the same for a week, this would mean things are progressing on schedule. Alternatively, seeing this answer change throughout the week is also okay. Maybe we got sidetracked helping someone get unblocked. Maybe new blockers came up. The intention is not to have to justify our actions but to keep a running record of how our work is progressing or evolving.|
|Any personal tidbits you'd like to share?||This question is intentionally open-ended. You might want to share how you feel, a personal anecdote, funny joke, or simply let the team know that you will have limited availability that afternoon. All of these answers are welcome.|
|Question||Why we ask it|
|Are you facing any blockers requiring action from others?||Same reason as Monday's first question|
|Are you on track with your plan for the week?||We want to understand how each team member is doing on achieving our week goal(s). It is meant to highlight progress while also identifying if there are things getting in the way. This could also be used to update the plan for the week as things change.|
|What will be your primary focus for today?||This question is aimed at the most impactful task for the day. We aren't trying to account for the entire day's worth of work. Highlighting only a primary task keeps our answers concise and provides insight into each team member's most important priority. This doesn't necessarily mean sharing the task that will take the most time. We focus on results over input. Typically this will mean highlighting the task that is most impactful in closing the gap between today and our end of the week goal(s).|
|Any personal tidbits you'd like to share?||Same reason as Monday's last question|
|Question||Why we ask it|
|What went well this week? What did you enjoy?||The end of the week is a good time to reflect on our goals, and this question is meant to be a short retrospective of the week. This focusing on things that went well during the week.|
|What didn’t go so well? What caused you to slow down?||Like the previous question, this question is a way to review our week. This one is a way to surface things that did not go so well or things that go in the way of meeting our weekly goal(s).|
|What have you learned?||This could be something about work or personal. We hope that by sharing things we have learned that others can also learn from us.|
|Any plans for the weekend you'd like to share?||Like the "personal tidbit" question we ask other days of the week, this one is very opened ended. You can share as much or as little as you want and all answers are welcome.|
To emphasize code ownership (which is aligned with organization-wide efforts) and help identify domain expertise for parts of the codebase that are developed by the Monitor stage. We place every new feature into a dedicated top-level namespace. Names used for these namespaces are derived from product categories. Such an approach helps us to communicate better not only between the engineers but also with other peers since used terms are well known across the organization.
lib/directory, in this case, all code which can not be run alone outside of the GitLab repository, should be at first placed under
Gitlabtop-level namespace, and after that in the product category subnamespace.
Groupscope which is rarely a case. To address that one should place controller accordingly to the feature scope (
Group) and then in category subnamespace.
lib/apidirectory. That enforces this code will have
APItop-level namespace, and after that should come product category subnamespace.
IncidentManagement::ProcessAlertWorkerwas properly placed under
IncidentManagementnamespace which relates to Incident Management product category.
Metrics::UsersStarredDashboardwas properly placed under
Metricsnamespace which relates to Metrics product category.
Metrics::Dashboard::Annotationwas properly placed under
Metricsnamespace which relates to Metrics product category.
API::Metrics::UsersStarredDashboardwas properly placed in
APItop-level namespace and than in
Metricssubnamespace which relates to Metrics product category.
Gitlab::Prometheus::MetricGroupshould be named
Gitlab::Metrics::Prometheus::MetricGroupand placed at
Projects::PerformanceMonitoring::DashboardsControllershould be named
Projects::Metrics::DashboardsControllerand placed at
MetricsDashboardAnnotationand respectively it should be placed at
Prometheus::ProxyVariableSubstitutionServiceaccording to this guidance should be named
Metrics::Prometheus::ProxyVariableSubstitutionServiceand placed at
Prometheus::CreateDefaultAlertsWorkershould be named
IncidentManagement::Prometheus::CreateDefaultAlertsWorkerand placed at
Spikes are time-boxed investigations typically performed in agile software development. Groups in the monitor stage typically create Spike issues when there is uncertainty on how to proceed on a feature from a technical perspective before a feature is developed.
deliverableto ensure clear ownership from engineers
workflow::ready for development
workflow::verificationand close the issue
Engineer(s) assigned to the Spike issue will be responsible for the following tasks:
With the support of GitLab's SRE team, we implemented the SRE shadow program as a means of improving the team's understanding of our ideal user personas so that we can build a better product.
In this program, engineers are expected to devote 1 entire week to shadow SREs. There is no expectation for the engineer to complete their assigned issues during this time. Engineers are added to PagerDuty and will follow the existing SRE shadow format of interning (except scaled down to a shorter duration of 1 week). Although typical SREs on-call for multiple days at a time, shadows are only expected to shadow during their regular business hours. This can be set as a preference in PagerDuty.
Team members interested in the program should notify their respective managers. We are currently limited to 2 max shadows per release so that we do not overload the SRE team. If you are shadowing during the same release as another engineer, coordinate to create a combined access request for the duration of the release.
The participant's manager and the Engineering Manager, Reliability Engineering should collaborate with you to define a schedule in the Slack channel
#monitor-sre-shadow. You can either check PagerDuty schedules or coordinate with the SRE manager to figure out who you'll be shadowing. Your schedule should include the week and working hours of your on-call shift, along with which SRE team member(s) will be on-call at the same time.
The participant should create an access request for PagerDuty and assign the access request to the SRE manager (this is a departure from established processes). PagerDuty licenses are limited so previous participants in the program may have to relinquish their licenses to you. Once PagerDuty access is set up, and you have been assigned to SRE engineer, you should override shadow PagerDuty schedule and assign yourself to corresponding shift, you can add contact details like a mobile number so PagerDuty can send alerts to you.
Typically, shadowing an SRE involves activities such as paying attention to SRE slack channels (#production, #incident-management), reading through incident issues posted there, and jumping into the 'The Situation Room' posted at the top of incident management for any active issues where the on-call SRE joins that room. About one week before starting your rotation you should get familiar with the Runbooks README and coordinate with the SRE(s) who will be on-call to determine which areas it makes sense for you to shadow (incidents, other on-call tasks, SRE daily tasks, etc). Scheduling a coffee chat with the SRE manager and/or SRE team members is recommended.
Please ask to be removed from PagerDuty after your shadow rotation to free up your license.
Alumni of the program are encouraged to add themselves to this list and document/link to the observations/outcomes they were able to share with the wider team.
|Laura Montemayor||Shadowing a Site Reliability Engineer|
|Tristan Read||My week shadowing a GitLab Site Reliability Engineer|
|Sarah Yasonik||Created 4 issues for the team to consider adding to the product|
|Miguel Rincon||Video: Miguel talks about his SRE shadow experience|
|Olena Horal-Koretska||7 things I’ve learnt while shadowing an SRE|
In order to make it more efficient to verify changes and demonstrate our product features to customers and other stakeholders. The engineers in this stage maintain a few demo environments.
|Customer simulation environment||tanuki-inc|
|Verifying features in Staging||monitor-sandbox (Staging)|
|Verifying features in Production||monitor-sandbox (Production)|
To be able to test logging features in both the elastic stack enabled and Kubernetes only cases, the following clusters and environments exist in production and staging:
|Elastic Stack ON
|Elastic Stack OFF