The Monitor:Observability group at GitLab is responsible for building tools that enable DevOps teams to respond to, triage and remediate errors and IT alerts for the systems and applications they maintain.
Based on the recent Opstrace acquisition, we aim to provide a streamlined Operations experience within GitLab that enables the individuals who write the code, to maintain it at the same time.
Our primary goals right now are to deliver a set of 3 Integration Milestones:
We have Geekbot standups on Weds and retrospectives on Fridays. We use these async standups to communicate what we have accomplished, any current blockers and what we plan to work on next.
Team Leadership Meeting: This weekly meeting is focused on aligning the EM, PM, Principal Engineer, Developer Advocate on cross-functional objectives. Goals are set and weekly status is communicated.
Bi-monthly social hour: This meeting is non-work related and helps team socialize and get to know each other better.
Team member coffee chats: Each team member should schedule a coffee chat with all other team members rough every 4-6 weeks. Feel free to discuss work or non-work topics. If timezones are an issue find another way to connect, such as a async slack thread to checkin. The goal is to get to know your other team members on a 1:1 basis.
We use several Slack channels to organize ourselves:
Currently, during our initial phase, we are using a 2 month milestone cadence. All work is organized into Epics, sub-epics, and assigned to the relevant Milestone.
Normally at the beginning of the Milestone the EM will discuss an overview of the work and what relevant areas you will focus on. Sometimes issues will already be assigned to you before the Milestone begins.
If you are ever looking for additional issues to work on:
workflow:in dev
label to the issuePerson | Role |
---|---|
Sebastien Pahl | Principal Product Manager, Monitor:Observability |
Andy Volpe | Staff Product Designer, Secure:Composition Analysis, Configure, Monitor, Secure, Protect |
Mat Appelman | Principal Engineer, Monitor:Observability |
Vitor Meireles De Sousa | Senior Security Engineer, Application Security, Package (Package), Configure (Configure), Monitor (Monitor) |
Justin Mandell | Product Design Manager, Configure, Monitor, Secure & Protect |
Kevin Chu | Group Manager of Product Management, Configure, Monitor, Release |
A Pulse survey was conducted for the Opstrace team and the feedback was focused on a few key areas:
The Observability team is involved in the introduction of several new technologies and technical components to GitLab's tech stack.
The GitLab Monitor Stage Product Direction Handbook Page has information about the product strategy for integrating GitLab and Opstrace.
Observability and analytics features have big data and insert heavy requirements which are not a good fit for Postgres or Redis. ClickHouse was selected as a good fit to meet these features requirements. ClickHouse is an open-source column-oriented database management system. It is attractive for these use cases because it can efficiently filter, aggregate, and sum across large numbers of rows. ClickHouse is not intended to replace Postgres or Redis in GitLab's stack.
ClickHouse is the backend datastore for these features (currently under development):
ClickHouse is also being considered as a backend for:
GitLab Observability UI is a data-visualization library used to visualize metrics. We intend to extend it to visualize logging and tracing data as these features are added to the platform. GitLab Observability UI is based on a version of Grafana prior to the license change from Apache to AGPL. This approach was chosen for the following reasons:
Cortex is a highly scalable timeseries database for Prometheus. Cortex has been part of the Opstrace stack since its founding and the team has invested energy in everything from scale testing and developing its own operator.
Moving forward, we are proposing to remove Cortex from the stack for several reasons:
We would love community input on this proposal: https://gitlab.com/gitlab-org/opstrace/opstrace/-/issues/1656
Timeline: April 6 - June 22, 2022
Many of the Goals and reasoning is discussed in the borrow request proposal.
"Clickhouse integrated as part of a standard Opstrace + GitLab .COM deployments with Error Tracking backed by Clickhouse enabled by default."
We will communicate ongoing progress via:
Exception Ratio: 1 Principal Engineer, Multiple Staff Engineers : Team
Justification: The Observability team is focused on delivering several high priority objectives in a short period of time (deadline June 22, 2022). As part of the ClickHouse Acceleration effort, the team will include 3 Staff engineers in addition to an existing Principal Backend Engineer. This higher ratio will enable us to achieve all the objectives of the Borrow Request in addition to our pre-existing Milestone 3 goals.
(Sisense↗) We also track our backlog of issues, including past due security and infradev issues, and total open SUS-impacting issues and bugs.
(Sisense↗) MR Type labels help us report what we're working on to industry analysts in a way that's consistent across the engineering department. The dashboard below shows the trend of MR Types over time and a list of merged MRs.