The Monitor:Observability group at GitLab is responsible for building tools that enable DevOps teams to respond to, triage and remediate errors and IT alerts for the systems and applications they maintain.
Stages align with the eight loop stages of Plan, Create, Verify, Secure, Package, Release, Configure, and Monitor. These stages are based on Gartner's original research on how organizations can sequentially build a successful DevOps process by mapping business priorities to DevOps activities. A stage doesn't necessarily align with products in the market.
The Gartner definition of Monitor stage comprises:
This list of business priorities maps together as a group of activities you typically think about together in DevOps. At GitLab, we have added some adjacent business priorities we want to solve for our users in Incident Management, ITSM (IT Service Management), and service desk.
Monitor is generally understood to mean application performance monitoring or infrastructure monitoring. This is because these functions are offered by some of the most widely used software development tools, such as Prometheus, Datadog, and New Relic. However, Monitor sometimes also encompasses incident management. Strictly speaking, incident management is not about monitoring application performance, but what you do when something goes wrong. These are related, but different, areas.
Given the close relationship between these two areas, people can quickly and intuitively find incident management under Monitor within the GitLab UI. Furthermore, many monitoring tools are also branching into incident management as a logical expansion, so the lines are becoming blurry.
Observability is an old term originating from control theory. The term became popular because traditionally, monitoring is about defining metrics in advance, then hoping you've defined good metrics that can indicate when your system is encountering or about to encounter problems.
This approach is insufficient in modern computing, as today's applications are often built on ephemeral and elastic infrastructure. With microservices, the interesting information often lies in the connection between the services, rather than within the services themselves.
We no longer operate in a space where we can confidently predict what might go wrong. Observability is all about operating on the assumption that although we know some things, we need a tool to help us understand exactly what's happening in those murkier areas, a tool that immediately alerts us to anomalies, errors, and downtime.
Observability extends monitoring by providing visibility into the entire software system, not just the individual services or functions. To achieve this perspective, observability relies on three types of telemetry data:
We want to build an observability product, a tool that's a fit for the modern developer building modern applications. Monitoring is good, but ultimately insufficient.
We have Geekbot standups on Weds and retrospectives on Fridays. We use these async standups to communicate what we have accomplished, any current blockers and what we plan to work on next.
Every Friday, the EM provides an async update of the team's progress, following the Ops sub-department async updates process.
These updates are published as issues in the general
project.
Updates and highlights from all teams in Ops are collected automatically here, grouped by week / month / quarter.
Weekly Meetings: These are focused on organizing ongoing work or specific efforts such as rollout-outs or bigger initiatives.
Bi-weekly Cross-functional meeting: This weekly meeting is focused on aligning the EM, PM, Principal Engineer, Developer Advocate, and UX on cross-functional objectives. Goals are set and weekly status is communicated.
Bi-monthly social hour: This meeting is non-work related and helps team socialize and get to know each other better.
Team member coffee chats: Each team member should schedule a coffee chat with all other team members rough every 4-6 weeks. Feel free to discuss work or non-work topics. If timezones are an issue find another way to connect, such as a async slack thread to checkin. The goal is to get to know your other team members on a 1:1 basis.
Dev Syncs: These are developer-organized sync meetings where ICs can meet and discuss technical issues or organize technical work amongst themselves without requiring the presence of a EM.
We use several Slack channels to organize ourselves:
Currently, during our initial phase, we are using a 2 month milestone cadence. All work is organized into Epics, sub-epics, and assigned to the relevant Milestone.
Normally at the beginning of the Milestone the EM will discuss an overview of the work and what relevant areas you will focus on. Sometimes issues will already be assigned to you before the Milestone begins.
If you are ever looking for additional issues to work on:
workflow:in dev
label to the issueFor some projects, we require a cleanroom development process. Our process is described here.
Person | Role |
---|---|
Mat Appelman | Principal Engineer, Monitor |
Ottilia Westerlund | Security Engineer, Fulfillment (Fulfillment Platform, Billing and Subscription Management), Govern (Security Policies, Threat Insights), Monitor (Observability), Plan (Product Planning) |
To showcase features developed by our team, we have created a number of demos:
The Observability team is involved in the introduction of several new technologies and technical components to GitLab's tech stack.
The GitLab Monitor Stage Product Direction Handbook Page has information about the product strategy for integrating GitLab and Opstrace.
We also encourage you to read our architecture documentation.
Observability and analytics features have big data and insert heavy requirements which are not a good fit for Postgres or Redis. ClickHouse was selected as a good fit to meet these features requirements. ClickHouse is an open-source column-oriented database management system. It is attractive for these use cases because it can efficiently filter, aggregate, and sum across large numbers of rows. ClickHouse is not intended to replace Postgres or Redis in GitLab's stack.
ClickHouse is the backend datastore for these features:
ClickHouse is planned as the data backend for:
#clickhouse-gitlab-external431
)We organized and participated in a Borrow Request to help migrate Error Tracking to ClickHouse
The timeline for the Borrow Request was April 6 - July 22, 2022
Many of the Goals and reasoning is discussed in the borrow request proposal.
"Clickhouse integrated as part of a standard Opstrace + GitLab .COM deployments with Error Tracking backed by Clickhouse enabled by default."
Epic: https://gitlab.com/groups/gitlab-org/opstrace/-/epics/73
Week of 2023-09-18
Week of 2023-09-25
Week of 2023-10-02
Week of 2023-10-09
Week of 2023-10-16
Week of 2023-10-23
(Sisense↗) We also track our backlog of issues, including past due security and infradev issues, and total open System Usability Scale (SUS) impacting issues and bugs.
(Sisense↗) MR Type labels help us report what we're working on to industry analysts in a way that's consistent across the engineering department. The dashboard below shows the trend of MR Types over time and a list of merged MRs.
(Sisense↗) Flaky test are problematic for many reasons.