This is the product direction for Monitor. If you'd like to discuss this direction directly with the product managers for Monitor, feel free to reach out to Dov Hershkovitch (GitLab, Email), Sarah Waldner (GitLab, Email Zoom call) or Kevin Chu (GitLab, Email Zoom call).
The monitoring and management market, otherwise branded as observability, is well-established and crowded. It is also fast-changing in terms of technologies used and users' expectations. For instance, the trend to move infrastructure to public cloud introduced a new category of technologies to monitor that traditional vendors did not address. This SaaS delivery model disrupted the on-prem delivery model for many existing vendors. More recently, a transition from virtualization to container-based technologies caused another wave of adjustment to what it means to monitor or observe. This further challenged existing vendors. These and other market trends allow new entrants (like Sentry and current market leader Datadog) to quickly capture mindshare and eclipse existing vendors.
In 2 year’s time, GitLab aims to make observability a commodity by being ubiquitous, complete, cost effective, and simple to setup and operate for any cloud-native team, enabling them to continuously improve.
GitLab, at this particular time, is uniquely qualified to deliver on this bold and ambitious vision because:
A trade-off in our approach is that we are explicitly not striving to be a fully turn-key experience that can be used to monitor all applications, particularly legacy applications. Wholesale removing an existing monitoring solution is painful and a land and expand strategy is prudent here. As a customer recently explained, "Every greenfield application that we can deploy with your monitoring tools saves us money on New Relic licenses."
As this stage matures, we will begin to shift our attention and compete more directly with incumbent players as a holistic Monitoring solution for modern applications.
Dovetailing on our 2 year vision statement, our 3 year goal is to have built an integrated package of observability and operations tools that can displace today's front-runner in modern observability, Datadog and compete in all Monitor categories. We'll do that by focusing on the four core workflows of Instrument, Triage, Resolve and Improve.
The following links describe our strategy for each individual workflow:
The Monitor stage comes after you've configured your production infrastructure and deployed your application to it. As part of the verification and release process you've done some performance validation - but you need to ensure your service(s) maintain the expected service-level objectives (SLOs) for your users.
GitLab's Monitor stage product offering makes instrumentation of your service easy, giving you the right tools to prevent, respond to, and restore SLO degradation. Current DevOps teams either lack exposure to operational tools or utilize ones that put them in a reactive position when complex systems fail inexplicably. Our mission is to empower your DevOps teams by finding operational issues before they hit production and enabling them to respond like pros by leveraging default SLOs and responses they proactively instrumented. GitLab Monitoring allows you to successfully complete the DevOps loop, not just for the features in your product, but for its performance and user experience as well.
Using GitLab observability solutions, users will be handed with an easy way to gain a holistic understanding of the state of production services across multiple groups and projects. When you are deploying a suite of services, it's critical that you can drill into each individual services SLO attainment as well as troubleshoot issues which span multiple services.
We track epics for all the major deliverables associated with the north stars, and category maturity levels. You can view them on our Monitor Roadmap.
The terms monitoring and observability are at times used interchangeably and can cause some confusion.
Note - Yes, we're also guilty of this and actively improving it. If you see room for improvement, please feel free to make a contribution!
Observability is the ability to infer internal states of a system based on the system’s external outputs. Monitoring, on the other hand is the activity of observing the state of a system over time. To achieve observability, your system’s various telemetry types should all be available to enable proactive introspection and enable greater operational visibility. The overarching goal for the GitLab's monitor category is to help improve the observability of your applications and system.
If you are interested in more information on this topic, Charity Majors, CTO of HoneyComb, has given many great talks and written many articles on this topic. Giving credit where it is due, Charity, has played a major role in pointing out the shortcomings of monitoring and helped push observability to the mainstream. Here are some useful articles from her along with Ben Sigelman of LightStep on this topic:
Note - Charity has critiqued our direction in the past. Points taken, improvements coming!
We are currently in the process of bringing most of the Monitor categories to
minimal maturity. Post this effort, we will have two main focus areas for the next 3 to 6 months.
First, we plan to provide a streamline triage experience to allows our users to quickly identify and effectively troubleshoot an application problem as described in the following flow:
Detailed information can be found in the triage to minimal epic
Second, we plan to dogfood our current capabilities. Monitor and observability solutions, by nature of what they are, have a high bar to meet before adoption. By continuing to improve the triage workflow, we will at the same time enable our GitLab teammates to use GitLap Monitor more fully. We will pause incremental investment in additional Monitor capabilities until we have at minimum met GitLab's internal need for Monitoring.
We're pursuing a few key objectives within the Monitor Stage.
Your team's service(s), first and foremost, need to be observable before you are able to evaluate production performance characteristics. We believe that observability should be easy. GitLab will ship with smart conventions that setup your applications with generic observability. We will also make it simple to instrument your service, so that custom metrics, ones that you'd like to build your own SLOs around, can be added with a few lines of code.
Alerting and notification services is a table-stakes expectation of APM, and Metrics solutions. GitLab will build a great experience for setting thresholds and metrics, including setting smart defaults for known metrics. We'll lean heavily on our early integration with Prometheus scheduling, notification, and alerting services. Beyond alerting, integration with chatops and incident management is also going to be important.
Visually working with time-series data is an important expectation of an observability solution. Our dashboarding solutions will include an ad-hoc data visualization which allow us to quickly build time-series based visualizations based on metrics, charting them against related metrics, and breaking them down per the field of your choice. A dashboarding system should also provide a curated UI experience for the established vendors that are clearly in the lead.
The most effective way to bootstrap usage of a new feature / solution is to expose existing users to it in the context of what they are already doing. All 3 solution areas (Logs, Metrics and APM) should incorporate integrations of each solution and a guide on how to get started. In addition to cross-linking between observability apps, a number of broader GitLab initiatives
We want to help teams resolve outages faster, accelerating both the troubleshooting and resolution of incidents. GitLab's single platform can correlate the incoming observability data with known CI/CD events and source code information, to automatically suggest potential root causes.
Continuously learning and driving those insights back into your development cycle is a critical part of the DevOps loop. The tools in the Monitor stage make it possible to gain insights about production SLOs, incidents and observability sources across the multi-project systems that make up a complete application.
Container based deployments have rapidly expanded the amount of observability information available. It is no longer possible to collate and visualize this information without automation and distillation of valuable insights which GitLab can do for you.
We'll also provide views across a suite of applications so that managers of a large number of DevOps or Operations teams can get a quick view of their application suite, and team's health.
Our north stars are the guide posts for where we are headed. Our principles inform how we will get there. First and foremost we abide by GitLab's universal Product Principles. There are a few unique principles to the Monitor stage itself.
As part of our general principle of Flow One the Monitor stage will seek to complete the full observability feedback loop for limited use cases first, before moving on to support others. As a starting point this will mean supoprt for modern, cloud-native developers first.
In modern DevOps organizations developers are expected to also operate the services they develop. In many cases this expectation isn't met. Whether a developer is the one operating an application or not, we will build tools that work for those doing the operator job. This means forgoing preferences, like developers to avoid deep production troubleshooting, and instead building tools that allow those who operate to be best-in-class operators, regardless of their title.
Our users can't expect a complete set of Monitoring tools if we don't utilize it ourselves for instrumenting and operating GitLab. That's why we will dogfood everything.
We will start with GitLab Self-Monitoring and our own Infrastructure teams. We want self-managed administrator users to utilize the same tools to observe and respond to health alerts about their GitLab instance as they would to monitor their own services. We'll also complete our own DevOps loop by having our Infrastructure teams for GitLab.com utilize our incident management feature.
Monitor SMAU is determined by tracking how users configure, interact, and view the features contained within the stage. The following features are considered:
|Install Prometheus||Add/Update/Delete Metric Chart||View Metrics Dashboard|
|Enable external Prometheus instance integration||Download CSV data from a Metric chart||View Kubernetes pod logs|
|Enable Jaeger for Tracing||Generate a link to a Metric chart||View Environments|
|Enable Sentry integration for Error Tracking||Add/removes an alert||View Tracing|
|Enable auto-creation of issues on alerts||Change the environment when looking at pod logs||View operations settings|
|Enable Generic Alert endpoint||Selects issue template for auto-creation||View Prometheus Integration page|
|Enable email notifications for auto-creation of issues||Use /zoom and /remove_zoom quick actions||View error list|
|Click on metrics dashboard links in issues|
|Click View in Sentry button in errors list|
See the corresponding Periscope dashboard (internal).
There are a few workflows that are critical to our users in this stage.
Each of these workflows has a designated level of maturity; you can read more about our category maturity model to help you decide which categories you want to start using and when.
This workflow is planned, but not yet available.
Starting with the highest level alert, using preconfigured dashboards to review relevant metrics, enabling ad-hoc visualization and immediate drill down from time sliced metrics into logs and traces in the same screen This workflow is planned, but not yet available.
This workflow is planned, but not yet available.
There are a few product categories that are critical for success here; each one is intended to represent what you might find as an entire product out in the market. We want our single application to solve the important problems solved by other tools in this space - if you see an opportunity where we can deliver a specific solution that would be enough for you to switch over to GitLab, please reach out to the PM for this stage and let us know.
Each of these categories has a designated level of maturity; you can read more about our category maturity model to help you decide which categories you want to start using and when.
GitLab collects and displays performance metrics for deployed apps, leveraging Prometheus. Developers can determine the impact of a merge and keep an eye on their production systems, without leaving GitLab. This category is at the "viable" level of maturity.
Out-of-the-box Kubernetes cluster monitoring let you know the health of your deployment environments with traceability back to every issue and code change as part of a single application for end-to-end DevOps. This category is at the "viable" level of maturity.
Track incidents within GitLab, providing a consolidated location to understand the who, what, when, and where of the incident. Define service level objectives and error budgets, to achieve the desired balance of velocity and stability. This category is at the "viable" level of maturity.
GitLab makes it easy to view the logs distributed across multiple pods and services using log aggregation with Elastic Stack. Once Elastic Stack is enabled, you can view your aggregated Kubernetes logs across multiple services and infrastructure, go back in time, conduct infinite scroll, and search through your application logs from within the GitLab UI itself. This category is at the "viable" level of maturity.
Tracing provides insight into the performance and health of a deployed application, tracking each function or microservice which handles a given request. This makes it easy to understand the end-to-end flow of a request, regardless of whether you are using a monolithic or distributed system. This category is at the "minimal" level of maturity.
Self-managed GitLab instances come out of the box with great observability tools, reducing the time and effort required to maintain a GitLab instance.
Error tracking allows developers to easily discover and view the errors that their application may be generating. By surfacing error information where the code is being developed, efficiency and awareness can be increased. This category is at the "viable" level of maturity.
Digital experience management includes both real user monitoring (passive) and synthetics monitoring (active) to allow developers to detect problems in end-to-end workflows and understand real-world performance as experienced by users. This category is planned, but not yet available.
Priority: medium • Direction
Easily communicate the status of your services to users and customers. This category is planned, but not yet available.
We follow the same prioritization guidelines as the product team at large.
As noted above, in the short term the Monitor stage will be prioritizing (video discussion) the following:
You can see our entire public backlog for Monitor at this link; filtering by labels or milestones will allow you to explore. If you find something you're interested in, you're encouraged to jump into the conversation and participate. At GitLab, everyone can contribute!
Issues with the "direction" label have been flagged as being particularly interesting, and are listed in the section below.
sidekiq-clusterscript to Core
There are a number of other issues that we've identified as being interesting that we are potentially thinking about, but do not currently have planned by setting a milestone for delivery. Some are good ideas we want to do, but don't yet know when; some we may never get around to, some may be replaced by another idea, and some are just waiting for that right spark of inspiration to turn them into something special.
Remember that at GitLab, everyone can contribute! This is one of our fundamental values and something we truly believe in, so if you have feedback on any of these items you're more than welcome to jump into the discussion. Our vision and product are truly something we build together!