This is the product direction for Monitor. If you'd like to discuss this direction directly with the product managers for Monitor, feel free to reach out to Dov Hershkovitch (GitLab, Email), Sarah Waldner (GitLab, Email Zoom call) or Kevin Chu (GitLab, Email Zoom call).
The IT monitoring and management market is well-established and crowded, but also fast-changing in terms of technologies used in IT and expectations of the IT user base. For instance, the trend to move infrastructure to public cloud introduced a new category of technologies to monitor that traditional vendors did not address. This SaaS delivery model disrupted the on-prem delivery model for many existing vendors. More recently, a transition from virtualization to container-based technologies caused another wave of adjustment to what it means to monitor infrastructure. This further challenged existing vendors. These and other market trends allow new entrants (like Sentry and Datadog) to quickly capture mindshare and eclipse existing vendors (like New Relic and Splunk).
Put bluntly, we are building an integrated package of observability and operations tools which will, in three years, displace today's front-runner in modern observability, Datadog for APM usecases for modern applications. We'll do that by focusing on the four core workflows of Instrument, Triage, Resolve and Improve.
The following links describe our strategy for each individual workflow:
Our long-term strategy is ambitious.
In the first year, we will focus our efforts on our current target application types (cloud native applications, web-apps, static sites). As a result, during that time we will not strive to be a fully turn-key experience that can be used to Monitor legacy applications. Wholesale removing a Monitoring solution is painful and a land and expand strategy is prudent here. As a customer recently explained, "Every greenfield application that we can deploy with your monitoring tools saves us money on New Relic licenses."
As curated solutions mature, we can increasingly target new application types. In subsequent years we will compete with incumbent players as a holistic Monitoring solution for modern applications.
The Monitor stage comes after you've configured your production infrastructure and deployed your application to it. As part of the verification and release process you've done some performance validation - but you need to ensure your service(s) maintain the expected service-level objectives (SLOs) for your users.
GitLab's Monitor stage product offering makes instrumentation of your service easy, giving you the right tools to prevent, respond to, and restore SLO degradation. Current DevOps teams either lack exposure to operational tools or utilize ones that put them in a reactive position when complex systems fail inexplicably. Our mission is to empower your DevOps teams by finding operational issues before they hit production and enabling them to respond like pros by leveraging default SLOs and responses they proactively instrumented. GitLab Monitoring allows you to successfully complete the DevOps loop, not just for the features in your product, but for its performance and user experience as well.
Using GitLab observability solutions, users will be handed with an easy way to gain a holistic understanding of the state of production services across multiple groups and projects. When you are deploying a suite of services, it's critical that you can drill into each individual services SLO attainment as well as troubleshoot issues which span multiple services.
We track epics for all the major deliverables associated with the north stars, and category maturity levels. You can view them on our Monitor Roadmap.
We plan to provide a streamline triage experience to allows our users to quickly identify and effectively troubleshoot an application problem as described in the following flow:
Detailed information can be found in the triage to minimal epic
We're pursuing a few key principles within the Monitor Stage.
Your team's service(s), first and foremost, need to be observable before you are able to evaluate production performance characteristics. We believe that observability should be easy. GitLab will ship with smart conventions that setup your applications with generic observability. We will also make it simple to instrument your service, so that custom metrics, ones that you'd like to build your own SLOs around, can be added with a few lines of code.
Alerting and notification services is a table-stakes expectation of APM, and Metrics solutions. GitLab will build a great experience for setting thresholds and metrics, including setting smart defaults for known metrics. We'll lean heavily on our early integration with Prometheus scheduling, notification, and alerting services. Beyond alerting, integration with chatops and incident management is also going to be important.
Visually working with time-series data is an important expectation of an observability solution. Our dashboarding solutions will include an ad-hoc data visualization which allow us to quickly build time-series based visualizations based on metrics, charting them against related metrics, and breaking them down per the field of your choice. A dashboarding system should also provide a curated UI experience for the established vendors that are clearly in the lead.
The most effective way to bootstrap usage of a new feature / solution is to expose existing users to it in the context of what they are already doing. All 3 solution areas (Logs, Metrics and APM) should incorporate integrations of each solution and a guide on how to get started. In addition to cross-linking between observability apps, a number of broader GitLab initiatives
We want to help teams resolve outages faster, accelerating both the troubleshooting and resolution of incidents. GitLab's single platform can correlate the incoming observability data with known CI/CD events and source code information, to automatically suggest potential root causes.
Continuously learning and driving those insights back into your development cycle is a critical part of the DevOps loop. The tools in the Monitor stage make it possible to gain insights about production SLOs, incidents and observability sources across the multi-project systems that make up a complete application.
Container based deployments have rapidly expanded the amount of observability information available. It is no longer possible to collate and visualize this information without automation and distillation of valuable insights which GitLab can do for you.
We'll also provide views across a suite of applications so that managers of a large number of DevOps or Operations teams can get a quick view of their application suite, and team's health.
Our north stars are the guide posts for where we are headed. Our principles inform how we will get there. First and foremost we abide by GitLab's universal Product Principles. There are a few unique principles to the Monitor stage itself.
As part of our general principle of Flow One the Monitor stage will seek to complete the full observability feedback loop for limited use cases first, before moving on to support others. As a starting point this will mean supoprt for modern, cloud-native developers first.
In modern DevOps organizations developers are expected to also operate the services they develop. In many cases this expectation isn't met. Whether a developer is the one operating an application or not, we will build tools that work for those doing the operator job. This means forgoing preferences, like developers to avoid deep production troubleshooting, and instead building tools that allow those who operate to be best-in-class operators, regardless of their title.
Our users can't expect a complete set of Monitoring tools if we don't utilize it ourselves for instrumenting and operating GitLab. That's why we will dogfood everything.
We will start with GitLab Self-Monitoring and our own Infrastructure teams. We want self-managed administrator users to utilize the same tools to observe and respond to health alerts about their GitLab instance as they would to monitor their own services. We'll also complete our own DevOps loop by having our Infrastructure teams for GitLab.com utilize our incident management feature.
Monitor SMAU is determined by tracking how users configure, interact, and view the features contained within the stage. The following features are considered:
|Install Prometheus||Add/Update/Delete Metric Chart||View Metrics Dashboard|
|Enable external Prometheus instance integration||Download CSV data from a Metric chart||View Kubernetes pod logs|
|Enable Jaeger for Tracing||Generate a link to a Metric chart||View Environments|
|Enable Sentry integration for Error Tracking||Add/removes an alert||View Tracing|
|Enable auto-creation of issues on alerts||Change the environment when looking at pod logs||View operations settings|
|Enable Generic Alert endpoint||Selects issue template for auto-creation||View Prometheus Integration page|
|Enable email notifications for auto-creation of issues||Use /zoom and /remove_zoom quick actions||View error list|
|Click on metrics dashboard links in issues|
|Click View in Sentry button in errors list|
See the corresponding Periscope dashboard (internal).
There are a few workflows that are critical to our users in this stage.
Each of these workflows has a designated level of maturity; you can read more about our category maturity model to help you decide which categories you want to start using and when.
This workflow is planned, but not yet available.
Starting with the highest level alert, using preconfigured dashboards to review relevant metrics, enabling ad-hoc visualization and immediate drill down from time sliced metrics into logs and traces in the same screen This workflow is planned, but not yet available.
This workflow is planned, but not yet available.
There are a few product categories that are critical for success here; each one is intended to represent what you might find as an entire product out in the market. We want our single application to solve the important problems solved by other tools in this space - if you see an opportunity where we can deliver a specific solution that would be enough for you to switch over to GitLab, please reach out to the PM for this stage and let us know.
Each of these categories has a designated level of maturity; you can read more about our category maturity model to help you decide which categories you want to start using and when.
GitLab collects and displays performance metrics for deployed apps, leveraging Prometheus. Developers can determine the impact of a merge and keep an eye on their production systems, without leaving GitLab. This category is at the "viable" level of maturity.
Out-of-the-box Kubernetes cluster monitoring let you know the health of your deployment environments with traceability back to every issue and code change as part of a single application for end-to-end DevOps. This category is at the "viable" level of maturity.
Track incidents within GitLab, providing a consolidated location to understand the who, what, when, and where of the incident. Define service level objectives and error budgets, to achieve the desired balance of velocity and stability. This category is at the "viable" level of maturity.
GitLab makes it easy to view the logs of running pods in connected Kubernetes clusters. By displaying the logs directly in GitLab, developers can avoid having to manage console tools or jump to a different interface. This category is at the "viable" level of maturity.
Tracing provides insight into the performance and health of a deployed application, tracking each function or microservice which handles a given request. This makes it easy to understand the end-to-end flow of a request, regardless of whether you are using a monolithic or distributed system. This category is at the "minimal" level of maturity.
Self-managed GitLab instances come out of the box with great observability tools, reducing the time and effort required to maintain a GitLab instance.
Error tracking allows developers to easily discover and view the errors that their application may be generating. By surfacing error information where the code is being developed, efficiency and awareness can be increased. This category is at the "minimal" level of maturity.
Simulate user activity within your application, to detect problems in end-to-end workflows and understand real-world performance. This category is planned, but not yet available.
Easily communicate the status of your services to users and customers. This category is planned, but not yet available.
We follow the same prioritization guidelines as the product team at large.
As noted above, in the short term the Monitor stage will be prioritizing (video discussion) the following:
You can see our entire public backlog for Monitor at this link; filtering by labels or milestones will allow you to explore. If you find something you're interested in, you're encouraged to jump into the conversation and participate. At GitLab, everyone can contribute!
Issues with the "direction" label have been flagged as being particularly interesting, and are listed in the section below.
sidekiq-clusterscript to Core
There are a number of other issues that we've identified as being interesting that we are potentially thinking about, but do not currently have planned by setting a milestone for delivery. Some are good ideas we want to do, but don't yet know when; some we may never get around to, some may be replaced by another idea, and some are just waiting for that right spark of inspiration to turn them into something special.
Remember that at GitLab, everyone can contribute! This is one of our fundamental values and something we truly believe in, so if you have feedback on any of these items you're more than welcome to jump into the discussion. Our vision and product are truly something we build together!