Our aim is to integrate monitoring and observability into our DevOps Platform in order to provide a convenient and cost effective solution that allows our customers to monitor the state of their applications, understand how changes they make can impact their applications performance characteristics and give them the tools to resolve issues that arise. APM is a core part of the DevOps workflow as it shows the direct impact of deployed changes.
This effort will be SaaS first and we will iterate by leveraging open source agents for auto-instrumentation.
Due to the size and complexity of our vision, we may split this SEG into three subgroups that will focus on each part of the architecture:
Use the open-source DataDog agent to collect metrics from production applications. The DataDog agent is written in Go and will need to be integrated with our preferred storage solution to send metrics and events in a periodic batch payload.
Store the metrics, logging and events in a queryable event series database. We need to balance memory and CPU usage, against timeliness of data, and efficiency in querying.
Integrate an open-source visualation tool that has takes care of all the analytics. The aim is to provide the ability to query, visualize, set up alerts and understand the data from our users applications.
GitLab users can currently monitor their services and application by leveraging GitLab to install Prometheus to a GitLab managed cluster. Similarly, users can also install the ELK stack to do log aggregation and management. The advantage of using GitLab with these popular tools is users can collaborate on monitoring in the same application they use for building and deploying their services and applications.
What we've learned since that makes this particular strategy challenging are the following:
We are intentionally shifting our strategy to account for what we learned:
We anticipate that our initial entry to this market will be a set of open-source tools that we can leverage within GitLab.com to offer a rudimentary APM solution to users, before we iterate to improve usability and functionality. Our technical approach will start with collecting metrics, before adding logging and then tracing.
The APM group's mission is to help our customers decrease the frequency and severity of their production issues. As such, we've defined the team's Performance Indicator (PI) to be the total number of metrics and log views. The decision to use this NSM is the following:
Since the PI is a single metric, we've decided to combine those two metrics into a single one, representing the team's PI.
We expect to track the journey of users through the following funnel:
Metrics are numeric values tracked over time, such as memory usage, CPU usage and network speed.
Monitoring is the ability to understand, and alert on, an applications usage and performance.
Observability (abbreviated as “o11y”) allows you to answer questions about the state of your application by observing data coming from your application.
Trace is the relationship between events coming from your system, visualised by using timing data to display the relationships between events.