At GitLab, we collect product usage data for the purpose of helping us build a better product. Data helps GitLab understand which parts of the product need improvement and which features we should build next. Product usage data also helps our team better understand the reasons why people use GitLab. With this knowledge we are able to make better product decisions.
To get an introduction of Product Analytics at GitLab you have a look at the GitLab Product Data Training (internal) deck
There are several stages and teams involved to go from collecting data to making it useful for our internal teams and customers.
|Collection||The data collection tools used across all GitLab applications including GitLab SaaS, GitLab self-managed, CustomerDot, VersionDot, and about.gitlab.com. Our current tooling includes Snowplow, Service Ping, and Google Analytics.||Product Intelligence||Infrastructure|
|Extraction||The data extraction tools used to extract data from Product, Infrastructure, Enterprise Apps data sources. Our current tooling includes Stitch, Fivetran, and Custom.||Data|
|Loading||The data loading tools used to extract data from Product, Infrastructure, Enterprise Apps data sources and to load them into our data warehouse. Our current tooling includes Stitch, Fivetran, and Custom.||Data|
|Orchestration||The orchestration of extraction and loading tooling to move data from sources into the Enterprise Data Warehouse. Our current tooling includes Airflow.||Data|
|Storage||The Enterprise Data Warehouse (EDW) which is the single source of truth for GitLab's corporate data, performance analytics, and enterprise-wide data such as Key Performance Indicators. Our current EDW is built on Snowflake.||Data|
|Transformation||The transformation and modelling of data in the Enterprise Data Warehouse in preparation for data analysis. Our current tooling is dbt and Python scripts.||Data||Product Intelligence|
|Analysis||The analysis of data in the Enterprsie Data Warehouse using a querying and visualization tool. Our current tooling is Sisense.||Data, Product Data Analysis||Product Intelligence|
The systems overview is a simplified diagram showing the interactions between GitLab Inc and self-managed instances.
|GitLab SaaS||GitLab Self-managed|
|For Product Intelligence purposes, GitLab Inc has three major components:
1. Data Infrastructure: This contains everything managed by our data team including Sisense Dashboards for visualization, Snowflake for Data Warehousing, incoming data sources such as PostgreSQL Pipeline and S3 Bucket, and lastly our data collectors GitLab.com's Snowplow Collector and GitLab's Versions Application.
2. GitLab.com: This is the production GitLab application which is made up of a Client and Server. On the Client or browser side, a Snowplow JS Tracker (Frontend) is used to track client-side events. On the Server or application side, a Snowplow Ruby Tracker (Backend) is used to track server-side events. The server also contains Service Ping which leverages a PostgreSQL database and a Redis in-memory data store to report on usage data. Lastly, the server also contains System Logs which are generated from running the GitLab application.
3. Monitoring infrastructure: This is the infrastructure used to ensure GitLab.com is operating smoothly. System Logs are sent from GitLab.com to our monitoring infrastructure and collected by a FluentD collector. From FluentD, logs are either sent to long term Google Cloud Services cold storage via Stackdriver, or, they are sent to our Elastic Cluster via Cloud Pub/Sub which can be explored in real-time using Kibana.
|For Product Intelligence purposes, self-managed instances have two major components:
1. Data infrastructure: Having a data infrastructure setup is optional on self-managed instances. If you'd like to collect Snowplow tracking events for your self-managed instance, you can setup your own self-managed Snowplow collector and configure your Snowplow events to point to your own collector.
2. GitLab: A self-managed GitLab instance contains all of the same components as GitLab.com mentioned above.
As shown by the orange lines, on GitLab.com Snowplow JS, Snowplow Ruby, Service Ping, and PostgreSQL database imports all flow into GitLab Inc's data infrastructure. However, on self-managed, only Service Ping flows into GitLab Inc's data infrastructure.
As shown by the green lines, on GitLab.com system logs flow into GitLab Inc's monitoring infrastructure. On self-managed, there are no logs sent to GitLab Inc's monitoring infrastructure.
Note: Snowplow JS and Snowplow Ruby are available on self-managed, however, the Snowplow Collector endpoint is set to a self-managed Snowplow Collector which GitLab Inc does not have access to.
Our SaaS data collection catalog spans both the client (frontend) and server (backend), and uses various tools. We pick-up events and data produced when using the application. By utilizing collected identifiers, we can string these backend and frontend events together to illustrate a GitLab journey at the (1) user (pseudonymized), (2) namespace, and (3) project level.
The below table explains the types of data we collect from GitLab.com and examples for what it can be used.
|Technology||Data Type||Description||Aggregation Method|
- Snowplow JS Tracker: client side (FE) events
-Snowplow Ruby Tracker: server-side (BE) events
- Schema of events here
|Event Based Data||Examples:
- Collects an event on Git pushes
- Collects an event on a button click - Collects an event on a successful Pipeline
- Collects an a request to a Rails controller
|Event based or grouped by an attribute (e.g. session)|
- PostgreSQL database
- Redis in-memory data store
- System Logs
- Total issues created
- Instance settings such as the instance's Git version
- Integration settings such as an external storage provider
|Count based on either total time our given timeframe|
In order to create the SaaS usage journeys documented below, we collect various identifiers in our data collection catalog. Where the identifier can be used to personally identify a user by someone without permissions to view that information, we will pseudonymize the data via hashing at the collection layer. Follow this epic to track our progress on our pseudonymization project.
|Metric||Example Data Pre-Pseudonymization||Pseudonymized?||Example Data Post-Pseudonymization|
A user signups for a free GitLab.com account and creates a group and/or project (Pseudonymized user_id created and associated with group or project ID), they set up their repo and then view CI/CD and decide they want to invite a colleague to set up this functionality. The newly invited user signs up for GitLab (New pseudonymized user_id created and associated with the existing group or project ID) and sets up CI/CD for their team (backend event).
Why this user journey is valuable to GitLab and our users. In this example, by being able to connect pseudonymized user actions with backend actions we're are able to understand how often teams utilize this adoption path and at what rate they're successful. This will us know what work to prioritize to maximize improvements within the user experience and ensure we're able to understand how impactful these future iterations are.
Check the Getting started with Product Intelligence.
The Metrics Dictionary is a single source of truth for the metrics and events we collect for product usage data. The Metrics Dictionary lists all the metrics we collect in Service Ping, why we're tracking them, and where they are tracked.
The Metrics Dictionary should be updated any time a new metric is implemented, updated, or removed. Currently, the metrics dictionary is built automatically once a day, so when a change to a metric is made in the .yml, you will see the change in the dictionary within 24 hours.
The Metrics Dictionary was introduced in GitLab version 13.9. Metrics status changes are tracked in the metric YAML definition files. Changes prior to GitLab version 13.9 are not reflected in metrics YAML definitions.
Take a look at our current thinking about future iterations and please let us know how we can improve this tool for you and your team!
Get started with our Quick Start Instrumentation Guide , which is a single page with links to documentation for the entire instrumentation process flow.
This means that we have deprecated the usage of
usage_data.rb and for adding a new metric we do not change this file. When you add or change a Service Ping Metric, you must migrate the metric to instrumentation classes.
For metrics which are added dynamically, the Product Intelligence team is working to migrate these, see epic. In the case that a metric from this list is added/removed/changed before Product intelligence team has migrated the metric, the group owning the metric should perform the migration to instrumentation classes during the course of the desired change.
Adding a new metric in Service ping will require adding an instrumentation class and a metric definition, see more in Metrics Instrumentation guide.
As we add more support to instrumentation classes, you can follow the progress in this epic.
Note that not all the metrics in Service Ping are migrated to use instrumentation classes, this will be a continuous work and we kindly ask for cross-team collaboration to achieve this.
|1. xMAU level||Determine the level at which the metric should be measured:
1. User level - UMAU
2. Stage level - SMAU, Paid SMAU, Other PI
3. Group level - GMAU, Paid GMAU, Other PI
|2. Collection framework||There are two main tools that we use for tracking users data: Service Ping and Snowplow.||We strongly recommend using Service Ping for xMAU as your metrics will be available on both SaaS and self-managed.|
|3. Instrumentation||Work with your engineering team to instrument tracking for your xMAU.
- Utilize our Quick start instrumentation guide to find the documentation needed for the instrumentation process.
- Service Ping Guide
- Snowplow Guide
|4. Data Availability||Plan instrumentation with sufficient lead time for data availability.
1. Merge your metrics into the next self-managed release as early as possible since users will have to upgrade their instance version to start reporting your instrumented metrics.
2. Wait for your metrics to be released onto production GitLab.com. These releases currently happen on a daily basis.
3. Service Pings are generated on GitLab.com on a weekly basis. An issue is created each milestone associated with this epic, to track the weekly SaaS Service Ping generation. You can find successful payloads and failures in these issues. Verify your new metrics are present in the GitLab.com Service Ping payload.
4. Wait for the Versions database to be imported into the data warehouse.
5. Check the dbt model version_usage_data_unpacked to ensure the database column for your metric is present.
6. Check Sisense to ensure data is available in the data warehouse.
1. Self-managed releases happen on the 22nd of each month (+30 days)
2. Wait at least a week for customers to upgrade to the new release and for a Service Ping to be generated (+7 days)
3. Service Pings are collected in the Versions application. The Versions application's database is automatically imported into the Snowflake Data Warehouse every day (+1 day).
4. In total, plan for up to 38 day cycle times. Cycle times are slow with monthly releases and weekly pings, so, implement your metrics early.
|5. Dashboard||Create a Sisense dashboard. Instructions for creating dashboards are here.|
Note: We now enable you to deduplicate aggregated metrics implemented via Redis HLL, in order to get distinct counts (ex distinct users count across multiple actions in a stage). Please read our docs on Aggregated Metrics for more information.
|Aggregated||Metric contains rolled-up values due to an aggregate function (COUNT, SUM, etc)||Total Page Views (TPV) - the sum of all events when a page was viewed.|
|Deduplicated||Metric counts each unit of measurement once.||Unique Monthly Active Users (UMAU) - each user_id is counted once|
|Deduplicated Aggregated||Metric contains a rolled-up value where each unit is counted once.||UMAU is a deduplicated aggregated metric but TPV is not.|
version.gitlab.com is provided once per quarter. See the epic for more details.
See issue list for work related with VersionApp.
|Getting started with Product Intelligence||The guide covering implementation and usage of Product Intelligence tools|
|Service Ping Guide||An implementation guide for Service Ping|
|Snowplow Guide||An implementation guide for Snowplow|
|Metrics Dictionary||A SSoT for all collected metrics and events|
|Product Intelligence Direction||The roadmap for Product Intelligence at GitLab|
|Product Intelligence Development Process||The development process for the Product Intelligence groups|
|FAQ||Product Intelligence FAQ|
2022-05-11: last page update