At GitLab, we collect product usage data for the purpose of helping us build a better product. Data helps GitLab understand which parts of the product need improvement and which features we should build next. Product usage data also helps our team better understand the reasons why people use GitLab. With this knowledge we are able to make better product decisions.
There are several stages and teams involved to go from collecting data to making it useful for our internal teams and customers.
|Collection||The data collection tools used across all GitLab applications including GitLab SaaS, GitLab self-managed, CustomerDot, LicenseDot, VersionDot, and [about.gitlab.com](http://about.gitlab.com/). Our current tooling includes Snowplow, Usage Ping, and Google Analytics.||Product Intelligence||Infrastructure|
|Extraction||The data extraction tools used to extract data from Product, Infrastructure, Enterprise Apps data sources. Our current tooling includes Stitch, Fivetran, and Custom.||Data|
|Loading||The data loading tools used to extract data from Product, Infrastructure, Enterprise Apps data sources and to load them into our data warehouse. Our current tooling includes Stitch, Fivetran, and Custom.||Data|
|Orchestration||The orchestration of extraction and loading tooling to move data from sources into the Enterprise Data Warehouse. Our current tooling includes Airflow.||Data|
|Storage||The Enterprise Data Warehouse (EDW) which is the single source of truth for GitLab's corporate data, performance analytics, and enterprise-wide data such as Key Performance Indicators. Our current EDW is built on Snowflake.||Data|
|Transformation||The transformation and modelling of data in the Enterprise Data Warehouse in preparation for data analysis. Our current tooling is dbt and Python scripts.||Data||Product Intelligence|
|Analysis||The analysis of data in the Enterprsie Data Warehouse using a querying and visualization tool. Our current tooling is Sisense.||Data||Product Intelligence|
The systems overview is a simplified diagram showing the interactions between GitLab Inc and self-managed instances.
For Product Intelligence purposes, GitLab Inc has three major components:
For Product Intelligence purposes, self-managed instances have two major components:
As shown by the orange lines, on GitLab.com Snowplow JS, Snowplow Ruby, Usage Ping, and PostgreSQL database imports all flow into GitLab Inc's data infrastructure. However, on self-managed, only Usage Ping flows into GitLab Inc's data infrastructure.
As shown by the green lines, on GitLab.com system logs flow into GitLab Inc's monitoring infrastructure. On self-managed, there are no logs sent to GitLab Inc's monitoring infrastructure.
Note (1): Snowplow JS and Snowplow Ruby are available on self-managed, however, the Snowplow Collector endpoint is set to a self-managed Snowplow Collector which GitLab Inc does not have access to.
UI events are any interface-driven actions from the browser including click data.
CRUD or API events
These are backend events that include the creation, read, update, deletion of records, and other events that might be triggered from layers other than those available in the interface.
Database records These are raw database records which can be explored using business intelligence tools like Sisense. The full list of available tables can be found in structure.sql.
These are settings of your instance such as the instance's Git version and if certain features are enabled such as
Integration settings These are integrations your GitLab instance interacts with such as an external storage provider or an external container registry. These services must be able to send data back into a GitLab instance for data to be tracked.
✅ Available, 🔄 In Progress, 📅 Planned, ✖️ Not Planned
We use three methods to gather product usage data:
Snowplow is an enterprise-grade marketing and product analytics platform which helps track the way users engage with our website and application.
Snowplow consists of two components:
For more details, read the Snowplow guide.
Usage Ping is a method for GitLab Inc to collect usage data on a GitLab instance. Usage Ping is primarily composed of row counts for different tables in the instance’s database. By comparing these counts month over month (or week over week), we can get a rough sense for how an instance is using the different features within the product. This high-level data is used to help our product, support, and sales teams.
For more details, read the Usage Ping guide.
Database imports are full imports of data into GitLab's data warehouse. For GitLab.com, the PostgreSQL database is loaded into Snowflake data warehouse every 6 hours. For more details, see the data team handbook.
Our reporting levels of aggregate or individual reporting varies by segment. For example, on Self-Managed Users, we can report at an aggregate user level using Usage Ping but not on an Individual user level.
Our reporting time periods varies by segment. For example, on Self-Managed Users, we can report all time counts and 28 day counts in Usage Ping.
The Metrics Dictionary is a single source of truth for the metrics and events we collect for product usage data. The Metrics Dictionary lists all the metrics we collect in Usage Ping, why we're tracking them, and where they are tracked.
The Metrics Dictionary should be updated any time a new metric is implemented, updated, or removed.
We're currently focusing our Metrics Dictionary on Usage Ping. In the future, we will also include Snowplow. We currently have an initiative across the entire product organization to complete the Metrics Dictionary for Usage Ping.
We've recently had a large push across the product organization to become more data driven. Part of this push includes getting product metrics in place for each product section, stage, and group. In FY21-Q3 OKRs, we setup a couple of OKRs to help us accomplish this:
To accomplish these OKRs, we setup a seven step process to implement product metrics. This process was originally presented in the Weekly Product Meeting on August 11, 2020 (slide deck and video presentation) and has been refined over time.
|Implementation Status||Description||Responsibility||Exit Criteria|
|Definition||The definition step outlines the process for deciding which product metrics to track.||PM Responsibility, Product Intelligence Support||Metric is defined in the Event Dictionary and in the Performance Indicator file with the future
|Instrumentation||The instrumentation step outlines how each product team implements data collection.||PM Responsibility, Product Intelligence Support||Instrumentation is completed and feature flags are turned off so that data can be collected|
|Data Availability||The data availability step outlines the timing of a product release to receiving product usage data in the data warehouse.||PM Responsibility, Product Intelligence Support||PM confirms that the
|Dashboard||The dashboarding step outlines how Sisense dashboards are built.||PM Responsibility, Product Intelligence Support||There is a chart in Sisense.|
|Handbook||The Product PI handbook page describes how product performance indicators are added for each product section, stage, and group.||PM Responsibility, Product Intelligence Support||Chart is embedded into the handbook. Target has been assigned based off the data.|
|Target||The target definition step outlines how targets are defined for each performance indicator.||PM Responsibility, Product Intelligence Support||The target value is in both the chart and in the Performance Indicator (PI) file.|
|Complete||All of the prior steps have been completed.||PM Responsibility, Product Intelligence Support||:tada:|
Determine what metrics are important for your specific section, stage, or group.
Note: We now enable you to deduplicate aggregated metrics implemented via Redis HLL, in order to get distinct counts (ex distinct users count across multiple actions in a stage). Please read our docs on Aggregated Metrics for more information. We are working towards the ability to deduplicate across multiple Database HLL metrics via #288848, and then deduplication across multiple Redis HLL and Database HLL metrics via #421
|Aggregated||Metric contains rolled-up values due to an aggregate function (COUNT, SUM, etc)||Total Page Views (TPV) - the sum of all events when a page was viewed.|
|Deduplicated||Metric counts each unit of measurement once.||Unique Monthly Active Users (UMAU) - each user_id is counted once|
|Deduplicated Aggregated||Metric contains a rolled-up value where each unit is counted once.||UMAU is a deduplicated aggregated metric but TPV is not.|
Work with your engineering team to instrument tracking for your XMAU. Focus on using Usage Ping as your metrics will be available on both SaaS and self-managed.
The Usage Ping Guide outlines the steps required for instrumentation. It includes:
Plan instrumentation with sufficient lead time for data availability. Ensure your metrics make it into the self-managed release as early as possible.
Dashboard the metric. This is done by creating a Sisense dashboard. Avoid cumulative views and instead focus on month-over-month growth. Instructions for creating dashboards are here.
We need PMs to self-serve their own dashboards as data team capacity is limited. The data team will be focused on enabling self-service, advising PMs, and working on the more challenging XMAU dashboards.
To learn how to create your own dashboard, see Data For Product Managers: Creating Charts
For GMAU and SMAU data issues:
XMAUlabel to the data issues.
For non-GMAU and non-SMAU data issues:
We need all PMs to ensure their PIs are showing on the performance indicator pages. Based off What we're aiming for
To do so, we need a clear way to communicate to PMs exactly which PIs are remaining. We will be adding placeholder PIs for each section into the performance indicator file so that all required PIs show in the handbook. Once a PI is implemented, the actual PI will replace the placeholder PI.For more information about how PIs and XMAUs are related to one another, see PI Structure.
As a product organization, we need to get into the habit of understanding our baselines and setting targets for each stage & group. For the PI Target step, you will work with your Section or Group Leader to define targets for each of your XMAUs.
Set a growth target, and embed in the tracking dashboard. Growth targets should be ambitious but achievable.
All of the prior steps have been completed and a PI is successfully implemented.
|Product Intelligence Guide||A guide to Product Intelligence|
|Usage Ping Guide||An implementation guide for Usage Ping|
|Snowplow Guide||An implementation guide for Snowplow|
|Event Dictionary||A SSoT for all collected metrics and events|
|Implementing Product Performance Indicators||The workflow for putting product performance indicators in place|
|Product Intelligence Direction||The roadmap for Product Intelligence at GitLab|
|Product Intelligence Development Process||The development process for the Product Intelligence groups|