Analytics Instrumentation Guide

Disclaimer: This guide is mostly out of date. We are in the process of updating and migrating the content. {: .alert .alert-info}

Analytics Instrumentation Overview

At GitLab, we collect product usage data for the purpose of helping us build a better product. Data helps GitLab understand which parts of the product need improvement and which features we should build next. Product usage data also helps our team better understand the reasons why people use GitLab. With this knowledge we are able to make better product decisions.

To get an introduction of Product Analytics at GitLab you have a look at the GitLab Product Data Training (internal) deck

There are several stages and teams involved to go from collecting data to making it useful for our internal teams and customers.

Stage Description DRI Support Teams
Privacy Settings The implementation of our Privacy Policy including data classification, data access, and user settings to control what data is shared with GitLab. Analytics Instrumentation Legal, Data
Collection The data collection tools used across all GitLab applications including GitLab SaaS, GitLab self-managed, CustomerDot, VersionDot, and about.gitlab.com. Our current tooling includes Snowplow, Service Ping, and Google Analytics. Analytics Instrumentation Infrastructure
Extraction The data extraction tools used to extract data from Product, Infrastructure, Enterprise Apps data sources. Our current tooling includes Stitch, Fivetran, and Custom. Data
Loading The data loading tools used to extract data from Product, Infrastructure, Enterprise Apps data sources and to load them into our data warehouse. Our current tooling includes Stitch, Fivetran, and Custom. Data
Orchestration The orchestration of extraction and loading tooling to move data from sources into the Enterprise Data Warehouse. Our current tooling includes Airflow. Data
Storage The Enterprise Data Warehouse (EDW) which is the single source of truth for GitLab’s corporate data, performance analytics, and enterprise-wide data such as Key Performance Indicators. Our current EDW is built on Snowflake. Data
Transformation The transformation and modelling of data in the Enterprise Data Warehouse in preparation for data analysis. Our current tooling is dbt and Python scripts. Data Analytics Instrumentation
Analysis The analysis of data in the Enterprsie Data Warehouse using a querying and visualization tool. Our current tooling is Sisense. Data, Product Data Analysis Analytics Instrumentation

Editable source file

Our SaaS data collection catalog spans both the client (frontend) and server (backend), and uses various tools. We pick-up events and data produced when using the application. By utilizing collected identifiers, we can string these backend and frontend events together to illustrate a GitLab journey at the (1) user (pseudonymized), (2) namespace, and (3) project level.

The below table explains the types of data we collect from GitLab.com and examples for what it can be used.

Technology Data Type Description Aggregation Method
Snowplow Tracking

- Snowplow JS Tracker: client side (FE) events
-Snowplow Ruby Tracker: server-side (BE) events
- Schema of events here
Event Based Data Examples:
- Collects an event on Git pushes
- Collects an event on a button click - Collects an event on a successful Pipeline
- Collects an a request to a Rails controller
Event based or grouped by an attribute (e.g. session)
Service Ping

- PostgreSQL database
- Redis in-memory data store
- System Logs
Transactional data Examples:
- Total issues created
- Instance settings such as the instance’s Git version
- Integration settings such as an external storage provider
Count based on either total time or given timeframe

Data used as identifiers

In order to create the SaaS usage journeys documented below, we collect various identifiers in our data collection catalog. Where the identifier can be used to personally identify a user by someone without permissions to view that information, we will pseudonymize the data via hashing at the collection layer. You can find the collected identifieres in our Metrics Dictionary.

Example User Journey

A user signups for a free GitLab.com account and creates a group and/or project (Pseudonymized user_id created and associated with group or project ID), they set up their repo and then view CI/CD and decide they want to invite a colleague to set up this functionality. The newly invited user signs up for GitLab (New pseudonymized user_id created and associated with the existing group or project ID) and sets up CI/CD for their team (backend event).

Why this user journey is valuable to GitLab and our users. In this example, by being able to connect pseudonymized user actions with backend actions we’re are able to understand how often teams utilize this adoption path and at what rate they’re successful. This will us know what work to prioritize to maximize improvements within the user experience and ensure we’re able to understand how impactful these future iterations are.

Which Tooling To Use

Check the Getting started with Analytics Instrumentation.

Metrics Dictionary

The Metrics Dictionary is a single source of truth for the metrics and events we collect for product usage data. The Metrics Dictionary lists all the metrics we collect in Service Ping, why we’re tracking them, and where they are tracked.

The Metrics Dictionary should be updated any time a new metric is implemented, updated, or removed. Currently, the metrics dictionary is built automatically once a day, so when a change to a metric is made in the .yml, you will see the change in the dictionary within 24 hours.

The Metrics Dictionary can be viewed for Service Ping and Snowplow data, however Snowplow metrics have not been fully converted to this method yet.

The Metrics Dictionary was introduced in GitLab version 13.9. Metrics status changes are tracked in the metric YAML definition files. Changes prior to GitLab version 13.9 are not reflected in metrics YAML definitions.

Metrics Dictionary features

Metrics Dictionary Features

  1. Filter data. In the search bar, enter a value you want to filter the results set by.
  2. Customize viewable columns. Select the options button to expand the “table fields” control. From here, you can select the columns you want to display in your view. Note, this doesn’t not filter data by the selection, this only displays or not displays the data regardless of the values.
  3. Sisense query for GitLab.com. Copy this query for use in Sisense. A common use case for this feature is to identify if data is available for SaaS Service Ping. Watch this quick video to learn more.
  4. Performance indicator type. Metrics which are utilized in business critical xMAU calculations are indicated with a Performance indicator type value.
  5. Export. You can now download the entire metrics dictionary as a .csv file.
  6. Metric Version. Starting with miletone 13.9, we’ve begun to attribute the version associated with the metric. Unfortunately we couldn’t populate the historical values for existing metrics so all prior metrics are labeled as <13.9.
  7. Metric Product Section/Stage/Group. You can display and/or filter by Section, Stage and Group as needed.
  8. Service Usage Data Category. View and/or filter by Service Usage Data category (Optional, Operational, Subscription).

Instrumenting Metrics and Events

Get started with our Quick Start Instrumentation Guide , which is a single page with links to documentation for the entire instrumentation process flow.

Implementing xMAU metrics

Description Instructions Notes
1. xMAU level Determine the level at which the metric should be measured:

1. User level - UMAU
2. Stage level - SMAU, Paid SMAU, Other PI
3. Group level - GMAU, Paid GMAU, Other PI
2. Collection framework There are two main tools that we use for tracking users data: Service Ping and Snowplow. We strongly recommend using Service Ping for xMAU as your metrics will be available on both SaaS and self-managed.
3. Instrumentation Work with your engineering team to instrument tracking for your xMAU.

- Utilize our Quick start instrumentation guide to find the documentation needed for the instrumentation process.
Additional reference:

- Service Ping Guide
- Snowplow Guide
4. Data Availability Plan instrumentation with sufficient lead time for data availability.

1. Merge your metrics into the next self-managed release as early as possible since users will have to upgrade their instance version to start reporting your instrumented metrics.

2. Wait for your metrics to be released onto production GitLab.com. These releases currently happen on a daily basis.

3. Service Pings are generated on GitLab.com on a weekly basis. An issue is created each milestone associated with this epic, to track the weekly SaaS Service Ping generation. You can find successful payloads and failures in these issues. Verify your new metrics are present in the GitLab.com Service Ping payload.

4. Wait for the Versions database to be imported into the data warehouse.

5. Check the dbt model version_usage_data_unpacked to ensure the database column for your metric is present.

6. Check Sisense to ensure data is available in the data warehouse.
Timeline

1. Self-managed releases happen every month month (+30 days)

2. Wait at least a week for customers to upgrade to the new release and for a Service Ping to be generated (+7 days)

3. Service Pings are collected in the Versions application. The Versions application’s database is automatically imported into the Snowflake Data Warehouse every day (+1 day).

4. In total, plan for up to 38 day cycle times. Cycle times are slow with monthly releases and weekly pings, so, implement your metrics early.
5. Dashboard Create a Sisense dashboard. Instructions for creating dashboards are here.

Deduplication of Aggregated Metrics

Note: We now enable you to deduplicate aggregated metrics implemented via Redis HLL, in order to get distinct counts (ex distinct users count across multiple actions in a stage). Please read our docs on Aggregated Metrics for more information.

Term Definition Example
Aggregated Metric contains rolled-up values due to an aggregate function (COUNT, SUM, etc) Total Page Views (TPV) - the sum of all events when a page was viewed.
Deduplicated Metric counts each unit of measurement once. Unique Monthly Active Users (UMAU) - each user_id is counted once
Deduplicated Aggregated Metric contains a rolled-up value where each unit is counted once. UMAU is a deduplicated aggregated metric but TPV is not.

Finding Reporting Dependencies on Metrics

Although PMs commonly use Service Ping metrics to measure feature health and product usage, that is not the only use for Service Ping metrics. For instance, the Product Data Insights team relies on Service Ping metrics for xMAU reporting. Additionally, the Customer Success Operations team relies on Service Ping metrics to generate health scores for customers.

For a variety of reasons, we recommend not changing the calculation of metrics that are used for reporting once implemented. If you do need to change a metric that is being used, please coordinate with the customer success team before doing so so that they can update their models and health scores accordingly. To identify metrics that are relied upon for reporting, follow these directions.

xMAU Reporting

  1. Go to the Metrics Dictionary
  2. Click “Customize table” and select “Performance indicator type”
  3. Search for a metric and view the performance indicator type values.
    • If the field is blank, there are no xMAU reporting dependencies on this metric
    • If the field is not blank, there are xMAU reporting dependencies on this metric. Please reach out to the Product Data Insights to understand how changing metric calculations would impact downstream dependencies.

Customer Health Scoring

  1. Go to this CSV, which is the SSOT for metrics that are used for health scoring.
  2. If the metric of interest is in this CSV, then is it being used for customer health scoring. Please reach out to Customer Success Operations to understand how changing metric calculations would impact downstream dependencies.
Resource Description
Getting started with Analytics Instrumentation The guide covering implementation and usage of Analytics Instrumentation tools
Service Ping Guide An implementation guide for Service Ping
Snowplow Guide An implementation guide for Snowplow
Metrics Dictionary A SSoT for all collected metrics and events
Privacy Policy Our privacy policy outlining what data we collect and how we handle it
Analytics Instrumentation Direction The roadmap for Analytics Instrumentation at GitLab
Analytics Instrumentation Development Process The development process for the Analytics Instrumentation groups

2022-05-11: last page update


Our Commitment to Individual User Privacy in relation to Service Usage Data
Our Commitment to Individual User Privacy in relation to Service Usage Data While there are examples of data collection used for malicious intent, data collection and analysis has also allowed companies to improve their product or service, benefiting their end user/consumer. It is in this vein, that GitLab collects usage data about its products. We collect individual usage data in a pseudonymized manner at the namespace level and then use this information to power our product decisions and improve GitLab for you.