This page is intended to help Product Managers at GitLab understand what data is available to them and how they can use it to understand how their product is used. This page primarily covers two topics: how to consume data, and what data is available.
The user-facing end of GitLab's data stack is comprised of our BI Tool, Sisense (formerly known as Periscope) which is connected to our Snowflake data warehouse. The Sisense page of the data team handbook has general information about Sisense aimed for a wider GitLab audience.
Some useful links that we recommend for you to bookmark:
Projects > gitlab_snowflake > modelswill give a list of all of the models (think of these as tables) that exist in the data warehouse. Models are organized in directories according to their data source.
analytics_stagingschema have undergone little to no data transformation, meaning they're basically clones of the raw data source. Example: the
analytics_staging.gitlab_dotcom_merge_requeststable is a copy of the
merge_requeststable from the GitLab.com Postgres database (except it includes rows that were hard deleted on the latter).
analyticshave typically undergone more data transformation. They will typically end in
_xfto represent the fact that they are transformed. This is done as a convenience to the end-user who won't have to do the common joins and transformations themselves. For example: the
analytics.gitlab_dotcom_merge_requests_xftable includes columns about a merge request's labels, milestone and author, which come from joining to other tables.
analyticsmodels, whereas others might only have an
analytics_staging.gitlab_dotcom_groupswill function the same as
The first question we on the data team ask product managers is usually "are you interested in knowing this for self-managed or GitLab.com?" Our approach to answering your question differs greatly between the two. Although our self-managed offering has many more active customers, our GitLab.com offering has much more data available to analyze.
Query Example filtering out GitLab.com:
SELECT * FROM analytics.version_usage_data_unpacked WHERE uuid <> 'ea8bf810-1d6f-4a6a-b4fd-93e8cbd8b57f' --GitLab.com UUID
Snippets are great ways to allow Periscope users to build charts without writing any SQL. Anyone with editor access can write their own snippets. The data team has created several snippets that have the official badge. To find a list of available snippets, click on the scissors in the left menu.
We created a snippet called
feature_usage_this_agg_period_and_growth(feature) (SQL source) that allows you to get quickly without any SQL writing some feature usage from the Usage Pings data source. Here is a list of the different fields that :
feature_usage_sumgreater than 0 for this period.
feature_usage_sum_changegreater than 0 for this period.
feature_usage_sumvalue over the previous 12 months.
This snippet can be used with the date-range and aggregation filters. It has been used in several dashboards, one of them is the Secure Metrics dashboard
We are planning to build a second snippet to report on the adoption rate per product tier. This snippet will allow to easily calculate among the instances that send us usage pings in a given period, the number of instances that use a specific feature.
instancelevel, it is not super useful for GitLab.com since we often want to see information at the
namespacelevel. For example, knowing that 40K people used your stage on GitLab.com is somewhat useful, but you'll probably want to know more context (Are they free or paid? What plan are they on? Do I have any power users or is usage equally distributed?)
SELECT COUNT(*) FROM xstatements, making it trivial to replicate.
user_idon any of the snowplow events, making all events functionally anonymous. This severely limits the utility of these events.
As mentioned, the anonymization of any snowplow events is a severe limitation for Snowplow data instrumentation. Nonetheless, you can still add some Snowplow to measure feature adoption and usage.
We recommend Product Managers and their teams use Snowplow custom structured events, which are Snowplow's canonical events. We have built
When adding new click events, we should add them in a way that's internally consistent. If we don't, it'll be very painful to perform analysis across features since each feature will be capturing events differently.
The current method provides 5 attributes that are sent on each click event. Please try to follow these guidelines when specifying events to capture:
|category||text||true||The page or backend area of the application. Unless infeasible, please use the Rails page attribute by default in the frontend, and namespace + classname on the backend.|
|action||text||true||The action the user is taking, or aspect that's being instrumented. The first word should always describe the action or aspect: clicks should be
|label||text||false||The specific element, or object that's being acted on. This is either the label of the element (e.g. a tab labeled 'Create from template' may be
|property||text||false||Any additional property of the element, or object being acted on.|
|value||decimal||false||Describes a numeric value or something directly related to the event. This could be the value of an input (e.g.
Testing your events can be tricky. Snowplow doesn't have a proper testing interface. However, several tools can help you debug, test, and validate your events implementation:
Once you see your events, that means they are ready to be queried and visualized in Periscope. We are collecting several millions of events (pageviews, structured events) per month, so the whole dataset is quite slow to query. In order to make it easy to explore this data source we have created several smaller and more compact tables:
analytics.snowplow_structured_events_all: contains ALL structured events (but no pageviews).
analytics.snowplow_page_views_30: contains ALL pageviews
analytics.snowplow_unstructured_events_30: contains ALL unstructured events (especially click events, submit_form).
PRO TIP: Optimizing queries
To make your query faster, use a date filter in your
SELECT event_action, COUNT(*) FROM analytics.snowplow_structured_events_all WHERE derived_tstamp > ' 2020-01-01' GROUP BY 1 ORDER BY 2 DESC