We have a variety of methods for identifying and recruiting participants for research studies, including Respondent.io and recruiting via social media. However, there are situations where these methods aren't suitable and you need to be able to find users based on a very specific criteria backed by objective usage data rather than self-reported or inferred data. The best way to do this is to take advantage of the usage and demographic data we have in our data warehouse.
A non-exhaustive list of examples of when finding users via the data warehouse would be appropriate:
Our data warehouse contains data for a variety of different areas. For our purposes, we're interested in the data detailing user information and usage of GitLab.com (SaaS).
Most of the tables housing data related to SaaS use the prefix gitlab_dotcom_*
, though not all. A good starting point would be to explore the different tables with this prefix and look at what data they contain.
The our overall users table for SaaS is gitlab_dotcom_users_xf
. It contains the most complete picture of individual users, including information such as account creation date, days active, role, and highest paid plan (if any). This should be your starting point and your source of truth. the user_id
variable found in this table is our primary identifier for GitLab.com users, and is used in other tables to identify the user with which a record is associated. For example, the gitlab_dotcom_merge_requests_xf
table houses records of MRs created on GitLab.com, and contains a column called author_id
to identify the author of a given MR. This author_id
is a GitLab.com user ID, and you'd be able to connect the author of that merge request to the record in the users table (or any other table) using that user_id
variable.
You can use this table to find GitLab.com user IDs for 100 users that created their account in the last seven days:
SELECT user_id FROM analytics.gitlab_dotcom_users_xf
WHERE account_age >= 7
LIMIT 100
We have a tables containing records for all of our "top level" objects in GitLab, among others:
If you can create it on GitLab.com, there is probably a table for it. These tables follow the same structure as other SaaS tables. For example:
gitlab_dotcom_issues_xf
gitlab_dotcom_epics_xf
gitlab_dotcom_merge_requests_xf
The records in these tables are not full fidelity. They may not include all data shown when the object is rendered in the interface. When you load an issue in the GitLab.com UI, you see comments and all sorts of metadata. Much of this data will not be found in the issues table in the warehouse, and is either not accessible or stored in separate tables to keep tables to a reasonable size. Additionally, we do not expose data in the warehouse that would otherwise be private. For example, while you can see titles and descriptions for issues that are public on GitLab.com (just as you would if you viewed them on the Web), you will not be able to see that information for non-public issues.
These tables are good for when you want to understand how many of something a user or a group has created or interacted with, or if you want to start with a certain type of entity (say, an issue that was promoted to an epic) and then see what users have interacted with such an entity. These tables do not provide event level data, so you will not necessarily be able to tell when or exactly how a user created or interacted with something, just that they did (or did not).
For example: You're researching issue weights and you want to find the user IDs for 100 users who have created an issue with a weight assigned in the last 30 days. You also need to know how many of those issues they've created over that period. You would use this query:
SELECT
author_id,
COUNT(1) as num_issues
FROM analytics.gitlab_dotcom_issues_xf
WHERE
weight IS NOT NULL
AND issue_created_at >= DATEADD(day, -30, current_timestamp)
GROUP BY 1
LIMIT 100
We don't log user IDs on specific event records, so you can't find out exactly when a certain user completed a specific action. However, we do log how many times per month and days per month a user interacts with different high level features and stages in the gitlab_dotcom_monthly_stage_active_users
table. This is good for criteria where you're looking for "light" or "heavy" users of certain features or stages, or when you want to make sure you get users with at least some minimum recent usage of a stage.
For example: You need to find user IDs for 50 users who interacted with CI pipelines at least 17 times OR on 9 different days in a certain month (or more accurately, the last 28 days of a certain month):
SELECT user_id
FROM analytics.gitlab_dotcom_monthly_stage_active_users
WHERE
event_name='ci_pipelines'
AND (event_count >= 17 OR event_day_count >= 9)
AND smau_month='2020-04-01'
LIMIT 50