At GitLab, we have a unique approach to experimentation that is built in-house by our incredible development team. The reason we use this approach is to uphold our commitment to our users and customers to protect their privacy. This custom approach leads to some challenges that are not experienced with more commonly used 3rd party experimentation tools. Due to this reality, experimentation at GitLab must be approached with a high level of intentionality and forethought. The purpose of this handbook page is to create some guidelines around experimentation to avoid common errors and to define best practices.
In order to increase our velocity while maintaining our ability to learn from experiments, the GitLab Growth stage (including the Product Analysis group) is adopting a new framework for designing and analyzing experiments. This framework is adapted from the work of data scientist Danielle Nelson.
This experiment framework is derived from the previous version of our Experiment Framework, which utilized the
gitlab_dotcom_experiment_subjects table to host the unique identifiers. The
snowplow_gitlab_events_experiment_contexts_all replaces this table as a landing place for all unique identifiers. This framework was created to improve efficiency and maintain our commitment to user privacy by using pseudonymized data to avoid tracking user-identifiable data.
The current experiment framework that is being utilized by GitLab is called the
gitlab-experiment-gem or GLEX for short. Here at GitLab we run experiments as A/B/n tests and review the data the experiment generates. From that data, we determine the best performing code path and promote it as the new default code path, or revert back to the original code path. You can read our Experiment Guide documentation if you're curious about how we use this gem internally at GitLab. This experiment framework relies heavily on front-end events or events that are created by our data collector, Snowplow.
When we discuss the behavior of this gem, we'll use terms like experiment, context, control, candidate, and variant. It's worth defining these terms so they're more understood. These are the universal terms used across the company.
For this documentation, these are the important terms that need to be expounded on that are specific to Snowplow Event Structures.
This is the JSON column that will contain any useful identifiers that are used to tie experiments to a specific namespace or project. These identifiers will be used to measure experiment impact on the KPIs and is required for every experiment for analysis. Prior versions of experiment analysis required this data to be parsed manually via SQL in order to utilize the information stored in this column, but the current
experiment_contexts_all table already parses this data for ease of use.
Context_keys are used in experiment analysis to determine the unique users/projects/namespaces that engage in the various features and/or stages as defined by the experiment analysis. Context_keys can be “sticky” to various identifiers (either user, namespace, project or some combination of user per project or user per namespace) depending on what is needed for the experiment.
Variants can be defined as either
control if running a traditional A/B test. If a multivariate experiment is being run, each variant must have a unique identifier per variant as well as a control to set a baseline for the experiment.
For successful implementation of a GLEX experiment, events need to contain certain data to be able to analyze an experiment successfully. Below is a table that outlines the columns that will be analyzed and the use of each.
|Database Column Name||Definition|
|event_id||A unique identifier per event that is fired - can be multiple per context key / user as this data is recorded each time an event is fired.|
|experiment_name||The unique title of the experiment being launched. Normally written in
|experiment_variant||An identifier that differentiates the unique experiences that are being launched to the population.
If experiment is A/B:
If experiment is multivariate:
- Variant 1
- Variant 2, Variant 3 etc until all variants are identified
|context_key||An identifier that is used to differentiate the unique entities in an experiment. These can be ‘sticky’ or defined at different levels. More documentation on context_keys are outlined above.
1. If you are looking for events per user, assign the context_key per user
2. If you are looking for events per namespace, assign the context_key per namespace
|event_action||An identifier that describes the action type for events
|event_label||An identifier used alongside the
- If you have a
|event_category||An identifier used for events to specify where in the product the event is being fired.
|event_label||An identifier for events that is used for further identification of an event if the event_category and event_label do not provide enough identification for the event.
|gsc_namespace_id||A unique identifier for namespaces as currently used in
|gsc_project_id||A unique identifier for projects as currently used in
|gsc_pseudonymized_user_id||An pseudonymized identifier that is unique to each user.
Please note that in accordance with our commitment to user privacy, this data cannot be joined with other tables to identify a specific user. This column is only present in the Snowplow tables.
There are three main roles that need to be involved in the creation, launch, and analysis of an experiment. This table below outlines the different roles that need to be involved for the successful implementation of a GitLab experiment.
|Product Manager||Engineer||Product Analyst|
Determine experiment type
Primary, secondary, and guardrail metric identification
|Event definition review||Experiment Type, Metrics and Event definition review|
|Implementation||Implementing events and event tracking into the product through manually coding events into the product|
|QA||PM/Dev QA the experiment variants||Rollout on staging||Data checks to ensure data collection in staging|
|Experiment in Production||PM confirms experiment variants||Data QA||Data QA, Dashboard Creation|
|Post-Experiment||Ending of experiment and post-experiment data collection||Resolution and cleanup||Experiment Analysis|
A hypothesis is defined as a proposed explanation or solution for an observation. In the world of experimentation, we look at a hypothesis as a prediction that you create prior to running an experiment and helps answer the question "What are we hoping to learn from this experiment?"
A complete hypothesis has three parts - the goal of the hypothesis is to define what change you are trying to implement and what the effect of that change will be. You can use this simple, three-part formula to write out a hypothesis:
"If __, then __, because __."
The If portion of the formula pertains to a variable that can be modified, added, or taken away to produce a desired result or outcome. For example, "If we add an additional CTA" or "If we remove this page."
The then portion of this formula pertains to the desired result that will happen as a result of the variable that was changed above. For example, "If we add an additional CTA, then we will see more views to the subsequent page."
The last part of the formula or the because portion refers to the rationale behind your expected result. This is the research, anecdotes, or observations that explains what caused the change. For example, "If we add an additional CTA, then we will see more views to the subsequent page because we will be directing more users directly to this page."
Start with defining the metrics that you will use to determine success in the experiment. The different types of metrics that are defined are outlined in more detail below, but here are some additional components that should be considered when going through the thought process of outlining an experiment:
The reasoning behind selecting the target metrics over conversion metrics - Ex: low traffic/volume would bring our velocity to a screeching halt if we use conversion metrics, we would have to run the experiment for 9 months to reach significance - Ex: the experiment is intended to drive traffic to a page, not necessarily influence conversion
The assumptions we are making about the collision (or lack thereof) of concurrent experiments (if there are two or more experiments targeting the same population of users) - Ex: we assume that a concurrent experiment will not be a material impact on this experiment's target metric
The risks we are assuming by proceeding with the given metrics, experiment sequencing, and experiment design - Ex: Higher clicks might not lead to higher conversion, it's possible that we have a negative impact on down-funnel metrics
Why we have the appetite for those risks - Ex: We want to be able to keep the business moving and continue iterating on experiments
Whether we will do a longer-term follow-up to try to look at conversion metrics. (It may be that a longer-term follow-up measurement is not even possible due to experiments colliding) - Ex: We will follow-up in 2 months to see what conversions look like for the control and candidate - Ex: We will not be able to do a follow-up to see conversion because of the following experiments colliding
Here are some questions to help guide the selection of metrics for an experiment. We recommend defining 1 target KPI or what we would use to declare a winner in this experiment, and any other secondary KPIs that you’d like to be included in the analysis:
NOTE: Product Analysis is available to help with the definition of events and metrics for experiments for better alignment before proceeding with the analysis
There are three different kinds of metrics that can be defined for an experiment:
These are the main KPIs or metrics that you expect to be impacted by the experiment. Usually these metrics will be used to define the success or failure of the experiment using statistical significance - be sure to consult the Product Analysis team to identify how long it will take these metrics to hit statistical significance.
2. Secondary Metrics
These are any metrics that you expect could be impacted by the experiment but are not going to be the main metrics that we will use to declare success or failure for an experiment.
3. Leading Indicators
Leading indicators are a directional determinant of the performance of an experiment based on the volume of front-end events between variants
Identify what kind of experiment you’d like to conduct from the experiments we’ve described below. We’ve also included a questionnaire that will help you decide what desired result you’d like from the experiment: Our experimentation design and analysis framework leverages two different "types" of experiments: True Randomized Control Trials (True RCTs) and Pseudo-Randomized Control Trials (Pseudo-RCTs). The two types differ in terms of statistical rigor (including p-value interpretation), which in turn impacts required sample size and experiment duration.
True RCTs are optimized for statistical certainty and pseudo-RCTs are optimized for experiment velocity.
True RCTs are the most statistically rigorous experiments which, if designed and run properly, result in causal inference. In other words, we can actually say that the experiment caused a change in a metric. True RCTs are classic "A/B/n Tests". Unfortunately these types of experiments and the certainty of the learnings come at a price: they tend to require a large sample size and experiment duration. If the effect size (minimum change in the metric that would be relevant to detect) is small (ex: you want to detect a 1% change), the metric is less prevalent (ex: low conversion rate), or the variance in the metric is large (i.e., a “noisy” metric), you need a much larger sample size. In addition, there needs to be extra care taken to ensure that experiments are not colliding.
True RCTs will be developed and evaluated with the industry standard statistical significance level of p <= 0.05 and a power of 0.8.
Pseudo-RCTs are less statistically rigorous than true RCTs and lead to directional learnings. In other words, we cannot say that the experiment caused a change in a metric, but we can say it directionally impacted a metric. Pseudo-RCTs carry the spirit of true RCTs without requiring a larger sample size and duration. As such, they carry a higher risk that the results were due to random chance instead of the experiment.
Since pseudo-RCTs are less strict, we evaluate them based on a looser p-value interpretation and we use different language to understand and communicate the results. The language used for these measurements needs to be very intentional so as to not overstate our confidence. This means that we should not communicate a percent change (ex: 10% increase) because our level of certainty and statistical significance could be misinterpreted. In addition, including the p-value (or noting the confidence level) helps to avoid misinterpretation.
It is not always straightforward or easy to select which type of experiment to run. Here are a few questions to help guide that decision:
We will continue to build out a guide on how to select which type of experiment to run.
Create an issue for the Product Analysis team using the “Experiment Analysis Request” template.
Once the experiment has data in staging (before being launched into production) be sure to let the Product Analysis team know so they can check if the data is coming through. You can also use the Experiment Data Validation dashboard to check your data.
|assignment||1||BE||auto||Any time the experiment is evaluated. Use unique keys to get experiences, or review as a total count. Group by unique keys to see changes over time or subsequent evaluations. Experiment is sticky to the user.|
|focus_form||2||FE||link||Standard event with the focus_form context.|
|change_form||3||FE||link||Standard event with the change_form context.|
|submit_form||4||FE||link||Standard event with the submit_form context.|
|create_group||5||BE||link||When a group is created in subsequent onboarding steps|
Link any tracking and/or related-issues to the main experiment issue assigned to Product Analytics
Let all stakeholders know when the experiment is available on staging or production and at what percent (if rolling out in phases or in certain percentages of the population of users) it’s currently set at.
Once production data has begun to flow in, be sure to swap your data source to reference production data and NOT staging data. Keep an eye on your metrics as they bake to their full sample size, and call out any discrepancies or unexpected behavior to your Product Manager.
There are a few standard filters that need to be applied across all experiment analysis to ensure representative results. Be sure that these filters are applied to your analysis if you are creating your own dashboard outside of the Standard Experiment Dashboard:
growth_data_namespacessnippet is a good way to bring in data that is already being filtered
Be sure to check the sample size of the variants against the total population of GitLab users to identify the right sample size needed for results to reach the agreed upon level of significance/confidence.
Share any relevant insights to the Product Manager and discuss any post-experiment analysis that needs to be done.
In this section, we will review some of the most common errors that are made by all parties that are involved in the creation, deployment, and analysis of an experiment. Be sure to be vigilant of these errors, as some of these can affect the overall results of the experiment.
You can find additional experimentation resources throughout the handbook and GitLab docs. Here are a few pages to check out:
Here are some useful terms used in the context of experimentation. In addition to the definitions below, Khan Academy provides excellent videos explaining these terms and concepts.