Data Science Handbook

GitLab Data Science Team Handbook

PURPOSE: This page is focused on the operations of GitLab’s internal Data Science Team. For information about GitLab’s Product Data Science Capabilities, please visit GitLab ModelOps {: .alert .alert-success}

Last Updated At: 2023-12-22

The Internal Data Science Team at GitLab

The mission of the Data Science Team is to facilitate making better decisions faster using predictive analytics.

Handbook First

At GitLab we are Handbook First and promote this concept by ensuring the data science team page remains updated with the most accurate information regarding data science objectives, processes, and projects. We also strive to keep the handbook updated with useful resources and our data science toolset.

Learning About Data Science

Check out this brief overview of what data science is at GitLab:

(Corresponding slides)

AMAs:

2021-09-15 AMA Recording, Presentation,
2021-12-09 AMA Recording, Presentation

Want to Learn More? Become a Data Science Champion, visit Slack #bt-data-science, watch a Data Team video. We want to hear from you! {: .alert .alert-success}

Common Data Science Terms

Accuracy - ability of a Data Science model to capture all correct data points out of all possible data points
Algorithm - sequence of computer-implementable instructions used to solve specific problem
Classification - process of predicting a category for each observation. For example, determining if a picture is of a cat or a dog
Clustering - process of finding natural groupings of observations in dataset. Often used for segmentation of users or customers
Data Science (DS) - interdisciplinary field that uses computer science, statistical techniques and domain expertise to extract insights from data
Exploratory Data Analysis (EDA) - analysis of the data that summarises it’s main characteristics (includes statistics and data visualisation)
Feature - single column in dataset that can be used for analysis, such as country or age. Also referred to as variables or attributes
Feature Engineering - process of selecting, combining and transforming data into features that can be used by machine learning algorithms
Imputation - process of replacing missing or incorrect data with statistical “best guesses” of the actual values
Machine Learning (ML) - use and development of algorithms without being explicitly programmed to determine patterns in data
Model - a complex set of mathematical formulas that generates predictions
Propensity modelling - building models to predict specific events by analyzing past behaviors of a target audience.
Regression - a statistical method for predicting an outcome. For example, predicting a person’s income, or how likely a customer is to churn
Scoring - process of generating predictions for the new dataset
Training - process of applying an algorithm to data to create a model
Test Dataset - deliberately excluding some observations from training the model so they can be used to verify how well the model predicts
Weight - numerical value assigned to feature that determines it’s strength

Data Science Responsibilities

Of the Data Team’s Responsibilities, the Data Science Team is directly responsible for:

Delivering descriptive, predictive, and prescriptive solutions that promote and improve GitLab’s KPIs
Being a Center of Excellence for predictive analytics and supporting other teams in their data science endeavors
Developing tooling, processes, and best practices for data science and machine learning

Additionally, the Data Science Team supports the following responsibilities:

With Data Leadership:
- Scoping and executing a data science strategy that directly impacts business KPIs
- Broadcasting regular updates about deliverables, ongoing initiatives, and roadmap
With the Data Platform Team:
- Defining and championing data quality best practices and programs for GitLab data systems
- Deploying data science models, ensuring data quality and integrity, shaping datasets to be compatible with machine learning, and brining new datasets online
- Creating data science pipelines that work natively with the GitLab platform and the Data Team tech stack
With the Data Analytics Team:
- Incorporating data science into analytics initiatives
- Designing dashboard to enhance the value and impact of the data science models

How We Work

As a Center of Excellence, the data science team is focused on working collaboratively with other teams in the organization. This means our stakeholders and executive sponsors are usually in other parts of the business (e.g. Sales, Marketing). Working closely with these other teams, we craft a project plan that aligns to their business needs, objectives, and priorities. This usually involves working closely with functional analysts within those teams to understand the data, the insights from prior analyses, and implementation hurdles.

The Data Science flywheel is focused on improving business efficiency and KPIs by creating accurate and reliable predictions. This is done in collaboration with Functional Analytics Center of Excellence to ensure the most relevant data sources are utilized, business objectives are met, and results can be quantifiably measured. As business needs change, and as the user-base grows, this flywheel approach will allow the data science team to quickly adapt, iterate, and improve machine learning models.

graph BT;
   id1(Faster, More Accurate Predictions)-->id2(Increased Business Understanding) & id5(Continuous Feedback) 
   id2-->id3(More Revenue & Users) 
   id5-->id1
   id3-->id4(More Data)
   id4-->id1

How to request Data Science project?

To request a new Data Science project, please fill out the Opportunity Canvas. In the description choose [New Request] Create Opportunity Canvas. The Problem Statement and Stakeholders sections should be completed. You can tag a data science team member with whom you discussed the project with or share an issue in #bt-data-science slack channel. During the quarterly planning process, requests will be reviewed and priortisied accordingly by the Data Leadership Forum.

Work Streams

Work Stream	Internally Known As	Maturity	Objective	Last Update	Next Update
Revenue Expansion	Propensity to Expand (PtE)	Optimized	Determine which paid accounts are likely to increase in ARR via seat expansion or up-tier to Ultimate	FY23-Q4	FY24-Q2
Loss Prevention	Propensity to Contract (PtC)	Optimized	Determine which paid accounts are likely to decrease in ARR via seat expansion or down-tier to Premium	FY24-Q3	FY24-Q4
Conversion	Propensity to Purchase (PtP)	Viable	Identify which non-paid users (free and trials accounts) are likely to become paid accounts	FY24-Q1	FY24-Q2
Product Research	Namespace Segmentation	Optimized	Define groups for paid and free SaaS namespaces based on its product usage	FY23-Q3	TBD
Lead Funnel Generation	Prospect/Lead Scoring	Planned	Identify leads and prospects most likely to convert to closed won opportunities		FY24-Q2
MLOps	GitLab MLOps Product Development	In progress (Ongoing)	Dogfood new MLOps product features and enhancements	FY24-Q1	FY24-Q2
Backlog	Adoption Index	Planned	Define way to measure adoption and customer journey		TBD
Backlog	Product Usage Event	Planned	-		TBD
Backlog	Golden Journey	Planned	Identify optimal paths to increasing platform usage and adoption		TBD
Backlog	Stage Adoption MRI	Planned	-		TBD
Backlog	Community Sentiment Analysis	Unplanned	-		TBD
Backlog	Feature $ARR Uplift Prediction	Unplanned	Attribute incremental ARR lift based on feature adoption		TBD

For implementation details and where to find model predictions/scores, please see the Propensity Models Internal Handbook Page

Maturity

Maturity of data science projects is similar to the GitLab product maturity model:

Unplanned: Not implemented, and not yet on our roadmap.
Planned: Not implemented, but on our roadmap; executive sponsor attached to project.
In Progress: Plan established, developing model.
Viable: Available, but not fully productionalized yet; scores and insights manually generated; low adoption outside of immediate stakeholders.
Complete: Fully implemented into Data Team cloud production infrastructure; increasing adoption of corresponding dashboards and scores within the intended organization.
Optimized: Fine-tuned, fully automated, and self-service; continuous model monitoring and scoring; high adoption within intended organization.

Revenue Expansion

Organizational Sponsor: Sales
Use Cases: Issue
Plans for next iteration: Predicted ARR amount
Slack Channel (internal only): #data-propensity-projects
Repositories (internal only): Propensity to Expand
Read-outs (internal only):
- Exec Summary
- Sales GTM
Dashboards (internal only):
- PtE Inspector
- PtE Results Dashboard
Data sources: Product usage: SaaS & Self-Managed - paid tiers; Product stage usage: SaaS & Self-Managed - paid tiers; Salesforce (account, opportunities, events, tasks); Zuora (billing); Bizible (marketing); Firmographics; ZenDesk (help tickets); prior expansion type (product change, seat licenses), amount, and time lapse; account health scores
- Future sources: Buyer personas attached to opportunities

Loss Prevention

Organizational Sponsor: Customer Success
Use Cases: Issue
Plans for next iteration: Churn forecasting (in progress, Q2), Downtier Measure (Q3), PtC refresh (Q4)
Slack Channel (internal only): #data-propensity-projects
Repositories (internal only):
Read-outs (internal only):
Dashboards (internal only):
Data sources: Product usage: SaaS & Self-Managed - paid tiers; Product stage usage: SaaS & Self-Managed - paid tiers; Salesforce (account, opportunities, events, tasks); Zuora (billing); Bizible (marketing); ZenDesk (help tickets); Firmographics, account health fields; security score
- Future sources: # of answered emails, ratio sent/answered emails, sales activity (Gainsight)

Conversion

Organizational Sponsor: Sales & Marketing
Use Cases: Trails; Free Namespaces
Plans for next iteration: Self-managed free instances
Slack Channel (internal only): #data-propensity-projects
Repositories (internal only): Propensity to Purchase
Read-outs (internal only):
- SaaS Trials Model Readout
- SaaS Free Namespaces Readout
Data sources: Product usage: SaaS Only - free tiers; Product stage usage and adoption: SaaS Only - Free Tiers; Registration; Namespace metadata; User-level
- Future sources: Self-managed usage data

Product Research

Organizational Sponsor: Growth & Product Insights
Use Cases: Issue
Plans for next iteration: Self-managed segmentation
Slack Channel (internal only): #namespace-segmentation
Repositories (internal only): Namespace Segmentation
Read-outs (internal only):
- Namespace Segmentation Deck
Data sources: Product usage: SaaS & Self Managed - free and paid tiers; Product stage usage: SaaS & Self Managed - free and paid tiers; Salesforce (account); Zuora (billing); Bizible (marketing)
- Future sources: # of consecutive days of product/stage usage

Project Structure

The Data Science Team follows Cross-Industry standard process for data mining (CRISP-DM)

Data sources: Product usage: SaaS & Self Managed - free and paid tiers; Product stage usage: SaaS & Self Managed - free and paid tiers; Salesforce (account); Zuora (billing); Bizible (marketing)
- Future sources: # of consecutive days of product/stage usage

Project Structure

The Data Science Team follows Cross-Industry standard process for data mining (CRISP-DM), which consists of 6 iterative phases:

Business Understanding
- Includes requirements gathering, stakeholders interviews, project definition, product user stories, and potential use cases in order to establish success criteria of the project.
Data Understanding
- Requires determining the breadth and scope of existing relevant data sources. Data scientists work closely with data engineers and data analysts to determine where gaps may exist and to identify any data discrepancies or risks.
Data Preparation
- Requires conducting data quality checks and exploratory data analysis (EDA) to develop a greater understanding of data and how different datapoints relate to solving the business need.
Modeling
- Machine learning techniques are used to find a solution that addresses the business need. This often takes the form of predicting why/when/how future instances of a business outcome will occur.
Evaluation
- Performance is generally measured by how accurate, powerful, and explainable the model is. Findings are presented to the stakeholders for feedback.
Deployment
- Once the model has been approved it then gets deployed into the data science production pipeline. This process automatically updates, generates predictions, and monitors the model on a regular cadence.

The GitLab approach

The Data Science Team approach to model development is centered around GitLab’s value of iteration and the CRISP-DM standard. Our process expands on some of the 6 phrase outlined in CRISP-DM in order to best address the needs of our specific business objectives and data infrastructure.

Data Science Platform

Our current platform consists of:

the Enterprise Data Warehouse for storing raw and normalized source data as well as final model output for consumption by downstream consumers
JupyterLab for model training, tuning, and selection
GitLab for collaboration, project versioning, and score code management, experiment tracking, and CI/CD -GitLab CI for automation and orchestration
Monte Carlo for drift detection
Tableau Server for model monitoring and on-going performance evaluation
Feast as a an open-source Feature Store for Machine Learning models

Feast: Feature Store Implementation

We are using Feast as an open-source Feature Store for our machine learning models. Configuration can be found on the Feast project repository, updating the feature store is done via GitLab CI/CD and the web UI is published in a VM on GCP.

You can find more details on this implementation on the Feast - Feature Store Implementation Internal handbook section.

CI/CD Pipelines for Data Science

We are in the process of fully moving over the training and scoring of our machine learning models to the native GitLab CI/CD capabilities. Please see Getting Started With CI/CD for Data Science Pipelines for the most up-to-date information and instructions.

Current State Data Flows

graph 
    A[Enterprise Data Warehouse: Raw and Normalized Data Sources]
    B[JupyterLab & GitLab CI/CD: Model Training, Tuning, and Selection]
    C(GitLab CI/CD & Pipeline Schedules: Batch scoring with Papermill)
    F[Enterprise Data Warehouse: Model Output for Consumption]
    D[Salesforce/Marketo: CRM Use Cases]
    E[Tableau/Monte Carlo: Model Monitoring and Reporting]
    G[GitLab: Source Code Management]
    H[Experiment tracking]
    A --> |ODBC| B
    B --> H
    H --> B
    B --> G
    G --> B
    G --> C
    C --> |JSON| F
    F --> |CSV| D
    F --> |ODBC| E

For putting a model into production, please create an new data issue using the Scheduling Notebook Request Template

Data Science Tools at GitLab

Pre-configured JuypterLab Image: The data science team uses JupyterLab pre-configured with common python modules (pandas, numpy, etc.), native Snowflake connectivity, and git support. Working from a common framework allows us to create models and derive insights faster. This setup is freely available for anyone to use. Check out our Jupyter Guide for additional information.
GitLab Data Science Tools for Python: Functions to help automate common data prep (dummy coding, outlier detection, variable reduction, etc.) and modeling tasks (i.e. evaluating model performance). Install directly via pypi (pip install gitlabds), or use as part of the above JupyterLab image.
Modeling Templates: The data science team has created modeling templates to allow you to easily start building predictive models without writing python code from scratch. To enable these templates, follow the instructions on the Jupyter Guide.

Useful Data Science & Machine Learning Resources

Python Data Science Handbook by Jake VanderPlas: Great for beginners looking to learn python and dip their toes into data science.
Python Machine Learning by Sebastian Raschka & Vahid Mirjalili: More advanced topics with the assumption of a basic level of python.
The Elements of Statistical Learning, Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, & Jerome Friedman: Great deep dive into all the statistics and logic behind many of the commonly used predictive techniques. Can be pretty stats/math heavy at time.

Data Science Project Development Approach

GitLab Data Science Team Approach to Model Development

Last modified April 15, 2024: Update infra information (7d638bfd)

View page source - Edit this page - please contribute.