Data Science Handbook

GitLab Data Science Team Handbook

PURPOSE: This page is focused on the operations of GitLab’s internal Data Science Team. For information about GitLab’s Product Data Science Capabilities, please visit GitLab ModelOps {: .alert .alert-success}

Last Updated At: 2023-12-22

The Internal Data Science Team at GitLab

The mission of the Data Science Team is to facilitate making better decisions faster using predictive analytics.

Handbook First

At GitLab we are Handbook First and promote this concept by ensuring the data science team page remains updated with the most accurate information regarding data science objectives, processes, and projects. We also strive to keep the handbook updated with useful resources and our data science toolset.

Learning About Data Science

Check out this brief overview of what data science is at GitLab:

(Corresponding slides)

AMAs:

Want to Learn More? Become a Data Science Champion, visit Slack #bt-data-science, watch a Data Team video. We want to hear from you! {: .alert .alert-success}

Common Data Science Terms

  • Accuracy - ability of a Data Science model to capture all correct data points out of all possible data points
  • Algorithm - sequence of computer-implementable instructions used to solve specific problem
  • Classification - process of predicting a category for each observation. For example, determining if a picture is of a cat or a dog
  • Clustering - process of finding natural groupings of observations in dataset. Often used for segmentation of users or customers
  • Data Science (DS) - interdisciplinary field that uses computer science, statistical techniques and domain expertise to extract insights from data
  • Exploratory Data Analysis (EDA) - analysis of the data that summarises it’s main characteristics (includes statistics and data visualisation)
  • Feature - single column in dataset that can be used for analysis, such as country or age. Also referred to as variables or attributes
  • Feature Engineering - process of selecting, combining and transforming data into features that can be used by machine learning algorithms
  • Imputation - process of replacing missing or incorrect data with statistical “best guesses” of the actual values
  • Machine Learning (ML) - use and development of algorithms without being explicitly programmed to determine patterns in data
  • Model - a complex set of mathematical formulas that generates predictions
  • Propensity modelling - building models to predict specific events by analyzing past behaviors of a target audience.
  • Regression - a statistical method for predicting an outcome. For example, predicting a person’s income, or how likely a customer is to churn
  • Scoring - process of generating predictions for the new dataset
  • Training - process of applying an algorithm to data to create a model
  • Test Dataset - deliberately excluding some observations from training the model so they can be used to verify how well the model predicts
  • Weight - numerical value assigned to feature that determines it’s strength

Data Science Responsibilities

Of the Data Team’s Responsibilities, the Data Science Team is directly responsible for:

  • Delivering descriptive, predictive, and prescriptive solutions that promote and improve GitLab’s KPIs
  • Being a Center of Excellence for predictive analytics and supporting other teams in their data science endeavors
  • Developing tooling, processes, and best practices for data science and machine learning

Additionally, the Data Science Team supports the following responsibilities:

  • With Data Leadership:
    • Scoping and executing a data science strategy that directly impacts business KPIs
    • Broadcasting regular updates about deliverables, ongoing initiatives, and roadmap
  • With the Data Platform Team:
    • Defining and championing data quality best practices and programs for GitLab data systems
    • Deploying data science models, ensuring data quality and integrity, shaping datasets to be compatible with machine learning, and brining new datasets online
    • Creating data science pipelines that work natively with the GitLab platform and the Data Team tech stack
  • With the Data Analytics Team:
    • Incorporating data science into analytics initiatives
    • Designing dashboard to enhance the value and impact of the data science models

How We Work

As a Center of Excellence, the data science team is focused on working collaboratively with other teams in the organization. This means our stakeholders and executive sponsors are usually in other parts of the business (e.g. Sales, Marketing). Working closely with these other teams, we craft a project plan that aligns to their business needs, objectives, and priorities. This usually involves working closely with functional analysts within those teams to understand the data, the insights from prior analyses, and implementation hurdles.

The Data Science flywheel is focused on improving business efficiency and KPIs by creating accurate and reliable predictions. This is done in collaboration with Functional Analytics Center of Excellence to ensure the most relevant data sources are utilized, business objectives are met, and results can be quantifiably measured. As business needs change, and as the user-base grows, this flywheel approach will allow the data science team to quickly adapt, iterate, and improve machine learning models.

graph BT;
   id1(Faster, More Accurate Predictions)-->id2(Increased Business Understanding) & id5(Continuous Feedback) 
   id2-->id3(More Revenue & Users) 
   id5-->id1
   id3-->id4(More Data)
   id4-->id1

How to request Data Science project?

To request a new Data Science project, please fill out the Opportunity Canvas. In the description choose [New Request] Create Opportunity Canvas. The Problem Statement and Stakeholders sections should be completed. You can tag a data science team member with whom you discussed the project with or share an issue in #bt-data-science slack channel. During the quarterly planning process, requests will be reviewed and priortisied accordingly by the Data Leadership Forum.

Work Streams

Work Stream Internally Known As Maturity Objective Last Update Next Update
Revenue Expansion Propensity to Expand (PtE) Optimized Determine which paid accounts are likely to increase in ARR via seat expansion or up-tier to Ultimate FY23-Q4 FY24-Q2
Loss Prevention Propensity to Contract (PtC) Optimized Determine which paid accounts are likely to decrease in ARR via seat expansion or down-tier to Premium FY24-Q3 FY24-Q4
Conversion Propensity to Purchase (PtP) Viable Identify which non-paid users (free and trials accounts) are likely to become paid accounts FY24-Q1 FY24-Q2
Product Research Namespace Segmentation Optimized Define groups for paid and free SaaS namespaces based on its product usage FY23-Q3 TBD
Lead Funnel Generation Prospect/Lead Scoring Planned Identify leads and prospects most likely to convert to closed won opportunities FY24-Q2
MLOps GitLab MLOps Product Development In progress (Ongoing) Dogfood new MLOps product features and enhancements FY24-Q1 FY24-Q2
Backlog Adoption Index Planned Define way to measure adoption and customer journey TBD
Backlog Product Usage Event Planned - TBD
Backlog Golden Journey Planned Identify optimal paths to increasing platform usage and adoption TBD
Backlog Stage Adoption MRI Planned - TBD
Backlog Community Sentiment Analysis Unplanned - TBD
Backlog Feature $ARR Uplift Prediction Unplanned Attribute incremental ARR lift based on feature adoption TBD

Maturity

Maturity of data science projects is similar to the GitLab product maturity model:

  • Unplanned: Not implemented, and not yet on our roadmap.
  • Planned: Not implemented, but on our roadmap; executive sponsor attached to project.
  • In Progress: Plan established, developing model.
  • Viable: Available, but not fully productionalized yet; scores and insights manually generated; low adoption outside of immediate stakeholders.
  • Complete: Fully implemented into Data Team cloud production infrastructure; increasing adoption of corresponding dashboards and scores within the intended organization.
  • Optimized: Fine-tuned, fully automated, and self-service; continuous model monitoring and scoring; high adoption within intended organization.

Revenue Expansion

  • Organizational Sponsor: Sales
  • Use Cases: Issue
  • Plans for next iteration: Predicted ARR amount
  • Slack Channel (internal only): #data-propensity-projects
  • Repositories (internal only): Propensity to Expand
  • Read-outs (internal only):
  • Dashboards (internal only):
  • Data sources: Product usage: SaaS & Self-Managed - paid tiers; Product stage usage: SaaS & Self-Managed - paid tiers; Salesforce (account, opportunities, events, tasks); Zuora (billing); Bizible (marketing); Firmographics; ZenDesk (help tickets); prior expansion type (product change, seat licenses), amount, and time lapse; account health scores
    • Future sources: Buyer personas attached to opportunities

Loss Prevention

Conversion

Product Research

  • Organizational Sponsor: Growth & Product Insights
  • Use Cases: Issue
  • Plans for next iteration: Self-managed segmentation
  • Slack Channel (internal only): #namespace-segmentation
  • Repositories (internal only): Namespace Segmentation
  • Read-outs (internal only):
  • Data sources: Product usage: SaaS & Self Managed - free and paid tiers; Product stage usage: SaaS & Self Managed - free and paid tiers; Salesforce (account); Zuora (billing); Bizible (marketing)
    • Future sources: # of consecutive days of product/stage usage

Project Structure

The Data Science Team follows Cross-Industry standard process for data mining (CRISP-DM)

  • Data sources: Product usage: SaaS & Self Managed - free and paid tiers; Product stage usage: SaaS & Self Managed - free and paid tiers; Salesforce (account); Zuora (billing); Bizible (marketing)
    • Future sources: # of consecutive days of product/stage usage

Project Structure

The Data Science Team follows Cross-Industry standard process for data mining (CRISP-DM), which consists of 6 iterative phases:

  1. Business Understanding

    • Includes requirements gathering, stakeholders interviews, project definition, product user stories, and potential use cases in order to establish success criteria of the project.
  2. Data Understanding

    • Requires determining the breadth and scope of existing relevant data sources. Data scientists work closely with data engineers and data analysts to determine where gaps may exist and to identify any data discrepancies or risks.
  3. Data Preparation

    • Requires conducting data quality checks and exploratory data analysis (EDA) to develop a greater understanding of data and how different datapoints relate to solving the business need.
  4. Modeling

    • Machine learning techniques are used to find a solution that addresses the business need. This often takes the form of predicting why/when/how future instances of a business outcome will occur.
  5. Evaluation

    • Performance is generally measured by how accurate, powerful, and explainable the model is. Findings are presented to the stakeholders for feedback.
  6. Deployment

    • Once the model has been approved it then gets deployed into the data science production pipeline. This process automatically updates, generates predictions, and monitors the model on a regular cadence.

The GitLab approach

The Data Science Team approach to model development is centered around GitLab’s value of iteration and the CRISP-DM standard. Our process expands on some of the 6 phrase outlined in CRISP-DM in order to best address the needs of our specific business objectives and data infrastructure.

Data Science Platform

Our current platform consists of:

  • the Enterprise Data Warehouse for storing raw and normalized source data as well as final model output for consumption by downstream consumers
  • JupyterLab for model training, tuning, and selection
  • GitLab for collaboration, project versioning, and score code management, experiment tracking, and CI/CD -GitLab CI for automation and orchestration
  • Monte Carlo for drift detection
  • Tableau Server for model monitoring and on-going performance evaluation
  • Feast as a an open-source Feature Store for Machine Learning models

Feast: Feature Store Implementation

We are using Feast as an open-source Feature Store for our machine learning models. Configuration can be found on the Feast project repository, updating the feature store is done via GitLab CI/CD and the web UI is published in a VM on GCP.

You can find more details on this implementation on the Feast - Feature Store Implementation Internal handbook section.

CI/CD Pipelines for Data Science

We are in the process of fully moving over the training and scoring of our machine learning models to the native GitLab CI/CD capabilities. Please see Getting Started With CI/CD for Data Science Pipelines for the most up-to-date information and instructions.

Current State Data Flows

graph 
    A[Enterprise Data Warehouse: Raw and Normalized Data Sources]
    B[JupyterLab & GitLab CI/CD: Model Training, Tuning, and Selection]
    C(GitLab CI/CD & Pipeline Schedules: Batch scoring with Papermill)
    F[Enterprise Data Warehouse: Model Output for Consumption]
    D[Salesforce/Marketo: CRM Use Cases]
    E[Tableau/Monte Carlo: Model Monitoring and Reporting]
    G[GitLab: Source Code Management]
    H[Experiment tracking]
    A --> |ODBC| B
    B --> H
    H --> B
    B --> G
    G --> B
    G --> C
    C --> |JSON| F
    F --> |CSV| D
    F --> |ODBC| E

Data Science Tools at GitLab

  • Pre-configured JuypterLab Image: The data science team uses JupyterLab pre-configured with common python modules (pandas, numpy, etc.), native Snowflake connectivity, and git support. Working from a common framework allows us to create models and derive insights faster. This setup is freely available for anyone to use. Check out our Jupyter Guide for additional information.
  • GitLab Data Science Tools for Python: Functions to help automate common data prep (dummy coding, outlier detection, variable reduction, etc.) and modeling tasks (i.e. evaluating model performance). Install directly via pypi (pip install gitlabds), or use as part of the above JupyterLab image.
  • Modeling Templates: The data science team has created modeling templates to allow you to easily start building predictive models without writing python code from scratch. To enable these templates, follow the instructions on the Jupyter Guide.

Useful Data Science & Machine Learning Resources

  • Python Data Science Handbook by Jake VanderPlas: Great for beginners looking to learn python and dip their toes into data science.
  • Python Machine Learning by Sebastian Raschka & Vahid Mirjalili: More advanced topics with the assumption of a basic level of python.
  • The Elements of Statistical Learning, Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, & Jerome Friedman: Great deep dive into all the statistics and logic behind many of the commonly used predictive techniques. Can be pretty stats/math heavy at time.

Data Science Project Development Approach
GitLab Data Science Team Approach to Model Development
Last modified April 15, 2024: Update infra information (7d638bfd)