Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Data Team

On this page


Primary Project

dbt docs

SQL Style Guide

Python Style Guide

Roadmap

Epics

OKRs



We Data

Data Team Principles

The Data Team at GitLab is working to establish a world-class analytics function by utilizing the tools of DevOps in combination with the core values of GitLab. We believe that data teams have much to learn from DevOps. We will work to model good software development best practices and integrate them into our data management and analytics.

A typical data team has members who fall along a spectrum of skills and focus. For now, the analytics function at GitLab has Data Engineers and Data Analysts; eventually the team will include Data Scientists. Analysts are divided into being part of the Core Data Function and specializing in different functions in the company.

Data Engineers on our team are essentially software engineers who have a particular focus on data movement and orchestration. The transition to DevOps is typically easier for them because much of their work is done using the command line and scripting languages such as Bash and Python. One challenge in particular are data pipelines. Most pipelines are not well tested, data movement is not typically idempotent, and auditability of history is challenging.

Data Analysts are further from DevOps practices than Data Engineers. Most analysts use SQL for their analytics and queries, with Python or R a close second. In the past, data queries and transformations may have been done by custom tooling or software written by other companies. These tools and approaches share similar traits in that they're likely not version controlled, there are probably few tests around them, and they are difficult to maintain at scale.

Data Scientists are probably furthest from integrating DevOps practices into their work. Much of their work is done in tools like Jupyter Notebooks or R Studio. Those who do machine learning create models that are not typically version controlled. Data management and accessibility is also a concern as well.

We will work closely with the analytics community to find solutions to these challenges. Some of the solutions may be cultural in nature, and we aim to be a model for other organizations of how a world-class Data and Analytics team can utilize the best of DevOps for all Data Operations.

Some of our beliefs are:

Data Analysis Process

Analysis usually begins with a question. A stakeholder will ask a question of the data team by creating an issue in the Data Team project using the appropriate template. The analyst assigned to the project may schedule a discussion with the stakeholder(s) to further understand the needs of the analysis. This meeting will allow for analysts to understand the overall goals of the analysis, not just the singular question being asked, and should be recorded. Analysts looking for some place to start the discussion can start by asking:

An analyst will then update the issue to reflect their understanding of the project at hand. This may mean turning an existing issue into a meta issue or an epic. Stakeholders are encouraged to engage on the appropriate issues. The issue then becomes the SSOT for the status of the project, indicating the milestone to which its been assigned and the analyst working on it, among other things. Barring any confidentiality concerns, the issue is also where the final project will be delivered. On delivery, the data team manager will be cc'ed where s/he will provide feedback and/or request changes. When satisfied, s/he will close the issue. If the stakeholder would like to request a change after the issue has been closed, s/he should create a new issue and link to the closed issue.

The Data Team can be found in the #analytics channel on slack.

Getting Things Done

The data team currently works in two-week intervals, called milestones. Milestones start on Tuesdays and end on Mondays. This discourages last-minute merging on Fridays and allows the team to have milestone planning meetings at the top of the milestone.

Milestones may be three weeks long if they cover a major holiday or if the majority of the team is on vacation. As work is assigned to a person and a milestone, it gets a weight assigned to it.

Issue Pointing

Weight Description
Null Meta, Discussion, or Documentation issues that don't result in an MR
0 Should not be used.
1 The simplest possible change. We are confident there will be no side effects.
2 A simple change (minimal code changes), where we understand all of the requirements.
3 A simple change, but the code footprint is bigger (e.g. lots of different files, or tests effected). The requirements are clear.
5 A more complex change that will impact multiple areas of the codebase, there may also be some refactoring involved. Requirements are understood but you feel there are likely to be some gaps along the way.
8 A complex change, that will involve much of the codebase or will require lots of input from others to determine the requirements.
13 A significant change that may have dependencies (other teams or third-parties) and we likely still don't understand all of the requirements. It's unlikely we would commit to this in a milestone, and the preference would be to further clarify requirements and/or break in to smaller Issues.

Issue Labeling

Think of each of these groups of labels as ways of bucketing the work done. All issues should get the following classes of labels assigned to them:

Optional labels that are useful to communicate state or other priority

Priority Description Probability of shipping in milestone
P1 Urgent: top priority for achieving in the given milestone. These issues are the most important goals for a milestone and should be worked on first; some may be time-critical or unblock dependencies. ~100%
P2 High: important issues that have significant positive impact to the business or technical debt. Important, but not time-critical or blocking others. ~75%
P3 Normal: incremental improvements to existing features. These are important iterations, but deemed non-critical. ~50%
P4 Low: stretch issues that are acceptable to postpone into a future milestone. ~25%

Merge Request Workflow

Ideally, your workflow should be as follows:

  1. Create an issue or open an existing issue.
  2. Add appropriate labels to the issue issue (see above)
  3. Open an MR from the issue using the "Create merge request" button. This automatically creates a unique branch based on the issue name. This marks the issue for closure once the MR is merged.
  4. Push your work to the branch
  5. Run any relevant jobs to the work being proposed
    • e.g. if you're working on dbt changes, run the dbt MR job and the dbt test job.
  6. Document in the MR description what the purpose of the MR is, any additional changes that need to happen for the MR to be valid, and if it's a complicated MR, how you verified that the change works. See this MR for an example of good documentation. The goal is to make it easier for reviewers to understand what the MR is doing so it's as easy as possible to review.
  7. Assign the MR to a peer to have it reviewed. If assigning to someone who can merge, either leave a comment asking for a review without merge, or you can simply leave the WIP: label.
    • Note that assigning someone an MR means action is required from them.
    • Adding someone as an approver is a way to tag them for an FYI. This is similar to doing cc @user in a comment.
  8. Once it's ready for further review and merging, remove the WIP: label, mark the branch for deletion, mark squash commits, and assign to the project's maintainer. Ensure that the attached issue is appropriately labeled and pointed.

Other tips:

Our Data Stack

We use GitLab to operate and manage the analytics function. Everything starts with an issue. Changes are implemented via merge requests, including changes to our pipelines, extraction, loading, transformations, and parts of our analytics.

Stage Tool
Extraction Stitch and Custom
Loading Stitch and Custom
Orchestration Airflow and GitLab CI
Storage Cloud SQL (PostgreSQL) and Snowflake
Transformations dbt and Python scripts
Analysis TBD

Extract and Load

We currently use Stitch for most of our data sources.

Data Source Pipeline Management Responsibility Frequency
Clearbit      
CloudSQL Postgres Stitch Data Team  
DiscoverOrg      
Gitter      
GitLab dot Com      
SheetLoad SheetLoad Data Team  
Marketo Stitch Data Team 12 hour intervals - Backfilled from January 1, 2013
Netsuite Stitch Data Team 30 minute intervals - Backfilled from January 1, 2013
Pings Stitch/Custom Data Team  
SFDC Stitch Data Team 1 hour intervals - Backfilled from January 1, 2013
Snowplow   Data Team  
Zendesk Stitch Data Team 1 hour intervals - Backfilled from January 1, 2013
Zuora Stitch Data Team 30 minute intervals - Backfilled from January 1, 2013

Planned:

Adding new Data Sources

Process for adding a new data source:

Using SheetLoad

SheetLoad is the process by which a GoogleSheet, local CSV or file from GCS can be ingested into the data warehouse. This is not an ideal solution to get data into the warehouse, but may be the appropriate solution at times.

As it is being iterated on often, the best place for up-to-date info on Sheetload is the Sheetload readme.

Orchestration

We are in the process of moving from GitLab CI to Airflow.

Data Warehouse

We currently use Snowflake as our data warehouse.

Warehouse Access

To gain access to the data warehouse:

Managing Roles for Snowflake

Here are the proper steps for provisioning a new user and user role:

  • Login and switch to securityadmin role
  • Create user (EBURKE)
    • Create a password using https://passwordsgenerator.net/
    • Click next and fill in additional info. Make Login Name and Display name match user name (all caps).
    • Do not set any defaults.
    • Send to person using https://onetimesecret.com/
  • Create role for user (EBURKE for example) with sysadmin as the parent role (this grants the role to sysadmin)
  • Grant user role to new user
  • Create user_scratch schema in ANALYTICS as sysadmin
    • CREATE SCHEMA eburke_scratch;
  • Grant ownership of scratch schema to user role
    • GRANT OWNERSHIP ON schema eburke_scratch TO ROLE eburke;
  • Document in Snowflake config.yml permissions file

Transformation

Please see the data analyst onboarding issue template for details on getting started with dbt.

At times, we rely on dbt packages for some data transformation. Package management is built-in to dbt. A full list of packages available are on the dbt Hub site. We use the repository-syntax instead of the hub-syntax in our packages.yml file.

Tips and Tricks about Working with dbt

Visualization

We are currently evaluating multiple visualization tools. To request access, please follow submit an access request.

Team Roles

Data Analyst

Position Description

Data Engineer

Position Description

Manager

Position Description