Gitlab hero border pattern left svg Gitlab hero border pattern right svg

dbt Guide


Primary Project dbt docs

What and why

πŸŒŽπŸ’‘

dbt, short for data build tool, is an open source project for managing data transformations in a data warehouse. Once data is loaded into a warehouse, dbt enables teams to manage all data transformations required for driving analytics. It also comes with built in testing and documentation so we can have a high level of confidence in the tables we're generating and analyzing.

The following links will give you an excellent overview of what dbt is:

But why do we use dbt? There are several reasons.

First is that it is an open source tool with a vibrant community. Choosing an open source tool enables us to collaborate with the larger data community and solve problems faster than had we gone with a proprietary solution.

Second, it was built with version control in mind. For GitLab, this is essential since we use the product for building and running the company.

Third, it speaks the language of analysts - SQL. This increases the number of people that can contribute since SQL is becoming such a critical part of many people's jobs.

Finally, it enables teams to move faster by integrating testing and documentation from the start.

For even more information about the basics of dbt, see our data analyst onboarding issue template

At times, we rely on dbt packages for some data transformation. Package management is built-in to dbt. A full list of packages available are on the dbt Hub site.

Running dbt

If you're interested in using dbt, the dbt documemtation has a great tutorial on getting setup to work on data from a fictional business called Jaffle Shop.

If you wish to use dbt and contribute to the data team project, you'll need to gain access to our Snowflake instance, which can be done via an access request.

Configuration

Note that many of these steps are done in the onboarding script we recommend new analysts run.

Docker Workflow

πŸ› πŸ

To facilitate an easier workflow for analysts, and to abstract away some of the complexity around handling dbt and its dependencies locally, the main analytics project supports using dbt from within a Docker container. We build the container from the data-image project. There are commands within the Makefile to facilitate this, and if at any time you have questions about the various make commands and what they do, use make help to get a list of the commands and what each of them does.

Before your initial run (and whenever the containers get updated) make sure to run the following commands:

  1. make update-containers
  2. make cleanup

These commands will ensure you get the newest versions of the containers and generally clean up your local Docker environment.

Using dbt

Command line cheat sheet

This is a simplified version of the primary command reference.

dbt specific:

Works only if you've run the onboarding script:

Configuration for contributing to dbt project

πŸ› πŸ

If you're interested in contributing to dbt, here's our recommended way of setting up your local environment to make it easy.

When you're ready to submit your code for a PR, ensure you've signed their CLA.

Production Runs

Style and Usage Guide

Model Structure

As we transition to a more Kimball-style warehouse, we are improving how we organize models in the warehouse and in our project structure. The following sections will all be top-level directories under the models directory, which is a dbt default. This structure is inspired by how Fishtown Analytics structures their projects.

Legacy Structure

Prior to our focus on Kimball dimensional modeling, we took inspiration from the BEAM* approach to modeling introduced in "Agile Data Warehouse Design" by Corr and Stagnitto. Many of the existing models still follow that pattern. The information in this section is from previous iterations of the handbook.

  • The goal of a (final) _xf dbt model should be a BEAM* table, which means it follows the business event analysis & model structure and answers the who, what, where, when, how many, why, and how question combinations that measure the business.
  • base models- the only dbt models that reference the source table; base models have minimal transformational logic (usually limited to filtering out rows with data integrity issues or actively flagged not for analysis and renaming columns for easier analysis); can be found in the analytics_staging schema; is used in ref statements by end-user models
  • end-user models - dbt models used for analysis. The final version of a model will likely be indicated with an _xf suffix when it’s goal is to be a BEAM* table. It should follow the business event analysis & model structure and answer the who, what, where, when, how many, why, and how question combinations that measure the business. End user models are found in the analytics schema.

Sources

All raw data will still be in the RAW database in Snowflake. These raw tables are referred to as source tables or raw tables. They are typically stored in a schema that indicates its original data source, e.g. netsuite

Sources are defined in dbt using a sources.yml file.

Source Models

We are enforcing a very thin source layer on top of all raw data. These are "base" models that pull directly from the raw data and should do only the following:

Even in cases where the underlying raw data is perfectly cast and named, there should still exist a source model which enforces the formatting. This is for the convenience of end users so they only have one place to look and it makes permissioning cleaner in situations where this perfect data is sensitive.

The following should not be done in a source model:

For all intents and purposes, the source models should be considered the "raw" data for the vast majority of users.

Key points to remember:

Staging

This directory is where the majority of source-specific transformations will be stored. These are the models that do the prep work required to make facts and dimensions. Prior to our implementation of Kimball modeling, most all of our models would have fallen into this category.

The key feature of this directory is that the transformations should stick to data-source structure. This means Zuora, Salesforce, etc. will have their own directory in here that maps to what exists in sources. These initially may seem redundant, but we're optimizing for the analyst and data consumers, not the data/analytics engineers. Consolidating all sources into a single directory and schema enables simpler configuration, permissioning, and logical explanations. Having all source-specific transformations in staging encourages better organization and collaboration.

Marts

This directory is where facts, dimensions, and OLAP cubes will be stored. When a model is in this directory it communicates to business stake holders that the data is cleanly modelled and is ready for querying. All of these models should have documentation and testing associated with them.

General

Schema References (aka What goes where)

Purpose Production Dev Config
For querying & analysis analytics emilie_scratch_analytics None
For modeling marts analytics_staging emilie_scratch_staging {{ config({ "schema": "staging"}) }}
For modeling, but SENSITIVE analytics_sensitive emilie_scratch_analytics_sensitive {{ config({ "schema": "sensitive"}) }}
Intermediate tables that are dropped analytics_temporary emilie_scratch_temporary {{ config({ "schema": "temporary"}) }}
Source models used to build other models analytics_source emilie_scratch_temporary {{ config({ "schema": "source"}) }}

Model Configuration

There are multiple ways to provide configuration definitions for models. The dbt docs for configuring models provide a concise explanation of the ways to configure models.

Our guidelines for configuring models:

Seeds

Seeds are a way to load data from csv files into our data warehouse (dbt documentation). Because these csv files are located in our dbt repository, they are version controlled and code reviewable. This method is appropriate for loading static data which changes infrequently. A csv file that’s up to ~1k lines long and less than a few kilobytes is probably a good candidate for use with the dbt seed command.

Organizing columns

When writing a base model, colummns should have some logical ordering to them. We encourage these 4 basic groupings:

Primary data is the key information describing the table. The primary key should be in this group along with other relevant unique attributes such as name.

Foreign keys should be all the columns which point to another table.

Logical data is for additional data dimensions that describe the object in reference. For a Salesforce opportunity this would be the opportunity owner or contract value. Further logical groupings are encouraged if they make sense. For example, having a group of all the variations of contract value would make sense.

Within any group, the columns should be alphabetized on the alias name.

An exception to the grouping recommendation is when we control the extraction via a defined manifest file. A perfect example of this is our gitlab.com manifest which defines which columns we extract from our application database. The base models for these tables can be ordered identically to the manifest as it's easier to compare diffs and ensure accuracy between files.

Testing

Snapshots

Snapshots are a way to take point-in-time copies of source tables. dbt has excellent documentation on how the snapshots work. Take note of how we talk about and define snapshots.

Create snapshot tables with dbt snapshot

Snapshots are stored in the snapshots folder of our dbt project. We have organized the different snapshots by data source for easy discovery.

The following is an example of how we implement a snapshot:

{% snapshot sfdc_opportunity_snapshots %}

    {{
        config(
          unique_key='id',
          strategy='timestamp',
          updated_at='systemmodstamp',
        )
    }}
    
    SELECT * 
    FROM {{ source('salesforce', 'opportunity') }}
    
{% endsnapshot %}

Key items to note:

Snapshots are tested manually by a maintainer of the Data Team project before merging.

Make snapshots table available in analytics data warehouse

As stated above, RAW database and snapshots schemas are hard-coded in the config dictionary of dbt snapshots. That means, once these snapshot tables are created in the RAW we need to make them available in the ANALYTICS data warehouse in order to be able to query them downstream (with Sisense or for further _xf dbt models).

Base models for snapshots are available in the folder /models/snapshots of our dbt project. Key items to note: