Data Team

On this page


Primary Project

Looker Project

dbt docs

Roadmap

Epics

OKRs



We Data

Data Analysis Process

Analysis usually begins with a question. A stakeholder will ask a question of the data team by creating an issue in the Data Team project using the appropriate template. The analyst assigned to the project may schedule a discussion with the stakeholder(s) to further understand the needs of the analysis. This meeting will allow for analysts to understand the overall goals of the analysis, not just the singular question being asked, and should be recorded. Analysts looking for some place to start the discussion can start by asking:

An analyst will then update the issue to reflect their understanding of the project at hand. This may mean turning an existing issue into a meta issue or an epic. Stakeholders are encouraged to engage on the appropriate issues. The issue then becomes the SSOT for the status of the project, indicating the milestone to which its been assigned and the analyst working on it, among other things. Barring any confidentiality concerns, the issue is also where the final project will be delivered. On delivery, the data team manager will be cc'ed where s/he will provide feedback and/or request changes. When satisfied, s/he will close the issue. If the stakeholder would like to request a change after the issue has been closed, s/he should create a new issue and link to the closed issue.

The Data Team can be found in the #analytics channel on slack.

Getting Things Done

The data team currently works in two-week intervals, called milestones. Milestones start on Tuesdays and end on Mondays. This discourages last-minute merging on Fridays and allows the team to have milestone planning meetings at the top of the milestone.

Milestones may be three weeks long if they cover a major holiday or if the majority of the team is on vacation. As work is assigned to a person and a milestone, it gets a weight assigned to it.

Issue Pointing

Merge Request Workflow

Ideally, your workflow should be as follows:

  1. Create an issue
  2. Open an MR from the issue using the "Create merge request" button. This automatically creates a unique branch based on the issue name. This marks the issue for closure once the MR is merged.
  3. Push your work to the branch
  4. Run any relevant jobs to the work being proposed
    • e.g. if you're working on dbt changes, run the dbt MR job and the dbt test job.
  5. Document in the MR description what the purpose of the MR is, any additional changes that need to happen for the MR to be valid, and if it's a complicated MR, how you verified that the change works. See this MR for an example of good documentation. The goal is to make it easier for reviewers to understand what the MR is doing so it's as easy as possible to review.
  6. Assign the MR to a peer to have it reviewed. If assigning to someone who can merge, either leave a comment asking for a review without merge, or you can simply leave the WIP: label.
  7. Once it's ready for further review and merging, remove the WIP: label, mark the branch for deletion, mark squash commits, and assign to the project's maintainer.

Other tips:

Extract and Load

We currently use Stitch for most of our data sources. In some cases, we use Meltano.

Data Source Pipeline Management Responsibility
CloudSQL Postgres Stitch Data Team
Gitter    
SheetLoad SheetLoad Data Team
Marketo Stitch Data Team
Netsuite Stitch Data Team
Pings Meltano The Meltano team will investigate the source of the problem. If the issue is in the Meltano tap, the Meltano team will fix it. If not, the data team is responsible.
SFDC Stitch Data Team
Snowplow   Data Team
Zendesk Stitch Data Team
Zuora Stitch Data Team

Adding new Data Sources

Process for adding a new data source:

Using SheetLoad

SheetLoad is the process by which a GoogleSheet can be ingested into the data warehouse. This is not an ideal solution to get data into the warehouse, but may be the appropriate solution at times.

How to use SheetLoad

  1. Add file to SheetLoad Google Drive Folder with appropriate naming convention, described below
  2. Share the sheet with the SheetLoader runner => Doc with email (GitLab internal)
  3. Add the full file name to the extract-ci.yml file
  4. Create dbt base models
  5. Add to data quality test that helps ensure these files are updated monthly.

Naming Conventions

Sheetload expects there to be three parts to a name. The file name (google sheet name) should be schema.value where value is useful for understanding the file's purpose. The table's name will come from the tab name in the file. The extract file, thus, will include schema.value.tablename but the table will appear in the database in schema.tablename.

Orchestration

We are in the process of moving from GitLab CI to Airflow.

Data Warehouse

We currently use Snowflake as our data warehouse.

Warehouse Access

To gain access to the data warehouse:

Managing Roles for Snowflake

Here are the proper steps for provisioning a new user and user role:

  • Login and switch to SECURITYADMIN role
  • Create user (EBURKE)
  • Create a password using https://passwordsgenerator.net/
  • Click next and fill in additional info.
    • Make Login Name and Display name match user name (all caps).
    • Do not set any defaults.
    • Send to person using https://onetimesecret.com/
  • Create role for user (EBURKE for example) with SYSADMIN as the parent role (this grants the role to sysadmin)
  • Grant user role to new user
  • Create user_scratch schema in ANALYTICS as SYSADMIN
    • CREATE SCHEMA eburke_scratch;
  • Grant ownership of scratch schema to user role
    • GRANT OWNERSHIP ON schema eburke_scratch TO ROLE eburke;
  • Document in Snowflake config.yml permissions file

Transformation

Tips and Tricks about Working with dbt

Visualization

Looker is GitLab's data visualization tool. Many modern data visualization tools require analysts to write SQL queries; Looker's unique advantage lies in their modeling layer that allows non-technical end users to build their own data analysis in a drag and drop user interface. While this means that the initial configuration (e.g., setting up a new data source) takes longer than just querying a table would be, once the initial configuration is done, you have a new data set- a Looker explore- available for all users to take advantage of. The data team aims for Looker to be the SSOT for all of GitLab.

Getting Looker Access

To get initial Looker Access, please create a new access issue following Security's procedures in this project. There are multiple levels of user access: View-Only, Explorer, and Developer. All GitLab Employees should be a part of their respective team group as well as the GitLab Employee group. If you do not have access, please see the handbook on how to request it.

User Roles

We have 4 role types: View Only, Explorer, Developer, Admin.

Getting Started with Looker- A special note for users coming from Redash

Users coming from Redash or another query-based data visualization tool, especially those with a strong familiarity with SQL, may find themselves uniquely frustrated at how long it can take to answer a "simple" question when doing so aims to take advantage of new data. Any new data sources need to be brought into the data warehouse, modeled in dbt following the team's dbt coding conventions, modeled in a LookML view, and added to a new or existing explore before an analysis can be built on top of it. While this initial configuration might seem like a bit of a slog, it moved all of the analyses configuration to be an up-front responsibility, making the explore you build usable not just for you, but for future users with related questions too.

Core Looker Definitions & Structure

LookML - is a Looker proprietary language for describing dimensions, aggregates, calculations and data relationships in a SQL database. The Looker app uses a model written in LookML to construct SQL queries against a particular database.

LookML Project - A LookML project is a collection of LookML files that describe a set of related models, explores, views, and LookML dashboards.

By convention, LookML code is segregated into three types of files: model files, view files, and dashboard files. In new LookML, these files have the following extensions: .model.lkml, .view.lkml, and .dashboard.lookml, respectively.

Model - A model is a customized portal into the database, designed to provide intuitive data exploration for specific business users. Multiple models can exist for the same database connection in a single LookML project. Each model can expose different data to different users. For example, sales agents need different data than company executives, and so you would probably develop two models to offer views of the database appropriate for each user. A model, in turn, is essentially metadata about the source database.

View - A view declaration defines a list of fields (dimensions or measures) and their linkage to an underlying table or derived table. In LookML a view typically references an underlying database table, but it can also represent a derived table.

A view may join to other views. The relationship between views is typically defined as part of a explore declaration in a model file.

Explore - An explore is a view that users can query. You can think of the explore as a starting point for a query, or in SQL terms, as the FROM in a SQL statement. An explore declaration specifies the join relationships to other views.

Derived Table - A derived table is a table comprised of values from other tables, which is accessed as though it were a physical table with its own set of columns. A derived table is exposed as its own view using the derived_table parameter, and defines dimensions and measures in the same manner as conventional views. Users can think of a derived table similar to a view, not to be confused with a Looker view, in a database.

Dimension Group - The dimension_group parameter is used to create a set of time-based dimensions all at once. For example, you could easily create a date, week, and month dimension based on a single timestamp column.

Tips, Tricks, and Best Practices about Working in Looker

Team Roles

Data Analyst

Position Description

Data Engineer

Position Description

Manager

Position Description