Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Data Team

On this page


Primary Project dbt docs Periscope Epics OKRs GitLab Unfiltered YouTube Playlist

Data Team Handbook


Contact Us

Slack

The Data Team uses these channels on Slack:

Meetings

The Data Team's Google Calendar is the SSOT for meetings. It also includes relevant events in the data space. Anyone can add events to it. Many of the events on this calendar, including Monthly Key Reviews, do not require attendance and are FYI events. When creating an event for the entire Data Team, it might be helpful to consult the Time Blackout sheet.

The Data Team has the following recurring meetings:

Meeting Tuesday

The team honors Meeting Tuesday. We aim to consolidate all of our meetings into Tuesday, since most team members identify more strongly with the Maker's Schedule over the Manager's Schedule.

We Data

Charter

The Data Team is a part of the Finance organization within GitLab, but we serve the entire company. We do this by maintaining a data warehouse where information from all business systems are stored and managed for analysis.

Our charter and goals are as follows:

Data Team Principles

The Data Team at GitLab is working to establish a world-class analytics function by utilizing the tools of DevOps in combination with the core values of GitLab. We believe that data teams have much to learn from DevOps. We will work to model good software development best practices and integrate them into our data management and analytics.

A typical data team has members who fall along a spectrum of skills and focus. For now, the data function at GitLab has Data Engineers and Data Analysts; eventually, the team will include Data Scientists. Analysts are divided into being part of the Central Data Function and specializing in different functions in the company.

Data Engineers are essentially software engineers who have a particular focus on data movement and orchestration. The transition to DevOps is typically easier for them because much of their work is done using the command line and scripting languages such as bash and python. One challenge in particular are data pipelines. Most pipelines are not well tested, data movement is not typically idempotent, and auditability of history is challenging.

Data Analysts are further from DevOps practices than Data Engineers. Most analysts use SQL for their analytics and queries, with Python or R. In the past, data queries and transformations may have been done by custom tooling or software written by other companies. These tools and approaches share similar traits in that they're likely not version controlled, there are probably few tests around them, and they are difficult to maintain at scale.

Data Scientists are probably furthest from integrating DevOps practices into their work. Much of their work is done in tools like Jupyter Notebooks or R Studio. Those who do machine learning create models that are not typically version controlled. Data management and accessibility is also a concern.

We will work closely with the data and analytics communities to find solutions to these challenges. Some of the solutions may be cultural in nature, and we aim to be a model for other organizations of how a world-class Data and Analytics team can utilize the best of DevOps for all Data Operations.

Some of our beliefs are:

Team Organization

The Data Team operates in a hub and spoke model, where some analysts or engineers are part of the central data team (hub) while others are embedded (spoke) or distributed (spoke) throughout the organization.

Central - those in this role report to and have their priorities set by the Data team. They currently support those in the Distributed role, cover ad-hoc requests, and support all functional groups (business units).

Embedded - those in this role report to the data team but their priorities are set by their functional groups (business units).

Distributed - those in this role report to and have their priorities set by their functional groups (business units). However, they work closely with those in the Central role to align on data initiatives and for assistance on the technology stack.

All roles mentioned above have their MRs and dashboards reviews by members in the Data team. Both Embedded and Distributed data analyst or data engineer tend to be subject matter experts (SME) for a particular business unit.

Data Support Per Organization

Central
Data Engineers
Role Team Member Type Prioritization Owners
Manager, Data @jjstark Central @wzabaglio
Staff Data Engineer, Architecture @tayloramurphy Central @jjstark
Senior Data Engineer @tlapiana Central @jjstark
Data Engineer TBD Central (Main Focus: Sales & Marketing) TBD
Data Engineer TBD Central (Main Focus: Product & Engineering) TBD

Board: TBD

Data Analysts
Role Team Member Type Prioritization Owners Board
Manager, Data @kathleentam Central @wzabaglio Board
Data Analyst @derekatwood Central @kathleentam Board
Data Analyst TBD Central (Main Focus: Periscope, Corporate, Alliances) @kathleentam TBD
Data Analyst Click Here to Apply! Central (Main Focus: Engineering) @kathleentam TBD
Data Analyst TBD Central (Main Focus: Sales, Marketing, Finance) @kathleentam TBD
Data Analyst TBD Central (Main Focus: Product, Growth) @kathleentam TBD
Business Operations
Role Team Member Type Prioritization Owners Board
Data Analyst, Operations TBD Embedded @wzabaglio TBD
Finance
Role Team Member Type Prioritization Owners
Data Analyst, Finance @iweeks Embedded @wwright

Board: TBD

Growth
Role Team Member Type Prioritization Owners
Sr. Data Analyst, Growth @mpeychet Embedded Primary (DRI): @sfwgitlab; Secondary: @timhey, @jstava, @s_awezec, @mkarampalas
Data Analyst, Growth @eli_kastelein Embedded Primary (DRI): @sfwgitlab; Secondary: @timhey, @jstava, @s_awezec, @mkarampalas

Board: Growth Board

People

This role will support all People and Recruiting data analytics requests.

Role Team Member Type Prioritization Owners
Data Analyst, People Click Here to Apply! Embedded TBD

Board: TBD

Engineering, Infrastruture
Role Team Member Type Prioritization Owners
Staff Data Analyst @davis_townsend Distributed @glopezfernandez

Board: TBD

Marketing
Role Team Member Type Prioritization Owners
Marketing Operations Manager @rkohnke Distributed @rkohnke
Data Analyst, Marketing TBD Embedded @rkohnke

Board: TBD

Sales
Role Team Member Type Prioritization Owners
Senior Sales Analytics Analyst @JMahdi Distributed @mbenza
Senior Sales Analytics Analyst @mvilain Distributed @mbenza
Sales Analytics Analyst TBD Distributed @mbenza

Board: TBD

Data Analysis Process

Analysis usually begins with a question. A stakeholder will ask a question of the data team by creating an issue in the Data Team project using the appropriate template. The analyst assigned to the project may schedule a discussion with the stakeholder(s) to further understand the needs of the analysis, though the preference is always for async communication. This meeting will allow for analysts to understand the overall goals of the analysis, not just the singular question being asked, and should be recorded. All findings should be documented in the issue. Analysts looking for some place to start the discussion can start by asking:

An analyst will then update the issue to reflect their understanding of the project at hand. This may mean turning an existing issue into a meta issue or an epic. Stakeholders are encouraged to engage on the appropriate issues. The issue then becomes the SSOT for the status of the project, indicating the milestone to which it has been assigned and the analyst working on it, among other things. The issue should always contain information on the project's status, including any blockers that can help explain its prioritization. Barring any confidentiality concerns, the issue is also where the final project will be delivered, after peer/technical review. When satisfied, the analyst will close the issue. If the stakeholder would like to request a change after the issue has been closed, s/he should create a new issue and link to the closed issue.

The Data Team can be found in the #data channel on slack.

Working with the Data Team

  1. Once a KPI or other Performance Indicate is defined and assigned a prioritization, the metric will need to be added to Periscope by the data team.
  2. Before syncing with the data team to add a KPI to Periscope, the metric must be:
    • Clearly defined in the relevant section in the handbook and added to the GitLab KPIs with all of its parts.
    • Reviewed with the Financial Business Partner for the group.
    • Approved and reviewed by the executive of the group.
  3. Once the KPI is ready to be added into Periscope, create an issue on the GitLab Data Team Issue Tracker using the KPI Template or PI Request Template.
    • The Data team will verify the data sources and help to find a way to automate (if necessary).
    • Once the import is complete, the data team will present the information to the owner of the KPI for approval who will document in the relevant section of the handbook.

Can I get an update on my dashboard?

The data team's priorities come from our OKRs. We do our best to service as many of the requests from the organization as possible. You know that work has started on a request when it has been assigned to a milestone. Please communicate in the issue about any pressing priorities or timelines that may affect the data team's prioritization decisions. Please do not DM a member of the data team asking for an update on your request. Please keep the communication in the issue.

How we Work

Documentation

The data team, like the rest of GitLab, works hard to document as much as possible. We believe this framework for types of documentation from Divio is quite valuable. For the most part, what's captured in the handbook are tutorials, how-to guides, and explanations, while reference documentation lives within in the primary analytics project. We have aspirations to tag our documentation with the appropriate function as well as clearly articulate the assumed audiences for each piece of documentation.

OKR Planning

Data Team OKRs are derived from the higher level BizOps/Finance OKRs as well as the needs of the team. At the beginning of a FQ, the team will outline all actions that are required to succeed with our KRs and in helping other teams measure the success of their KRs. The best way to do that is via a team brain dump session in which everyone lays out all the steps they anticipate for each of the relevant actions. This is a great time for the team to raise any blockers or concerns they foresee. These should be recorded for future reference.

These OKRs drive ~60% of the work that the central data team does in a given quarter. The remaining time is divided between urgent issues that come up and ad hoc/exploratory analyses. Specialty data analysts (who have the title "Data Analyst, Specialty") should have a similar break down of planned work to responsive work, but their priorities are set by their specialty manager.

Milestone Planning

The data team currently works in two-week intervals, called milestones. Milestones start on Tuesdays and end on Mondays. This discourages last-minute merging on Fridays and allows the team to have milestone planning meetings at the top of the milestone.

Milestones may be three weeks long if they cover a major holiday or if the majority of the team is on vacation or at Contribute. As work is assigned to a person and a milestone, it gets a weight assigned to it.

Milestone planning should take into consideration:

The milestone planning is owned by the Manager, Data.

The timeline for milestone planning is as follows:

The short-term goal of this process is to improve our ability to plan and estimate work through better understanding of our velocity. In order to successfully evaluate how we're performing against the plan, any issues not raised at the T+7 mark should not be moved until the next milestone begins.

The work of the data team generally falls into the following categories:

During the milestone planning process, we point issues. Then we pull into the milestone the issues expected to be completed in the timeframe. Points are a good measure of consistency, as milestone over milestone should share an average. Then issues are prioritized according to these categories.

Issues are not assigned to individual members of the team, except where necessary, until someone is ready to work on it. Work is not assigned and then managed into a milestone. Every person works on the top priority issue for their job type. As that issue is completed, they can pick up the next highest priority issue. People will likely be working on no more than 2 issues at a time.

Given the power of the Ivy Lee method, this allows the team to collectively work on priorities as opposed to creating a backlog for any given person. As a tradeoff, this also means that every time a central analyst is introduced to a new data source their velocity may temporarily decrease as they come up to speed; the overall benefit to the organization that any analyst can pick up any issue will compensate for this, though. Learn how the product managers groom issues.

Data Engineers will work on Infrastructure issues. Data Analysts, Central and sometimes Data Engineers work on general Analytics issues. Data Analysts, work on analyses, e.g Growth, Finance, etc.

There is a demo of what this proposal would look like in a board.

This approach has many benefits, including:

  1. It helps ensure the highest priority projects are being completed
  2. It can help leadership identify issues that are blocked
  3. It provides leadership view into the work of the data team, including specialty analysts whose priorities are set from outside the data function
  4. It encourages consistent throughput from team members
  5. It makes clear to stakeholders where their ask is in priority
  6. It helps alleviate the pressure of planning the next milestone, as issues are already ranked

Issue Types

There are three general types of issues:

Not all issues will fall into one of these buckets but 85% should.

Discovery issues

Some issues may need a discovery period to understand requirements, gather feedback, or explore the work that needs to be done. Discovery issues are usually 2 points.

Introducing a new data source

Introducing a new data source requires a heavy lift of understanding that new data source, mapping field names to logic, documenting those, and understanding what issues are being delivered. Usually introducing a new data source is coupled with replicating an existing dashboard from the other data source. This helps verify that numbers are accurate and the original data source and the data team's analysis are using the same definitions.

Work

This umbrella term helps capture:

It is the responsibility of the assignee to be clear on what the scope of their issue is. A well-defined issue has a clearly outlined problem statement. Complex or new issues may also include an outline (not all encompassing list) of what steps need to be taken. If an issue is not well-scoped as its assigned, it is the responsibility of the assignee to understand how to scope that issue properly and approach the appropriate team members for guidance early in the milestone.

Issue Pointing

Issue pointing captures the complexity of an issue, not the time it takes to complete an issue. That is why pointing is independent of who the issue assignee is.

Weight Description
Null Meta and Discussions that don't result in an MR
0 Should not be used.
1 The simplest possible change including documentation changes. We are confident there will be no side effects.
2 A simple change (minimal code changes), where we understand all of the requirements.
3 A simple change, but the code footprint is bigger (e.g. lots of different files, or tests effected). The requirements are clear.
5 A more complex change that will impact multiple areas of the codebase, there may also be some refactoring involved. Requirements are understood but you feel there are likely to be some gaps along the way.
8 A complex change, that will involve much of the codebase or will require lots of input from others to determine the requirements.
13 A significant change that may have dependencies (other teams or third-parties) and we likely still don't understand all of the requirements. It's unlikely we would commit to this in a milestone, and the preference would be to further clarify requirements and/or break into smaller Issues.

Issue Labeling

Think of each of these groups of labels as ways of bucketing the work done. All issues should get the following classes of labels assigned to them:

Optional labels that are useful to communicate state or other priority

Daily Standup

Members of the data team use Geekbot for our daily standups. These are posted in #data-daily. When Geekbot asks, "What are you planning on working on today? Any blockers?" try answering with specific details, so that teammates can proactively unblock you. Instead of "working on Salesforce stuff", consider "Adding Opportunity Owners for the sfdc_opportunity_xf model`." There is no pressure to respond to Geekbot as soon as it messages you. Give responses to Geekbot that truly communicate to your team what you're working on that day, so that your team can help you understand if some priority has shifted or there is additional context you may need.

Merge Request Workflow

Ideally, your workflow should be as follows:

  1. Confirm you have access to the analytics project. If not, request Developer access so you can create branches, merge requests, and issues.
  2. Create an issue or open an existing issue.
  3. Add appropriate labels to the issue (see above)
  4. Open an MR from the issue using the "Create merge request" button. This automatically creates a unique branch based on the issue name. This marks the issue for closure once the MR is merged.
  5. Push your work to the branch
  6. Run any relevant jobs to the work being proposed
  7. Document in the MR description what the purpose of the MR is, any additional changes that need to happen for the MR to be valid, and if it's a complicated MR, how you verified that the change works. See this MR for an example of good documentation. The goal is to make it easier for reviewers to understand what the MR is doing so it's as easy as possible to review.
  8. Assign the MR to a peer to have it reviewed. If assigning to someone who can merge, either leave a comment asking for a review without merge, or you can simply leave the WIP: label.
    • Note that assigning someone an MR means action is required from them.
    • Adding someone as an approver is a way to tag them for an FYI. This is similar to doing cc @user in a comment.
  9. Once it's ready for further review and merging, remove the WIP: label, mark the branch for deletion, mark squash commits, and assign to the project's maintainer. Ensure that the attached issue is appropriately labeled and pointed.

Other tips:

Local Docker Workflow

To faciliate an easier workflow for analysts and to abstract away some of the complexity around handling dbt and its dependencies locally, the main analytics repo now supports using dbt from within a Docker container. There are commands within the Makefile to facilitate this, and if at any time you have questions about the various make commands and what they do, just use make help to get a handy list of the commands and what each of them does.

Before your initial run (and whenever the containers get updated) make sure to run the following commands:

  1. make update-containers
  2. make cleanup

These commands will ensure you get the newest versions of the containers and generally clean up your local Docker environment.

Using dbt:

YouTube

We encourage everyone to record videos and post to GitLab Unfiltered. The handbook page on YouTube does an excellent job of telling why we should be doing this. If you're uploading a video for the data team, be sure to do the following extra steps:

Our Data Stack

We use GitLab to operate and manage the analytics function. Everything starts with an issue. Changes are implemented via merge requests, including changes to our pipelines, extraction, loading, transformations, and parts of our analytics.

Stage Tool
Extraction Stitch, Fivetran, and Custom
Loading Stitch, Fivetran, and Custom
Orchestration Airflow
Storage Snowflake
Transformations dbt and Python scripts
Analysis Periscope Data

Extract and Load

We currently use Stitch and Fivetran for most of our data sources. These are off-the-shelf ELT tools that remove the responsibility of building, maintaining, or orchestrating the movement of data from some data sources into our Snowflake data warehouse. We run a full-refresh of all of our Stitch/Fivetran data sources at the same time that we rotate our security credentials (approx every 90 days).

Data Source Pipeline Management Responsibility Frequency
BambooHR Airflow Data Team 12 hour intervals for all time
Clearbit      
CloudSQL Postgres Stitch Data Team  
Customer DB Postgres_Pipeline Data Team  
DiscoverOrg      
Gitter     not updated
GitLab.com Postgres_Pipeline Data Team  
Greenhouse Airflow (custom script) Data Team Once per day
License DB Postgres_Pipeline Data Team  
Marketo Stitch Data Team 12 hour intervals - Backfilled from January 1, 2013
Netsuite Fivetran Data Team 6 hour intervals - Backfilled from January 1, 2013
Salesforce (SFDC) Stitch Data Team 1 hour intervals - Backfilled from January 1, 2013
SheetLoad SheetLoad Data Team 24 hours
Snowplow Snowpipe Data Team Continuously loaded
Version DB (Pings) Postgres_Pipeline Data Team  
Zendesk Stitch Data Team 1 hour intervals - Backfilled from January 1, 2013
Zuora Stitch Data Team 30 minute intervals - Backfilled from January 1, 2013

SLOs (Service Level Objectives) by Data Source

This is the lag between real-time and the analysis displayed in the data visualization tool.

Data Source SLO
BambooHR 1 day
Clearbit None
Airflow DB 9 hours
CI Stats DB None - Owned by GitLab.com Infrastructure Team, intermittently unavailable
Customer DB None - Owned by GitLab.com Infrastructure Team, intermittently unavailable
DiscoverOrg None
GitLab.com None - Owned by GitLab.com Infrastructure Team, intermittently unavailable
GitLab Profiler DB None - Owned by GitLab.com Infrastructure Team, intermittently unavailable
Greenhouse 2 days
License DB None - Owned by GitLab.com Infrastructure Team, intermittently unavailable
Marketo None
Netsuite 1 day
Salesforce (SFDC) 1 day
SheetLoad 2 days
Snowplow 1 day
Version DB (Pings) None - Owned by GitLab.com Infrastructure Team, intermittently unavailable
Zendesk 1 day
Zuora 1 day

Adding new Data Sources and Fields

Process for adding a new data source:

To add new fields to the BambooHR extract:

Data Team Access to Data Sources

In order to integrate new data sources into the data warehouse, specific members of the Data team will need admin-level access to data sources, both in the UI and through the API. We need this admin-level access through the API in order to pull all the data needed to build the appropriately analyses and through the UI to compare the results of prepared analyses to the UI.

Sensitive data sources can be limited to no less than 1 data engineer and 1 data analyst having access to build the require reporting. In some cases, it may only be 2 data engineers. We will likely request an additional account for the automated extraction process.

Sensitive data is locked down through the security paradigms listed below; Periscope will never have access to sensitive data, as Periscope does not have access to any data by default. Periscope's access is always explicitly granted.

Using SheetLoad

SheetLoad is the process by which a Google Sheets and CSVs from GCS or S3 can be ingested into the data warehouse.

Technical documentation on usage of sheetload can be found in the [readme] in the data team project(https://gitlab.com/gitlab-data/analytics/tree/master/extract/sheetload).

If you want to import a Google Sheet or CSV into the warehouse, please make an issue in the data team project using the "CSV or GSheets Data Upload" issue template. This template has detailed instructions depending on the type of data you want to import and what you may want to do with it.

Things to keep in mind about SheetLoad

We strongly encourage you to consider the source of the data when you want to move it into a spreadsheet. SheetLoad should primarily be used for data whose canonical source is a spreadsheet - i.e. Sales quotas. If there is a source of this data that is not a spreadsheet you should at least make an issue to get the data pulled automatically. However, if the spreadsheet is the SSOT for this data, then we can get it into the warehouse and modelled appropriately via dbt.

We do understand, though, that there are instances where a one-off analysis is needed based on some data in a spreadsheet and that you might need to join this to some other data already in the warehouse. We offer a "Boneyard" schema where you can upload the spreadsheet and it will be available for querying within Periscope. We call it Boneyard to highlight that this data is relevant only for an ad hoc/one off use case and will become stale within a relatively short period of time.

SheetLoad is designed to make the table in the database a mirror image of what is in the sheet from which it is loading. Whenever SheetLoad detects a change in the source sheet it will forcefully drop the database table and recreate it in the image of the updated spreadsheet. This means that if columns are added, changed, etc. it will all be reflected in the database.

Except for where absolutely not possible, it is best that the SheetLoad sheet import from the original Google Sheet directly using the importrange function. This allows you to leave the upstream sheet alone and while enabling you to format the sheetload version to be plain text. Any additional data type conversions or data cleanup can happen in the base dbt models. (This does not apply to the Boneyard.)

Snowplow Infrastructure

In June of 2019, we switched sending Snowplow events from a third party to sending them to infrastructure managed by GitLab. From the perspective of the data team, not much changed from the third party implementation. Events are sent through the collector and enricher and dumped to S3. See Snowplow's architecture overview for more detail.

Enriched events are stored in TSV format in the bucket s3://gitlab-com-snowplow-events/output/. Bad events are stored as JSON in s3://gitlab-com-snowplow-events/enriched-bad/. For both buckets, there are paths that follow a date format of /YYYY/MM/DD/HH/<data>.

For details on how the ingestion infrastructure was set up, please see the Snowplow section of the handbook.

Data Source Overviews

Orchestration

We use Airflow on Kubernetes for our orchestration. Our specific setup/implementation can be found here.

Data Warehouse

We currently use Snowflake as our data warehouse.

Warehouse Access

To gain access to the data warehouse:

Snowflake Permissions Paradigm

Goal: Mitigate risk of people having access to sensitive data.

We currently use Meltano's Permission Bot in dry mode to help manage our user, roles, and permissions for Snowflake. Documentation on the permission bot is in the Meltano docs. Our configuration file for our Snowflake instance is stored in this roles.yml file.

There are four things that we need to manage:

There are currently four primary analyst roles in Snowflake:

Analysts are assigned to relevant roles and are explicitly granted access to the schemas they need.

Two notes of the permission bot:

Common errors that are encountered:

Data Storage

We currently use two databases- raw and analytics. The former is for EL'ed data; the latter is for data that is ready for analysis (or getting there).

All database clones are wiped away every week.

Raw
Analytics

Managing Roles for Snowflake

Here are the proper steps for provisioning a new user and user role:

  • Login and switch to securityadmin role
  • Create user
    • User name: JSMITH - This is the GitLab default of first letter of first name and full last name.
    • Create a password using https://passwordsgenerator.net/
    • Click next and fill in additional info.
      • Make Login Name their email. This should match the user name just with @gitlab.com appended.
      • Display name should match match user name (all caps).
      • First and Last name can be normal.
    • Do not set any defaults
    • Send to person using https://onetimesecret.com/
  • Create role for user (JSMITH) with sysadmin as the parent role (this grants the role to sysadmin)
  • Grant user role to new user
  • Grant any additional roles to user
  • Add future grant to analytics and analytics_staging schemas to user with grant select on future tables in schema <schema> to role <username> using the sysadmin role
  • Document in Snowflake config.yml permissions file
  • User should also be able to login via Okta.

Snowflake Compute Resources

Compute resources in Snowflake are known as "warehouses". To better track and monitor our credit consumption, we have created several warehouses depending on who is accessing the warehouse. The names of the warehouse are appended with their size (analyst_s for small)

Timezones

All timestamp data in the warehouse should be stored in UTC. The default timezone for a Snowflake sessions is PT, but we have overridden this so that UTC is the default. This means that when current_timestamp() is queried, the result is returned in UTC.

Stitch explicitly converts timestamps to UTC. Fivetran does this as well (confirmed via support chat).

Transformation

Please see the data analyst onboarding issue template for details on getting started with dbt.

At times, we rely on dbt packages for some data transformation. Package management is built-in to dbt. A full list of packages available are on the dbt Hub site.

Tips and Tricks about Working with dbt

Schema References (aka What goes where)

Purpose Production Dev Config
For querying & analysis analytics emilie_scratch_analytics None
For querying & analysis staging analytics_staging emilie_scratch_staging {{ config({ "schema": "staging"}) }}
For querying & analysis but SENSITIVE analytics_sensitive emilie_scratch_analytics_sensitive {{ config({ "schema": "sensitive"}) }}

Snapshots

Create snapshot tables with dbt snapshot

Snapshots are a way to take point-in-time copies of source tables. dbt has excellent documentation on how the snapshots work. Snapshots are stored in the snapshots folder of our dbt project. We have organised the different snapshots by data source for easy discovery.

The following is an example of how we implement a snapshot:

{% snapshot sfdc_opportunity_snapshots %}

    {{
        config(
          target_database='RAW',
          target_schema='snapshots',
          unique_key='id',
          strategy='timestamp',
          updated_at='systemmodstamp',
        )
    }}
    
    SELECT * 
    FROM {{ source('salesforce', 'opportunity') }}
    
{% endsnapshot %}

Key items to note:

Snapshots are tested manually by a maintainer of the Data Team project before merging.

Make snapshots table available in analytics data warehouse

As stated above, RAW database and snapshots schemas are hard-coded in the config dictionary of dbt snapshots. That means, once these snapshot tables are created in the RAW we need to make them available in the ANALYTICS data warehouse in order to be able to query them downstream (with Periscope or for further _xf dbt models).

Base models for snapshots are available in the folder /models/snapshots of our dbt project. Key items to note:

Configuration for dbt

Visualization

We use Periscope as our Data Visualization and Business Intelligence tool. To request access, please follow submit an access request.

Meta Analyses for the Data Team

Security

Passwords

Per GitLab's password policy, we rotate service accounts that authenticate only via passwords every 90 days. A record of systems changed and where those passwords were updated is kept in this Google Sheet.

Data Learning and Resources

Data Newsletters

Data Blogs

Data Visualization Resources

Data Slack Communities

Technical Learning Resources

Team Roles

Triager

The data team has a triager who is responsible for addressing dbt run and dbt test failures; labeling, prioritizing, and asking initial questions on created issues; and guiding and responding to questions from GitLabbers outside of the Data Team.

Having a dedicated triager on the team helps address the bystander affect. This is not an on-call position. This role just names a clear owner. Through clear ownership, we create room for everyone else on the team to spend most of the day around deep work. The triager is encouraged to plan their day for the kind of work that can be accomplished successfully with this additional demand on time.

The triager is not expected to know the answer to all the questions. They should pull in other team members who have more knowledge in the subject matter area to pass on the conversation after understanding the issue. Document any issues you stumble upon and learn about to help disseminate knowledge amongst all team members.

Slack channels to monitor for triaging

Triage schedule The data team has implemented a triagle schedule that takes advantage of folks native timezones. The Data Team's Google Calendar is the source of truth for the triage schedule. A team member who is off, on vacation, or working on a high priority project is responsible for finding coverage and communicating to the team who is taking over their coverage; this should be updated on the Data Team's Google Calendar.

UTC day Team member
Monday @derekatwood / @kathleentam
Tuesday @jjstark
Wednesday @mpeychet / @kathleentam
Thursday @iweeks / @kathleentam
Friday @eli_kastelein / @kathleentam

Reviewer

All GitLab data team members can, and are encouraged to, perform code review on merge requests of colleagues and community contributors. If you want to review merge requests, you can wait until someone assigns you one, but you are also more than welcome to browse the list of open merge requests and leave any feedback or questions you may have.

Note that while all team members can review all merge requests, the ability to accept merge requests is restricted to maintainers.

Maintainer

A maintainer in any of the data team projects is not synonymous with any job title. Here, the data team takes from the precedent set forward by the engineering division on the responsibilities of a maintainer. Every data team project has at least one maintainer, but most have multiple, and some projects (like Analytics) have separate maintainers for dbt and orchestration.

How to become a data team maintainer

We have guidelines for maintainership, but no concrete rules. Maintainers should have an advanced understanding of the GitLab Data projects codebases. Prior to applying for maintainership, a person should get a good feel for the codebase, expertise in one or more domains, and deep understanding of our coding standards.

Apart from that, someone can be considered as a maintainer when both:

  1. The MRs they've reviewed consistently make it through maintainer review without significant additionally required changes.
  2. The MRs they've written consistently make it through reviewer and maintainer review without significant required changes.

Once those are done, they should:

  1. Create a MR to add the maintainership to their team page entry.
  2. Explain in the MR body why they are ready to take on that responsibility.
  3. Use specific examples of recent "maintainer-level" reviews that they have performed.
  4. The MRs should not reflect only small changes to the code base, but also architectural ones and ones that create a fully functioning addition.
  5. Assign the MR to their manager and mention the existing maintainers of the relevant project (Infrastructure, Analytics, etc) and area (dbt, Airflow, etc.).
  6. If the existing maintainers of the relevant group e.g., dbt, do not have significant objections, and if at least half of them agree that the reviewer is indeed ready, we've got ourselves a new maintainer!

Job Descriptions

Data Analyst

Job Family

Data Engineer

Job Family

Manager

Job Family

Director of Data and Analytics

Job Family