We use GitLab to operate and manage the analytics function. Everything starts with an issue. Changes are implemented via merge requests, including changes to our pipelines, extraction, loading, transformations, and parts of our analytics.
|Extraction||Stitch, Fivetran, and Custom|
|Loading||Stitch, Fivetran, and Custom|
|Transformations||dbt and Python scripts|
|Analysis||Sisense For Cloud Data Teams|
We currently use Stitch and Fivetran for most of our data sources. These are off-the-shelf ELT tools that remove the responsibility of building, maintaining, or orchestrating the movement of data from some data sources into our Snowflake data warehouse. We run a full-refresh of all of our Stitch/Fivetran data sources at the same time that we rotate our security credentials (approx every 90 days). Prior to running a full refresh we will drop all of the tables.
For source ownership please see the Tech Stack Applications sheet (internal only).
RAWdatabase where data is stored.
PREPdatabase where source models are materialized.
xindicates undefined or not run
|Data Source||Pipeline||Raw Schema||Prep Schema||Audience||RF / SLO|
||Data Team||9h / 9h|
||People||12h / 24h|
|Clearbit||x||x||x||x / x|
||Product||2 weeks / x|
||Product, Engineering||6h / x|
|GitLab Ops DB||pgp||
||Engineering||6h / x|
|GitLab Profiler DB||x||x||x||x||x / x|
|Google Analytics 360||Fivetran||
||Marketing||6h / 32h|
|Google Cloud Billing||x||
||Engineering||x / x|
||x||Engineering||24h / 48h|
||People||24h / 48h|
|Handbook YAML Files||Airflow||
||Multiple||8h / 24h|
|Handbook MR Data||Airflow||
||Multiple||24h / 24h|
|Handbook Git Log Data||Airflow||
||Multiple||1w / 1m|
|LicenseDot ERD||Manual Process||
||Product||2 weeks / x|
||x||Marketing||x / x|
||Finance||6h / 24h|
||x||x / x|
||Marketing||12h / 48h|
||Sales||1h / 24h|
||Multiple||24h / 48h|
||Product||15m / 24h|
|Version DB||Manual Process||
||Product||2 weeks / x|
||Support||1h / 24h|
||Finance||30m / 24h|
Process for adding a new data source:
To add new fields to the BambooHR extract:
In order to integrate new data sources into the data warehouse, specific members of the Data team will need admin-level access to data sources, both in the UI and through the API. We need this admin-level access through the API in order to pull all the data needed to build the appropriately analyses and through the UI to compare the results of prepared analyses to the UI.
Sensitive data sources can be limited to no less than 1 data engineer and 1 data analyst having access to build the require reporting. In some cases, it may only be 2 data engineers. We will likely request an additional account for the automated extraction process.
Sensitive data is locked down through the security paradigms listed below; Sisense will never have access to sensitive data, as Sisense does not have access to any data by default. Sisense's access is always explicitly granted.
To ensure that the data team has a complete picture of where sensitive data is in the data warehouse, as well as make sure Sisense does not have access to sensitive data, a periodic scan of the data warehouse is made using dbt along with the internally-developed library of tools created as
datasiren. This scan is currently executed weekly. The fine-grained results are stored in Snowflake in the
PREP.DATASIREN schema and are not available in Periscope because of sensitivity reasons. High-level results have been made available in Periscope, including the simple dashboard found here.
SheetLoad is the process by which a Google Sheets and CSVs from GCS or S3 can be ingested into the data warehouse.
Technical documentation on usage of SheetLoad can be found in the readme in the data team project.
If you want to import a Google Sheet or CSV into the warehouse, please make an issue in the data team project using the "CSV or GSheets Data Upload" issue template. This template has detailed instructions depending on the type of data you want to import and what you may want to do with it.
SheetLoad should primarily be used for data whose canonical source is a spreadsheet - i.e. Sales quotas. If there is a source of this data that is not a spreadsheet, you should at least make an issue for a new data source to get the data pulled automatically. However, if the spreadsheet is the SSOT for this data, then SheetLoad is the appropriate mechanism for getting it into the warehouse.
SheetLoad is designed to make the table in the database an exact copy of the sheet from which it is loading. Whenever SheetLoad detects a change in the source sheet, it will drop the database table and recreate it in the image of the updated spreadsheet. This means that if columns are added, changed, etc. it will all be reflected in the database. Changes are detected within 24 hours.
Except for where absolutely not possible, it is best that the SheetLoad sheet import from the original Google Sheet directly using the
importrange function. This allows you to leave the upstream sheet alone while enabling you to format the SheetLoad version to be plain text. Any additional data type conversions or data cleanup can happen in the base dbt models. (This does not apply to the Boneyard.)
Some additional considerations for preparing the data for loading:
GitLab - User Name
nullin the database.
Before regular SheetLoad data can be accessible by Sisense, it has to be modeled via dbt. A minimum of 2 models will be made for the sheet: a source model and a staging model. These will be made by a member of the data team once you've made an issue in the data team project.
boneyard schema is where data can be uploaded from a spreadsheet and it will be directly available for querying within Sisense. However, this is for instances where a one-off analysis is needed and/or you need to join this to data already in the warehouse. It is called Boneyard to highlight that this data is relevant only for an ad hoc/one off use case and will become stale within a relatively short period of time. We will periodically remove stale data from the
If you are adding Certificates to SheetLoad, refer to the instructions in the People Group page
The Qualtrics data push process, also known in code as
Qualtrics SheetLoad, enables emails to be uploaded to Qualtrics from the data warehouse without having to be downloaded onto a team member's machine first. This process shares its name with SheetLoad because it looks through Google Sheets for files with names starting with
qualtrics_mailing_list. For each of the files it finds with an
id column as the first column, it uploads that file to Snowflake. The resulting table is then joined with the GitLab user table to retrieve email addresses. The result is then uploaded to Qualtrics as a new mailing list.
During the process, the Google Sheet is updated to reflect the process' status. The first column's name is set to
processing when the process begins, and then is set to
processed when the mailing list and contacts have been uploaded to Qualtrics. Changing the column name informs the requester of the process' status, assists in debugging, and ensures that a mailing list is only created once for each spreadsheet.
The end user experience is described on the UX Qualtrics page.
Refer to the Snowplow Infrastructure page for more information on our setup.
We currently use Snowflake as our data warehouse. The Enterprise Data Warehouse (EDW) is the single source of truth for GitLab's corporate data, performance analytics, and enterprise-wide data such as Key Performance Indicators. The EDW supports GitLab’s data-driven initiatives by providing all teams a common platform and framework for reporting, dashboarding, and analytics. With the exception of point-to-point application integrations all current and future data projects will be driven from the EDW. As a recipient of data from a variety of GitLab source systems, the EDW will also help inform and drive Data Quality best-practices, measures, and remediation to help ensure all decisions are made using the best data possible.
To gain access to the data warehouse:
We follow this general strategy for role management:
sysadmin) are assigned directly to users
Every user will have their own user role that should match their user name.
Object level permissions (database, schemas, tables) in Snowflake can only be granted to roles.
Roles can be granted to users or to other roles.
We strive to have all privileges flow through the user role so that a user only has to use one role to interact with the database.
Exceptions are privileged roles such as
These roles grant higher access and should be intentionally selected when using.
Functional roles represent a group of privileges and role grants that typically map to a job family.
The major exception is the analyst roles.
There are several variants of the analyst role which map to different areas of the organization.
analyst_people, and more.
Analysts are assigned to relevant roles and are explicitly granted access to the schemas they need.
Functional roles can be created at any time. It makes the most sense when there are multiple people who have very similar job families and permissions.
Object roles are for managing access to a set of data.
Typically these represent all of the data for a given source.
zuora object role is an example.
This role grants access to the raw Zuora data coming from Stitch, and also to the source models in the
When a user needs access to Zuora data, granting the
zuora role to that user's user role is the easiest solution.
If for some reason access to the object role doesn't make sense, individual privileges can be granted at the granularity of a table.
This is an example role hierarchy for an Data Analyst, Core:
This is an example role hierarchy for an Data Engineer and Account Administrator:
This is an example role hierarchy for a Security Operations Engineer:
Managing Roles for Snowflake
Here are the proper steps for provisioning a new user and user role:
user_provision.sqlscript and replace the email, firstname, and lastname values in the initial block
Compute resources in Snowflake are known as "warehouses".
To better track and monitor our credit consumption, we have created several warehouses depending on who is accessing the warehouse.
The names of the warehouse are appended with their size (
analyst_xs for extra small)
|warehouse||purpose||max query (minutes)|
||This is for permission bot and other admin tasks||10|
||For testing Airflow locally||30|
||These are for Data Analysts to use when querying the database or modeling data||30|
||These are for Data Engineers and the Manager to use when querying the database or modeling data||30|
||This is exclusively for Fivetran to use||30|
||This is for extraction jobs that pull from GitLab internal Postgres databases||10|
||This is for our Extract and Load jobs||60|
||These are scoped to GitLab CI for dbt jobs within a merge request||60|
||This is for the BI tool for querying. Note that Sisense enforces a 4 minute timeout.||30|
||This is exclusively for Stitch to use||30|
||This is for the Meltano team to test their Snowflake loader||5|
||These are for production dbt jobs||60|
If you're running into query time limits consider using a larger warehouse.
We use three primary databases:
raw database is where data is first loaded into Snowflake; the other databases are for data that is ready for analysis (or getting there).
There is a
snowflake database, which contains information about the entire GitLab instance.
This includes all tables, views, queries, users, etc.
There is a
covid19 database, which is a shared database managed through the Snowflake Data Exchange.
There is a
testing_db database, which is used for testing Permifrost.
All databases not defined in our
roles.yml Permifrost file are removed on a weekly basis.
This database is not available to query in Sisense.
This database is not available to query in Sisense.
Target State Schema:
For data warehouse use cases that require us to store data for our users without updating it automatically with dbt we use the
STATIC database. This also allows for analysts and other users to create their own data resources (tables, views, temporary tables). There is a sensitive schema for sensitive data within the static database. If your use case for static requires the use or storage of sensitive data please create an issue for the data engineers.
All timestamp data in the warehouse should be stored in UTC. The default timezone for a Snowflake sessions is PT, but we have overridden this so that UTC is the default. This means that when
current_timestamp() is queried, the result is returned in UTC.
Stitch explicitly converts timestamps to UTC. Fivetran does this as well (confirmed via support chat).
We use the term snapshots in multiple places throughout the data team handbook and the term can be confusing depending on the context. Snapshots as defined by the dictionary is "a record of the contents of a storage location or data file at a given time". We strive to use this definition whenever we use the word.
The most common usage is in reference to dbt snapshots. When dbt snapshots is run, it takes the state of the data based on a query specified by the user and updates a table that contains the full history of the state of the data. It has
valid_from fields indicating the time period for which that particular snapshot is valid. See the dbt snapshots section in our dbt guide for more technical information.
The tables generated and maintained by dbt snapshots are the raw historical snapshot tables. We will build downstream models on top of these raw historical snapshots for further querying. The snapshots folder is where we store the dbt models. One common model we may build is one that generate a single entry (i.e. a single snapshot) for a given day; this is useful when there are multiple snapshots taken in a 24 hour period. We also will build models to return the most current snapshot from the raw historical table.
Our Greenhouse data can be thought of as a snapshot. We get a daily database dump provided by Greenhouse that we load into Snowflake. If we start taking dbt snapshots of these tables then we would be creating historical snapshots of the Greenhouse data.
The extracts we do for some yaml files and for BambooHR can also be thought of as snapshots. This extraction works by taking the full file/table and storing it in its own, timestamped row in the warehouse. This means we have historical snapshots for these files/tables but these are not the same kind of snapshot as dbt. We'd have to do additional transformations to get the same
valid_to. For BambooHR and yaml extracts these correspond to the last time the extraction job was run. For Greenhouse raw, this represents the data as it is in the warehouse. Were we to start taking snapshots of the Greenhouse data the speaker would have to clarify if they mean the raw table or the latest record in the historical snapshots table.
For an extra layer of robustness, we backup data from the warehouse into GCS (Google Cloud Storage). We execute the jobs using dbt's
run-operation capabilities. Currently, we backup all of our snapshots daily and retain them for a period of 1 month. We implemented the basic instructions outlined in this Calogica blog post.
Data Customers expect Data Teams to provide data they can trust to make their important decisions. And Data Teams need to be confident in the quality of data they deliver. But this is a hard problem to solve: the Enterprise Data Platform is complex and involves multiple stages of data processing and transformation, with tens to hundreds of developers and end-users actively changing and querying data 24 hours a day. The Trusted Data Framework (TDF) supports these quality and trust needs by defining a standard framework for data testing and monitoring across data processing stages, accessible by technical teams and business teams. Implemented as a stand-alone module separate from existing data processing technology, the TDF fulfills the need for an independent data monitoring solution.
The primary elements of the TDF include:
The TDF embraces business users as the most important participant in establishing trusted data and uses a simple and accessible testing model. With SQL and YAML as a test agent, a broad group of people can contribute test cases. The test format is straightforward with simple PASS/FAIL results and just four test case types. Adoption grows quickly as TDF demonstrates value:
Over time, it is not uncommon to develop hundreds of tests cases which are run on a daily basis, continually validating data quality.
SQL is the universal language in databases and nearly everyone who works with data has some level of SQL competency. However, not everyone may be familiar with SQL and we don't want that to limit who can contribute. We use dbt to support the TDF which enables the defining of tests via SQL and YAML.
With all tests being run via dbt, storing tests results is simple. We store the results of every test run in the data warehouse. Storing test results enables a variety of valuable features, including:
These test results are parsed and are available for querying in Sisense.
The Data Warehouse environment can change quickly and the TDF supports predictability, stability, and quality with test coverage of the areas in the Data Warehouse that are most likely to change:
The implementation details of these tests are documented in our dbt guide.
More to come.
More to come.
Per GitLab's password policy, we rotate service accounts that authenticate only via passwords every 90 days. A record of systems changed and where those passwords were updated is kept in this Google Sheet.
We also rotate Snowflake user passwords the first Sunday of every 3rd month of the year (January, April, July, October) via the Snowflake Password Reset DAG.
The data team is responsible for provisioning users within the tools managed by the Data Team. This includes tools like Fivetran, Stitch, and Snowflake.
For Snowflake, we have a robust process documented in the Snowflake Permissions Paradigm section of this page.
For other tools, add users via the UI and in the appropriate Google Group if one exists.