The Data Team at GitLab is working to establish a world-class analytics function by utilizing the tools of DevOps in combination with the core values of GitLab. We believe that data teams have much to learn from DevOps. We will work to model good software development best practices and integrate them into our data management and analytics.
A typical data team has members who fall along a spectrum of skills and focus. For now, the analytics function at GitLab has Data Engineers and Data Analysts; eventually the team will include Data Scientists. Analysts are divided into being part of the Core Data Function and specializing in different functions in the company.
Data Engineers on our team are essentially software engineers who have a particular focus on data movement and orchestration. The transition to DevOps is typically easier for them because much of their work is done using the command line and scripting languages such as Bash and Python. One challenge in particular are data pipelines. Most pipelines are not well tested, data movement is not typically idempotent, and auditability of history is challenging.
Data Analysts are further from DevOps practices than Data Engineers. Most analysts use SQL for their analytics and queries, with Python or R a close second. In the past, data queries and transformations may have been done by custom tooling or software written by other companies. These tools and approaches share similar traits in that they're likely not version controlled, there are probably few tests around them, and they are difficult to maintain at scale.
Data Scientists are probably furthest from integrating DevOps practices into their work. Much of their work is done in tools like Jupyter Notebooks or R Studio. Those who do machine learning create models that are not typically version controlled. Data management and accessibility is also a concern as well.
We will work closely with the analytics community to find solutions to these challenges. Some of the solutions may be cultural in nature, and we aim to be a model for other organizations of how a world-class Data and Analytics team can utilize the best of DevOps for all Data Operations.
Some of our beliefs are:
Analysis usually begins with a question. A stakeholder will ask a question of the data team by creating an issue in the Data Team project using the appropriate template. The analyst assigned to the project may schedule a discussion with the stakeholder(s) to further understand the needs of the analysis. This meeting will allow for analysts to understand the overall goals of the analysis, not just the singular question being asked, and should be recorded. Analysts looking for some place to start the discussion can start by asking:
An analyst will then update the issue to reflect their understanding of the project at hand. This may mean turning an existing issue into a meta issue or an epic. Stakeholders are encouraged to engage on the appropriate issues. The issue then becomes the SSOT for the status of the project, indicating the milestone to which its been assigned and the analyst working on it, among other things. Barring any confidentiality concerns, the issue is also where the final project will be delivered. On delivery, the data team manager will be cc'ed where s/he will provide feedback and/or request changes. When satisfied, s/he will close the issue. If the stakeholder would like to request a change after the issue has been closed, s/he should create a new issue and link to the closed issue.
The Data Team can be found in the #analytics channel on slack.
The data team currently works in two-week intervals, called milestones. Milestones start on Tuesdays and end on Mondays. This discourages last-minute merging on Fridays and allows the team to have milestone planning meetings at the top of the milestone.
Milestones may be three weeks long if they cover a major holiday or if the majority of the team is on vacation. As work is assigned to a person and a milestone, it gets a weight assigned to it.
|Null||Meta, Discussion, or Documentation issues that don't result in an MR|
|0||Should not be used.|
|1||The simplest possible change. We are confident there will be no side effects.|
|2||A simple change (minimal code changes), where we understand all of the requirements.|
|3||A simple change, but the code footprint is bigger (e.g. lots of different files, or tests effected). The requirements are clear.|
|5||A more complex change that will impact multiple areas of the codebase, there may also be some refactoring involved. Requirements are understood but you feel there are likely to be some gaps along the way.|
|8||A complex change, that will involve much of the codebase or will require lots of input from others to determine the requirements.|
|13||A significant change that may have dependencies (other teams or third-parties) and we likely still don't understand all of the requirements. It's unlikely we would commit to this in a milestone, and the preference would be to further clarify requirements and/or break in to smaller Issues.|
Think of each of these groups of labels as ways of bucketing the work done. All issues should get the following classes of labels assigned to them:
Optional labels that are useful to communicate state or other priority
|Priority||Description||Probability of shipping in milestone|
|P1||Urgent: top priority for achieving in the given milestone. These issues are the most important goals for a milestone and should be worked on first; some may be time-critical or unblock dependencies.||~100%|
|P2||High: important issues that have significant positive impact to the business or technical debt. Important, but not time-critical or blocking others.||~75%|
|P3||Normal: incremental improvements to existing features. These are important iterations, but deemed non-critical.||~50%|
|P4||Low: stretch issues that are acceptable to postpone into a future milestone.||~25%|
Ideally, your workflow should be as follows:
cc @userin a comment.
WIP:label, mark the branch for deletion, mark squash commits, and assign to the project's maintainer. Ensure that the attached issue is appropriately labeled and pointed.
We use GitLab to operate and manage the analytics function. Everything starts with an issue. Changes are implemented via merge requests, including changes to our pipelines, extraction, loading, transformations, and parts of our analytics.
|Extraction||Stitch and Custom|
|Loading||Stitch and Custom|
|Orchestration||Airflow and GitLab CI|
|Storage||Cloud SQL (PostgreSQL) and Snowflake|
|Transformations||dbt and Python scripts|
We currently use Stitch for most of our data sources.
|Data Source||Pipeline||Management Responsibility||Frequency|
|CloudSQL Postgres||Stitch||Data Team|
|GitLab dot Com|
|Marketo||Stitch||Data Team||12 hour intervals - Backfilled from January 1, 2013|
|Netsuite||Stitch||Data Team||30 minute intervals - Backfilled from January 1, 2013|
|SFDC||Stitch||Data Team||1 hour intervals - Backfilled from January 1, 2013|
|Zendesk||Stitch||Data Team||1 hour intervals - Backfilled from January 1, 2013|
|Zuora||Stitch||Data Team||30 minute intervals - Backfilled from January 1, 2013|
Process for adding a new data source:
SheetLoad is the process by which a GoogleSheet, local CSV or file from GCS can be ingested into the data warehouse. This is not an ideal solution to get data into the warehouse, but may be the appropriate solution at times.
As it is being iterated on often, the best place for up-to-date info on Sheetload is the Sheetload readme.
We are in the process of moving from GitLab CI to Airflow.
We currently use Snowflake as our data warehouse.
To gain access to the data warehouse:
Managing Roles for Snowflake
Here are the proper steps for provisioning a new user and user role:
EBURKEfor example) with
sysadminas the parent role (this grants the role to sysadmin)
CREATE SCHEMA eburke_scratch;
GRANT OWNERSHIP ON schema eburke_scratch TO ROLE eburke;
Please see the data analyst onboarding issue template for details on getting started with dbt.
At times, we rely on dbt packages for some data transformation. Package management is built-in to dbt. A full list of packages available are on the dbt Hub site. We use the repository-syntax instead of the hub-syntax in our
_xfdbt model should be a
BEAM*table, which means it follows the business event analysis & model structure and answers the who, what, where, when, how many, why, and how question combinations that measure the business.
source table- (can also be called
raw table) table coming directly from data source as configured by the manifest. It is stored directly in a schema that indicates its original data source, e.g.
base models- the only dbt models that reference the source table; base models have minimal transformational logic (usually limited to filtering out rows with data integrity issues or actively flagged not for analysis and renaming columns for easier analysis); can be found in the
analytics_stagingschema; is used in
end-user models- dbt models used for analysis. The final version of a model will likely be indicated with an
_xfsuffix when it’s goal is to be a
BEAM*table. It should follow the business event analysis & model structure and answer the who, what, where, when, how many, why, and how question combinations that measure the business. End user models are found in the
We are currently evaluating multiple visualization tools. To request access, please follow submit an access request.