Analysis usually begins with a question. A stakeholder will ask a question of the data team by creating an issue in the Data Team project using the appropriate template. The analyst assigned to the project may schedule a discussion with the stakeholder(s) to further understand the needs of the analysis. This meeting will allow for analysts to understand the overall goals of the analysis, not just the singular question being asked, and should be recorded. Analysts looking for some place to start the discussion can start by asking:
An analyst will then update the issue to reflect their understanding of the project at hand. This may mean turning an existing issue into a meta issue or an epic. Stakeholders are encouraged to engage on the appropriate issues. The issue then becomes the SSOT for the status of the project, indicating the milestone to which its been assigned and the analyst working on it, among other things. Barring any confidentiality concerns, the issue is also where the final project will be delivered. On delivery, the data team manager will be cc'ed where s/he will provide feedback and/or request changes. When satisfied, s/he will close the issue. If the stakeholder would like to request a change after the issue has been closed, s/he should create a new issue and link to the closed issue.
The Data Team can be found in the #analytics channel on slack.
The data team currently works in two-week intervals, called milestones. Milestones start on Tuesdays and end on Mondays. This discourages last-minute merging on Fridays and allows the team to have milestone planning meetings at the top of the milestone.
Milestones may be three weeks long if they cover a major holiday or if the majority of the team is on vacation. As work is assigned to a person and a milestone, it gets a weight assigned to it.
NULLpoints belongs to a meta issue, where the sub issues get points.
Ideally, your workflow should be as follows:
WIP:label, mark the branch for deletion, mark squash commits, and assign to the project's maintainer.
We currently use Stitch for most of our data sources. In some cases, we use Meltano.
|Data Source||Pipeline||Management Responsibility|
|CloudSQL Postgres||Stitch||Data Team|
|Pings||Meltano||The Meltano team will investigate the source of the problem. If the issue is in the Meltano tap, the Meltano team will fix it. If not, the data team is responsible.|
Process for adding a new data source:
SheetLoad is the process by which a GoogleSheet can be ingested into the data warehouse. This is not an ideal solution to get data into the warehouse, but may be the appropriate solution at times.
How to use SheetLoad
Sheetload expects there to be three parts to a name. The file name (google sheet name) should be
value is useful for understanding the file's purpose. The table's name will come from the tab name in the file. The extract file, thus, will include
schema.value.tablename but the table will appear in the database in
We are in the process of moving from GitLab CI to Airflow.
We currently use Snowflake as our data warehouse.
To gain access to the data warehouse:
Managing Roles for Snowflake
Here are the proper steps for provisioning a new user and user role:
EBURKEfor example) with
SYSADMINas the parent role (this grants the role to sysadmin)
CREATE SCHEMA eburke_scratch;
GRANT OWNERSHIP ON schema eburke_scratch TO ROLE eburke;
_xfdbt model should be a
BEAM*table, which means it follows the business event analysis & model structure and answer the who, what, where, when, how many, why, and how question combinations that measure the business.
source table- (can also be called
raw table) table coming directly from data source as configured by the manifest. It is stored directly in a schema that indicates its original data source, e.g.
base models- the only dbt models that reference the source table; base models have minimal transformational logic (usually limited to filtering out rows with data integrity issues or actively flagged not for analysis and renaming columns for easier analysis); can be found in
analyticsschema; is used in
end-user models- dbt models used for analysis. The final version of a model will likely be indicated with an
_xfsuffix when it’s goal is to be a
BEAM*table. It should follow the business event analysis & model structure and answer the who, what, where, when, how many, why, and how question combinations that measure the business.
Looker is GitLab's data visualization tool. Many modern data visualization tools require analysts to write SQL queries; Looker's unique advantage lies in their modeling layer that allows non-technical end users to build their own data analysis in a drag and drop user interface. While this means that the initial configuration (e.g., setting up a new data source) takes longer than just querying a table would be, once the initial configuration is done, you have a new data set- a Looker explore- available for all users to take advantage of. The data team aims for Looker to be the SSOT for all of GitLab.
To get initial Looker Access, please create a new access issue following Security's procedures in this project. There are multiple levels of user access: View-Only, Explorer, and Developer. All GitLab Employees should be a part of their respective team group as well as the
GitLab Employee group. If you do not have access, please see the handbook on how to request it.
We have 4 role types: View Only, Explorer, Developer, Admin.
Users coming from Redash or another query-based data visualization tool, especially those with a strong familiarity with SQL, may find themselves uniquely frustrated at how long it can take to answer a "simple" question when doing so aims to take advantage of new data. Any new data sources need to be brought into the data warehouse, modeled in dbt following the team's dbt coding conventions, modeled in a LookML view, and added to a new or existing explore before an analysis can be built on top of it. While this initial configuration might seem like a bit of a slog, it moved all of the analyses configuration to be an up-front responsibility, making the explore you build usable not just for you, but for future users with related questions too.
LookML - is a Looker proprietary language for describing dimensions, aggregates, calculations and data relationships in a SQL database. The Looker app uses a model written in LookML to construct SQL queries against a particular database.
LookML Project - A LookML project is a collection of LookML files that describe a set of related models, explores, views, and LookML dashboards.
By convention, LookML code is segregated into three types of files: model files, view files, and dashboard files. In new LookML, these files have the following extensions: .model.lkml, .view.lkml, and .dashboard.lookml, respectively.
Model - A model is a customized portal into the database, designed to provide intuitive data exploration for specific business users. Multiple models can exist for the same database connection in a single LookML project. Each model can expose different data to different users. For example, sales agents need different data than company executives, and so you would probably develop two models to offer views of the database appropriate for each user. A model, in turn, is essentially metadata about the source database.
View - A view declaration defines a list of fields (dimensions or measures) and their linkage to an underlying table or derived table. In LookML a view typically references an underlying database table, but it can also represent a derived table.
A view may join to other views. The relationship between views is typically defined as part of a explore declaration in a model file.
Explore - An explore is a view that users can query. You can think of the explore as a starting point for a query, or in SQL terms, as the FROM in a SQL statement. An explore declaration specifies the join relationships to other views.
Derived Table - A derived table is a table comprised of values from other tables, which is accessed as though it were a physical table with its own set of columns. A derived table is exposed as its own view using the derived_table parameter, and defines dimensions and measures in the same manner as conventional views. Users can think of a derived table similar to a view, not to be confused with a Looker view, in a database.
Dimension Group - The dimension_group parameter is used to create a set of time-based dimensions all at once. For example, you could easily create a date, week, and month dimension based on a single timestamp column.
_xfview, be sure to limit explores already referencing that view with sets, so as not to accidentally clutter existing explores with new/irrelevant data.
Technical Account Managershould be
Technical Account Manager (TAM). This is to make it easier for users to search for fields within Looker and avoids ambiguity.
#4b4ba3, #7c7ccc, #e05842, #fca121, #2e87e0, #37b96d, #707070, #bababa, #494C5based on the GitLab style guide.
$#,##0is an appropriate format for most dollar values.