The Enterprise Dimensional Model (EDM) is GitLab's centralized data model, designed to enable and support the highest levels of accuracy and quality for reporting and analytics. The data model follows the Kimball technique, including a Bus Matrix and Entity Relationship Diagram. Dimensional Modeling is the third step of our overarching Data Development Approach (after Requirements definition and UI Wireframing) and this overall approach enables us to repeatedly produce high-quality data solutions. The EDM is housed in our Snowflake Enterprise Data Warehouse and is generated using dbt.
Example SiSense dashboards powered by the EDM include:
These diagrams provide the relationships between data objects in the Enterprise Dimensional Model across the major business process fly wheels.
Dimensional modeling is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design.
a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access
Dimensional Modeling is business process oriented and can be built in 4 steps:
Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.
In the simplest version fact table is a central table and is linked to dimensional tables with foreign keys creating a star schema. Star schema with dimensional tables linking to more dimensional tables are called snowflake schemas, multi fact tables schemas are called galaxies.
PREP
database using source specific schema names.PREP
tables in the COMMON_PREP
schema in the PROD
database as required by the use case. Building PREP
tables in the COMMON_PREP
schema is optional and depends on the use case.MD5('-1')
key value to represent missing
.MD5('-1')
key value for missing keys.When the canonical source of data is NOT a CSV file type upload, our preference is to extract the data directly from the source system application. However, in some cases, a short-term work around such as a seed file, Sheetload, or Driveload file extraction is acceptable. Be cautious in these situations; short term workarounds frequently turn into technical debt, and it may be wise to draft a removal plan before implementing a short term fix. While these approaches should be used sparingly in the Enterprise Dimensional Model, the use of seed, Sheetload, or Driveload files is fine for the workspace schemas.
When the Canonical source of a data set is inherently a CSV file type, we have several options to extract the data into Snowflake. All the options have different pros and cons.
dbt seed: This is a solid option. dbt seed is version controlled via GitLab, easy to update via MR in GitLab Web IDE. A con is that it is recommended to be used with low row counts, about a 1,000.
GCP Driveload: This is a solid option as well. It is stable and has no unexpected changes to data. A con is that you have to load a new file to driveload to update.
Sheetload: Sheetload makes it easy for team members to enter data and self-serve in GSheets. However, there are several issues associated with Sheetload: it is unstable and has broken many times, causing downstream data to get stale; it is hard to keep SOX compliant; it is an uncontrolled entry point into the data warehouse. In certain cases, we've had to disable the Sheetload model extraction to ensure trust for Tier 1 data assets and prevent unexpected changes to data.
Of the 3 options, dbt seed is the highest preference when the use case requires less than 1000 records of data. GCP Driveload is preferred when a row limitation prevents the use of dbt seed. Sheetload is the last option for any of our Tier 1 Data Assets due to low stability and high difficulty of audit.
It is critical to be intentional when organizing a self-service data environment, starting with naming conventions. The goal is to make navigating the data warehouse easy for beginner, intermediate, and advanced users. We make this possible by following these best practices:
prep_<subject>
= Used to clean raw data and prepare it for dimensional analysis.fct_<verb>
Facts represent events or real-world processes that occur. Facts can often be identified because they represent the action or 'verb'. (e.g. session, transaction)dim_<noun>
= dimension table. Dimensions provide descriptive context to the fact records. Dimensions can often be identified because they are 'nouns' such as a person, place, or thing (e.g. customer, employee) The dimension attributes act as 'adjectives'. (e.g. customer type, employee division)mart_<subject>
= Join dimension and fact tables together with minimal filters and aggregations. Because they have minimal filters and aggregations, mart tables can be used for a wide variety of reporting purposes.rpt_<subject>
= Can be built on top of dim, fact, and mart tables. Very specific filters are applied that make report tables applicable to a narrow subset of reporting purposes.pump_<subject>
= Can be built on top of dim, fact, mart, and report tables. Used for models that will be piped into a third party tool.map_<subjects>
= Used to maintain one-to-one relationships between data that come from different sources.bdg_<subjects>
= Used to maintain many-to-many relationships between data that come from different sources. See the Kimball Group's documentation for tips on how to build bridge tables.crm
in the dimensional model and not sfdc
or salesforce
.dbt_utils.surrogate_key
macro and is not derived from any source application data unlike a natural key. The main uses of the surrogate key are to act as the primary key and be used to join dims and facts together in the dimensional model. This allows the data warehouse to generate and control the key and protect the dimensional model from changes in the source system. The surrogate key cannot be used to filter tables since it is a hashed field.{{ dbt_utils.surrogate_key(['order_type_stamped']) }}
, the natural key order_type_stamped
is added to the macro to generate the surrogate key dim_order_type_id
._sk
will keep it simple to name both composite and non-composite surrogate keys. dim_order_type_id will be dim_order_type_sk. This is a new requirement starting in June 2022; therefore, not all surrogate keys in models will have the _sk
suffix.dim_
prefix on them. An example is dim_order_type_id
. This prefix indicates that this is what the dimension should be joined on in the dimensional model. The new requirement would have this key named dim_order_type_sk
.--Surrogate Key
so that it is easily identifiable. Also, the surrogate key description can be added to the columns in the schema.yml
file.order_type_stamped
is a single field natural key that is included in the dimension table.dim_
prefix to indicate a natural key for single key or composite natural keys. We will continue to use the dim_
prefix to name single field natural keys for consistency in the model. For natural keys that are composed of multiple columns, we have not added the dim_
prefix and we will continue to follow that pattern of not adding the dim_
prefix. We determined that it would be expensive to refactor all existing models and Sisense dashboards to not have the dim_
prefix on existing natural keys and it would add little value to do so. Typically, we would prefer to not have the dim_
prefix at all on the natural key because the dimension should not be joined on the natural key to query or materialize star schemas. The joining of dims and facts and materializations of star schemas in the dimensional model should only be done via the hashed surrogate keys.--Natural Key
so that it is easily identifiable. Also, the natural key description can be added to the columns in the schema.yml
file.MD5('-1')
to generate the dim_order_type_id
value for missing member. As of June 2022, the dbt_utils.surrogate_key
will be used to generate the value for missing member key. It returns the same value as the MD5('-1')
and it will be easier to keep the same function that is used to generate the main surrogate key in the dimension. 'Missing order_type_name' AS order_type_name
and 'Missing order_type_grouped' AS order_type_grouped
are added as the values that would be returned when a join is made to the dimension and there is a missing member.missing member
value automagically in the dimension. The current process is to add the missing member
entry manually to the dimension pursuant to the prep_order_type example. When there are many fields in the dimension, missing member
values should be added for each field. prep_amendment is an example of this. The macro under development will aim to automate the generate of these missing member
entries.dbt_utils.surrogate_key
macro or a concatenation function can be used to create the key.--Primary Key
in the SQL script so that it is easily identifiable.fct_
prefix plus the _pk
suffix on the end. It is not necessary to include id
in the field name. An example is the fct_event where event_id
is the primary key, in a commented --Primary Key
section, where the key is the table name fct_event
minus fct_
plus id
on the end. However, going forward from June 2022, this primary key would be named event_pk
.--Foreign Keys
. fct_event is an example. The foreign keys are the hashed surrogate keys from the dimensions. The star schema should be joined on and materialized using hashed surrogate keys.--Foreign Keys
section can also be further organized into a Conformed Dimensions
and Local Dimensions
sub-sections for the foreign keys. This will help for readability and makes it easy to identify the conformed and local dimensions.get_keyed_nulls
macro.common
schema first, and then the code that builds the fact tables references these common dimensions (LEFT JOIN) to ensure the common key is available. There are some situation where logically we are 100% confident every key that is in the fact table will be available in the dimension. This depends on the source of the dimension and any logic applied.MD5('-1')
(missing) dimension_id, and total row counts.The Common
schemas contain the Enterprise Dimensional Model. Each schema has a specific purpose in the Archtiecture as described below.
The Common Prep Schema has 6 primary use cases at this time. The use cases are as follows:
In order to keep the COMMON_PREP
schema streamlined and without unnecessary redundancies, we need to have guidlines and preferences for developing models in the Common Prep Schema. As Developers work through the use cases mentioned above, they should take into consideration the below guidlines when developing models in the COMMON_PREP
schema. The COMMON_PREP
schema is optional and it is not a requirement that every lineage have a model built in the COMMON_PREP
schema. However, the COMMON_PREP
schema is useful to resolve the above mentioned use cases while keeping the data model DRY (Do not repeat yourself) and maintaining a SSOT for business entity use cases.
Prefer to have one Prep Model per dimensional entity in the COMMON
schema. For example, prefer to only have prep_charge
and NOT a prep_charge
and prep_recurring_charge
. The 2nd prep_recurring_charge
is a filtered down and aggregated version of prep_charge
. In this case, prep_recurring_charge
should instead be built as either a FACT
, MART
, or REPORT
downstream in the lineage of prep_charge
. Doing this will streamline the lineages, keep code DRY, and prevent extra layers and redundancy in the COMMON_PREP
schema. Multiple Prep models in this case results in the Developer having to update code in both Prep models.
Prefer to keep the Prep model at the lowest grain of the dimensional entity it is representing in the COMMON_PREP
schema. This allows it to be the SSOT Prep model for the lineage where additional dimensional and reporting models can be built downstream from it either in a DIM
, FACT
, MART
, MAPPING
, BDG
or REPORT
table. Preaggreagting data in the COMMON_PREP
schema renders the Prep model not useful to build models in the COMMON
schema that would be at a lower grain.
Prefer to not filter records out of the Prep Model in the COMMON_PREP
schema. This allows it to be the SSOT Prep model for the lineage where additional dimensional and reporting models can be built downstream from it either in a DIM
, FACT
, MART
, MAPPING
, BDG
or REPORT
table. Prefer to start filtering out data in the COMMON
schema and subsequent downstream models. Filtering out data in the COMMON_PREP
schema renders the Prep model not useful to build models in the COMMON
schema that would require the data that was filtered out too early in the COMMON_PREP
schema.
Prefer to not make a model in the COMMON_PREP
schema that is only for the sake of following the same pattern of having Prep models. For example, if a dimension table is built in the COMMON
schema and is only built by doing a SELECT *
from the table in the COMMON_PREP
schema, then it may be the case that the Prep model is not needed and we can build that model directly in the COMMON
schema.
Mapping/look-up(map_) tables to support dimension tables should be created in the common_mapping
schema.
The Common schema is where all of the facts and dimensions that compose the Enterprise Dimensional Model are stored. The Common schema contains a variety of different types of dimensions and facts that create multiple star schemas. The models in this schema are robust and provide the basis for analyzing GitLab's businesses processes. The Common Mart
schema contains materializations of the star schemas stored in the Common
schema that provide the Analyst and BI Developer with easy to use and query data marts. What follows is a summary of the different types of facts and dimensions that are available in the Common Schema.
Conformed Dimensions serve as the basis for a series of interlocking stars.
Local dimensions serve as the basis for analyzing one star.
Facts are the things being measured in a process. The terms measurement
or metric
are often used instead of fact to describe what is happening in the model. The Atomic Facts are modeled at the lowest level of the process they are measuring. They do not filter out any data and include all the rows that have been generated from the process the Atomic Fact table is describing. The documentation for the models would include details about the fact table being an atomic one or a derived one.
Derived Facts are built on top of the Atomic Facts. The Derived Fact directly references the Atomic Fact thereby creating an auditable, traceable lineage. The documentation for the models would include details about the fact table being an atomic one or a derived one. There are several use cases for building a Derived Fact table:
Drill Across Facts
that connect two or more facts together through Conformed Dimensions and store as a Derived Fact table. In this process, separate select statments are issued to each fact in the project and includes the Conformed Dimensions the facts have in Common. The results are combined using a Full Outer Join on the Conformed Dimensions included in the select statement results.Bridge(bdg_) tables should reside in the common
schema. These tables act as intermediate tables to resolve many-to-many relationships between two tables.
Marts are a combination of dimensions and facts that are joined together and used by business entities for insights and analytics. They are often grouped by business units such as marketing, finance, product, and sales. When a model is in this directory, it communicates to business stakeholders that the data is cleanly modelled and is ready for querying.
Below are some guidlines to follow when building marts:
mart_
.In broad, generalized terms, there are two perspectives used when analysing data: the current view and the historical view. When using the current view, an analyst uses the most up-to-date values for dimensions to slice facts, regardless of what these values may have been in the past. However, sometimes it is necessary to look at the world the way it was in a prior period, when the values in a dimension were different. For example, you may want to look at sales numbers using a prior product catalog or calculate the number of customers in a given country over time, to provide insights on customers who change addresses periodically.
We use three types of dimensions to cover both the current and historical view of the data. For a current view, a Type 1 dimension will have the up-to-date values for each attribute. For this historical analysis, a Type 2 slowly changing dimension (SCD) can track infrequent changes to a dimension's attributes and capture the period when these values were valid. This differs from an event/activity table because the attribute values are not expected to change, but are allowed to change. In an activity table, we track the changes made to objects which have lifecycles (ex. opportunities). A third type of dimension can be used to provide an alterative view. This could be used when a product in the catalog could fall into two categories, and the analyst would like to see the data from both perspectives separately. We do not currently employ this Type 3 dimension in the Enterprise Dimensional Model.
For more information about the three types of dimensions:
valid_from
and valid_to
) are associated with each record in the dimension to indicate when the value was actively used in the source data. Using these dates, we can look back to how the universe looked at a previous state in time. This is what we call a slowly changing dimension.Alternate Category
which can allow users to slice the data by an alternative version of a dimension. This type of dimension in not currently used in the Enterprise Dimensional Model.Slowly changing dimensions are useful when coupled with snapshot tables. A snapshot table shows the history of an object, providing analysts with a timeline of the modifications made to the object from its creation to the present day. As an example, an analyst might want to track an opportunity from the day it was created, through each of its states, and see what state it is in today. In another handbook page, we have described how to create snapshots in dbt. The end result of this process is a model which has a format similar to the below example:
id | attribute | valid_from_date | valid_to_date |
---|---|---|---|
1 | 'open' | 2022-01-01 | 2021-01-02 |
1 | 'validating' | 2022-01-02 | 2022-01-02 |
1 | 'approved' | 2022-01-02 | 2022-01-03 |
1 | 'closed' | 2022-01-03 | 2022-01-05 |
1 | 'approved' | 2022-01-05 | NULL |
2 | 'open' | 2022-01-01 | 2022-01-05 |
2 | 'approved' | 2022-01-05 | NULL |
Tip: It can be helpful to add a column with logic to indicate the most recent record in a slowly changing dimension.
For performance reasons, it is helpful to keep the slowly changing dimension at the grain in the example above for as long as possible. That being said, it is often easier for users to understand a slowly changing dimension when it is expanded to a daily snapshot view. The example above is transformed into the table below:
id | attribute | date |
---|---|---|
1 | 'open' | 2022-01-01 |
1 | 'validating' | 2022-01-02 |
1 | 'approved' | 2022-01-02 |
1 | 'approved' | 2022-01-03 |
1 | 'closed' | 2022-01-03 |
1 | 'closed' | 2022-01-04 |
1 | 'closed' | 2022-01-05 |
1 | 'approved' | 2022-01-05 |
1 | 'approved' | repeat until chosen date |
2 | 'open' | 2022-01-01 |
2 | 'open' | 2022-01-02 |
2 | 'open' | 2022-01-03 |
2 | 'open' | 2022-01-04 |
2 | 'open' | 2022-01-05 |
2 | 'approved' | 2022-01-05 |
2 | 'approved' | repeat until chosen date |
In the Enterprise Dimensional Model, we introduce the daily grain in the COMMON
schema so the snapshot models are available in our reporting tool. These daily snapshot models should end with the suffix _daily_snapshot
. If a slowly changing dimension requires additional business logic beyond what comes out of the source system and is stored in a _source
model, the best practice is to create a staging model in the COMMON_PREP
schema with the transformations and then build the slowly changing dimension and daily snapshot model from the COMMON_PREP
model.
The dbt solution for building snapshot tables will set the valid_to
field as NULL for the current version of a record, as shown in the first example above. This is how the data will be presented in the _source
models. When this is transformed into a daily snapshot in the COMMON
schema, there is flexibility for the analyst to decide how to set the end date (today's date, a future date, into the infinite future) depending on the business use case.
Big Data is a concept that we use to understand the limits of data offerings. Generically, Big Data is anything that exceeds or strains our current technical capacity for processing and delivery. Dealing with Big Data may be less efficient and more costly as new or creative solutions need to be developed and deployed to expand the capabilities of the data offerings.
Big Data in the Enterprise Data Warehouse is categorized by three general topics:
The Enterprise Data Warehouse is limited in each of these conceptual areas and any data that approaches these limits would be considered Big Data.
While there is not a real limit to the amount of data that can be stored in the Enterprise Data Warehouse there are limits to how much data can be transformed in a performant and cost effective way. The current limits are being able to create a single table within a 3 hour timeframe using an XL snowflake warehouse. As an example all of the collected snowplow data, over 6TB and 30 billion records, can not all be processed at once and would be considered Big Data.
As the Enterprise Data Warehouse is designed to process and transform the data to present a curated set of tables, there are limits to how quickly those transformations can be processed in a cost effective way. The Enterprise Data Warehouse is designed to process the data every 24 hours and data that needs to be processed more quickly than that to accommodate business needs would be considered Big Data. As an example evaluating website behavior in near real time would be considered Big Data.
The current design of the Enterprise Data Warehouse is build on the Snowflake cloud database. This limits the formats and structures of data that can be processed to those that fit into structured tables. While some processing of semi-structured data, such as JSON, is possible directly in the Enterprise Data Warehouse this is limited and must be first ingested as a column in a table decreasing efficiency. Generally, any data input or output from the Enterprise Data Warehouse that is not in a structured table would be considered Big Data. As an example the service ping payload from the Version database is JSON that requires extensive manipulation before in can be analyzed and would be considered Big Data.