The Enterprise Dimensional Model (EDM) is GitLab's centralized data model, designed to enable and support the highest levels of accuracy and quality for reporting and analytics. The data model follows the Kimball technique, including a Bus Matrix and Entity Relationship Diagram. Dimensional Modeling is the third step of our overarching Data Development Approach (after Requirements definition and UI Wireframing) and this overall approach enables us to repeatedly produce high-quality data solutions. The EDM is housed in our Snowflake Enterprise Data Warehouse and is generated using dbt.
Example SiSense dashboards powered by the EDM include:
Dimensional modeling is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design.
a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access
Dimensional Modeling is business process oriented and can be built in 4 steps:
Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.
In the simplest version fact table is a central table and is linked to dimensional tables with foreign keys creating a star schema. Star schema with dimensional tables linking to more dimensional tables are called snowflake schemas, multi fact tables schemas are called galaxies.
PREP
database using source specific schema names.PREP
tables in the COMMON_PREP
schema in the PROD
database as required by the use case. Building PREP
tables in the COMMON_PREP
schema is optional and depends on the use case.MD5('-1')
key value to represent missing
.MD5('-1')
key value for missing keys.It is critical to be intentional when organizing a self-service data environment, starting with naming conventions. The goal is to make navigating the data warehouse easy for beginner, intermediate, and advanced users. We make this possible by following these best practices:
fct_<verb>
Facts represent events or real-world processes that occur. Facts can often be identified because they represent the action or 'verb'. (e.g. session, transaction)dim_<noun>
= dimension table. Dimensions provide descriptive context to the fact records. Dimensions can often be identified because they are 'nouns' such as a person, place, or thing (e.g. customer, employee) The dimension attributes act as 'adjectives'. (e.g. customer type, employee division)dbt_utils.surrogate_key
macro and is not derived from any source application data unlike a natural key. The main uses of the surrogate key are to act as the primary key and be used to join dims and facts together in the dimensional model. This allows the data warehouse to generate and control the key and protect the dimensional model from changes in the source system. The surrogate key cannot be used to filter tables since it is a hashed field.{{ dbt_utils.surrogate_key(['order_type_stamped']) }}
, the natural key order_type_stamped
is added to the macro to generate the surrogate key dim_order_type_id
._sk
will keep it simple to name both composite and non-composite surrogate keys. dim_order_type_id will be dim_order_type_sk. This is a new requirement starting in June 2022; therefore, not all surrogate keys in models will have the _sk
suffix.dim_
prefix on them. An example is dim_order_type_id
. This prefix indicates that this is what the dimension should be joined on in the dimensional model. The new requirement would have this key named dim_order_type_sk
.--Surrogate Key
so that it is easily identifiable. Also, the surrogate key description can be added to the columns in the schema.yml
file.order_type_stamped
is a single field natural key that is included in the dimension table.dim_
prefix. This is because the dimension should not be joined on the natural key to query or materialize star schemas. The joining of dims and facts and materializations of star schemas in the dimensional model should only be done via the hashed surrogate keys.--Natural Key
so that it is easily identifiable. Also, the natural key description can be added to the columns in the schema.yml
file.MD5('-1')
to generate the dim_order_type_id
value for missing member. As of June 2022, the dbt_utils.surrogate_key
will be used to generate the value for missing member key. It returns the same value as the MD5('-1')
and it will be easier to keep the same function that is used to generate the main surrogate key in the dimension. 'Missing order_type_name' AS order_type_name
and 'Missing order_type_grouped' AS order_type_grouped
are added as the values that would be returned when a join is made to the dimension and there is a missing member.missing member
value automagically in the dimension. The current process is to add the missing member
entry manually to the dimension pursuant to the prep_order_type example. When there are many fields in the dimension, missing member
values should be added for each field. prep_amendment is an example of this. The macro under development will aim to automate the generate of these missing member
entries.dbt_utils.surrogate_key
macro or a concatenation function can be used to create the key.--Primary Key
in the SQL script so that it is easily identifiable.fct_
prefix plus the _pk
suffix on the end. It is not necessary to include id
in the field name. An example is the fct_event where event_id
is the primary key, in a commented --Primary Key
section, where the key is the table name fct_event
minus fct_
plus id
on the end. However, going forward from June 2022, this primary key would be named event_pk
.--Foreign Keys
. fct_event is an example. The foreign keys are the hashed surrogate keys from the dimensions. The star schema should be joined on and materialized using hashed surrogate keys.--Foreign Keys
section can also be further organized into a Conformed Dimensions
and Local Dimensions
sub-sections for the foreign keys. This will help for readability and makes it easy to identify the conformed and local dimensions.get_keyed_nulls
macro.common
schema first, and then the code that builds the fact tables references these common dimensions (LEFT JOIN) to ensure the common key is available. There are some situation where logically we are 100% confident every key that is in the fact table will be available in the dimension. This depends on the source of the dimension and any logic applied.MD5('-1')
(missing) dimension_id, and total row counts.The Common
schemas contain the Enterprise Dimensional Model. Each schema has a specific purpose in the Archtiecture as described below.
The Common Prep Schema has 6 primary use cases at this time. The use cases are as follows:
In order to keep the COMMON_PREP
schema streamlined and without unnecessary redundancies, we need to have guidlines and preferences for developing models in the Common Prep Schema. As Developers work through the use cases mentioned above, they should take into consideration the below guidlines when developing models in the COMMON_PREP
schema. The COMMON_PREP
schema is optional and it is not a requirement that every lineage have a model built in the COMMON_PREP
schema. However, the COMMON_PREP
schema is useful to resolve the above mentioned use cases while keeping the data model DRY (Do not repeat yourself) and maintaining a SSOT for business entity use cases.
Prefer to have one Prep Model per dimensional entity in the COMMON
schema. For example, prefer to only have prep_charge
and NOT a prep_charge
and prep_recurring_charge
. The 2nd prep_recurring_charge
is a filtered down and aggregated version of prep_charge
. In this case, prep_recurring_charge
should instead be built as either a FACT
, MART
, or REPORT
downstream in the lineage of prep_charge
. Doing this will streamline the lineages, keep code DRY, and prevent extra layers and redundancy in the COMMON_PREP
schema. Multiple Prep models in this case results in the Developer having to update code in both Prep models.
Prefer to keep the Prep model at the lowest grain of the dimensional entity it is representing in the COMMON_PREP
schema. This allows it to be the SSOT Prep model for the lineage where additional dimensional and reporting models can be built downstream from it either in a DIM
, FACT
, MART
, MAPPING
, BDG
or REPORT
table. Preaggreagting data in the COMMON_PREP
schema renders the Prep model not useful to build models in the COMMON
schema that would be at a lower grain.
Prefer to not filter records out of the Prep Model in the COMMON_PREP
schema. This allows it to be the SSOT Prep model for the lineage where additional dimensional and reporting models can be built downstream from it either in a DIM
, FACT
, MART
, MAPPING
, BDG
or REPORT
table. Prefer to start filtering out data in the COMMON
schema and subsequent downstream models. Filtering out data in the COMMON_PREP
schema renders the Prep model not useful to build models in the COMMON
schema that would require the data that was filtered out too early in the COMMON_PREP
schema.
Prefer to not make a model in the COMMON_PREP
schema that is only for the sake of following the same pattern of having Prep models. For example, if a dimension table is built in the COMMON
schema and is only built by doing a SELECT *
from the table in the COMMON_PREP
schema, then it may be the case that the Prep model is not needed and we can build that model directly in the COMMON
schema.
Mapping/look-up(map_) tables to support dimension tables should be created in the common_mapping
schema.
The Common schema is where all of the facts and dimensions that compose the Enterprise Dimensional Model are stored. The Common schema contains a variety of different types of dimensions and facts that create multiple star schemas. The models in this schema are robust and provide the basis for analyzing GitLab's businesses processes. The Common Mart
schema contains materializations of the star schemas stored in the Common
schema that provide the Analyst and BI Developer with easy to use and query data marts. What follows is a summary of the different types of facts and dimensions that are available in the Common Schema.
Conformed Dimensions serve as the basis for a series of interlocking stars.
Local dimensions serve as the basis for analyzing one star.
Facts are the things being measured in a process. The terms measurement
or metric
are often used instead of fact to describe what is happening in the model. The Atomic Facts are modeled at the lowest level of the process they are measuring. They do not filter out any data and include all the rows that have been generated from the process the Atomic Fact table is describing. The documentation for the models would include details about the fact table being an atomic one or a derived one.
Derived Facts are built on top of the Atomic Facts. The Derived Fact directly references the Atomic Fact thereby creating an auditable, traceable lineage. The documentation for the models would include details about the fact table being an atomic one or a derived one. There are several use cases for building a Derived Fact table:
Drill Across Facts
that connect two or more facts together through Conformed Dimensions and store as a Derived Fact table. In this process, separate select statments are issued to each fact in the project and includes the Conformed Dimensions the facts have in Common. The results are combined using a Full Outer Join on the Conformed Dimensions included in the select statement results.Bridge(bdg_) tables should reside in the common
schema. These tables act as intermediate tables to resolve many-to-many relationships between two tables.
Marts are a combination of dimensions and facts that are joined together and used by business entities for insights and analytics. They are often grouped by business units such as marketing, finance, product, and sales. When a model is in this directory, it communicates to business stakeholders that the data is cleanly modelled and is ready for querying.
Below are some guidlines to follow when building marts:
mart_
.In broad, generalized terms, there are two perspectives used when analysing data: the current view and the historical view. When using the current view, an analyst uses the most up-to-date values for dimensions to slice facts, regardless of what these values may have been in the past. However, sometimes it is necessary to look at the world the way it was in a prior period, when the values in a dimension were different. For example, you may want to look at sales numbers using a prior product catalog or calculate the number of customers in a given country over time, to provide insights on customers who change addresses periodically.
We use three types of dimensions to cover both the current and historical view of the data. For a current view, a Type 1 dimension will have the up-to-date values for each attribute. For this historical analysis, a Type 2 slowly changing dimension (SCD) can track infrequent changes to a dimension's attributes and capture the period when these values were valid. This differs from an event/activity table because the attribute values are not expected to change, but are allowed to change. In an activity table, we track the changes made to objects which have lifecycles (ex. opportunities). A third type of dimension can be used to provide an alterative view. This could be used when a product in the catalog could fall into two categories, and the analyst would like to see the data from both perspectives separately. We do not currently employ this Type 3 dimension in the Enterprise Dimensional Model.
For more information about the three types of dimensions:
valid_from
and valid_to
) are associated with each record in the dimension to indicate when the value was actively used in the source data. Using these dates, we can look back to how the universe looked at a previous state in time. This is what we call a slowly changing dimension.Alternate Category
which can allow users to slice the data by an alternative version of a dimension. This type of dimension in not currently used in the Enterprise Dimensional Model.Slowly changing dimensions are useful when coupled with snapshot tables. A snapshot table shows the history of an object, providing analysts with a timeline of the modifications made to the object from its creation to the present day. As an example, an analyst might want to track an opportunity from the day it was created, through each of its states, and see what state it is in today. In another handbook page, we have described how to create snapshots in dbt. The end result of this process is a model which has a format similar to the below example:
id | attribute | valid_from_date | valid_to_date |
---|---|---|---|
1 | 'open' | 2022-01-01 | 2021-01-02 |
1 | 'validating' | 2022-01-02 | 2022-01-02 |
1 | 'approved' | 2022-01-02 | 2022-01-03 |
1 | 'closed' | 2022-01-03 | 2022-01-05 |
1 | 'approved' | 2022-01-05 | NULL |
2 | 'open' | 2022-01-01 | 2022-01-05 |
2 | 'approved' | 2022-01-05 | NULL |
Tip: It can be helpful to add a column with logic to indicate the most recent record in a slowly changing dimension.
For performance reasons, it is helpful to keep the slowly changing dimension at the grain in the example above for as long as possible. That being said, it is often easier for users to understand a slowly changing dimension when it is expanded to a daily snapshot view. The example above is transformed into the table below:
id | attribute | date |
---|---|---|
1 | 'open' | 2022-01-01 |
1 | 'validating' | 2022-01-02 |
1 | 'approved' | 2022-01-02 |
1 | 'approved' | 2022-01-03 |
1 | 'closed' | 2022-01-03 |
1 | 'closed' | 2022-01-04 |
1 | 'closed' | 2022-01-05 |
1 | 'approved' | 2022-01-05 |
1 | 'approved' | repeat until chosen date |
2 | 'open' | 2022-01-01 |
2 | 'open' | 2022-01-02 |
2 | 'open' | 2022-01-03 |
2 | 'open' | 2022-01-04 |
2 | 'open' | 2022-01-05 |
2 | 'approved' | 2022-01-05 |
2 | 'approved' | repeat until chosen date |
In the Enterprise Dimensional Model, we introduce the daily grain in the COMMON
schema so the snapshot models are available in our reporting tool. These daily snapshot models should end with the suffix _daily_snapshot
. If a slowly changing dimension requires additional business logic beyond what comes out of the source system and is stored in a _source
model, the best practice is to create a staging model in the COMMON_PREP
schema with the transformations and then build the slowly changing dimension and daily snapshot model from the COMMON_PREP
model.
The dbt solution for building snapshot tables will set the valid_to
field as NULL for the current version of a record, as shown in the first example above. This is how the data will be presented in the _source
models. When this is transformed into a daily snapshot in the COMMON
schema, there is flexibility for the analyst to decide how to set the end date (today's date, a future date, into the infinite future) depending on the business use case.