The Enterprise Dimensional Model (EDM) is GitLab's centralized data model, designed to enable and support the highest levels of accuracy and quality for reporting and analytics. The data model follows the Kimball technique, including a Bus Matrix and Entity Relationship Diagram. Dimensional Modeling is the third step of our overarching Data Development Approach (after Requirements definition and UI Wireframing) and this overall approach enables us to repeatedly produce high-quality data solutions. The EDM is housed in our Snowflake Enterprise Data Warehouse and is generated using dbt.
As of 2021-April, the EDM solves for Go-To-Market funnel analytics and is actively being expanded to solve for Product Usage analytics. Example SiSense dashboards powered by the EDM include:
Dimensional modeling is part of the Business Dimensional Lifecycle methodology developed by Ralph Kimball which includes a set of methods, techniques and concepts for use in data warehouse design.
a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access
Dimensional Modeling is business process oriented and can be built in 4 steps:
Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.
In the simplest version fact table is a central table and is linked to dimensional tables with foreign keys creating a star schema. Star schema with dimensional tables linking to more dimensional tables are called snowflake schemas, multi fact tables schemas are called galaxies.
PREPdatabase using source specific schema names.
PREPtables in the
COMMON_PREPschema in the
It is critical to be intentional when organizing a self-service data environment, starting with naming conventions. The goal is to make navigating the data warehouse easy for beginner, intermediate, and advanced users. We make this possible by following these best practices:
fct_<verb>Facts represent events or real-world processes that occur. Facts can often be identified because they represent the action or 'verb'. (e.g. session, transaction)
dim_<noun>= dimension table. Dimensions provide descriptive context to the fact records. Dimensions can often be identified because they are 'nouns' such as a person, place, or thing (e.g. customer, employee) The dimension attributes act as 'adjectives'. (e.g. customer type, employee division)
commonschema first, and then the code that builds the fact tables references these common dimensions (LEFT JOIN) to ensure the common key is available. There are some situation where logically we are 100% confident every key that is in the fact table will be available in the dimension. This depends on the source of the dimension and any logic applied.
commonschema. These tables act as intermediate tables to resolve many-to-many relationships between two tables.
The Dimensional Model is meant to be simple to use and designed for the user. Dimensional models are likely to be denormalized, as opposite to source models, making them easier to read and interpret, as well as allowing efficient querying by reducing the number of joins.
The Common Prep Schema has 6 primary use cases at this time. As we iterate on the EDM 2.0, we will evaluate each of these cases for efficacy and performance and determine if there is a more streamlined and efficient approach to perform the transformation being done in Common Marts. The use cases are as follows:
Marts are a combination of dimensions and facts that are joined together and used by business entities for insights and analytics. They are often grouped by business units such as marketing, finance, product, and sales. When a model is in this directory, it communicates to business stakeholders that the data is cleanly modelled and is ready for querying.
Below are some guidlines to follow when building marts:
In broad, generalized terms, there are two perspectives used when analysing data: the current view and the historical view. When using the current view, an analyst uses the most up-to-date values for dimensions to slice facts, regardless of what these values may have been in the past. However, sometimes it is necessary to look at the world the way it was in a prior period, when the values in a dimension were different. For example, you may want to look at sales numbers using a prior product catalog or calculate the number of customers in a given country over time, to provide insights on customers who change addresses periodically.
We use three types of dimensions to cover both the current and historical view of the data. For a current view, a Type 1 dimension will have the up-to-date values for each attribute. For this historical analysis, a Type 2 slowly changing dimension (SCD) can track infrequent changes to a dimension's attributes and capture the period when these values were valid. This differs from an event/activity table because the attribute values are not expected to change, but are allowed to change. In an activity table, we track the changes made to objects which have lifecycles (ex. opportunities). A third type of dimension can be used to provide an alterative view. This could be used when a product in the catalog could fall into two categories, and the analyst would like to see the data from both perspectives separately. We do not currently employ this Type 3 dimension in the Enterprise Dimensional Model.
For more information about the three types of dimensions:
valid_to) are associated with each record in the dimension to indicate when the value was actively used in the source data. Using these dates, we can look back to how the universe looked at a previous state in time. This is what we call a slowly changing dimension.
Alternate Categorywhich can allow users to slice the data by an alternative version of a dimension. This type of dimension in not currently used in the Enterprise Dimensional Model.
Slowly changing dimensions are useful when coupled with snapshot tables. A snapshot table shows the history of an object, providing analysts with a timeline of the modifications made to the object from its creation to the present day. As an example, an analyst might want to track an opportunity from the day it was created, through each of its states, and see what state it is in today. In another handbook page, we have described how to create snapshots in dbt. The end result of this process is a model which has a format similar to the below example:
Tip: It can be helpful to add a column with logic to indicate the most recent record in a slowly changing dimension.
For performance reasons, it is helpful to keep the slowly changing dimension at the grain in the example above for as long as possible. That being said, it is often easier for users to understand a slowly changing dimension when it is expanded to a daily snapshot view. The example above is transformed into the table below:
|1||'approved'||repeat until chosen date|
|2||'approved'||repeat until chosen date|
In the Enterprise Dimensional Model, we introduce the daily grain in the
COMMON schema so the snapshot models are available in our reporting tool. These daily snapshot models should end with the suffix
_daily_snapshot. If a slowly changing dimension requires additional business logic beyond what comes out of the source system and is stored in a
_source model, the best practice is to create a staging model in the
COMMON_PREP schema with the transformations and then build the slowly changing dimension and daily snapshot model from the
The dbt solution for building snapshot tables will set the
valid_to field as NULL for the current version of a record, as shown in the first example above. This is how the data will be presented in the
_source models. When this is transformed into a daily snapshot in the
COMMON schema, there is flexibility for the analyst to decide how to set the end date (today's date, a future date, into the infinite future) depending on the business use case.