This page describes in detail the process of adding new data sources to the data warehouse. The Data Team embraces any initiative of adding new data sources towards the data warehouse. We envision that data could lead towards value, but adding new data towards the data warehouse does not go with no costs.
Both the development (assigning resources from the Data Team, from other teams involved to support and also you as requestor) and keeping the data pipeline up and running (storage, compute resources, incident management, monitoring etc..) cost time and or money.
Please, before request adding a new data source, take the following into account:
rawdata layer, and this is not accessible by default for Sisense. Data need to be loaded downstream into the Enterprise Dimensional Model (EDM) via dbt. Follow up needs to take place, and will come on top of the process described on this page.
rawdata layer. Raise an AR to get access to the raw data.
For adding a new data source, the workflow looks as follow (note, this is the existing workflow we use for all our work and made explicit here in the context of adding a new data source).
Process for adding a new data source:
|Stage (Label)||Summery for new data source|
||All initial information will be provided by the requester and assessed by the Data Team|
||A solution design and complete work breakdown for upcoming development will be created|
||Once a quarter the Data Team OKRs are defined. Based on workload and priorities, new data sources could be incorporated in the next quarter OKRs|
||Adding a new data source is on the list of next quarter OKRs. An epic will be created with all the work breakdown activities attached|
||Development is in progress|
||Development is under review|
||Development is blocked|
Every new data source requests starts with an issue. This applies for data source that are already extracted, but needs to be extended and for completely new data sources. Please use the
New Data Source template. Assign the new issue to the Manager, Data Platform so that the Data Team can triage the issue in a timely manner. All initial information will be provided by the requestor and assessed by the Data Team. Cross check will be performed if the data doesn't already exist in the data warehouse.
All details regarding the new data source will be flashed out, with the goal of creating a full work breakdown.
Based on the requirements, the data points that need to be extracted (i.e. which tables) will be determined. A total data scope must be given.
In the Data Platform we adopt the principle of data minimisation when extracting data, which means we only extract data that is needed to fulfill the scope/need. We do this for the following reasons:
This is all in the spirit of GitLabs value of efficiency.
Data minimisation is applied on the following levels:
Data minimisation is not applied on the following levels:
The Data Team has different instruments available to extract data out of source systems (randomly ordered):
The decision for which instrument to use, is always based on the combination of:
Its the Data Team that determines which instrument is used. The following decision diagram is used to determine the right extraction solution:
Custom development is a solution designed and developed by the GitLab Data Team. Examples of this are the current PGP and the Zuora Rev Pro extraction.
Although it could be helpful to already provide the Data Team access to the source system, its not mandatory to raise an Access Request right now.
End goal of the validation is to have a solution design and a complete work breakdown, with DRI's and size estimation attached. All the work that needs to be performed is stated down. We aim to have it on a level, where every work action could be converted to an issue with no larger T-Shirt size than
5/8 issue poins.
It needs to be determined if there is MNPI data in the data sources and if this data is about to be extracted towards the Data Warehouse. If the data source contains MNPI and this data is extracted, change the issue label of the issue to
new data source MNPI.
Based on the business case, effort to implement, workload in the Data Team and Data Team priorities, the implementation will be scheduled. For scheduling we follow the GitLab Data Team Planning Drumbeat. This means that for every quarter, the Data Team determines when a new data source request will be picked up. When a new data source request remains in
scheduling, it does not mean that it isn't on the radar of the Data Team. It means that it isn't been scheduled yet, because:
When a request for implementing a new data source is in scope for next quarter OKRs, the request is
scheduled. An epic will be created with the issue attached to it. Based on the work breakdown all issues are created with the corresponding issue weight, label assigned
workflow::4 - scheduled, the right milestone (if known) and attached to the epic.
When the execution of the work starts, the issue is in development and follows the regular development life cycle.
During the development, the Data Engineer aligns with all the stakeholders, about access to the data. Data access can be provided to the
raw schema, depending on the data and the use case.
When it is a complete new data pipeline that extracts data from a upstream system that is not yet extracted, create a new label on in the GitLab Data group. I.e: https://gitlab.com/groups/gitlab-data/-/labels/23453371/edit
When the execution of the work is finished, the issue is in review and follows the regular development life cycle.
When the execution could not continue due to the need of external intervention, the issue is
blocked. There must be a clear problem statement given and the right people need to be assigned on the shortest notice possible.
Red data is not allowed to be stored in our Data Platform (Snowflake). Therefore we will not bring in/connect new data sources that are listed in the tech stack with
data_classification: Red unless there is a mission critical business reason. There is an exception process available which will enable us to evaluate the needs on a case-by-case basis and this process will require approval from BT/Data VP-level, Security and Privacy. Evaluating the business reason and obtaining approvals are part of the triage process and are governed via the new data source template.
Note: The exception process must be fulfilled to either connect a system with Red data and/or to extract Red data (fields) from that system. However, the business case to extract Red data (fields) under the exception process will necessitate a higher standard of review than a business case that only requires connecting a Red data system without extraction of Red data (fields). Where extraction of Red data (fields) is approved under the exception process, masking will be applied in the Data Platform (Snowflake) as described in the proceeding section.
When extracting new data towards Snowflake and the data source is not listed or listed
When a new data source is extracted towards the
raw layer there will be a new separate schema created for that data source in many cases. In order to make sure the new data source is observed by Monte Carlo follow the steps outlined in Monte Carlo permission handbook section.
The new data source will be documented in the handbook, following this template.