The Data Science Team approach to model development is centered around Gitlab's value of iteration and the CRISP-DM standard. Our process expands on some of the 6 phrase outlined in CRISP-DM in order to best address the needs of our specific business objectives and data infrastructure:
Recommended check-ins with stakeholders/project owner at the following phases, at minimum, to ensure project is on target for achieving project objectives:
For defintions of some of the terms used, please refer to Common Data Science Terms. For definitions around sizing, please see T-Shirt Sizing Approach.
Create a new issue using the Data Science Process Template
Sizing: Small (with on-going refinements through Implementation Plan Phase)
Purpose: This is perhaps the most critical phase in any data science project. Often a stakeholder will have a general idea of the problem they want to solve and you will need to help them define and refine the scope the of the project before beginning to develop a modeling and implementation strategy.
Tasks:
Considerations:
Completion Criteria:
Sizing: Small
Purpose: Understanding previous analytic work is essential for developing efficient and effective data science projects. By knowing the work that has already done we can identify useful data sources, outcome (target) definitions, potential predictors (features), nuances in the data, and important insights.
Tasks:
Considerations:
Completion Criteria:
Iterative with 3b & 3c
Sizing: Large
Purpose: Review available relevant data and conduct analysis around the outcome/target and potential predictors (features). This will allow you to narrow in on the necessarily data sources to be used in the Train MVP Model phase. It is important to understand how your outcome/target relates to your potential predictor data and to set up the prediction timeframe appropriately.
Tasks:
Considerations:
Completion Criteria:
Iterative with 3a & 3c
Sizing: Large
Purpose: Based on the EDA, prior analysis, and knowledge about the problem statement, create a list of features in SQL using the tables outlined in the implementation plan that will be used to predict the outcome.
Tasks:
Considerations:
_cnt, _pct, _amt, _flag
for fields containing counts, percents, currency amounts, and boolean flags, respectively.Completion Criteria:
Iterative with 3a & 3b
Sizing: Medium
Purpose: Putting together a modeling plan will allow you to communicate to the stakeholders how you intend to answer their problem statement using data science. A plan should clearly layout any necessary definitions, data sources, methodologies, and outputs. Additionally, constructing a plan will allow for faster, smoother development in future iterations.
Tasks: In the project issue, document the following:
Considerations:
Completion Criteria:
Sizing: Large
Purpose: Using the dataset created in the previous stage, prepare, model, and glean insights that directly address the problem statement.
Tasks:
_flag
fields for those features instead.
Considerations:
Completion Criteria:
Sizing: Medium
Purpose: Synthesize the findings from your model and report back to the stakeholders.
Tasks:
Considerations:
Completion Criteria:
Sizing: Medium
Purpose: In order to score a model independently of a training run – and on data outside of the training time period – we need to operationalize a scoring process and add it to the data science production pipeline
Tasks:
@gitlab-data/engineers
when ready to operationalizeConsiderations:
Completion Criteria:
Sizing: Medium
Purpose: The purpose of setting up dashboards for your model are two-fold: 1) Monitor model performance and lift "in the wild" and; 2) Provide an easy point of access to end-users to consume and understand model outputs. As we migrate to a new data visualization and data observability tools, we hope to streamline, automate, and simplify creating model dashboards.
Tasks:
Considerations:
Completion Criteria:
Sizing: X-Small
Purpose: The purpose of the retrospective is to help the data science team learn and improve as much as possible from every project.
Tasks:
Considerations:
Completion Criteria: