Code review is an important part of any development process. Over the years, code review has shown a large impact on improving the quality of source code. In addition, code review facilitates in transferring knowledge about the codebase, approaches, expectations regarding quality, etc.; both to the reviewers and to the author.
In a small project, finding an appropriate code reviewer is not a big issue. This might be done manually. However, as the project grows, finding an appropriate code reviewer becomes increasingly difficult. In this setting, manual assignments become time-consuming and require keeping in mind lots of aspects when looking for the right person. Random assignments will be potentially error-prone, as the pool of developers and possible reviewers usually grows with the size of the project, reducing the likelihood of appropriate recommendations. The other option of always assigning key reviewers would make the selected persons overburdened and a bottleneck due to inadvertently siloing knowledge. Overall, none of the provided strategies appear to be optimum. To find appropriate code reviewers in the most effective way, the UnReview project has been initiated.
UnReview focuses on achieving the following goals:
Today, UnReview is an early-stage technology. However, significant testing and validation has been done on production data. After the acquisition is complete, we continue to work on the approach, integrating UnReview into GitLab via iteration.
To make recommendations, UnReview considers the reviewer’s experience in the part of the source code proposed by a merge request. For a given project, UnReview automatically collects the commit history and merge requests in order to identify who is responsible for reviewing and for which part of the source code. Using that information, UnReview then trains the model that is able to make appropriate recommendations.
UnReview is able to resolve the cold start problem, i.e., when the proposed source code is unknown to the recommendation engine. When making recommendations, UnReview additionally tries to balance the review load across the team. Future versions will also consider the context of the merge request, i.e., which source code has been exactly changed and how it affects other parts of the project.
Working on UnReview, we have faced the following challenges:
UnReview consists of multiple components used for a variety of purposes, from data extraction and processing to training machine learning models:
The following chart provides more details on how the UnReview components relate to each other:
When integrating UnReview into GitLab, some components can be replaced as we progress through the defined milestones.
The backend work for integration will be primarily handled by the Applied ML team with help from the infrastructure team. The frontend work will be by both the Applied ML team and the
Create::code review team (PM Kai Armstrong and EM André Luís).
Milestone 1 focuses on creating an UnReview proof-of-concept that works like Reviewer Roulette based on the GitLab product code. This milestone retains the existing UnReview functionality but requires a number of changes to the architectural components of the approach.
The following tasks have to be completed:
More information on Milestone 1 can be found by following its epic.
At this milestone, a set of UnReview components has to be replaced according to the objectives specified above:
The following chart provides more details on how the UnReview components, including the replaced ones, relate to each other:
Milestone 2 focuses on providing the UnReview functionality to GitLab.com customers and for dogfooding at GitLab. At this milestone, UnReview should be able to connect to a provided project, automatically extract/process data, and periodically retrain the ML models to improve code reviewer recommendations. GitLab.com customers who license and enable this feature will start seeing and experiencing value from the functionality, including the ability to intelligently assign code reviewers to merge requests based on ML models.
To support GitLab.com customers:
To support dogfooding at GitLab:
More information on Milestone 2 can be found by following its epic.
At this milestone, UnReview integrates into GitLab without using the public API because it is believed the API will not be fast enough and would possibly put too high of a load on the GitLab infrastructure. The requested data is passed directly to the UnReview infrastructure via Apache Kafka. In this way, Kafka acts as a bridge between UnReview and GitLab, separating the two infrastructures, while allowing UnReview to access the data it needs that is not available through the public API. To solve another issue of periodically retraining ML models, UnReview could potentially use a scheduler, e.g., Apache Airflow.
The following chart provides more details on how UnReview communicates with the GitLab infrastructure via Apache Kafka:
Milestone 3 focuses on further improving the UnReview functionality for Gitlab.com customers, as well as planning to provide the functionality to self-hosted customers. At this milestone, GitLab.com customers will continue seeing and experiencing value from the functionality, including the ability to intelligently assign code reviewers to merge requests based on ML models. Primary feedback from GitLab.com customers and dogfooding at GitLab should be considered and integrated into the product.
To support GitLab.com customers:
To support self-hosted customers:
More information on Milestone 3 can be found by following its epic.
Delivering the full UnReview functionality to self-hosted customers may be challenging. Processing data and training ML models requires significant hardware power and administration efforts from the customer’s team itself. We continue to explore various approaches to addressing this issue. Additionally, we are looking towards privacy-preserving machine learning through data obfuscation.
Overall, one way to support self-hosted customers might be to:
The following diagram provides more details on how UnReview may interact with the self-hosted GitLab infrastructure: