Developing GitLab Duo: How we validate and test AI models at scale

Generative AI marks a monumental shift in the software development industry, making it easier to develop, secure, and operate software. Our new blog series, written by our product and engineering teams, gives you an inside look at how we create, test, and deploy the AI features you need integrated throughout the enterprise. Get to know new capabilities within GitLab Duo and how they will help DevSecOps teams deliver better results for customers.

GitLab values the trust our customers place in us. Part of maintaining that trust is transparency in how we build, evaluate, and ensure the high-quality functionality of our GitLab Duo AI features. GitLab Duo features are powered by a diverse set of models, which allows us to support a broad set of use cases and gives our customers flexibility. GitLab is not tied to a single model provider by design. We currently use foundation models from Google and Anthropic. However, we continuously assess what models are the right matches for GitLab Duo’s use cases. In this article, we give you an inside look at our AI model validation process.

Discover the future of AI-driven software development with our GitLab 17 virtual launch event. Watch today!

Understanding LLMs

Large language models (LLMs) are generative AI models that power many AI features across the platform. Trained on vast datasets, LLMs predict the next word in a sequence based on preceding context. Given an input prompt, they generate human-like text by sampling from the probability distribution of words conditioned on the prompt.

LLMs enable intelligent code suggestions, conversational chatbots, code explanations, vulnerability analysis, and more. Their ability to produce diverse outputs for a given prompt makes standardized quality evaluation challenging. LLMs can be optimized for different characteristics, which is why there are so many AI models actively being developed.

Testing at scale

Unlike traditional software systems where inputs and outputs can be more easily defined and tested, LLMs produce outputs that are often nuanced, diverse, and context-dependent. Testing these models requires comprehensive strategies that account for subjective and variable interpretations of quality, as well as the stochastic nature of their outputs. We, therefore, cannot judge the quality of an LLM’s output in an individual or anecdotal fashion; instead, we need to be able to examine the overall pattern of an LLM's behavior. To get a sense of those patterns, we need to test at scale. Testing at scale refers to the process of evaluating the performance, reliability, and robustness of a system or application across a large and diverse array of datasets and use cases. Our Centralized Evaluation Framework (CEF) utilizes thousands of prompts tied to dozens of use cases to allow us to identify significant patterns and assess the overall behavior of our foundational LLMs and the GitLab Duo features in which they are integrated.

Testing at scale helps us:

Ensure quality: Testing at scale enables us to assess the quality and reliability of these models across a wide range of scenarios and inputs. By validating the outputs of these models at scale, we can start to identify patterns and mitigate potential issues such as systematic biases, anomalies, and inaccuracies.
Optimize performance: Scaling up testing efforts allows GitLab to evaluate the performance and efficiency of LLMs under real-world conditions. This includes assessing factors such as output quality, latency, and cost to optimize the deployment and operation of these models in GitLab Duo features.
Mitigate risk: Testing LLMs at scale helps mitigate the risks associated with deploying LLMs in critical applications. By conducting thorough testing across diverse datasets and use cases, we can identify and address potential failure modes, security vulnerabilities, and ethical concerns before they impact our customers.

Testing LLMs at scale is imperative for ensuring their reliability and robustness for deployment within the GitLab platform. By investing in comprehensive testing strategies that encompass diverse datasets, use cases, and scenarios, GitLab is working to unlock the full potential of AI-powered workflows while mitigating potential risks.

How we test at scale

These are the steps we take to test LLMs at scale.

Step 1: Create a prompt library as a proxy for production

While other companies view and use customer data to train their AI features, GitLab currently does not. As a result, we needed to develop a comprehensive prompt library that is a proxy for both the scale and activity of production.

This prompt library is composed of questions and answers. The questions represent the kinds of queries or inputs that we would expect to see in production, while the answers represent a ground truth of what our ideal answer would be. This ground truth answer could also be mentally framed as a target answer. Both the question and the answer may be human generated, but are not necessarily so. These question/answer pairs give us a basis for comparison and a reference frame that allow us to tease out differences between models and features. When multiple models are asked the same question and generate different responses, we can use our ground truth answer to determine which model has provided an answer that is most closely aligned to our target and score them accordingly.

Again, a key element of a comprehensive prompt library is ensuring that it is representative of the inputs that we expect to see in production. We want to know how well foundational models fit to our specific use case, and how well our features are performing. There are numerous benchmark prompt datasets, but those datasets may not be reflective of the use cases that we see for features at GitLab. Our prompt library is designed to be specific to GitLab features and use cases.

Step 2: Baseline model performance

Once we have crafted a prompt library that accurately reflects production activity, we feed those questions into various models to test how well they serve our customer’s needs. We compare each response to our ground truth and provide it a ranking based on a series of metrics including: Cosine Similarity Score, Cross Similarity Score, LLM Judge, and Consensus Filtering with an LLM Judge. This first iteration provides us a baseline for how well each model is performing, and guides our selection of a foundational model for our features. For brevity, we won’t go into the details here, but we encourage you to learn more about more about the metrics here. It is important to note this isn’t a solved problem; the wider AI industry is actively researching and developing new techniques. GitLab’s model validation team keeps a pulse on the industry and is continuously iterating on how we measure and score the LLMs GitLab Duo uses.

Step 3: Feature development

Now that we have a baseline for our selected model's performance, we can start developing our features with confidence. While prompt engineering gets a lot of buzz, focusing entirely on changing the behavior of a model via prompting (or any other technique) without validation means that you are operating in the dark and very possibly overfitting your prompting. You may solve one problem, but be causing a dozen more. You would never know. Creating a baseline for a model's performance allows us to track how we are changing behavior over time for all our necessary use cases. At GitLab, we re-validate the performance of our features on a daily basis during active development to help ensure that all changes improve the overall functionality.

Step 4: Iterate, iterate, iterate

Here is how our experimental iterations work. Each cycle, we examine the scores from our tests at scale to identify patterns:

What are the commonalities across our weakest areas?
Is our feature performing poorly based on a specific metric or on a certain use case?
Do we see consistent errors popping up in response to a certain kind of question?

Only when we test at scale do these kinds of patterns begin to emerge and allow us to focus our experiments. Based on these patterns, we propose a variety of experiments or approaches to try to improve performance in a specific area and on a specific metric.

However, testing at scale is both expensive and time-consuming. To enable faster and less expensive iteration, we craft a smaller scale dataset to act as a mini-proxy. The focused subset will be weighted to include question/answer pairs that we know we want to improve upon, and the broader subset will also include sampling of all the other use cases and scores to ensure that our changes aren't adversely affecting the feature broadly. Make your change and run it against the focused subset of data. How does the new response compare to the baseline? How does it compare to the ground truth?

Once we have found a prompt that addresses the specific use case we are working on with the focused subset, we validate that prompt against a broader subset of data to help ensure that it won’t adversely affect other areas of the feature. Only when we believe that the new prompt improves our performance in our target area through validation metrics AND doesn’t degrade performance elsewhere, do we push that change to production.

The entire Centralized Evaluation Framework is then run against the new prompt and we validate that it has increased the performance of the entire feature against the baseline from the day before. In this way, GitLab is constantly iterating to help ensure that you are getting the latest and greatest performance of AI-powered features across the GitLab ecosystem. This allows us to ensure that we keep working faster, together.

Making GitLab Duo even better

Hopefully this gives you insight into how we’re responsibly developing GitLab Duo features. This process has been developed as we’ve brought GitLab Duo Code Suggestions and GitLab Duo Chat to general availability. We’ve also integrated this validation process into our development process as we iterate on GitLab Duo features. It’s a lot of trial and error, and many times fixing one thing breaks three others. But we have data-driven insights into those impacts, which helps us ensure that GitLab Duo is always getting better.

Start a free trial of GitLab Duo today!

Developing GitLab Duo: How we validate and test AI models at scale

Understanding LLMs