The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.
The Centralized Evaluation Framework is based on three main elements: a prompt library, an established ground truth, and validation metrics. Validation metrics assess the accuracy and usefulness of GenAI outputs against ground truth.
The Centralized Evaluation Framework incorporates various validation metrics, including but not limited to similarity scores, cross similarity scores, and LLM evaluator scores such as LLM consensus filtering and LLM judges. The combined output of use-case specific metrics serves as a proxy for production and mimics human judgment in accepting or rejecting AI-generated content.
Similarity scores compare AI-generated text against ground truth text. This ground truth may be static or may be the dynamic output of an LLM with known good answers in a specific domain.
Both the output to be tested and the ground truth are converted into numerical representations using embeddings. Embeddings are vector representations of words or sentences in a high-dimensional space, where semantically similar texts are positioned closer together. To calculate the embedding of each block, we use Vertex AI's text-embedding-gecko model. The similarity score is then calculated using the dot product of the two embeddings.
While similarity scores are good general quality indicators, they suffer from the partial matching problem. The similarity score can only measure similarity by treating the block of text as a whole. It therefore returns a low score when the lengths of the two blocks differ significantly. When two blocks of text match partially, the similarity score will be low. Because of this, similarity scores can be misleading in cases where partial matches may still be high quality.
Examples from the code suggestion use case include:
The cross similarity score overcomes the partial matching problem, allowing us to better evaluate text produced by GenAI across multiple use cases.
The cross similarity score is based on a cross similarity matrix, which compares element pairs across two outputs and quantifies the similarity or dissimilarity of each element. A matrix is constructed with rows and columns representing the relationship and similarity score of each element in one output to each element in another output. The scores are then aggregated to obtain a single score representing the overall similarity between two outputs.
The LLM Judge metric assesses specific criteria, such as the relevance of a response to a question. In this approach, one LLM (the LLM Judge) rates the responses of the LLM being evaluated against a series of prompts. The LLM Judge then scores these responses for specific criteria such as correctness, comprehensiveness, and readability.
To enhance credibility, we employ multiple LLMs as judges. The LLM Judges are selected based on their strong language comprehension capabilities.
Consensus filtering with an LLM Judge compares outputs from multiple LLMs responding to the same set of prompts. Unlike the single LLM Judge approach, this method aggregates all LLM outputs into a single prompt. The LLM Judge then compares the different responses and scores each one with full context of the range of possible answers.
Last Reviewed: 2025-06-11
Last Updated: 2025-06-11