The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.
Stage | AI-powered |
Group | AI Model Validation |
Content Last Reviewed | 2024-06-11 |
AI Validation is a critical cornerstone for successfully implementing GenAI solutions. It provides mechanisms to empirically measure GenAI outputs at scale, enabling data-driven decisions.
AI Validation empowers methodical iteration on AI features, creating greater efficiency in developing AI-enabled workflows across the platform. It ensures the reliability and quality of AI outputs, mitigating risks associated with GenAI.
The AI Validation Team's Centralized Evaluation Framework supports the entire end-to-end process of AI feature creation—from selecting appropriate models to evaluating feature outputs. AI Validation complements other evaluation types such as SET Quality testing and diagnostic testing, focusing specifically on GenAI interactions.
The Centralized Evaluation Framework relies on three main elements: a prompt library, validation metrics, and comparative foundational models. The prompt library contains diverse benchmark datasets tailored to various use cases, including code completion, code generation, and natural language questions. Validation metrics assess the accuracy and usefulness of GenAI outputs against industry benchmarks. The framework incorporates various validation methods, including but not limited to LLM consensus filtering, LLM judges, and cosine similarities. Foundational models provide baselines for measuring GitLab AI feature performance against industry standards. Additional information on our validation process is available here.
The Evaluation Framework enables large-scale tests designed to be composable via CLI (Command Line Interface) and API. This allows feature teams to assess the impact of code changes on specific AI-generated content. For example, teams focusing on improving output for specific use cases can test only those cases rather than the entire test corpus. Test results are accessible locally, on BigQuery (Google's data warehouse), or via feature-specific dashboards. The AI Validation team began by supporting Code Suggestions and now actively supports Duo Chat.
The AI Validation team continues iterating on and building the Centralized Evaluation Framework. Our goal is to expand support to future AI-powered features, enhance developer productivity, and instill confidence in the reliability and value of AI-powered processes across the software development lifecycle.
The AI Validation category focuses on assessing performance, tuning parameters, prompt engineering techniques, and quality of algorithms for various AI models. The Centralized Evaluation Framework incorporates numerous open source models as well as industry models from Google, Anthropic, and others.
Current work supports the Duo Chat team in assessing chat responses based on correctness, readability, and comprehensiveness. The AI Validation team uses industry models as benchmarks and has curated both open source and custom datasets for specific use cases identified by the Duo Chat team. We employ validation methods appropriate to chat use cases, including LLM consensus filtering, LLM judges, and cosine similarity scores. The AI Validation team works closely with the Duo Chat team to enable efficient, data-driven iteration on Chat tool engineering and production of high-quality responses.
FY25 R&D Investment Themes: Enable AI/ML Efficiencies Across DevSecOps
AI Validation enables data-driven decisions in creating and implementing GenAI features across the GitLab platform. We have provided support to Code Suggestions and Duo Chat, and will continue enabling efficient iteration on GenAI features. As a long-term initiative, we aim to expand our Centralized Evaluation Framework to evaluate various models based on Quality, Cost, and Latency.
Our short-term goals support scaling the Centralized Evaluation Framework and include:
Primary Decision Factors, inspired by this paper: