Measuring AI effectiveness beyond developer productivity metrics

AI-powered productivity tools promise to boost productivity by automating repetitive coding and tedious tasks, as well as generating code. How organizations measure the AI impact of these productivity tools has yet to be truly figured out. GitLab is working on a solution: AI Impact is a dashboard grounded in value stream analytics that will help organizations understand the effect of GitLab Duo, our AI-powered suite of features, on their productivity. AI Impact is the culmination of what we’ve learned at GitLab about measuring the impact of AI, and we wanted to share those lessons with you.

A report for The Pragmatic Engineer shows that measuring productivity in general isn’t straightforward, with top engineering teams around the globe all using different metrics. If everyone has a different productivity metric to optimize, how do we even begin to measure the impact of AI productivity tools? Welcome to why measuring AI assistant productivity impact is difficult and commonly misses the mark.

Follow the progress of our AI Impact dashboard and share your feedback.

Flawed productivity metrics

Simplistic productivity metrics like lines of code contributed per day or acceptance rates of AI suggestions fail to capture downstream costs. For instance, GitClear, according to an Infoworld article, “analyzed 153 million lines of changed code between January 2020 and December 2023 and now expects that code churn ('the percentage of lines that are reverted or updated less than two weeks after being authored') will double in 2024." Thus, simply measuring lines of code risks technical debt pileup and skill atrophy in developers.

Indirect impacts are hard to quantify

The goal of AI developer tools is to remove toil, allowing developers to focus on higher value tasks like system architecture and design. But how much time is really saved this way versus spent reviewing, testing, and maintaining AI-generated code? These second-order productivity impacts are very difficult to accurately attribute directly to AI, which may give you a false sense of value. One solution to this is to choose who gets to use AI productivity tools carefully.

Focus should be on business outcomes

Ultimately, what matters is real-world business outcomes, not developer activity metrics. Tracking lead time, cycle time, production defects, and user satisfaction better indicate where bottlenecks exist. If AI tools generate usable code faster, and quality teams can’t keep up with changes, the end software product may decrease in quality and lead to customer satisfaction problems. Shipping more sounds great until it causes problems that take even more time, money, and effort to resolve. Measuring business outcomes is also difficult and these measurements frequently are lagging indicators of problems. Measuring quality defects, security issues, and application performance are all ways to identify business impact sooner.

The need to balance speed and quality

While AI code generation has the potential to accelerate development velocity, it should not come at the cost of overall quality and maintainability. Teams must strike the right balance between velocity and writing maintainable, well-tested code that solves actual business problems. Quality should not be sacrificed purely to maximize productivity metrics. This is when measuring lines of code AI generates or number of AI suggestions developers accept can optimize for the problematic outcomes. More code doesn't necessarily mean higher quality or productivity. More code means more to review, test, and maintain – potentially slowing delivery down.

Let’s look at an example: AI-generated code output is scoped to the area a developer is currently working on. Current AI tools lack the ability to assess the broader architecture of the application (amplified in a microservices architecture). This means that even if the quality of the generated code is good, it may lead to repetition and code bloat because it will be inserted into the area targeted rather than making wider systematic changes. This is problematic in languages that are architected with object-oriented languages that use DRY (don't-repeat-yourself) principles. This is an active area of research and we’re excited to adopt new approaches and technologies to increase the context awareness of our AI features.

Acceptance rate can be particularly misleading, and unfortunately is becoming the primary way AI productivity tools measure success. Developers may accept an AI-generated suggestion but then need to heavily edit or rewrite it. Thus, the initial acceptance gives no indication of whether the suggestion was actually useful. Acceptance rate is fundamentally a proxy for AI assistant quality, yet it is misconstrued as a productivity measure. This is especially misleading when all vendors are measuring acceptance rate differently and marketing based on this number. GitLab intentionally does not use this kind of data in our marketing. What we’ve seen in practice is that developers use AI-generated code similar to how an actor uses a cue – they look at the generated code and say, "oh, right, that's the nudge I needed, I'll take it from here."

Implementation and team dynamics play a key role

How productivity gains materialize depends on how AI tools are implemented and developer dynamics. If some developers distrust the technology or reviews become lax expecting AI to catch errors, quality may suffer. Additionally, introducing AI tools often necessitates changes to processes like code reviews, testing, and documentation. Productivity could temporarily decline as teams adjust to new workflows before seeing gains. Organizations must ensure that when implementing AI tools, that they allow teams time to figure out how it works and how it fits into their workflows, knowing that this trial-and-error period may lead to reduced productivity metrics before seeing productivity gains.

To get this balance right, it’s important to define the tasks that are highly accurate and consistent and train the team to use AI for those use cases (at least, at first). We know that AI code generation is useful for producing scaffolding, test generation, and syntax corrections, as well as generating documentation. Have teams start there and they will see better results and learn to use the tool more effectively. Remember you can’t measure AI’s impact in a week. You have to give teams time to find their rhythm with their AI assistants.

Challenges exist, but AI is the future

Now that we’ve talked about the challenges of measuring AI impact and potential risks, we do want to say at GitLab we do believe AI has a huge role to play in the evolution of DevSecOps platforms. That’s why we’re building GitLab Duo. But we are not rushing into productivity measurement by showing acceptance rates, or lines of code generated. We believe these are a step backwards to previous ways of thinking about productivity. Instead we’re looking at the data we have within our unified DevSecOps platform to present a more complete picture of AI Impact.

What to measure instead

Measuring the productivity impacts of AI developer tools requires nuance and a focus on end-to-end outcomes rather than isolated productivity metrics. For these reasons, simple quantitative metrics tend to miss the nuances of measuring productivity with AI developer tools. The key is to combine quantitative data from across the software development lifecycle (SDLC) with qualitative feedback from developers on how AI actually impacts their day-to-day experience and shapes long-term development practices. Only then can we get an accurate picture of the productivity gains these tools can offer. We view AI as an augmentor to DevSecOps adoption, rather than a replacement for doing things the right way. Organizations focusing on building the right muscles in their SDLC practice are the ones best positioned to actually take advantage of any potential gains in developer coding productivity.

So what metric should we use instead? At GitLab we already have value stream analytics, which examine the end-to-end flow of work from idea to production to determine where bottlenecks exist. Value stream analytics isn’t a single measurement, it’s the ongoing tracking of metrics like lead time, cycle time, deployment frequency, and production defects. This keeps the focus on business outcomes rather than developer activity. By taking a holistic view across code quality, collaboration, downstream costs, and developer experience, teams can steer these technologies to augment (rather than replace) human abilities over the long run.

Introducing GitLab's AI Impact approach

GitLab has the whole picture being a unified DevSecOps platform that spans the entire SDLC. We built Value Stream Management to empower teams with metrics and insights to ship better software faster. Blending GitLab Value Stream Analytics and DORA metrics, and GitLab Duo usage data, we can provide organizations with the complete picture of how AI is impacting their SDLC. We’re calling this dashboard AI Impact, and it’s coming in an upcoming release to measure GitLab Duo’s impact on productivity. Follow our progress and share your feedback.

Disclaimer: This blog contains information related to upcoming products, features, and functionality. It is important to note that the information in this blog post is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned in this blog and linked pages are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab.

Measuring AI effectiveness beyond developer productivity metrics

Flawed productivity metrics

Indirect impacts are hard to quantify

Focus should be on business outcomes

The need to balance speed and quality

Implementation and team dynamics play a key role

Challenges exist, but AI is the future

What to measure instead

Introducing GitLab's AI Impact approach

More to explore

Use GitLab AI features out-of-the-box in a GitLab Workspace

Developing GitLab Duo: Use AI to remediate security vulnerabilities

Developing GitLab Duo: A roundup of recent Chat enhancements

We want to hear from you

Ready to get started?

Measuring AI effectiveness beyond developer productivity metrics

Flawed productivity metrics

Indirect impacts are hard to quantify

Focus should be on business outcomes

The need to balance speed and quality

Implementation and team dynamics play a key role

Challenges exist, but AI is the future

What to measure instead

Introducing GitLab's AI Impact approach

Sign up for GitLab’s newsletter

More to explore

Use GitLab AI features out-of-the-box in a GitLab Workspace

Developing GitLab Duo: Use AI to remediate security vulnerabilities

Developing GitLab Duo: A roundup of recent Chat enhancements

We want to hear from you

Ready to get started?