Everyone has an opinion about whether AI makes developers more productive. The researchers have data.
Over the past 18 months, a cluster of rigorous studies has landed on questions the marketing slides don't ask: not "does AI help developers write more code?" but "does it help them build better software?" Not "do developers feel more productive?" but "what's actually happening to the systems they're operating?"
The answers are counterintuitive. Some of them are uncomfortable. All five are worth understanding before your next AI tooling decision.
Study 1: Experienced developers on real codebases got 19% slower
The one that breaks the "AI helps experts most" assumption.
METR ran a controlled study with 16 experienced open-source contributors — people who'd worked on their codebases for years — across 246 tasks. Real work. Familiar code. Senior engineers.
AI tools increased completion time by 19%.
The developers themselves predicted AI would make them 24% faster. That's a 43-point gap between expectation and outcome, in the wrong direction. And this wasn't a sample of developers new to AI. These were contributors who used these tools regularly, on projects they knew deeply.
The mechanism isn't hard to explain. Code assistants excel at boilerplate and pattern completion — the parts of engineering that were already fast. The tasks that take the longest are the ones requiring architectural judgment, system-level reasoning, and deep familiarity with how a specific codebase behaves under specific conditions. That's precisely what AI tools currently can't replicate. So you add a layer of tool management and verification overhead on top of the hard work, and the hard work doesn't get faster.
If you're a VPE making the case that code assistants will reduce engineering hours on production incidents, this study is the one to read first.
Study 2: Developers using AI scored nearly two letter grades lower on debugging
The one about the skill your L3 queue actually depends on.
Anthropic ran a randomized controlled trial with 52 software developers. One group used AI assistants. One didn't. Both groups then took a comprehension test on the concepts they'd just used.
The AI group scored 17% lower — nearly two letter grades. The largest gap was specifically on debugging questions: the ability to understand when code is wrong and why it fails.
This is not a minor finding. Debugging comprehension is the bottleneck in every hard ticket. It's what separates an engineer who can close a production escalation in 20 minutes from one who needs three hours and two colleagues. When that skill atrophies across a team, every complex issue gets harder and slower to resolve.
The study also found a meaningful caveat: developers who used AI to build understanding — asking "how does this work?" rather than "write this for me" — retained comprehension close to the control group. The damage is specific to passive acceptance of AI-generated output, not to AI use itself. How your team uses these tools matters as much as whether they use them.
The practical question for any engineering leader: do you know which pattern your team has adopted?
Study 3: Senior engineers got 19% less productive — even though junior devs got faster
The one about what AI does to your most valuable people.
Tilburg University analyzed 2,755 open-source projects and 1,699 developers before and after the introduction of GitHub Copilot. The headline finding is that AI tools boosted productivity for junior contributors. The buried finding is more important for most organizations.
Senior engineers — the core contributors responsible for maintaining quality — saw a 19% drop in their own output. They were spending 6.5% more of their time reviewing, correcting, and cleaning up code that others (and AI) had generated. Their own contributions to new development declined.
This is the redistribution problem. AI doesn't eliminate senior engineering time. It redirects it from creative work toward maintenance and oversight. The more junior contributors lean on AI-generated code, the more senior engineers absorb the review burden. And in most software organizations, senior engineers are already the L3 escalation bottleneck, the production incident responders, and the architectural decision-makers. Pulling them further toward review work has compounding downstream costs that don't show up in any productivity dashboard.
The lesson isn't that AI tools are bad for junior developers. It's that organizations need to account for where the overhead lands — and it almost always lands on the people who can least afford to be interrupted.
Study 4: Refactoring collapsed. Copy/paste exploded. The cleanup bill is compounding.
The one about what's actually happening inside your codebase.
GitClear analyzed 211 million lines of code authored between 2020 and 2024 across major open-source repositories including Chromium, React, Kubernetes, and VS Code. They tracked seven distinct code operations — not just "lines added" but moved, refactored, copy/pasted, churned, and duplicated.
The findings are structural:
Refactoring — moving and consolidating code into reusable modules, the signature of a maintainable codebase — dropped from 25% of code activity in 2021 to under 10% in 2024. A 60% collapse in four years.
Copy/pasted code exceeded moved (refactored) code for the first time in the dataset's history. The frequency of commits containing duplicated code blocks increased 8-fold in 2024 alone.
Code churn — lines revised within two weeks of being written — increased 26% year over year.
The picture this paints: AI tools are generating more code, faster, with less architectural consolidation. The same business logic now lives in five files instead of one. When the logic needs to change, you're updating five places and hoping you found them all. Research on code clones consistently finds that 57% of co-changed code clones are involved in bugs — and that up to 33% of bug fixes in cloned code can contain propagated bugs that need to be fixed again elsewhere.
This is the invisible cost. It doesn't show up in sprint velocity metrics. It accumulates in the background, making every subsequent change slightly harder and every production issue slightly more complex to trace. The engineering world model at PlayerZero is specifically designed to reason across this kind of fragmented, duplicated codebase — because humans increasingly can't.
Study 5: Every 25% increase in AI adoption correlates with a 7.2% drop in delivery stability
The one that connects it all to production outcomes.
Google's DORA (DevOps Research and Assessment) team surveyed 39,000 respondents for its 2024 annual benchmarks — large enough to detect system-level patterns that individual team surveys miss.
They found that AI adoption is associated with increased throughput — more code shipped, faster. They also found it's associated with increased instability. For every 25% increase in AI adoption, their model projects a 7.2% decrease in delivery stability.
Google researchers described this as "surprising" — developers report perceiving AI as a positive performance contributor, and yet defect rates are rising alongside adoption. But the GitClear data suggests a reconciliation: more code authored, less refactored, more duplicated, more churned. "More code" isn't the same as "better software." When organizations measure developer productivity by commit count or lines added, AI can juice those numbers while the system underneath quietly accumulates maintenance risk.
The DORA finding also corroborates something most production engineering teams feel but struggle to quantify: the gap between how a team feels about its velocity and how production is actually behaving. Those two things can diverge — and often do — when the tooling optimizes for generation rather than operation.
The through-line
Five studies, five different methodologies, five different research teams. One consistent finding underneath them all:
AI code generation tools optimize for the moment of generation. Production software has to survive for months or years after that moment. The tools that help you write faster aren't the same ones that help you understand what you shipped, catch what breaks, or maintain what you built. And the organizations investing heavily in the first category without the second are accumulating a debt that's only now starting to show up in the data.
This isn't an argument against AI coding tools. METR's study participants still used them and still found value for specific tasks. Anthropic's study found that developers who used AI to build understanding — rather than to bypass it — retained their skills. The tools are real. The productivity gains on the right tasks are real.
The argument is about measurement. If your KPIs are lines shipped and PR velocity, AI looks like an unqualified win. If your KPIs are support escalation rate, MTTR, and delivery stability, the picture is more complicated — and these five studies are the reason.
What this means for production engineering
The studies point toward a distinction that most organizations haven't formally made yet: the difference between development-time AI (tools that help you write code) and production-time AI (tools that help you understand and operate what you've built).
Development-time AI is mature, or code generation tools. Copilot, Cursor, Claude Code — the category is established, widely adopted, and genuinely useful for the tasks it was designed for.
Production-time AI is where the gap is. When an incident fires at 2am, no code assistant helps you reconstruct which of five duplicated functions is the one that's broken, or why this customer's specific configuration triggers a code path nobody thought to test, or how the deployment from Tuesday connects to the ticket that arrived Thursday. That requires a persistent model of your system — something that's reasoned over your codebase, your deployment history, your prior incidents — not a tool that searches on-the-fly.
That's the category production engineering addresses. And based on what these five studies show about where the problems actually accumulate, it's where the next wave of engineering investment is going.
If your backlog looks the same (or worse) as it did before your last AI tooling rollout, one of these five studies probably explains why.
Related reading: