the adoption trap
Your CFO walks into your one-on-one with a spreadsheet, and the number on it is your annual spend on Copilot seats plus the Claude Code API bill plus whatever Cursor enterprise contract someone signed in a hurry last March, and the question is simple and brutal: what did we get for this. And if your answer is some variation of "94% of the team is using it daily" then you have already lost the argument, because adoption is not value, it never was, and the gap between those two things is exactly where most engineering orgs are bleeding money right now without realizing it.
The 2026 numbers are not subtle. Roughly 84% of developers report using AI tools in their daily work, which is a staggering adoption curve for any technology, faster than cloud, faster than containers, faster than basically anything we have seen. And yet when you measure productivity at the organisation level, not the individual prompt level, the gains land somewhere between 10 and 30%, and the high end of that range is mostly seen in greenfield work and tightly scoped tasks rather than the messy maintenance and integration work that eats the majority of a real engineering team's calendar.
That disconnect is the whole story. A developer feels 50% faster because the autocomplete fills in the boilerplate and the chat window spits out a working regex in three seconds, and that feeling is genuine, it is not a lie, but it is also not the same thing as your team shipping features 50% faster, because the bottleneck in most organisations was never typing speed. It was code review, it was context switching, it was waiting on the staging environment, it was the three Slack threads needed to figure out why the payment webhook fires twice in production. AI does very little for any of those, and in some cases it makes them worse.
what uber and amazon learned the expensive way
Uber reportedly burned through its entire 2026 AI tooling budget in four months, and the part that should make every engineering leader uncomfortable is that there was no corresponding jump in measurable output. Same delivery cadence, same defect rates, dramatically higher cloud and API costs. The money went somewhere, the tokens were definitely consumed, but the thing the budget was supposed to buy, faster shipping, never showed up on any dashboard anyone could point at.
Amazon's experience is the one I keep thinking about because it is darkly funny in the way that only real engineering incentive failures are. They built an internal productivity tool, Kirorank, meant to track and reward AI-assisted output, and engineers did exactly what engineers always do when you put a number in front of them and tie it to evaluation: they optimised the number. People started running up token costs and generating activity that scored well on the metric while producing nothing of value, and Amazon eventually killed the tool because the signal it produced was worse than no signal at all. Goodhart's law arrived precisely on schedule. The moment a measure becomes a target, it stops being a good measure, and "lines of AI-generated code" or "prompts per developer per day" are about the most gameable targets you could possibly pick.
The lesson is not that AI tooling is a scam. We use these tools at steezr every single day and they earn their keep. The lesson is that if you measure the wrong thing, you will spend real money buying the appearance of progress, and you will keep spending it for months before anyone notices, because the people generating the activity have every incentive to keep telling you it's working.
where the roi actually disappears
The leak happens in the gap between code that gets generated and code that ships and stays shipped. AI is extraordinary at producing plausible code fast. It is much worse at producing correct code that fits your existing architecture, respects your error handling conventions, and doesn't quietly introduce an N+1 query that nobody catches until the PostgreSQL connection pool saturates at 2am during a traffic spike.
We ran into this on a document processing pipeline for a client earlier this year. An engineer used Claude Code to scaffold a chunk of the ingestion layer in an afternoon, which felt like a massive win, except the generated code handled the happy path beautifully and silently swallowed malformed PDFs by catching a broad exception and logging nothing. It took two days of review and a production incident to find it, which means the net time saved on that feature was negative, and the only reason we caught it at all is that we treat AI output as a junior engineer's first draft rather than as finished work.
That is the pattern everywhere. The speed shows up at generation time and the cost shows up later, in review, in debugging, in the maintenance tail, and because those costs are diffuse and delayed they don't get attributed back to the AI tool that caused them. So the spreadsheet shows tokens consumed and seats active, the developers report feeling faster, and meanwhile your senior engineers are spending more of their week reviewing larger PRs full of confident-looking code that needs careful reading because it might be subtly wrong in ways that human-written code usually isn't. The volume of code went up. The volume of value did not necessarily follow.
measure value, not activity
If you want to justify the spend honestly, throw out every metric that counts AI usage and replace it with metrics that count outcomes, the same outcomes you cared about before any of this existed. Cycle time from first commit to merged-and-deployed. Change failure rate. Time to restore service after an incident. The DORA metrics have been sitting right there the whole time and they don't care whether a human or a model wrote the code, they only care whether your system gets better software into production reliably and quickly.
Run the comparison properly. Take two teams or two quarters, one with heavy AI tooling and one without, and look at whether cycle time actually dropped and whether change failure rate held steady or got worse. If your AI-heavy team is shipping faster but breaking production more often, you have not gained productivity, you have moved cost from the keyboard to the on-call rotation, and that trade is almost never worth it.
Watch your review load as a leading indicator. If PR size is climbing and review time per PR is climbing with it, that is the AI tax showing up, and it usually means people are generating more code than they're carefully reading. The healthy pattern is smaller, more frequent PRs where the AI helped someone get to a clean diff faster, not sprawling diffs full of generated code that the author themselves doesn't fully understand. One client we audited had average PR size up 40% year over year and review time up nearly as much, and once we pointed it out the cause was obvious to everyone in the room, they just hadn't been looking at it through that lens.
the workflow changes that actually move the needle
The teams getting real gains aren't the ones with the biggest token budget, they're the ones who restructured their workflow so that AI handles the work it's genuinely good at and humans stay firmly in control of the work that matters. Use it aggressively for the stuff that is tedious and low-stakes: writing tests against existing code, drafting migrations, generating boilerplate for a new endpoint that follows an established pattern, translating a function from one language to another, explaining an unfamiliar codebase. That is where the speedup is real and the downside is small.
Keep it on a tight leash for architecture, for anything touching auth or payments or data integrity, for the decisions that are expensive to reverse. We tell our engineers to treat generated code as a draft that has to earn its place, which means reading every line, running it against real edge cases, and never merging something just because it looks right and the tests happen to pass, because AI is very good at writing code that passes the tests it also wrote.
Invest in the boring infrastructure that makes review faster, because review is now the bottleneck. Good CI, fast test suites, strict typing, linters that catch the dumb stuff so humans can focus on the subtle stuff. If you're on Go, the compiler and the type system already do enormous work here, which is one reason AI-assisted Go tends to fail more loudly and earlier than AI-assisted Python, where a confidently wrong function can sit in production for weeks before anyone notices.
And be willing to pull seats from people who aren't getting value. Not everyone benefits equally, the gains concentrate among engineers who already know the codebase well enough to spot when the model is wrong, and paying for a tool that makes your weakest engineers produce more code you then have to clean up is a cost dressed up as an investment.
what to tell the cfo
Go back into that meeting with a different number. Not seats, not tokens, not adoption percentage, but a before-and-after on cycle time and change failure rate for a specific set of teams over a specific window, with an honest note about where the tooling helped and where it didn't. If the numbers are good, you have a real case and you can defend the spend with a straight face. If the numbers are flat or worse, you have just saved the company a lot of money and you should say so, because killing a tool that isn't working is a win, not a failure.
The companies that come out of this period ahead are the ones treating AI tooling as one input into a system they actually understand, rather than a magic spend that automatically produces speed. The 10 to 30% gains are real and worth capturing, but you only capture them by being ruthless about measuring outcomes and ignoring the dashboard of activity that the vendors and, frankly, your own developers will happily wave in front of you. Adoption was never the hard part. Adoption hit 84% almost on its own. Turning that into shipped, reliable software is the actual work, and it looks exactly like the engineering discipline you already knew you needed before any of this showed up.
