In May, Uber's CEO told investors the company was slowing hiring to fund its AI investments. Employees with AI tools could increase their throughput by 20%, 30%, 50%, 100%, he said. It's going to be well worth it.
Three weeks later, Uber's COO stood at a conference and admitted that it's getting "harder to justify" those same AI costs. Despite 95% of Uber's engineers using AI tools every month and 70% of committed code being AI-generated, he said he couldn't draw a clear line between that usage and measurable improvements in consumer-facing products. "That link is not there yet, right?"
One company. Two executives. One month apart. Completely different stories.
That's not a communications problem. That's a measurement problem — and it's the defining challenge of enterprise AI right now.
The tokenmaxxing trap
The COO named the dynamic: tokenmaxxing. Employees maximizing AI token consumption, either because the tools are embedded in daily workflows or because usage has become a proxy for productivity. The result is a budget that burns fast — Uber's CTO disclosed in April that the company had burned through its entire 2026 Claude Code and Cursor budget in four months — and a finance team asking what they got for it.
Uber isn't alone. Microsoft cancelled Claude Code licenses across one of its major divisions. Duolingo's CEO walked back AI usage as a performance-review input after employees pushed back on being measured by AI consumption rather than outcomes. The pattern is consistent enough to be structural: enterprise AI is entering a second phase, where the question is no longer whether to deploy AI but whether the deployment can be justified on financial terms.
Jaya Gupta at Foundation Capital framed it precisely: the bill can't tell you whether the spend replaced labor, generated revenue, reduced risk, or was just engineers tokenmaxxing on the leaderboard. At experimental scale, the variance feels like overhead. Past seven figures, it becomes a P&L line the CFO has to explain to the CEO.
The pressure is real. But most companies are solving the wrong problem.
A token isn't a unit of compute. It's a unit of work.
Here's the reframe that matters: when we talk about token spend, we're not actually talking about compute costs. We're talking about the cost of cognitive work being executed by AI.
Every token spent on an agent investigating a production incident, generating a code change, triaging a support ticket, or synthesizing context across systems — that's work. Work that used to be done by a person. Work that had a cost, a time, and an outcome.
We've always measured the output of human labor. A developer ships a feature and we ask: did it work? Did it introduce bugs? Did it come back as a support ticket three weeks later? We track defect escape rates, MTTR, release velocity, escalation volume. We know how to measure human productivity because we built decades of tooling to do it.
We haven't built that layer for AI labor yet. And that's exactly why Uber's COO can't draw the line.
The companies getting this right aren't the ones with the highest token volume. They're the ones who designed workflows where token consumption maps to a measurable output: a ticket resolved, an incident closed, a deployment validated, a customer issue deflected before it escalates. They measure tokens per outcome, not tokens in aggregate.
That's the only unit of measurement that makes AI spend legible.
Why "tokens per team" is the wrong denominator
The instinct when facing budget pressure is to allocate tokens to teams, set budgets, and monitor consumption. That's a reasonable place to start. But it measures inputs, not outputs. It's the AI equivalent of measuring engineering productivity by hours worked.
Aaron Levie noted that token budgeting is going to require the same excruciatingly detailed management as headcount or marketing budgets — but the companies that solve this well won't do it by managing tokens as an expense. They'll do it by connecting tokens to the work that matters.
A common trend emerging in larger enterprises is token budgeting as a major topic. As agents can do more and more long running tasks, and thus take vastly more compute, allocation of tokens across teams becomes a very real thing in the enterprise.
— Aaron Levie (@levie) May 9, 2026
Companies spend a meaningful…
Gupta describes what's actually happening at the companies working through this: the conversation is moving to cost per completed outcome. Cost per resolved ticket. Cost per closed incident. Cost per validated deployment. These are units executives already understand, because they're the units labor has always been priced in. A BPO contract is priced per ticket, per claim, per review. AI spend needs to become comparable.
The unit you actually want is: what did this token spend produce? And to answer that, you need visibility into the work itself — not just the tokens consumed, but the decision traces underneath them. Which parts of your system are fragile? Which tickets come back? Which agent actions resolved something versus just completed? Which code changes look clean but introduce regressions downstream?
Gupta calls this token-to-outcome attribution — a conversion layer that connects inference spend to the work performed and the business outcome produced. The companies that build that layer first make the allocation calls: which workflows get more compute, which get capped, which stay human. And once you make those calls, you control where AI spend goes inside the enterprise.
That's what an engineering world model is actually for. Not raw observability. Not another dashboard. A system that understands how production software actually behaves — the relationships between code, failures, customer impact, and resolutions — so that when an agent does work, you can trace that work to an outcome.
What this actually looks like in practice
This isn't theoretical. The teams already doing it are building ROI cases grounded in the same metrics executives have always used to evaluate labor.
Take support engineering, which is where the token-to-outcome problem is most tractable. The workflow is defined, the unit of output is clear (a ticket resolved, an escalation avoided, an engineer's time freed), and the cost comparison against human labor is direct.
At Zuora, L3 triage time dropped from three days to 15 minutes, and monthly escalations fell from 28 to 3. That's not a productivity metric — it's a labor displacement metric. The engineers who were spending days on triage are spending those hours on something else. The token spend on that investigation agent has a clear denominator: escalations deflected, engineering time reclaimed.
At Cayuse, engineers who used to spend 60% of their time on support work shifted that capacity to the feature roadmap. Ninety percent of defects are caught before they reach customers. The ROI case isn't "we used a lot of AI" — it's "our defect escape rate changed by this much, and here's what that translates to in engineering hours and customer churn."
At Key Data, debugging cycles that used to run for weeks now resolve in minutes. Release velocity doubled. Again: not a token metric. An outcome metric that happens to be enabled by AI.
What these teams share is that they started with the outcome unit and worked backward. Not "how many tokens did we spend?" but "what's a resolved ticket worth, what's an escalation avoided worth, what's a week of engineering time worth?" — and then built the measurement layer to connect token activity to those numbers.
That's what Gupta means by measurement becoming memory. To connect a token to an outcome, you have to capture what happened in between: what the agent saw, what it retrieved, where it retried, when a human overrode it, why one path worked while another failed. The measurement layer has to record decision traces — something enterprises have never really had, because those traces lived in Slack threads and people's heads. AI changes that. Every agent action becomes part of a durable record of how the organization actually decides. Gupta calls this a context graph — and it's exactly what makes the difference between AI activity and AI leverage.
The hidden costs tokens don't show you
There's another dimension to this that the Uber story gestures at but doesn't fully surface: the downstream cost of AI-generated work that isn't validated against how your system actually behaves in production.
Gupta describes the retry tail problem precisely: if an agent completes a workflow correctly on the first pass with probability p, expected tokens per resolved workflow scale as T/p. A drop in completion rate from 90% to 70% raises effective cost per resolution by about 28%, not 20%, because failures compound. In messy enterprise environments — where inputs are irregular, exceptions matter, and context is incomplete — failure doesn't just reduce accuracy. It changes the economics.
This is why context isn't optional. Agents that work from incomplete or static system knowledge produce output fast, but that output creates downstream work: code that looks complete but was written without understanding system-level dependencies ships, then comes back as a production incident. Ticket resolutions that close a symptom without addressing root cause generate repeat escalations. Agentic workflows that complete tasks in isolation create integration failures when those tasks interact with everything else.
The retry tail is invisible on the token bill. So is the engineering time spent cleaning up AI-generated work that didn't hold. That's the hidden cost that makes some AI deployments look like they're working while the underlying economics are quietly deteriorating.
The organizations seeing the strongest returns aren't the ones spending the most tokens. They're the ones whose agents have enough context to know the difference between work that compounds and work that unwinds — because they built the layer connecting tokens to what's actually happening in their production system. We wrote about what that layer looks like in more depth here.
What the Uber contradiction tells us about where this goes next
The CEO made a reasonable bet: AI tools would make employees more productive, and productivity gains would justify the investment. The COO's admission doesn't mean the bet is wrong. It means the measurement infrastructure to evaluate it doesn't exist yet.
Gupta puts the stakes clearly: the company that owns token-to-outcome attribution makes the allocation calls. Which workflows deserve more compute, which get capped, which get cheaper models, which stay human. And once you make those calls, you control where AI spend goes inside the enterprise. That's not a technology advantage. It's an organizational one. And it compounds.
The first phase of enterprise AI proved that models could do work. The next phase decides how much of that work is worth it.
Tokens aren't the price of experimentation. They're the operating fuel of a new way of working. And like any fuel, what matters isn't how much you're burning. It's how far you're going.
PlayerZero helps engineering teams connect AI agent activity to production outcomes — so token spend maps to real work, and that work maps to real results. See how it works.