The Factory Is the Product: Why AI Factories Need Production Engineering

Everyone's excited about AI factories. I am too. But I think most companies are building them wrong — and the gap between what they're getting and what their boards expect is going to be very expensive to close.

This post is part of an ongoing series on production engineering and the future of how software teams operate.

Here's what I mean.

The conveyor belt vs. the factory floor

When leaders talk about AI factories, they're usually describing the same thing: take your existing software development process and accelerate it with AI. Generate code faster. Triage tickets faster. Run tests faster. Move the conveyor belts faster.

That produces real gains. 30, 50, sometimes 75% improvement in throughput. Engineers ship more. Support resolves more. QA catches more.

But boards aren't expecting 50%. They're expecting 5x.

That gap isn't irrational. The boards are right. The CTOs are right. They're just talking about two completely different things.

Faster conveyor belts are an optimization. A rebuilt factory floor is a different business.

The industrial revolution didn't happen because someone made horses run faster. It happened because someone replaced the horse with a system that didn't need one. The assembly line wasn't an upgrade to the carriage factory. It made the carriage factory irrelevant. If you think about Tesla and SpaceX — those aren't companies that make great products. They're companies that built factory floors that happen to produce great products. The factory is the product.

The factory is the product
— Elon Musk (@elonmusk) January 11, 2021

That's the reframe most software organizations haven't made yet. And it starts with a question that sounds simple: what would it mean to redesign how software gets built and operated from scratch?

What actually needs to change

Most software organizations run production through three separate functions that barely talk to each other.

Support sees what customers are experiencing. They're the first to know when something breaks, the first to hear the pattern of complaints, the first to absorb the blast radius of a bad release. But their visibility stops at the ticket. They can't trace a complaint to a code change. They can't see what's upstream. They're triaging blind. And the result is predictable: support escalations that take days, not minutes, because nobody has the full picture.

SRE sees the infrastructure. They know when latency spikes, when memory leaks, when a service goes down. But they're missing the customer context. They don't know which users are affected, which behaviors triggered the failure, which code path is the actual root cause. They're watching a dashboard that tells them something is wrong without telling them why.

QA sees what developers think should be tested. That's not the same as what actually breaks in production. Tests check the scenarios that engineers anticipated. Real users find the ones they didn't. This is the core of predictive software quality — moving from reactive testing to a model that understands how software actually fails.

Three teams, three partial views, no shared model of what's actually happening.

This is the assembly line version of software operations. Each function is optimized in isolation. The factory floor — a unified system that connects all three — doesn't exist yet.

What changes when you build it isn't just efficiency. It's the nature of the work. Support teams that can trace a ticket directly to a code change don't need to escalate. SREs who can see which customer sessions triggered an incident don't need to reconstruct it from logs. QA teams that have a live model of how real users actually exercise the system can write tests that actually predict production failures instead of just passing a checklist.

The factory floor is the shared context that makes all three functions meaningfully better at the job they're already trying to do.

The knowledge that's been missing

Here's the structural problem with the three-silo model: knowledge doesn't compound.

A support escalation that finally gets resolved carries valuable signal — which code path was fragile, which customer behavior triggered it, which configuration was the culprit. In most organizations, that signal gets written up in a ticket, closed, and never touched again. It doesn't inform the next QA cycle. It doesn't make the next SRE incident easier to triage. It doesn't get fed back to the developer who wrote the code.

The SRE who resolves an incident at 2am knows something by the end of it that they didn't know at the beginning. That knowledge lives in their head. It gets shared in a postmortem, maybe. Then it fades.

This is why experience curves in software operations are so flat. You can have a team of senior engineers who've been operating a system for five years and still spend the same amount of time on incidents as they did in year one. Because the knowledge doesn't accumulate. It dissipates.

As much as we all want to drink the vibe coding kool-aid, the reality is that 90%+ of all code in a company is maintaining and migrating existing stuff that is complicated and messy. You can definitely vibe code the remaining 10% but unless there is a working example of how to reinvent the 90%, AI will not live up to its potential. https://t.co/EGghawwsp8
— Chamath Palihapitiya (@chamath) April 15, 2026

The factory floor changes that. When every support ticket, every incident, every deployment, every code change flows through a shared model of how the system actually behaves — that model gets smarter with every cycle. The third incident involving the same code path is resolved in minutes because the system already knows what the first two taught it. The QA team catches an edge case in pre-production because the model understands which user behaviors have historically triggered failures in that part of the code. This is what automated issue resolution actually looks like when it works — not rules-based automation, but a system that learns.

This is what compounding looks like in software operations. And it doesn't happen in silos.

What rebuilding the factory floor actually requires

You can't build a factory floor by buying three more point solutions. That's more conveyor belts. No matter how fast they are, you’ll never get exponential efficiency improvements.

The factory floor requires a single, unified model of how your software actually works in production — one that connects code to customer signals, commits to incidents, test failures to real user behavior. We call this a production world model. It's the accumulated institutional knowledge of how your system behaves, made explicit and queryable, instead of locked in people's heads and scattered across tools.

Building it requires three things working together.

First, a codebase integration that understands not just what your code says but how it behaves — which paths are exercised by which user behaviors, which changes create risk, which components are structurally fragile. This isn't static analysis. It's a live understanding that updates with every commit. Code simulations are one way to test this understanding before changes hit production.

Second, a connection to production reality. Support tickets, error traces, code telemetry, session replays — the signals that tell you what customers are actually experiencing and how those experiences map back to specific code. This is the link that most organizations are missing. It's the connection between the customer who filed a ticket and the engineer who needs to fix it.

Third, a feedback loop that turns every incident into learning. Not just documentation. Structured knowledge that the system can use to catch the next version of the same problem before it ships. This is the core of agentic debugging — systems that don't just surface issues but learn from how they were resolved.

When those three things work together, support teams can resolve issues without escalating to engineering. SREs can trace an incident to its root cause in minutes instead of hours. QA can validate against real production behavior instead of imagined test cases. And the whole system gets better with every release.

That's not a faster conveyor belt. That's a rebuilt factory floor.

The window is shorter than you think

The teams getting 5x aren't the ones who bought the most AI tools. They're the ones who asked the harder question first: what would we have to change about how we operate software, not just how we build it?

The AI factory conversation is happening at the board level right now. CTOs are being asked to justify their AI investments in terms of order-of-magnitude outcomes, not incremental improvement. Most of them don't have a good answer yet because they're still optimizing the assembly line.

The ones who figure out how to rebuild the factory floor — how to turn support, SRE, and QA from three siloed functions into one unified production engineering practice — are going to have a structural advantage that compounds over time. Every release makes their system smarter. Every incident makes their model richer. Every customer signal feeds back into the next QA cycle.

We've seen what this looks like in practice. Cyrano Video reduced engineering hours on support by 80% and enabled their CS team to resolve 40% of issues without escalating to engineering — not by hiring more people, but by giving everyone the context they were missing. Key Data went from debugging cycles measured in weeks to minutes. That's not a faster conveyor belt. That's a different factory.

And if you're thinking about where AI fits into all of this — the coding tools, the agents, the second-generation platforms — the framing that matters most is what AI-assisted coding can't yet do: operate what it builds. That's the factory floor problem. That's the one worth solving.

That's the factory worth building.