Context Graphs: Building Production World Models for the Age of AI Agents

AI generates code remarkably well. But here's what it struggles with: understanding production reality. Production exists in pieces. Code describes what should happen; observability tools see signals; ticketing systems see problems; CI/CD sees changes. Every surface sees a slice of production. None maintains a coherent model of how the system actually works. 

The same fragmentation exists across people and roles: SRE, support, QA, dev, PM. There is no central understanding of how production software works. Even within teams, individuals' knowledge silos can create problems for the organization.

Production understanding is implicit and fragmented—it lives in code, dashboards, tickets, tribal knowledge, and the heads of a few senior engineers. So when it’s time to solve a customer problem in production, the response mirrors the reality: it’s disjointed, slow, and siloed. 

In order for AI to truly help, it needs to understand the “why”, the backdrop of key decisions. Not just where we are today, but how we got here.

The Two Clocks Problem

This gap between the current state and historical state is called “the two clocks problem.” Here’s an analogy that may help you wrap your head around it: Your CRM stores the final deal value, not the negotiation. Your ticket system stores "resolved," not the reasoning. Your codebase stores the current state, not the two architectural debates that produced it.

We've built a trillion-dollar infrastructure for what's true now. Almost nothing for why it became true.

This made sense when humans were the reasoning layer. The organizational brain was distributed across human heads, reconstructed on demand through conversation. Now we want AI systems to decide, and we've given them nothing to reason from. We're asking models to exercise judgment without access to precedent. It's like training a lawyer on verdicts without case law.

The config file says timeout=30s. It used to say timeout=5s. Someone tripled it. Why? The git blame shows who. The reasoning is gone.

This pattern is everywhere. The CRM says "closed lost."

Enterprise software got very good at storing state, but it's still bad at storing decisions. Most systems can tell you what's true right now and what happened, but they don't preserve why a choice was made in the moment—what inputs were considered, which constraints were binding, and what tradeoff actually drove the outcome. That's why "connect an LLM to your systems" often disappoints: the model can see data, but it can't see the organization's decision logic. If you want AI to act reliably, you need a way to represent not just state, but the reasoning that turns state into action.

The valuable layer isn't the documents; it's the decisions those documents informed along with how those documents were created. How did the partner actually structure that earn-out? Why did the analyst reject that risk? What made the clinician deviate from protocol? 

Decision traces are where institutional intelligence actually lives. But decision traces are even harder to anonymize than documents. You can pseudonymize entities. You can't easily anonymize patterns of judgment. "We always take a harder line when the counterparty's counsel is from X firm" reveals something even with X masked.

What Context Graphs Actually Are

The next trillion-dollar platforms won't be built by adding AI to existing systems of record, but by capturing the reasoning that connects data to action in a context graph. A context graph captures something that systems of record explicitly do not: the history, the “why”, the “how did we get here?”

When context graphs accumulate enough structure, they can become a world model. They encode organizational physics—decision dynamics, state propagation, entity interactions. You can run simulations or tests within these models. You can ask "what if?" and get useful answers, not wild hallucinations, because you've built something real.

A context graph isn't a graph of nouns; it's a graph of decisions with evidence, constraints, and outcomes.

"Context graph" becomes real when you can turn messy operations into something replayable: not just events, but decisions with the evidence that was available, the constraints that were binding, the tradeoff that won, and what followed. Without that, you get either a beautiful model that doesn't drive action, or a firehose of activity that can't be learned from.

What does this look like in practice? A renewal agent proposes a 20% discount. Policy caps renewals at 10% unless a service-impact exception is approved. The agent pulls three SEV-1 incidents from PagerDuty, an open "cancel unless fixed" escalation in Zendesk, and the prior renewal thread where a VP approved a similar exception last quarter. It routes the exception to Finance. Finance approves. The CRM ends up with one fact: "20% discount."

Once you have decision records, the "why" becomes first-class data. Over time, these records naturally form a context graph: the entities the business already cares about (accounts, renewals, tickets, incidents, policies, approvers, agent runs) connected by decision events (the moments that matter) and "why" links. Companies can now audit and debug autonomy and turn exceptions into precedent instead of re-learning the same edge case in Slack every quarter.

The feedback loop makes this compound. Captured decision traces become searchable precedents. And every automated decision adds another trace to the graph.

None of this requires full autonomy on day one. It starts with human-in-the-loop: the agent proposes, gathers context, routes approvals, and records the trace. Over time, as similar cases repeat, more of the path can be automated because the system has a structured library of prior decisions and exceptions. Even when a human still makes the call, the graph keeps growing, because the workflow layer captures the inputs, approval, and rationale as durable precedent instead of letting it die in Slack.

If context graphs are so obviously essential, why haven’t we seen more of them?

Why Context Graphs Are Rare: The Five Coordinate Systems Problem

Context graphs don't really exist out in the wild today because they require joins across coordinate systems that don't share keys. Traditional databases solved joins decades ago. You have a customer_id, an order_id, a foreign key relationship. The join is discrete; the keys are stable; the operation is well-defined. Organizational reasoning requires a different kind of join. You need to connect: what happened (events) to when it happened (timeline) to what it means (semantics) to who owned it (attribution) to what it caused (outcome). These are five different coordinate systems. None of them share a primary key. And the keys themselves are fluid. "Jaya Gupta" in an email, "J. Gupta" in a contract, "@JayaGup10" in Slack. Same entity, no shared identifier. The join condition isn't equality. It's probabilistic resolution across representations in latent space. Every existing data system optimizes for joins within a single coordinate space. Context graphs require joins across all of them simultaneously.

Five coordinate systems, five types of joins: 

  1. Timeline joins: Connecting state across time. The config is 30s now. It was 5s last Tuesday. Joining these requires temporal indexing where "before" and "after" are first-class operations, not filters. 

  2. Event joins: Connecting occurrences into sequences. The deploy happened, then the alert fired, then the rollback. Order matters. Causally-relevant windows matter.The join condition is proximity in event-space, not key equality. 

  3. Semantic joins: Connecting meaning across representations. "Churn risk" in a support ticket relates to "retention concern" in a sales note. The join is vector similarity, not string matching. Fuzzy by nature. 

  4. Attribution joins: Connecting actions to actors to ownership. Who approved this? Who owns that decision? The join traverses org structure, permission hierarchies, approval chains. The topology itself is the join condition. 

  5. Outcome joins: Connecting decisions to consequences. This pricing change led to that revenue impact. The join is causal, not correlational. It requires counterfactual reasoning: what would have happened otherwise? Each join type has different geometry. Timeline is linear. Events are sequential. Semantics live in vector space. Attribution is graph-structured. Outcomes are causal DAGs.

There's no shared coordinate system or universal key. The context graph problem becomes solvable when you realize you don’t create them intentionally; they are a by-product of how agents and humans interact.

How Context Graphs Become Tractable: Agent Trajectories as Training Data

The reason context graphs are now feasible is that we can learn a shared coordinate system where these joins become expressible.

Agent trajectories (as they begin to own meaningful work) are an emerging training signal. When an agent solves a problem, it performs all five join types implicitly. It resolves entities across representations. It sequences events. It connects meaning. It traverses ownership. It traces outcomes. The trajectory is a sample of successful multi-coordinate joins.

Accumulate enough trajectories over time, and you learn embeddings that encode join-compatibility across coordinate systems. Entities that co-occur in trajectories are entities that join well in practice. The embedding space becomes a learned join index. Structural representations need to cooperate with semantic ones. Semantic embeddings encode meaning-similarity. Structural embeddings, learned from trajectories, encode operational-coupling. Together they give you a space where "find related decisions" can mean: related in time, related in meaning, related in ownership, related in outcome.

Any weighted combination with arbitrary join predicates across coordinate systems. Context graphs haven't been built because joining across five different geometries with fluid keys required learning a shared representation from operational data. Agent trajectories provide that data, the math now exists, and the agent ergonomics are just entering the enterprise.

Local walks (likely to backtrack) learn homophily—nodes are similar because they're connected. Global walks (pushing outward) learn structural equivalence—nodes are similar because they play analogous roles, even if never directly connected.

Consider two senior engineers at a company. One works on payments, one on notifications. No shared tickets, no overlapping code, no common Slack channels. Homophily wouldn't see them as similar. But structurally they're equivalent—same role in different subgraphs, similar decision patterns, similar escalation paths. Structural equivalence reveals this.

Agents are informed (not random) walkers.

When an agent investigates an issue or completes a task, it traverses organizational state space. It touches systems, reads data, calls APIs. The trajectory is a walk through the graph of organizational entities.

Unlike random walks, agent trajectories are problem-directed. The agent adapts based on what it finds. Investigating a production incident, it might start broad—what changed recently across all systems? That's global exploration, structural equivalence territory. As evidence accumulates, it narrows to specific services, specific deployment history, specific request paths. That's local exploration, homophily territory.

Random walks discover structure through brute-force coverage. Informed walks discover structure through problem-directed coverage. The agent goes where the problem takes it, and problems reveal what actually matters.

Engineered correctly, agent trajectories become the event clock.

Each trajectory samples organizational structure, biased toward parts that matter for real work. Accumulate thousands and you get a learned representation of how the organization functions, discovered through use.

The ontology emerges from walks. Entities appearing repeatedly are entities that matter. Relationships traversed are relationships that are real. Structural equivalences reveal themselves when different agents solving different problems follow analogous paths.

There's economic elegance here. The agents aren't building the context graph—they're solving problems worth paying for. The context graph is the exhaust. Better context makes agents more capable; capable agents get deployed more; deployment generates trajectories; trajectories build context. But it only works if agents do work that justifies the compute.

Over time, as context graphs build up enough knowledge, they can become something more: a full navigable production world model.

Context Graphs Can Build to Become a Production World Model

A production world model is a learned, compressed representation of how an environment works. It encodes dynamics, i.e. what happens when you take actions suspended in a specific state. It captures structure: what entities exist and how they relate. And it enables prediction: given a current state, and a proposed action, what happens next?

World models demonstrate something important: agents can learn compressed representations of environments and train entirely inside "dreams"—simulated trajectories through latent space. The world model becomes a simulator. You can run hypotheticals and get useful answers without executing in the real environment.

This has an obvious analogy in robotics. A world model capturing physics (how objects fall, how forces propagate) lets you simulate robot actions before executing them, train policies in imagination, explore dangerous scenarios safely, and transfer to physical hardware. The better your physics model, the more useful your simulations.

The same logic applies to organizations, but the physics is different.

Organizational physics isn't mass and momentum. It's decision dynamics. How do exceptions get approved? How do escalations propagate? What happens when you change this configuration while that feature flag is enabled? What's the blast radius of deploying to this service given current dependency state?

State tells you what's true. The event clock tells you how the system behaves—and behavior is what you need to simulate.

A context graph with enough accumulated structure becomes a world model for organizational physics. It encodes how decisions unfold, how state changes propagate, how entities interact. Once you have that, you can simulate.

At PlayerZero, we build code simulations—projecting hypothetical changes onto our model of production systems and predicting outcomes. Given a proposed change, current configurations and feature flags, patterns of how users exercise the system: will this break something? What's the failure mode? Which customers are affected?

These simulations aren't magic. They are inferences over accumulated structure. We've watched enough trajectories through production problems to learn patterns—which code paths are fragile, which configurations interact dangerously, which deployment sequences cause incidents. The world model encodes this. Simulation is querying the model with hypotheticals.

Simulation is the test of understanding. If your context graph can't answer "what if," it's just a search index.

Implications for the continual learning debate

Many folks argue AI isn't transforming the economy because models can't learn on the job—we're stuck building custom training loops for every capability, which doesn't scale to the long tail of organizational knowledge. He's right about the diagnosis.

But what if the standard framing is a distraction? Continual learning asks: how do we update weights from ongoing experience? That's hard—catastrophic forgetting, distributional shift, expensive retraining.

World models suggest an alternative: keep the model fixed, improve the world model it reasons over. The model doesn't need to learn if the world model keeps expanding.

This is what agents can do over accumulated context graphs. Each trajectory is evidence about organizational dynamics. At decision time, perform inference over this evidence: given everything captured about how this system behaves, given current observations, what's the posterior over what's happening? What actions succeed?

More trajectories, better inference. Not because the model updated, but because the world model expanded.

And because the world model supports simulation, you get something more powerful: counterfactual reasoning. Not just "what happened in similar situations?" but "what would happen if I took this action?" The agent imagines futures, evaluates them, chooses accordingly.

This is what experienced employees have that new hires don't. Not different cognitive architecture, a better world model. They've seen enough situations to simulate outcomes. "If we push this Friday, on-call will have a bad weekend." That's not retrieval. It's inference over an internal model of system behavior.

The path to economically transformative AI might not require solving continual learning. It might require building world models that let static models behave as if they're learning, through expanding evidence bases and inference-time compute to reason and simulate over them.

The model is the engine. The context graph is the world model that makes the engine useful.

One underlying dependency of world models are universal ontologies, so it’s worth exploring both prescribed and learned ontologies.

Prescribed vs Learned Ontologies: Two Approaches to Organizational Structure

Many people make the mistake of thinking that a context graph is a graph database or structured memory. That’s not true. Context graphs require a fundamentally different approach to schema and representation.

This matters as teams reach for familiar tools (Neo4j, vector stores, knowledge graphs) and wonder why their agents aren't getting smarter. The primitives are wrong.

"Ontology" is an overloaded term. There are prescribed ontologies (rule engines, workflows, governance layers). Palantir built a $50B company on this: a defined layer mapping enterprise data to objects and relationships. You define the schema. You enforce it. It works when you know the structure upfront.

The next $50B company will be built on learned ontologies. Structure that emerges from how work actually happens, not how you designed it to happen. This is important because there's so much implicit knowledge in decision making that we don't even realize in the moment, and agents replicate our judgement!

Enterprise AI has to navigate both. There are lots of priors for prescribed ontologies. There is almost no infrastructure for learning, representing, and updating the implicit ones. The implicit relationships (which entities get touched together, what co-occurs in decision chains) is the gap. And it's why memory will not solve the problem.

Memory assumes you know what to store and how to retrieve it. But the most valuable context is structure you didn't know existed until agents discovered it through use.

Another misconception: "decision traces are just trajectory logs." That's like saying embeddings are just keyword indexes. Technically adjacent, conceptually wrong.

Remember when embeddings looked like alien technology? A probabilistic way to represent similarity that made the "solved" problem of fuzzy search look prehistoric. People asked, "why do I need this when I have Elasticsearch?"

We're at a similar inflection point for structural learning. Trajectory logs store what happened. Decision traces (done right) learn why it happened. Which entities mattered. What patterns recur. How reasoning flows through organizational state space.

The difference: logs are append-only records. Decision traces are training data for production world models. The schema isn't something you define upfront. It emerges from the walks.

All of this may sound very academic or hypothetical, but context graphs exist in the wild today, and will increasingly in the future.

Where Context Graphs Actually Materialize

The context graph becomes real when you can turn messy operations into something replayable: not just events, but decisions with the evidence that was available, the constraints that were binding, the tradeoff that won, and what followed. Without that, you get either a beautiful model that doesn't drive action, or a firehose of activity that can't be learned from.

First, the decision surface has to be legible. Some domains have clean "commits": triage calls, dispatch reassignments, deviation approvals, escalation decisions that end in a clear "we're doing X." Those are learnable because there's a boundary between deliberation and commitment. Other environments sprawl across half-decisions and reversible moves. If you can't identify what actually counted as the decision, you end up modeling noise instead of judgment. This is where many generic "process mining + LLM" efforts stall: they capture activity, but not the decision boundary.

Second, capture friction matters because it determines how hard it is to get decision traces. That effort varies dramatically by industry. In some environments, decisions already live inside software, so traces fall out naturally. In others, the real decisioning happens verbally: in escalations, handoffs, dispatch calls, re-planning huddles, negotiations. That's why voice is an unlock in many physical-world industries: it lets you capture elements of verbal decisioning as it happens, without forcing people to translate their judgment into forms and fields after the fact.

Third, capture alone isn't enough. Captured context can be wrong, stale, or quietly superseded. Context graphs inherit the organization's flaws: optimistic analysis that becomes lore, decisions announced in writing that were reversed in a meeting, assumptions that stopped being true but never got revisited.

Ontology stability matters too, but its implications diverge, and this is where the market splits.

In asset-heavy domains, the explicit structure of the world is relatively stable. That's why ontology-first platforms work at all. But these same domains have historically been forced to pay an expensive up-front modeling tax because the real decision layer wasn't captured continuously in real time. The opportunity is to keep the substrate, but add another learning loop: treat the prescribed model as scaffolding, and let traces continuously teach the system how decisions are actually made. Over time, deployments become less dependent on bespoke discovery cycles and more defensible through accumulated precedent.

In tech, the inverse problem shows up. Ontologies are unstable because the business itself is constantly being refactored. Products ship and deprecate features. Teams reorganize. Go-to-market motions change. New pricing models appear, old ones disappear.

Even within the same company, different functions operate on fundamentally different objects and timelines, especially in B2B sales, where deals, accounts, territories, approvals, and discount logic vary by segment, region, and quarter. The nouns don't just evolve; they fragment.

This fragmentation is where misalignment shows up. Different parts of the company carry different versions of "what we believe": strategy narratives that drift, metric definitions that mutate, policies that are rewritten by exception, sales motions that contradict product intent. In a human-only organization, this gets papered over with meetings and escalation. In an agentic organization, it becomes immediately operational, because agents act on whatever context they can retrieve. Contradictory context doesn't produce better decisions, it produces wasted work, re-litigation, and actions that undo other actions. Maintaining coherence as the organization changes becomes the hard part.

Over time, the most valuable thing an organization produces isn't data. It's the collection of decisions.

The accumulated patterns of how decisions actually get made: what evidence mattered, which constraints were binding, which exceptions were normal, which tradeoffs were acceptable become the organization's IP. That's the operating heart of the business, and today it mostly lives in people and eventually disappears.

Application companies have an opening because they sit on the decision surface. If you can capture judgment as a byproduct of execution and keep it current, you can build the context graph: compounding decision memory that becomes the moat.

Instrument decisions, then compile the missing layer. There's an inversion that becomes more viable over time. Instead of declaring the world first, you capture decisions at the moment they're committed, and learn from how judgment is applied in practice.

When a decision happens, you capture the resources consulted, the constraints applied, the tradeoff was made, the action was taken, and how it was later evaluated. Over time, these traces compile into a memory of how decisions actually get made.

This doesn't replace the formal ontology, and it doesn't happen all at once. The prescribed model still matters for shared semantics, state, and hard constraints. The learnable part is the layer ontology-first platforms don't reliably get for free: soft constraints, exception patterns, and tacit heuristics that determine outcomes.

In healthcare, the system knows the prior authorization was submitted. It doesn't know the pattern that determines whether the patient gets care in three days or three weeks: which documentation format a payer responds to, when appeals flip, when peer-to-peer needs to be initiated proactively, and which "standard steps" are dead ends. That logic isn't in the schema. It lives in the organization's accumulated precedent.

This also changes the product economics. Instead of paying the full modeling tax up front, you can start with a thin substrate and let the highest-value layer emerge from real operations. The value compounds because every edge case handled becomes training data, and every correction becomes signal.

Most systems can tell you what happened; almost none can reconstruct why it happened at the moment it mattered. A context graph isn't a graph of nouns; it's a graph of decisions with evidence, constraints, and outcomes.

Who will build the context graphs of the future? You may first assume that it comes from the organizations that own the data, the systems of record, but it’s more likely to come from another source.

Why Incumbents Can't Build Context Graphs

Some are optimistic that existing players will evolve into this architecture. Warehouses become "truth registries," while CRMs become "state machines with APIs." This is a narrative of evolution, not replacement.

That might work for making existing data more accessible. It doesn't work for capturing decision traces.

Operational incumbents are siloed and prioritize current state.

Salesforce is pushing Agentforce, ServiceNow has Now Assist, and Workday is building agents for HR. Their pitch is "we have the data, now we add the intelligence."

But these agents inherit their parent's architectural limitations. Salesforce is built on current-state storage: it knows what the opportunity looks like now, not what it looked like when the decision was made. When a discount gets approved, the context that justified it isn't preserved. You can't replay the state of the world at decision time, which means you can't audit the decision, learn from it, or use it as precedent.

They also inherit their parents' blind spots. A support escalation doesn't live in Zendesk alone. It depends on the customer tier from the CRM, SLA terms from billing, recent outages from PagerDuty, and the Slack thread flagging churn risk. No incumbent sees this because no incumbent sits in the cross-system path.

Organizations that exist at the intersection of systems are the tell. RevOps exists because someone has to reconcile sales, finance, marketing, and customer success. DevOps exists because someone has to bridge development, IT, and support. Security Ops sits between IT, engineering, and compliance.

These "glue" functions are a tell. They emerge precisely because no single system of record owns the cross-functional workflow. The org chart creates a role to carry the context that software doesn't capture.

An agent that automates that role doesn't just run steps faster. It can persist the decisions, exceptions, and precedents the role was created to produce. That's the path to a new system of record: not by ripping out an incumbent, but by capturing a category of truth that only becomes visible once agents sit in the workflow.

Boiling all of this down, what does it mean practically for enterprises today?

The question isn't whether systems of record survive—they will. The question is whether the next trillion-dollar platforms are built by adding AI to existing data, or by capturing the decision traces that make data actionable.

What This Means: The Three Hard Problems

Context graphs require solving three problems:

The two clocks problem. We've built trillion-dollar infrastructure for state and almost nothing for reasoning. The event clock has to be reconstructed.

Schema as output. You can't predefine organizational ontology. Agent trajectories discover structure through problem-directed traversal. The embeddings are structural, not semantic—capturing neighborhoods and reasoning patterns, not meaning.

World models, not retrieval systems. Context graphs that accumulate enough structure become simulators. They encode organizational physics—decision dynamics, state propagation, entity interactions. Simulation is the test. If you can ask "what if?" and get useful answers, you've built something real.

The companies that do this will have something qualitatively different. Not agents that complete tasks—organizational intelligence that compounds and evolves. That simulates futures, not just retrieves pasts. That reasons from learned world models rather than starting from scratch.

That's the unlock. Not better models. Better infrastructure for making deployed intelligence accumulate.