Article · Blog
The agent that worked in testing and broke in production
Building an AI agent that works in a demo is straightforward. The problem shows up when real context arrives: noisy data, long sessions, ambiguous states. Most teams find that out after deploy.
The agent that worked in testing and broke in production
The agent passed every test. Every demo worked. The responses were coherent, the flow was correct, the tools were called in the right order.
Then it went to production.
Within three days, support started getting complaints about responses that made no sense. The agent was making decisions based on context that no longer existed. Sometimes it would repeat an action the user had cancelled two turns earlier. Sometimes it would just stop in the middle of a flow without signaling anything.
I'd seen this pattern before. And every time it shows up, the problem isn't the model. It's what you passed to it.
The illusion of clean context
When you build an agent in a controlled environment, the context is always clean. You know exactly what's in the history. The tools return predictable data. Turns are short. The user follows an expected flow.
In production, none of that is guaranteed.
The user starts with one intent, changes their mind halfway through, asks a tangential question, and comes back to the original intent three messages later. The tool that fetches external data returns a payload with null fields that didn't exist in the test environment. The history has accumulated twelve turns and is now pushing critical information outside the model's attention window.
The agent doesn't break. It keeps responding. It's just that the responses start drifting away from what the user actually wants.
I call this Context Drift. It's not an explicit error. It's a silent degradation.
What most teams get wrong
Most teams treat context like a log. You keep appending messages, the model reads everything, and you assume it will understand the current state of the world from the history.
That works for short conversations. It fails in agents that need to maintain state across multiple turns, multiple tools, and intents that change along the way.
The technical problem is clear: LLMs don't have memory. They have context. And context is finite, noisy, and non-uniform in terms of attention. Information from turn two of a twenty-turn conversation carries less weight than the same information at turn nineteen.
But the architectural problem is more subtle. When you use the history as a substitute for state, you're transferring the responsibility of tracking what's happening over to the model. And the model will get it wrong.
The real problem behind the problem
What I've found, after debugging this kind of failure more times than I'd like, is that agents in production need two separate things: fact memory and state memory.
Fact memory is what the user said, what the tools returned, what was agreed upon.
State memory is where the agent is in the flow: which intent is active, which steps have been completed, what's pending, what was cancelled.
When you mix the two into the same message history, the model has to infer state from facts. Sometimes it gets it right. Sometimes it decides the user still wants to cancel the order because that's what they said four turns ago, even though they changed their mind two turns later.
What actually works better in practice
The change that had the biggest impact on consistency across the agents I've worked on was separating explicit state from the narrative history.
Instead of letting the model infer where the flow stands, you pass the current state as part of the system prompt, updated every turn. Not as generic text. As structure.
// Simplified example of how state context is injected
struct AgentContext {
var activeIntent: String
var completedSteps: [String]
var pendingActions: [String]
var cancelledActions: [String]
var relevantFacts: [String]
}
Before each call to the model, you serialize that context, inject it into the system prompt, and truncate the message history to keep only what's recent and relevant.
The model stops having to infer state. You pass the state. It operates on top of that.
The part nobody wants to hear
This means you need an orchestration layer that tracks and updates state outside the model. That decides what goes into the context and what doesn't. That has business logic around what constitutes a completed step, a cancelled action, an active intent.
It's more work. It's a real piece of software, with its own responsibilities, its own bugs, and its own tests.
Most teams don't want to do that. They want to believe the model will figure it out. And the model does... until it doesn't anymore.
Context isn't what you collect. It's what you decide to pass.
That distinction seems obvious written out like this. In production, it costs days of debugging to learn.
Honest trade-offs
This approach has a real cost: you're increasing the complexity of the orchestration layer. Now there's state logic living outside the model, and that logic needs to be tested, maintained, and evolved alongside the agent's flows.
If the agent's behavior changes, you need to update both the prompt and the state logic. That can cause drift between the two if there's no discipline around it.
There's also the risk of over-engineering. For simple agents with short, well-defined flows, this separation may be unnecessary. The message history is enough.
The problem is that you rarely know, before going to production, whether the flow will actually stay simple.
What I would do differently today
I'd start with state separation from the beginning, even if the agent seems simple.
The cost of adding that layer early is low. The cost of adding it later, when unexpected behavior is already happening in production and the real conversation history has become debug evidence, is much higher.
I'd also invest earlier in observability specific to context: logging what's being passed to the model on every turn, not just the response that came back. Most agent systems I've seen log output, not input. You only find out what was in the context after something already broke.
Final reflection
There's a common belief that the model is the critical component of an agent. You pick the right model, tune the prompt, and the behavior emerges.
In practice, the model is the most predictable component. Given the same context, it behaves consistently.
What changes in production is the context. And context changes because the real world is noisy, users don't follow expected flows, and external systems return data that didn't exist in the test environment.
Agents don't fail because of the model. They fail because the context that reached the model wasn't what you thought you were passing.
That shifts where you put your attention during development. And it completely changes where you look when something breaks.