Engineering

The model isn't the problem

The model is rarely the bottleneck in production AI agents. What we learned building the harnesses behind Wincora's autonomous visa operations.

Ali Keyanjam

Co-founder

March 24, 2026

5 min read

A flat editorial illustration of a small reasoning core surrounded by workflow paths, state cards, tool surfaces, and recovery loops.

Most teams who set out to build an "AI agent" arrive at the same uncomfortable realization a few months in: the model is rarely the bottleneck. The model is usually capable enough. Almost everything that fails, fails everywhere around it.

Long-running sessions become unstable. Context windows quietly degrade. Goals drift. Agents fixate on stale information. Tool calls detach from intent. Prompts inflate. Performance drops while token costs climb. The whole system starts to feel less like intelligence and more like a probabilistic memory leak.

That is the actual frontier of production AI right now. Not models. Harnesses.

The orchestration layer is the product. The model is just the reasoning substrate. The harder you push real workloads through it, the more obviously true that becomes.

At Wincora, we ran straight into this problem while building Nyx, our autonomous fulfillment engine, and Aero, our traveler-facing conversational system. The brutal nature of our vertical made the problem land harder than it does in typical "toy agent" workloads.

Visa processing is not a chatbot problem

Visa processing is a long-running operational cognition problem. A single case can involve:

dozens of documents
constantly evolving workflow state
embassy-specific rules
browser automation
conversational intake
customer interaction over days or weeks
partial completion states
external system dependencies
retries and failures
human intervention
changing goals
compliance constraints
country-specific branching logic

This is not a single-turn classification task that a model nails or fails. It is a process that unfolds over time, across sessions, across systems, with state mutating underneath you while the agent is working.

The naive agent architecture collapses on contact with that.

Why the obvious architecture fails

The implementation most teams reach for first looks something like:

Keep appending conversation history.
Keep appending tool outputs.
Keep injecting workflow state.
Let the model figure it out.

This works for demos. Then reality arrives.

After enough turns, the system carries enormous amounts of stale operational state. The model spends tokens rereading irrelevant information. Important details dilute inside historical noise. Old assumptions persist after the system has changed. Tool outputs balloon. Browser sessions emit excessive telemetry. The model starts anchoring to outdated context because, statistically, it is still there in the prompt.

A four-panel illustration of a neat case folder gradually filling with notes, browser scraps, and loose fragments until it cracks under the load. — Degradation rarely arrives as a crash. It arrives as accumulation.

The dangerous part is that degradation is gradual. You do not see sudden failure. You see hesitation. Then redundancy. Then unnecessary tool usage. Then hallucinated assumptions. Then operational inefficiency. Then, eventually, wrong decisions.

In a chatbot, a hallucination is embarrassing. In an embassy workflow, a hallucination is operational damage.

So we threw the assumption out and rebuilt from first principles.

Workflow state is the source of truth, not conversation

The single realization that changed everything for us was this:

The conversation history is not the source of truth. The workflow state is.

That sounds subtle until you sit with the implications. Most agent systems treat the prompt as accumulated memory. We moved toward treating the prompt as a dynamically rendered operational projection of current reality.

Instead of carrying historical state forward forever, Wincora continuously regenerates context from authoritative system state. The model does not "remember" the workflow. The platform reconstructs the workflow for the model every turn.

A person faces drifting conversation fragments on one side while a clean case board organizes passport, validation, timeline, and location state on the other. — Conversation history is fluid and drifting. Workflow state is structured and authoritative. Pick the right one to anchor the model on.

In Aero, the actual case state lives in structured operational memory, not inside accumulated chat history. At inference time, the harness dynamically injects only what matters now:

current workflow phase
missing required information
document status
country-specific requirements
validation results
pending actions
user profile state
operational constraints
system-generated summaries
recent relevant tool outputs (only)

Crucially, this context is synthesized fresh every turn, not replayed.

That allows the harness to aggressively prune historical conversation while preserving operational continuity. A conversation that has lasted for days can still operate with the contextual efficiency of a relatively small prompt.

What that buys us in production

Once you stop relying on the conversation as memory, several second-order effects start to compound.

Token economics get tractable. A multi-day case is not a multi-day prompt. The state that matters at any moment is a fresh slice, not the entire history.

The system gets harder to break with bad inputs. A hostile or malformed prior turn does not poison the next inference, because the next inference is grounded in authoritative system state, not in what was said before.

Recovery becomes a first-class operation. If something goes wrong mid-case, we do not have to "reset the conversation." We project the current state into a context the model can act on, which is a thing the system already does on every turn.

Perhaps the most important consequence: the agent can be wrong without the system getting worse over time. Errors are bounded by the turn, not absorbed into the running history.

The harness is the product

Once you internalize that the prompt is a projection, not a record, the model itself becomes one of the smaller engineering decisions. You still pick a good one. But the part that determines whether the system holds up under real workloads, real failure modes, and real time, is everywhere else.

In the next post, we walk through the concrete patterns we ended up with: treating the browser as state instead of pixels, why goals must be made explicit every turn, semantic tools versus infrastructure primitives, stratifying memory by lifetime, and why the whole thing eventually starts to look more like distributed systems engineering than prompt engineering.

The model reasons. The harness operationalizes reasoning. In production, the second half is what matters.

Tags #ai-agents #architecture #harness #production

Keep reading

More from the blog

All posts

Engineering

Automation you can watch

Filling a government form is the easy demo. The hard part is everything around it: sessions that die, portals that ask unexpected questions, and results you cannot afford to assume. The design rules behind our embassy automation.

June 13, 2026 7 min read

Engineering

Visa requirements are a versioning problem

Most of the industry stores visa rules like content: pasted in, overwritten, impossible to audit. We treat them like code, with research citations, human review, versions, and replayable decisions.

June 11, 2026 7 min read

Engineering

The model was never the security boundary

The enterprise AI security conversation focuses almost entirely on model capability. In production, the real boundary is everything around it: identity, permissions, workflow authority, and audit.

May 10, 2026 5 min read

Get started

Ready to see Wincora in action?

Join the early access program and be among the first teams to operate visa processing on a modern, intelligent platform.

Apply for Early Access Explore the platform