ProductionDeployment

What It Actually Takes to Deploy Autonomous Agents in Production

Jonathan Viet Pham · 2024-06-10 · 8 min read

Your agent demos look great. The Jupyter notebook runs clean, the LLM calls return sensible outputs, and the tool integrations work exactly as you planned. Then you push to production and within 48 hours you're staring at a Slack alert: agent_runner crashed after 3,200 iterations, last tool call was 47 minutes ago, context buffer overflow. The demo was fine. The production environment was not.

This is the pattern we've seen repeatedly in early-stage agent deployments: the gap between a working prototype and a reliable production system is not a feature gap — it's an operational gap. Most frameworks get you to demo quality quickly. Very few give you the infrastructure to sustain that quality when real users hit your system at 3am with edge-case inputs.

Here's what actually needs to happen before you call an autonomous agent "production-ready."

The Four Failure Modes Nobody Mentions in the README

Context window overflow. A Claude Sonnet 4.6 context window is 200K tokens — generous, but not infinite. A multi-turn customer support agent that doesn't prune its history will eventually hit the ceiling. When it does, it doesn't gracefully degrade: it throws a hard error mid-conversation. We've seen this happen on ticket threads that accumulated 80+ tool call results over two hours of back-and-forth. The fix requires explicit working memory management, not just hoping the model handles it.

Tool call format drift. LLMs are not perfectly consistent in how they format tool calls, especially when the context is long or the instruction prompt is complex. The JSON schema your tool expects might not match what the model outputs under pressure. We've observed format drift rates of around 3-8% on complex multi-tool prompts when context exceeds 50K tokens. At 10,000 agent runs per month, 3% is 300 silent failures per month if you're not validating tool inputs before executing them.

Retry storms. An agent that retries on failure without backoff will hammer your downstream APIs. A Jira-connected agent that hits a rate limit and retries immediately, repeatedly, will get your API key suspended. Worse, an agent orchestrating other agents will amplify the storm — one failed root node causes all child nodes to retry simultaneously.

Semantic drift in long sessions. This one is subtle. An agent running a 45-minute research synthesis task will gradually "forget" its original framing as the context fills with intermediate results. The final output may be technically valid but drift significantly from the original intent. This is not a hallucination in the classic sense — it's a goal coherence problem that doesn't show up in unit tests.

What Production Hardening Actually Looks Like

Before shipping any agentic system to production, you need three things working simultaneously: a retry/backoff strategy, an explicit context budget, and tool input validation. None of these are optional.

Here's a minimal production-grade agent setup using Diaflow's SDK that addresses all three:

from diaflow import Agent, tool, RetryConfig, ContextBudget
from diaflow.validators import JSONSchemaValidator

# Define tool with explicit schema validation
@tool(
    name="search_tickets",
    description="Search support tickets by keyword and status",
    validator=JSONSchemaValidator({
        "type": "object",
        "properties": {
            "query": {"type": "string", "maxLength": 200},
            "status": {"enum": ["open", "closed", "pending"]}
        },
        "required": ["query"]
    })
)
def search_tickets(query: str, status: str = "open") -> list[dict]:
    # Tool implementation
    ...

# Configure production-grade agent
agent = Agent(
    name="support-triage-v2",
    model="claude-sonnet-4-6",           # model pinned explicitly
    system_prompt=TRIAGE_SYSTEM_PROMPT,
    tools=[search_tickets],
    retry_config=RetryConfig(
        max_retries=3,
        backoff_factor=2.0,              # exponential: 1s, 2s, 4s
        retry_on=["tool_error", "format_error"],
        fail_hard_on=["context_overflow", "auth_error"]
    ),
    context_budget=ContextBudget(
        max_tokens=120_000,              # 60% of 200K — leaves headroom
        on_overflow="summarize_and_trim" # not "crash"
    )
)

result = agent.run(user_input, session_id=ticket_id)

# Note: code examples are illustrative —
# actual SDK usage requires a Diaflow account.

The validator on the tool definition catches format drift before the tool executes. The ContextBudget with summarize_and_trim prevents overflow crashes by compressing older context segments. The RetryConfig separates retryable errors from hard failures — you don't want to retry an auth error 3 times.

The Observability Problem Is Different From Your Microservices Problem

Every engineer who has run microservices knows how to instrument a service: latency histograms, error rates, distributed traces. Agent observability requires all of that plus something your APM tool can't give you: semantic trace visibility.

You need to see not just "which tool was called and how long it took" but "why did the agent decide to call that tool, what did the model return, and was that reasoning correct?" A p99 latency spike means something different when it's caused by the model generating an unnecessary tool call chain versus when it's caused by an upstream API timeout.

This distinction matters operationally. We've seen teams spend four hours debugging what they thought was a Jira API slowdown, only to discover the agent was calling search_tickets six times per request because of an ambiguous tool description that caused the model to over-query. The latency was in the model, not the API. Standard distributed traces would never surface this.

When to Pin Your Model and When Not To

Model versioning in agentic systems is not optional. We're not saying you must pin every deployment to a static model hash — we're saying you must have a policy for what happens when a model is updated mid-deployment.

Claude Sonnet 4.6 and GPT-5 both offer stable model aliases that don't auto-update. For production agents, always use a pinned alias or version. We've seen behavior changes between model minor versions that pass unit tests but change tool-calling patterns subtly enough to cause 15-20% increases in step count on complex tasks. At scale, that's a meaningful cost and latency impact.

For agents running on Gemini 2.5 Pro, be especially careful with function-calling schema strictness — the model's tolerance for schema ambiguity is lower than Claude family models, which means more format errors if your tool definitions have optional fields without defaults.

A Realistic Pre-Production Checklist

Before any agent goes live, we walk through this list internally. It's not comprehensive, but missing any of these has caused an incident for us or for teams we've spoken with:

Tool schemas validated with input fuzzing (malformed JSON, missing required fields, out-of-range values)
Context budget set and tested to overflow boundary — agent behavior at 80%, 95%, 100% context fill
Retry config explicitly set: what retries, what escalates, what fails hard
Maximum step count set (a hard ceiling, not just a soft limit)
Session timeout defined — what happens when a run exceeds N minutes
Output schema validation — agent output conforms to expected structure before downstream systems consume it
Human-in-the-loop trigger defined — what conditions cause the agent to escalate vs. continue autonomously

The Honest Version: Autonomous Doesn't Mean Unattended

The word "autonomous" creates unrealistic expectations. A well-built autonomous agent can handle a large fraction of its intended workload without human intervention — but it will also surface cases it genuinely can't handle, and it needs good tooling to do that gracefully rather than silently fail or spin in a loop.

The teams that run agents most effectively treat them like junior engineers, not magic boxes. They write clear tool descriptions (the equivalent of writing clear function documentation), they define explicit escalation paths (the equivalent of writing clear escalation policies), and they instrument everything (the equivalent of writing good logging).

Production deployment isn't a moment in time — it's an ongoing operational discipline. The gap between demo and production is real, but it's not mysterious. It's the same gap that exists between a script that works on your laptop and a service that works at 3am under unexpected load. The tools to close that gap exist. You just have to use them.

The code examples in this post are illustrative of Diaflow SDK patterns. Actual implementation requires a Diaflow account and may differ from preview API shapes. See our documentation for current SDK reference.

More from the blog

Back to all posts