By mid-2025, every major LLM provider had released a model that could reliably produce structured tool calls, navigate multi-step reasoning chains, and handle 100K+ token contexts. Claude Sonnet 4.6, GPT-5, and Gemini 2.5 Pro all meet the capability bar that serious agentic workflows require. The underlying model capability that agent systems need is, by most reasonable measures, here. And yet most organizations that have tried to build production agent systems have struggled — not with model capability, but with everything else.
This post is about what "everything else" actually is. Not a product pitch — a diagnosis. After spending three years building Diaflow and talking with engineering teams across fintech, logistics, and devops, we've converged on three gaps that consistently explain why agent projects fail in production even when the model capability is sufficient.
The Demo Problem: Why Working Prototypes Don't Scale
The first gap is the demo-to-production transition. It's real, it's wide, and it's not primarily a model quality problem.
A notebook-based agent demo is built to show one thing: that the model can complete the target task. It's evaluated on a small, curated set of inputs. The tool schemas are written to match the prompts the developer tests with. The context fits in one pass. Error handling doesn't exist because it's never triggered.
Production means: 1,000 different users, 1,000 variations of input phrasing. It means API rate limits at 2am when no engineer is watching. It means context windows that fill over 90-minute sessions. It means the 0.5% of inputs that are malformed, adversarial, or outside the distribution the tool schemas were designed for. And it means needing to answer, after something goes wrong, exactly what happened and why.
None of these are model problems. They're infrastructure problems. The missing layer between a working agent prototype and a reliable production deployment is operational infrastructure: retry and backoff, context budget management, tool input validation, trace collection, anomaly detection, human escalation paths. This infrastructure doesn't come with any LLM API. It has to be built or sourced separately — and most teams underestimate the scope of it until they've already shipped something that's quietly failing.
The Integration Problem: Agents Need to Touch Real Systems
The second gap is the integration surface. An agent that reasons well but can only call one or two tools is limited in proportion to its integration surface. Expanding that surface — connecting to Jira, Slack, your internal database, your customer support platform, your data warehouse — requires tool implementations that hold up under adversarial conditions, not just demo conditions.
Tool integration development is unglamorous and time-consuming. For each external system you want to connect, you need to handle: authentication (and credential rotation), rate limits and throttling, schema changes in the upstream API, pagination, error code taxonomy, and testing under realistic failure conditions. This is roughly 2-4 weeks of engineering work per integration done properly. Teams building custom agent stacks end up spending the majority of their engineering capacity on integrations rather than on the agent logic they actually care about.
The MCP (Model Context Protocol) standard from Anthropic takes a step toward standardizing the tool definition layer, which is genuinely useful. We're not saying MCP is insufficient — we're saying it solves a specific problem (tool description portability) while leaving the implementation reliability problem open. A well-defined tool schema that wraps a fragile API implementation is still fragile.
The integration gap is one reason the infrastructure layer for agents can't just be "a better LangChain" — it requires battle-tested tool implementations that are maintained against upstream API changes, not just an orchestration framework.
The Observability Problem: You Can't Debug What You Can't See
The third gap is observability, and it's the one that most engineering teams appreciate only after their first production incident.
Agent systems have a fundamentally different failure signature than deterministic services. A microservice either returns 200 or it doesn't. An agent can return 200 — with a complete, well-formatted output — that is semantically wrong. It answered the wrong question, used stale data, skipped a step in its reasoning chain, or took an action that was locally coherent but globally incorrect. Standard service monitoring (latency, error rate, throughput) doesn't catch semantic failures. You need trace visibility into the agent's reasoning, not just its execution.
The observability tools that exist for traditional distributed systems — Datadog, Grafana, OpenTelemetry — are necessary but not sufficient. You need something that understands the agent execution model: which LLM call produced which tool call, what the model was reasoning about at each step, where in the execution the run started to diverge from expected behavior. This is agent-specific observability, and it doesn't exist as a commodity layer yet.
A concrete example of why this matters: a data processing agent at a growing analytics team had a silent failure that ran undetected for 72 hours. The agent was producing outputs that passed all schema validation checks. The outputs were wrong because a vector store index had gone stale and the agent was retrieving incorrect reference data. The fix took 2 hours once identified. Identifying it took 3 days, because there was no instrumentation on the retrieval quality of the agent's memory reads. With agent-specific observability on the retrieval path — cosine similarity scores, retrieved document timestamps, chunk counts — this would have surfaced in the first hour.
What the Infrastructure Layer Needs to Look Like
We're not claiming any of these problems are unsolvable — we're claiming they're systematically underaddressed by current tooling. Our thesis at Diaflow is that the agent infrastructure layer should provide, as first-class features, the operational components that every production agent needs: orchestration primitives (graph topology, conditional branching, loop control), memory management (tiered storage, retrieval quality controls, context budgeting), tool integration library (battle-tested implementations with error taxonomy handled), and observability (per-step semantic traces, anomaly detection, cost attribution).
This is infrastructure-layer work, not application-layer work. Just as you don't implement your own TCP stack when building a web service, you shouldn't need to re-implement retry logic, context management, and credential rotation for every agent you build. The goal is to make these capabilities available at the framework level so that teams can focus their engineering effort on the agent logic that's specific to their use case.
Where We Think This Goes
The agent infrastructure market is early and fragmented. Teams working in this space today are operating on a combination of LLM APIs, open-source orchestration libraries, custom-built integrations, and ad-hoc monitoring. The fragmentation is not because there isn't demand for consolidation — it's because the problem space is genuinely complex and the production failure modes haven't been widely documented yet.
Over the next few years, we expect the agent infrastructure layer to consolidate around a small number of platforms that provide: a high-quality orchestration primitive set with first-class loop control and state management, a maintained integration library with thorough error handling, semantic observability as a native capability (not an afterthought), and a deployment model that fits into existing engineering workflows.
The three gaps we've described — demo-to-production, integration surface, observability — are the same gaps that made microservices hard in 2015 before service meshes, container orchestration, and distributed tracing emerged as standard infrastructure. That analogy isn't perfect, but the infrastructure maturation pattern is likely to be similar. The tooling is catching up to the capability. Diaflow is our contribution to closing that gap.
Jonathan Viet Pham is Founder & CEO of Diaflow. Questions or counterarguments welcome at [email protected].