Tool CallingReliability

Tool Calling in Production: What Goes Wrong and How to Prevent It

Li Ying Chen · 2025-01-20 · 10 min read

The tool call schema looked fine. The description was clear, the parameters were typed, the examples were accurate. Then in production, roughly one in twelve calls to update_ticket_status came back with a 422 Unprocessable Entity — not from the upstream API, but from Diaflow's schema validator. The model was passing "status": "In Progress" when the schema only accepted "in_progress" (snake_case, no spaces). The description said "status of the ticket" and gave examples, but didn't specify the exact enum values. Under high context load, the model was interpolating a reasonable-looking string rather than selecting from the defined enum.

Tool calling is where the abstract capability of LLMs meets the concrete requirements of real APIs, and the gap between those two things is where most production failures live. Here's a systematic treatment of what goes wrong and how to prevent it.

Schema Design: The Errors That Live in Your Tool Definitions

The single most impactful thing you can do to improve tool call reliability is write explicit schemas. This sounds obvious — it isn't obvious enough, because most tool definitions in the wild are written with the model's understanding as the primary goal, not the model's precise execution behavior under adversarial inputs.

Use enums for categorical fields, always. If a parameter can only take a fixed set of values, define them as an enum in the JSON schema. Don't write "description": "status, one of open, closed, or pending" — write "enum": ["open", "closed", "pending"]. The model will select from the enum rather than generating a value that looks right.

Use additionalProperties: false on object schemas. Without this, a model that generates an extra field (say, "priority": "high" on a status-update tool that doesn't accept priority) will produce a call that your schema validator accepts but your downstream API rejects. Strict schemas surface this error at validation time, not execution time.

Make optional fields explicit. A schema with required and optional fields should use the required array to list which parameters must be present. A model that doesn't know a field is optional will sometimes omit it and sometimes include it based on context, causing inconsistent behavior.

from diaflow import tool
from diaflow.schemas import ToolSchema

# Explicit, strict schema definition
@tool(
    name="update_ticket_status",
    description="Update the status of a support ticket. Use this after confirming the resolution with the user.",
    schema=ToolSchema({
        "type": "object",
        "properties": {
            "ticket_id": {
                "type": "string",
                "pattern": "^TKT-[0-9]{6}$",        # explicit format constraint
                "description": "Ticket ID in format TKT-NNNNNN"
            },
            "status": {
                "type": "string",
                "enum": ["open", "in_progress", "resolved", "closed"],  # exact enum
                "description": "New status. Must be one of the listed values exactly."
            },
            "resolution_note": {
                "type": "string",
                "maxLength": 500,
                "description": "Optional resolution summary (max 500 chars)"
            }
        },
        "required": ["ticket_id", "status"],       # optional field not in required
        "additionalProperties": False
    })
)
def update_ticket_status(ticket_id: str, status: str, resolution_note: str = "") -> dict:
    # Implementation
    ...

# Note: code examples are illustrative —
# actual SDK usage requires a Diaflow account.

Tool Descriptions: What the Model Actually Reads

The description field in a tool definition is the model's primary signal for when to use the tool and how to use it. It's not documentation for human readers — it's an instruction to the model. Writing it poorly costs you reliability.

A tool description should answer three questions unambiguously: (1) What does this tool do, in one sentence? (2) When should you call it versus alternatives? (3) What are the side effects or constraints the model should know about?

For example, contrast these two descriptions for a database query tool:

Weak: "Queries the database and returns results."

Strong: "Query the customer database by customer ID or email. Returns up to 50 records. Use this for lookup operations only — do NOT use for bulk retrieval (use export_customers instead). Calling this with a partial email match may return multiple results; check the result count before proceeding."

The strong version is longer, but it answers the "when not to use this" question, which is what prevents the model from making unnecessary calls or making the wrong call when multiple tools are available.

The Timeout and Rate Limit Failure Cascade

External API tools will time out. They will rate limit. They will return 500 errors. The question is not whether these failures happen — it's whether your agent handles them correctly when they do.

The failure cascade pattern looks like this: tool A times out → agent retries immediately → tool A rate limits the retry → agent errors → Diaflow's retry config kicks in with exponential backoff → by the time the retry fires, context has accumulated an error state → model's next action is influenced by the error context in a way that wasn't intended.

Three defenses that actually work:

Tool-level timeout configuration. Set explicit timeouts per tool, shorter than your agent's total run timeout. A database query tool should timeout in 10s; a web scraping tool might allow 30s. Don't let one slow tool block the entire run indefinitely.
Error state isolation. Tool errors should not pollute the agent's main context chain. Store tool error results in a separate error slot that the model is explicitly instructed to check, rather than appending raw error messages to the conversation history.
Fallback tool definitions. For critical tools, define a fallback behavior: if the primary Jira API times out, use a simpler read-only query as a fallback. The fallback returns less data but allows the agent to continue rather than stall.

Testing Tool Calls Before Production: The Fuzzing Problem

Unit testing tool implementations is straightforward. Testing that the model calls tools correctly under realistic conditions is harder — and it's where most teams under-invest.

A practical eval harness for tool-calling reliability should include at minimum: (1) golden path tests (expected tool call sequence for representative inputs), (2) adversarial input tests (malformed queries, ambiguous requests, requests that shouldn't trigger any tool), and (3) schema stress tests (inputs designed to produce schema errors — long strings, wrong types, missing required fields).

We're not saying you need a full automated eval suite before shipping a single tool — we're saying that deploying tools with only golden-path testing will produce production incidents. The adversarial cases are the ones that bite you at 2am.

A realistic pass rate target for a production-quality tool definition on a model like Claude Sonnet 4.6: 98%+ on golden path, 85%+ on adversarial inputs (some adversarial cases should legitimately return no-tool-call rather than a forced call). If you're below 90% on golden path, the schema or description needs work before you ship.

MCP (Model Context Protocol) and What It Changes

Anthropic's Model Context Protocol standardizes how tools are defined and invoked across model providers. For teams building multi-provider agent systems — say, routing some agent types to Claude Sonnet 4.6 and others to GPT-5 based on cost or capability requirements — MCP-compliant tool definitions mean your tool schemas don't need to be adapted per provider.

In practice, the benefit is most visible in tool schema validation: MCP defines a canonical schema format that maps cleanly to each provider's native function-calling format, so you write tool definitions once and they work consistently across providers. The limitation to know: not every provider supports all MCP schema features with equal fidelity. Test explicitly when deploying MCP-defined tools to a new model provider.

Tool calling is the highest-return place to invest reliability effort in agent systems. A well-defined tool with explicit schemas, clear descriptions, proper error handling, and tested behavior under adversarial inputs will eliminate the largest category of production failures. Most of the work is in the definition, not the implementation.

The code examples in this post are illustrative of Diaflow SDK patterns. Actual implementation requires a Diaflow account and may differ from preview API shapes. See our documentation for current SDK reference.

More from the blog

Back to all posts