ObservabilityOps

Ops Teams: Your AI Agents Need Monitoring Just Like Your Services Do

Minh Duc Tran · 2024-11-25 · 7 min read

A platform team at a growing logistics company deployed a Diaflow-based agent to automate their weekly data reconciliation pipeline. It ran cleanly for six weeks. On week seven, it started producing incorrect reconciliation reports — not crashing, not throwing errors, just silently generating wrong numbers. Nobody noticed for four days. The incident cost roughly 18 hours of engineer time to trace and correct. The root cause: a vector store query that started returning stale embeddings after a schema migration that updated document metadata but didn't re-index affected chunks.

If this had been a microservice, the team would have caught it in hours. Error rates, latency spikes, dead letter queue depth — the monitoring stack would have fired. But agents don't fail the way services fail. Their failure mode is often silent quality degradation, not hard errors. And if you're treating agent monitoring the same way you treat service monitoring, you're going to miss the failures that matter most.

The Observability Gap Between Services and Agents

Traditional service observability is built around three pillars: metrics (what is happening), logs (what happened), and traces (how it happened). This stack works well for services with deterministic behavior — a function that receives input X should always produce output Y. You can define golden signals, set alert thresholds, and trust that if everything is green, the service is working.

Agents are non-deterministic by design. The same input can produce different outputs depending on model state, context history, and tool results. This means the traditional "expected output matches actual output" check doesn't apply. You can't just alert on error rate and call it monitoring — you need to instrument the reasoning, not just the execution.

What that means practically: agent observability requires a fourth pillar — semantic trace visibility. You need to see not just that a tool was called, but why the agent decided to call it. Not just that the LLM responded, but whether the response was coherent with the task goal. Not just that the run completed, but whether it completed correctly.

The Metrics That Actually Matter for Agent Workloads

Start with these five signal categories before adding anything else:

Run success rate. The percentage of agent runs that complete with a valid, parseable output — not just runs that complete without throwing an exception. An agent that finishes but returns {"error": null, "result": null} is not a successful run. Define "valid output" explicitly in your monitoring config.

Step count distribution. How many tool calls / LLM calls does each run require? A p50 of 8 steps and a p99 of 47 steps is a significant anomaly — that p99 is probably a stuck agent or a task that's being over-decomposed. Track this as a histogram, set an alert on p95 exceeding your expected ceiling.

Token cost per run. LLM token costs are your agent's compute costs. A 3x spike in average tokens per run that isn't explained by a workload change is a symptom of something wrong — a context budget misconfiguration, a prompt that's ballooning, or a retrieval system returning too much content. You need this metric to diagnose cost anomalies before they become billing surprises.

Tool error rate by tool name. Aggregated tool error rate hides which tools are failing. A 5% overall tool error rate looks manageable until you discover it's actually a 40% error rate on your Jira connector and 0% everywhere else. Track error rate per named tool.

Human escalation rate. For agents with human-in-the-loop fallback, what fraction of runs are escalating to human review? A sudden increase here is a leading indicator of model or data drift — the agent is becoming less confident before the quality numbers drop visibly.

Trace Structure for Agent Runs

Each agent run should produce a structured trace with the following fields, at minimum:

{
  "run_id": "run_a7b3c2d1",
  "agent_name": "data-reconciliation-v3",
  "model": "claude-sonnet-4-6",
  "started_at": "2024-11-14T09:23:01Z",
  "completed_at": "2024-11-14T09:23:47Z",
  "duration_ms": 46200,
  "status": "success",
  "total_tokens": 18420,
  "total_cost_usd": 0.0184,
  "steps": [
    {
      "step": 1,
      "type": "tool_call",
      "tool_name": "query_database",
      "duration_ms": 380,
      "tokens_in": 1240,
      "tokens_out": 0,
      "status": "success"
    },
    {
      "step": 2,
      "type": "llm_response",
      "model": "claude-sonnet-4-6",
      "duration_ms": 2100,
      "tokens_in": 4800,
      "tokens_out": 620,
      "reasoning_summary": "Analyzed query results, identified 3 discrepancies"
    }
  ],
  "output_validation": {
    "schema_valid": true,
    "semantic_check": "passed"
  }
}

The reasoning_summary field is what makes agent traces different from service traces. This is a short, auto-generated summary of what the model's response was trying to do at that step — generated by a lightweight parsing pass over the model output. It's not the full model response (too expensive to store at scale), but it's enough to give you a human-readable audit trail.

SLOs for Agent Workloads

If you're running agents in production, you need explicit SLOs. This is not aspirational — it's how you know whether your monitoring is working. An SLO without a threshold is a metric without an alert.

A reasonable starting set for an ops team taking over an agent workload:

Run success rate SLO: ≥ 95% of runs return valid output over any 24-hour window
P95 latency SLO: ≤ 120s for runs on standard task types (calibrate this against your actual workload)
Token cost SLO: Average cost per run ≤ 2× the baseline established during staging (alert at 1.5×, hard limit at 2×)
Tool error SLO: No individual tool exceeds 10% error rate over any 1-hour window

We're not saying these thresholds are right for every workload — we're saying you need some SLOs before you ship to production, because "no alerts defined" is not the same as "no problems."

What APM Tools Won't Give You

DataDog, New Relic, and OpenTelemetry can instrument your agent's hosting infrastructure and give you service-level metrics. They cannot give you semantic trace visibility. They can't tell you whether the agent's reasoning at step 4 was coherent with its task goal, or whether a retrieved document chunk was relevant to the current query, or whether the model's tool selection at step 7 was appropriate.

This isn't a knock on APM tooling — it's the right tool for service observability. For agent-specific observability, you need something built to understand the agent's internal state, not just its execution wrapper. The distinction matters when you're debugging a silent quality failure like the one in the opening example — standard APM would show a healthy run completing in 46 seconds. Agent-specific observability shows you that at step 3, the semantic memory retrieval returned documents from a stale index with cosine similarity 0.58 instead of the expected 0.79.

If you're an ops team inheriting an AI agent workload, the honest framing is: treat it like a service, but budget time for the additional instrumentation layer that services don't need. The operational discipline is the same — SLOs, alerts, runbooks, on-call rotation. The tooling layer is different. Get both right before you call it production-grade.

The JSON structure in this post reflects Diaflow's trace schema. Actual trace format and fields may vary with SDK version. See the Observability docs for current reference.

More from the blog

Back to all posts