MemoryRAG

Agent Memory Beyond Context Windows: Retrieval That Doesn't Break

Jonathan Viet Pham · 2024-09-30 · 9 min read

A financial research agent at a growing asset management firm was running beautifully in staging. 200K-token context window, Gemini 2.5 Pro, full document ingestion in a single pass. Then they started running it on real workflows: 8-hour sessions with 40+ documents, follow-up questions referencing analysis from two hours prior. Context window full after session hour three. Everything the agent "knew" from the first two hours was gone. The user asking "what was that risk factor you mentioned from the Kempfield Capital filing?" got a hallucinated answer because the filing itself had been evicted from context.

Context windows are not memory. This distinction matters more than any other concept in production agent design, and it's consistently underestimated by teams building their first multi-turn agents.

Context Window vs. Memory: The Core Distinction

A context window is a sliding window of tokens the model can attend to right now. It's fast, lossless within its limits, and requires no retrieval. But it's bounded, it's expensive to fill, and when you hit the limit, older content is discarded — silently, unless you're checking.

Memory is persistent, queryable storage that exists outside the model's context. It requires retrieval, which introduces latency and retrieval error. But it's unbounded and survives session boundaries. The challenge in agent design is deciding what to put in context and what to put in memory — and building retrieval that actually works reliably.

The naive approach is to dump everything into context and hope the model handles it. This works for simple tasks with limited information. It fails for multi-turn agents handling long sessions, multi-document tasks, or workflows that span multiple days.

The Three Memory Tiers (and What Actually Belongs in Each)

Working memory (in-context). The current conversation turn, the active task description, recent tool call results, and any high-priority facts the agent needs for its immediate next action. This should be aggressively pruned. A common mistake is letting tool call results accumulate in context indefinitely — a search tool that returns 3,000-token results, called 10 times, consumes 30K tokens of context that could be holding more relevant information.

Episodic memory (session store). What happened earlier in this session that might be relevant again. Previous tool call summaries, user preferences stated earlier, partial task results. This is typically stored in a lightweight key-value store keyed by session ID, with explicit retrieval when relevant. The key design decision: don't retrieve all episodic memory at every turn — retrieve it when the agent decides it's relevant. That decision can itself be a lightweight LLM call or a semantic similarity check.

Semantic memory (vector store). Long-term knowledge that transcends any single session: product documentation, domain knowledge, historical cases, customer data. This is where RAG (Retrieval-Augmented Generation) lives. Vector stores like Qdrant, Weaviate, or pgvector sit here, and retrieval quality is the dominant factor in how well this memory tier works.

Why Vector Retrieval Compounds Errors Across Multi-Turn Agents

Single-turn RAG is well-understood: embed a query, retrieve top-k chunks, insert into context, generate response. Multi-turn agent RAG is more complex and has failure modes that single-turn systems don't expose.

The problem: each agent step may do a retrieval call, and retrieval errors compound. If step 1 retrieves 80% relevant context and step 2 retrieves 80% relevant context, the probability that both are relevant to the final output isn't 80% — it's closer to 64% (0.8 × 0.8). At 5 retrieval steps, you're at 33%. This matters because agents don't just read retrieved chunks — they reason about them and carry those reasonings forward.

Concretely: an agent researching a regulatory change might retrieve "APAC data residency requirements (2023)" at step 1, then at step 3 retrieve "data localization policies (2024)." If the 2023 chunk contradicts the 2024 chunk (regulations changed), the agent may not resolve the contradiction correctly — it may confabulate a synthesis that's wrong. This looks like a hallucination but is actually a retrieval quality problem.

Practical Vector Store Choices and Their Trade-Offs

The choice of vector store is not academic. Here's the honest trade-off picture for the three stores we see most commonly in production agent systems:

pgvector (PostgreSQL extension). The right choice if you already have PostgreSQL infrastructure and your vector search requirements are moderate (sub-million vectors, single-digit millisecond query latency acceptable). Zero additional infrastructure, ACID transactions, SQL joins work naturally alongside vector search. The limitation: approximate nearest neighbor (ANN) performance degrades faster than dedicated vector databases as dataset size grows past a few million vectors.

Qdrant. Purpose-built vector database with strong filtering support (filter by metadata while doing vector search). Rust-based, genuinely fast at query time — p99 latencies under 10ms at 10M+ vectors are realistic with proper hardware. The limitation: it's another service to operate and the query API is less expressive than SQL for complex metadata filtering.

Weaviate. GraphQL-native, strong schema support, good multimodal support if your agents process non-text data. More opinionated than Qdrant about data modeling. Good for teams that want a higher-level abstraction over vector search; worse for teams that want fine-grained control.

For most early-stage agent teams: start with pgvector. The operational overhead reduction is worth more than the performance advantages of a dedicated vector store at your current scale. Switch when you have evidence of a bottleneck, not before.

Building Retrieval That Doesn't Break in Production

from diaflow import Agent, SemanticMemory, EpisodicMemory
from diaflow.memory import MemoryConfig

# Configure tiered memory
agent = Agent(
    name="research-analyst",
    model="claude-sonnet-4-6",
    memory=MemoryConfig(
        semantic=SemanticMemory(
            backend="qdrant",
            collection="research_docs",
            top_k=5,                     # retrieve 5 chunks, not 20
            score_threshold=0.72,        # drop low-relevance chunks
            reranker="cross-encoder",    # rerank before inserting to context
        ),
        episodic=EpisodicMemory(
            backend="redis",
            ttl_hours=8,                 # session-scoped, not forever
            max_episodes=20,             # hard limit on history depth
        ),
        context_budget_for_memory=40_000  # max tokens allocated to retrieved content
    )
)

# Note: code examples are illustrative —
# actual SDK usage requires a Diaflow account.

A few things worth noting in this configuration: the score_threshold at 0.72 means chunks below that cosine similarity are not inserted into context even if they're in the top-k. This is critical — naive top-k retrieval will insert irrelevant chunks when query-chunk similarity is uniformly low, which is worse than inserting nothing. The reranker setting adds a cross-encoder reranking step that significantly improves retrieval precision at the cost of 30-80ms additional latency — well worth it for research-quality tasks.

The Failure Mode Nobody Catches Until Production: Context Flooding

Context flooding happens when an agent retrieves high-similarity content at multiple steps, and the retrieved chunks accumulate to crowd out the actual task instructions. We've seen agents that were given a 200-word task instruction end up with 150K tokens of retrieved context and 200 tokens of task context — and then produce outputs that accurately synthesize the retrieved content but completely ignore the task framing.

We're not saying retrieval is harmful — we're saying retrieval without a context budget enforced per-step is the most common single cause of production memory failures. The context_budget_for_memory parameter above enforces a hard limit on how much context can be allocated to retrieved content. This is a production hygiene requirement, not an optimization.

Agent memory design is an area where framework abstractions often hide the details that matter most. Understanding what's actually being stored, when it's being retrieved, and how much context it's consuming isn't optional for production systems. It's the difference between an agent that works in demos and one that works for real users over real time horizons.

The code examples in this post are illustrative of Diaflow SDK patterns. Actual implementation requires a Diaflow account and may differ from preview API shapes. See our documentation for current SDK reference.

More from the blog

Back to all posts