Context Management & Reliability
Context Management & Reliability covers how an architect keeps Claude-based agents accurate and trustworthy as interactions grow long, span many tools, or fan out across multiple agents. The domain blends Anthropic's context-engineering guidance (context rot, the "lost in the middle" effect, compaction, structured note-taking, sub-agent isolation) with reliability patterns for escalation, structured error propagation, human review, confidence calibration, and provenance preservation in multi-source synthesis. The recurring theme: deliberately curate what enters the context window and how findings/errors/sources are structured, rather than dumping raw history, suppressing failures, or trusting aggregate metrics.
- 01Watch for the anti-pattern phrasing in scenario stems. The exam often describes a tempting shortcut (progressive summarization, sentiment-based escalation, silently returning empty results, trusting a 97% aggregate) and asks what is wrong or what to do instead. The correct answer is almost always: preserve exact facts/sources/errors in structured form, or stratify by segment.
- 02Distinguish a real escalation trigger from a fake one. Triggers are: explicit human request, policy gap/ambiguity, and inability to make meaningful progress. NOT triggers: negative sentiment alone, the model's self-reported confidence, or 'this case feels complex.' Honor an explicit human request immediately without first investigating.
- 03Know the difference between an access failure (timeout -> retry decision) and a valid empty result (query succeeded, zero matches). Collapsing both into a generic 'search unavailable' status is the classic error-propagation trap.
- 04Remember the two compaction risks: condensing exact values (amounts, dates, order IDs, customer-stated expectations) into vague prose, and losing claim-to-source mappings. Keep transactional facts and citations in a separate structured layer outside summarized history.
- 05For multi-agent designs, upstream/subagents should return structured, condensed output (key facts, citations, dates, relevance/coverage annotations) - not verbose content or reasoning chains - because downstream agents have limited context budgets. A subagent runs in its own context window and returns only a small summary (~1-2K tokens).
- 06Place key-findings summaries at the START of long aggregated inputs and use explicit section headers to counter the lost-in-the-middle effect. Models reliably read the beginning and end, not the middle.
- 07For human review: route LOW-confidence and contradictory/ambiguous-source items to humans, calibrate field-level confidence thresholds on a labeled validation set, and use stratified random sampling of HIGH-confidence outputs to catch novel error patterns. Validate accuracy per document-type and per-field before reducing review.
Manage conversation context to preserve critical information across long interactions
Official objective — Knowledge & Skills
- Progressive summarization risks: condensing numerical values, percentages, dates, and customer-stated expectations into vague summaries
- The "lost in the middle" effect: models reliably process information at the beginning and end of long inputs but may omit findings from middle sections
- How tool results accumulate in context and consume tokens disproportionately to their relevance (e.g., 40+ fields per order lookup when only 5 are relevant)
- The importance of passing complete conversation history in subsequent API requests to maintain conversational coherence
- Extracting transactional facts (amounts, dates, order numbers, statuses) into a persistent "case facts" block included in each prompt, outside summarized history
- Extracting and persisting structured issue data (order IDs, amounts, statuses) into a separate context layer for multi-issue sessions
- Trimming verbose tool outputs to only relevant fields before they accumulate in context (e.g., keeping only return-relevant fields from order lookups)
- Placing key findings summaries at the beginning of aggregated inputs and organizing detailed results with explicit section headers to mitigate position effects
- Requiring subagents to include metadata (dates, source locations, methodological context) in structured outputs to support accurate downstream synthesis
- Modifying upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains when downstream agents have limited context budgets
Long interactions degrade in predictable ways, and an architect's job is to curate what stays in the context window. Anthropic calls the core failure mode context rot: as token count grows, the model's ability to accurately recall any single fact drops, because transformer attention is spread across n² pairwise relationships. A related effect is lost in the middle - models reliably attend to the beginning and end of a long input but may omit findings buried in the middle.
The most common self-inflicted damage is progressive summarization that compresses exact values into vague prose. Numbers, percentages, dates, order numbers, statuses, and customer-stated expectations must survive verbatim. The fix is a separate, non-summarized layer: extract transactional facts into a persistent "case facts" block (and, for multi-issue sessions, structured per-issue records with order IDs, amounts, statuses) that is re-injected into every prompt, untouched by summarization of the chat history.
Tool results are the other context hog. A single order lookup may return 40+ fields when only ~5 are relevant; these accumulate and consume tokens disproportionate to their value. Trim verbose tool outputs to only the relevant fields before they enter context. Anthropic notes that clearing stale tool results is one of the safest, lightest forms of compaction.
To fight position effects, put a key-findings summary at the top of any aggregated input and organize details under explicit section headers so nothing important hides in the middle.
In multi-agent systems, downstream agents have limited context budgets, so upstream/subagents should return structured data - key facts, citations, relevance scores, and required metadata (dates, source locations, methodological context) - rather than verbose content and reasoning chains. This supports accurate downstream synthesis.
Finally, the API itself is stateless: you must pass the complete conversation history (or a faithful compacted equivalent plus the preserved facts block) in each request to maintain coherence. The Agent SDK and Claude Code manage this for you, but the principle stands: coherence comes from what you re-supply, not from the model remembering.
{
"case_facts": {
"order_id": "A-10293",
"order_date": "2026-03-14",
"amount_usd": 248.50,
"status": "delivered",
"customer_expectation": "full refund by 2026-06-10"
}
}Summary so far: Customer is unhappy about a recent order and wants
some money back soon. We looked it up and it seems mostly fine.KEY FINDINGS (read first):
- Refund owed: $248.50, deadline 2026-06-10
## Order lookup (trimmed)
order_id, status, amount, return_window_days
## Policy excerpts
...| Trap | Why it fails | Correct pattern |
|---|---|---|
| Relying on progressive summarization to shrink long conversations. | Summaries condense numerical values, dates, and customer-stated expectations into vague language, destroying the exact facts later steps depend on. | Extract transactional facts (amounts, dates, order numbers, statuses) into a persistent case-facts block included in every prompt, outside the summarized history. |
| Letting full tool results accumulate in context. | Verbose outputs (40+ fields per lookup) consume tokens disproportionate to relevance and accelerate context rot, crowding out signal. | Trim tool outputs to only the relevant fields (e.g., return-relevant fields) before they enter context. |
| Dumping detailed results in one long block and assuming the model reads all of it. | The lost-in-the-middle effect means middle-section findings are unreliably recalled. | Place a key-findings summary at the beginning and organize details under explicit section headers. |
- Context rot: recall degrades as the window fills; lost-in-the-middle means middle content is most likely dropped.
- Never let summarization swallow exact amounts, dates, order IDs, statuses, or customer-stated expectations - keep them in a separate persistent facts block re-injected each turn.
- Trim tool outputs to relevant fields before they accumulate (e.g., keep only return-relevant fields from a 40+ field order lookup).
- Place key findings at the start and use explicit section headers to counter position effects.
- Subagents/upstream agents should return structured facts, citations, dates, and relevance scores - not verbose reasoning - for context-limited downstream agents.
- The API is stateless: pass full conversation history (or compacted-plus-facts) each request for coherence.
Design effective escalation and ambiguity resolution patterns
Official objective — Knowledge & Skills
- Appropriate escalation triggers: customer requests for a human, policy exceptions/gaps (not just complex cases), and inability to make meaningful progress
- The distinction between escalating immediately when a customer explicitly demands it versus offering to resolve when the issue is straightforward
- Why sentiment-based escalation and self-reported confidence scores are unreliable proxies for actual case complexity
- How multiple customer matches require clarification (requesting additional identifiers) rather than heuristic selection
- Adding explicit escalation criteria with few-shot examples to the system prompt demonstrating when to escalate versus resolve autonomously
- Honoring explicit customer requests for human agents immediately without first attempting investigation
- Acknowledging frustration while offering resolution when the issue is within the agent's capability, escalating only if the customer reiterates their preference
- Escalating when policy is ambiguous or silent on the customer's specific request (e.g., competitor price matching when policy only addresses own-site adjustments)
- Instructing the agent to ask for additional identifiers when tool results return multiple matches, rather than selecting based on heuristics
Good escalation design routes the right cases to humans without over- or under-escalating. The three legitimate escalation triggers are: (1) the customer explicitly asks for a human; (2) policy is ambiguous, silent, or the request needs a policy exception/gap decision - not merely that the case is technically complex; and (3) the agent cannot make meaningful progress. Note that 'complex' alone is not a trigger; capability and policy coverage are what matter.
Two unreliable proxies are explicitly tested. Sentiment-based escalation (escalating because the customer sounds angry) and self-reported confidence scores are poor stand-ins for actual case complexity - a frustrated customer may have a trivially resolvable issue, and a model's stated confidence does not track correctness. Don't gate escalation on either.
There is a key distinction between honoring an explicit demand and offering to help. If a customer explicitly demands a human, transfer immediately - do not first attempt investigation. If the customer is merely frustrated but the issue is within the agent's capability, acknowledge the frustration and offer to resolve it, escalating only if they reiterate the human request.
For policy ambiguity, escalate when the policy is silent on the specific ask. Classic example: policy covers price adjustments on your own site, but the customer wants a competitor price match the policy never addresses - that gap should escalate rather than be improvised.
For identity ambiguity, when a lookup returns multiple matching customers, do not pick one heuristically; ask for an additional identifier (email, order number, ZIP) to disambiguate.
The implementation lever is the system prompt: add explicit escalation criteria with few-shot examples demonstrating escalate-vs-resolve decisions. In the Agent SDK, the AskUserQuestion tool supports asking clarifying questions; Anthropic's broader guidance is to design agents to return to the human for judgement at blockers rather than guessing.
Escalate to a human when ANY of these hold:
1) The customer explicitly asks for a human -> transfer immediately.
2) Policy is silent/ambiguous on the request -> escalate.
3) You cannot make meaningful progress.
Otherwise resolve autonomously.
Examples:
- "I want to speak to a person." -> Escalate now (no investigation).
- "This is ridiculous, fix my refund." -> Acknowledge frustration,
offer to resolve; escalate only if they reiterate wanting a human.
- "Match Competitor X's price." Policy only covers own-site
adjustments -> Escalate (policy gap).if sentiment == "angry" or model_confidence < 0.7:
escalate() # unreliable proxies for actual complexityLookup returned 3 customers named "J. Smith."
-> "I found multiple accounts. Can you share the email or order
number on the account?" (do NOT pick one by heuristic)| Trap | Why it fails | Correct pattern |
|---|---|---|
| Escalating based on customer sentiment or the model's self-reported confidence score. | Neither tracks actual case complexity - an angry customer may have a trivial issue, and stated confidence does not predict correctness. | Escalate on concrete triggers (explicit human request, policy gap, no meaningful progress) defined with few-shot examples in the system prompt. |
| Investigating or attempting resolution first when a customer explicitly demands a human. | It ignores an explicit, unambiguous request and erodes trust. | Transfer immediately on explicit human requests; only offer to resolve when frustration is present but no human was demanded. |
| Heuristically selecting one record when a lookup returns multiple customer matches. | Picking the wrong identity causes wrong actions on the wrong account. | Ask for an additional identifier to disambiguate before acting. |
- Legitimate triggers: explicit human request, policy ambiguity/gap/exception, and inability to make meaningful progress.
- 'Complex' alone is not a trigger; sentiment and self-reported confidence are unreliable proxies for complexity.
- Honor an explicit human request immediately - do not investigate first.
- If only frustrated and the issue is in scope, acknowledge + offer to resolve; escalate only if they reiterate.
- Escalate when policy is silent on the specific request (e.g., competitor price match vs own-site adjustment).
- On multiple customer matches, request an additional identifier rather than choosing heuristically.
- Encode escalation criteria with few-shot examples in the system prompt.
Implement error propagation strategies across multi-agent systems
Official objective — Knowledge & Skills
- Structured error context (failure type, attempted query, partial results, alternative approaches) as enabling intelligent coordinator recovery decisions
- The distinction between access failures (timeouts needing retry decisions) and valid empty results (successful queries with no matches)
- Why generic error statuses ("search unavailable") hide valuable context from the coordinator
- Why silently suppressing errors (returning empty results as success) or terminating entire workflows on single failures are both anti-patterns
- Returning structured error context including failure type, what was attempted, partial results, and potential alternatives to enable coordinator recovery
- Distinguishing access failures from valid empty results in error reporting so the coordinator can make appropriate decisions
- Having subagents implement local recovery for transient failures and only propagate errors they cannot resolve, including what was attempted and partial results
- Structuring synthesis output with coverage annotations indicating which findings are well-supported versus which topic areas have gaps due to unavailable sources
In multi-agent systems built on the orchestrator-worker (lead-coordinator) pattern, how subagents report failure determines whether the coordinator can recover intelligently. The core skill is structured error context: instead of a generic status, a failing subagent returns the failure type, what was attempted (the query/tool/parameters), any partial results already gathered, and potential alternative approaches. With that, the coordinator can retry, reroute, narrow scope, or proceed with partial coverage.
Two anti-patterns bookend the space. Silently suppressing errors - returning empty results as if the query succeeded - hides failure and corrupts downstream synthesis. Terminating the entire workflow on a single subagent failure is equally wrong: one unavailable source should not kill an otherwise-recoverable run. Anthropic's multi-agent system instead uses durable execution that can resume from where the error occurred, plus retry logic and checkpoints, and lets agents adapt when a tool is failing.
A critical distinction: an access failure (a timeout or unavailable tool) needs a retry/reroute decision, whereas a valid empty result (the query ran successfully and simply found no matches) is a legitimate finding. Collapsing both into a vague "search unavailable" destroys the information the coordinator needs to decide correctly.
Responsibility is layered: subagents implement local recovery for transient failures (retry a timeout, try a narrower query) and only propagate errors they cannot resolve, attaching what was attempted and any partial results. Generic statuses like "search unavailable" should be avoided precisely because they strip this context.
Finally, the synthesis output should carry coverage annotations - which findings are well-supported versus which topic areas have gaps due to unavailable sources - so the final answer is honest about its blind spots rather than presenting partial coverage as complete.
{
"status": "access_failure",
"failure_type": "timeout",
"attempted": {"tool": "web_search", "query": "Q1 2026 revenue ACME"},
"partial_results": [{"source": "acme.com/ir", "excerpt": "..."}],
"alternatives": ["retry with 30s timeout", "try SEC EDGAR"]
}{ "status": "empty_result", "query": "refunds over $10k in 2026", "matches": 0 }{ "status": "success", "results": [] } // search actually timed out| Trap | Why it fails | Correct pattern |
|---|---|---|
| Returning a generic error status such as "search unavailable." | It hides the failure type, what was attempted, and partial results, so the coordinator cannot make an intelligent recovery decision. | Return structured error context (failure type, attempted query, partial results, alternatives). |
| Silently returning empty results as if the operation succeeded. | It conflates a failed access with a genuine no-match finding, corrupting downstream synthesis. | Distinguish access failures (need retry decisions) from valid empty results and report each explicitly. |
| Terminating the entire workflow when one subagent fails. | A single unavailable source shouldn't kill a recoverable run; the coordinator could retry, reroute, or proceed with annotated partial coverage. | Have subagents recover locally for transient failures and propagate only unresolved errors; let the coordinator decide and annotate coverage gaps. |
- Return structured error context: failure type, attempted query, partial results, and alternative approaches.
- Distinguish access failures (timeout -> retry/reroute) from valid empty results (success, zero matches).
- Generic statuses like 'search unavailable' hide context the coordinator needs - avoid them.
- Both silently suppressing errors and killing the whole workflow on one failure are anti-patterns.
- Subagents recover locally from transient failures and only propagate what they cannot resolve, with partial results.
- Annotate synthesis output with coverage: well-supported findings vs gaps from unavailable sources.
Manage context effectively in large codebase exploration
Official objective — Knowledge & Skills
- Context degradation in extended sessions: models start giving inconsistent answers and referencing "typical patterns" rather than specific classes discovered earlier
- The role of scratchpad files for persisting key findings across context boundaries
- Subagent delegation for isolating verbose exploration output while the main agent coordinates high-level understanding
- Structured state persistence for crash recovery: each agent exports state to a known location, and the coordinator loads a manifest on resume
- Spawning subagents to investigate specific questions (e.g., "find all test files," "trace refund flow dependencies") while the main agent preserves high-level coordination
- Having agents maintain scratchpad files recording key findings, referencing them for subsequent questions to counteract context degradation
- Summarizing key findings from one exploration phase before spawning sub-agents for the next phase, injecting summaries into initial context
- Designing crash recovery using structured agent state exports (manifests) that the coordinator loads on resume and injects into agent prompts
- Using /compact to reduce context usage during extended exploration sessions when context fills with verbose discovery output
Exploring a large codebase is a marathon for context. In extended sessions you see context degradation: the model starts giving inconsistent answers and referencing "typical patterns" instead of the specific classes and files it discovered earlier - a symptom of context rot. Three techniques counteract it.
First, subagent delegation. A subagent runs in its own fresh context window, does verbose work (reading dozens of files), and returns only a small summary (the Claude Code visualization shows a subagent reading ~6,100 tokens of files and returning ~420 tokens). Spawn subagents for scoped questions - "find all test files," "trace the refund-flow dependencies" - so the main agent keeps high-level coordination clean and the verbose discovery never touches the main window. The Agent SDK exposes this via AgentDefinition (include Agent in allowedTools); Claude Code via .claude/agents.
Second, scratchpad / note files. Have agents write key findings to a file (a scratchpad, or CLAUDE.md / auto memory MEMORY.md) and reference it for later questions. This is durable external memory that survives context boundaries. Before moving from one exploration phase to the next, summarize key findings and inject the summary into the next phase's initial context.
Third, /compact reduces context usage mid-session by replacing the conversation with a structured summary; you can steer it (e.g., /compact focus on the refund flow). Note what survives: project-root CLAUDE.md is re-read and re-injected after compaction, but conversation-only facts are not - which is exactly why a scratchpad/CLAUDE.md matters.
For crash recovery, design structured state persistence: each agent exports its state to a known location, and the coordinator loads a manifest on resume and injects it into agent prompts, so the run continues from where it stopped instead of restarting. This mirrors Anthropic's use of checkpoints, durable execution, and a feature manifest in long-running agents.
options = ClaudeAgentOptions(
allowed_tools=["Read","Glob","Grep","Agent"],
agents={"explorer": AgentDefinition(
description="Trace dependencies for a given flow.",
prompt="Find all files in the refund flow; return a 1-page summary.",
tools=["Read","Glob","Grep"])}){
"phase": "map-tests",
"agents": {
"explorer": {"state": "./state/explorer.json", "status": "done"},
"refund-tracer": {"state": "./state/refund.json", "status": "in_progress"}
}
}# 80+ file reads in the main window, no notes, no subagents
# Result: model forgets PaymentService specifics, cites "typical patterns"| Trap | Why it fails | Correct pattern |
|---|---|---|
| Reading the whole codebase in the main conversation window. | Verbose discovery fills the context, triggering degradation - inconsistent answers and references to generic patterns instead of the specific classes found earlier. | Delegate verbose, scoped exploration to subagents that return condensed summaries, keeping the main window for coordination. |
| Holding all discovered findings only in the conversation. | Findings are lost across context boundaries and after /compact (conversation-only facts are not re-injected). | Persist key findings to scratchpad/CLAUDE.md/MEMORY.md files and reference them; summarize between phases and inject into the next phase. |
| Restarting the whole exploration after a crash. | Wastes context and work and may lose discovered state. | Export structured per-agent state and have the coordinator load a manifest on resume, injecting prior state into agent prompts. |
- Extended sessions degrade: inconsistent answers and generic 'typical patterns' instead of specific discovered classes.
- Spawn subagents for scoped investigations; they isolate verbose output in a separate window and return only a summary.
- Maintain scratchpad/CLAUDE.md/MEMORY.md notes and reference them later to counteract degradation.
- Summarize findings between phases and inject the summary into the next phase's initial context.
- Use /compact to shrink context mid-session; project-root CLAUDE.md is re-injected after compaction, conversation-only facts are not.
- Design crash recovery with per-agent state exports plus a manifest the coordinator loads on resume and injects into prompts.
Design human review workflows and confidence calibration
Official objective — Knowledge & Skills
- The risk that aggregate accuracy metrics (e.g., 97% overall) may mask poor performance on specific document types or fields
- Stratified random sampling for measuring error rates in high-confidence extractions and detecting novel error patterns
- Field-level confidence scores calibrated using labeled validation sets for routing review attention
- The importance of validating accuracy by document type and field segment before automating high-confidence extractions
- Implementing stratified random sampling of high-confidence extractions for ongoing error rate measurement and novel pattern detection
- Analyzing accuracy by document type and field to verify consistent performance across all segments before reducing human review
- Having models output field-level confidence scores, then calibrating review thresholds using labeled validation sets
- Routing extractions with low model confidence or ambiguous/contradictory source documents to human review, prioritizing limited reviewer capacity
Before you automate high-confidence extractions and pull back human review, you must prove the system is reliable where it matters - and aggregate metrics can lie. A headline like 97% overall accuracy can mask poor performance on a specific document type or a specific field. So the first skill is segmented validation: analyze accuracy by document type and by field and confirm consistent performance across every segment before reducing review. A model that is 99% on invoices but 70% on handwritten forms must not be treated as uniformly 97%.
Second, field-level confidence scores. Have the model output a confidence per extracted field, then calibrate review thresholds on a labeled validation set so a reported confidence maps to an actual error rate. Calibration is what makes "high confidence" a usable routing signal rather than a vibe.
Third, routing. Send to human review the extractions that most need it: low model confidence, and cases where the source documents are ambiguous or contradictory. This prioritizes scarce reviewer capacity on the riskiest items instead of spreading attention uniformly.
Fourth, even after automating, keep a safety net via stratified random sampling of high-confidence extractions. Ongoing sampling measures the real error rate in the auto-approved tier and detects novel error patterns (a new vendor template, a format change) that confidence scores haven't learned yet. Stratify the sample across document types and fields so rare-but-critical segments are actually represented - plain random sampling can miss them.
The through-line mirrors Anthropic's guidance on matching oversight to risk: automate the well-validated, calibrated, low-risk paths; route ambiguous/low-confidence/high-stakes items to humans; and continuously monitor with stratified sampling rather than trusting a single aggregate number.
Overall accuracy: 97%
invoices (typed): 99.2%
receipts (typed): 98.1%
handwritten forms: 71.4% <- do NOT auto-approve this segment{
"invoice_total": {"value": 248.50, "confidence": 0.97},
"vendor_name": {"value": "Acme", "confidence": 0.62} // -> human review
}if overall_accuracy >= 0.95:
disable_human_review_for_all_document_types() # masks weak segments| Trap | Why it fails | Correct pattern |
|---|---|---|
| Reducing/automating human review based on a single aggregate accuracy number. | A high overall figure can mask poor performance on specific document types or fields, so failures concentrate where you stopped looking. | Validate accuracy by document type and field, confirm consistency across segments, and only then reduce review for the proven segments. |
| Treating raw model confidence as a calibrated probability and only sampling high-confidence items randomly. | Uncalibrated scores don't map to real error rates, and unstratified sampling under-represents rare document types where novel errors hide. | Calibrate field-level thresholds on a labeled validation set and use stratified random sampling across segments for ongoing monitoring. |
- Aggregate accuracy (e.g., 97%) can hide poor performance on specific document types or fields.
- Validate accuracy per document type and per field before reducing human review.
- Output field-level confidence and calibrate review thresholds on a labeled validation set.
- Route low-confidence and ambiguous/contradictory-source extractions to human review to prioritize reviewer capacity.
- Use stratified random sampling of high-confidence outputs to measure ongoing error rates and catch novel error patterns.
Preserve information provenance and handle uncertainty in multi-source synthesis
Official objective — Knowledge & Skills
- How source attribution is lost during summarization steps when findings are compressed without preserving claim-source mappings
- The importance of structured claim-source mappings that the synthesis agent must preserve and merge when combining findings
- How to handle conflicting statistics from credible sources: annotating conflicts with source attribution rather than arbitrarily selecting one value
- Temporal data: requiring publication/collection dates in structured outputs to prevent temporal differences from being misinterpreted as contradictions
- Requiring subagents to output structured claim-source mappings (source URLs, document names, relevant excerpts) that downstream agents preserve through synthesis
- Structuring reports with explicit sections distinguishing well-established findings from contested ones, preserving original source characterizations and methodological context
- Completing document analysis with conflicting values included and explicitly annotated, letting the coordinator decide how to reconcile before passing to synthesis
- Requiring subagents to include publication or data collection dates in structured outputs to enable correct temporal interpretation
- Rendering different content types appropriately in synthesis outputs—financial data as tables, news as prose, technical findings as structured lists—rather than converting everything to a uniform format
When findings flow through multiple agents and summarization steps, the danger is that provenance is silently lost: summarizing compresses claims while dropping the claim-to-source mapping, so the final report asserts facts with no traceable origin. The fix is to require structured claim-source mappings at every hop - each claim carries its source URL or document name and a relevant excerpt - and to make the synthesis agent preserve and merge those mappings rather than flatten them. Anthropic's multi-agent system reflects this with a dedicated CitationAgent that attributes every claim to its source.
Handling uncertainty and conflict is the second pillar. When credible sources report conflicting statistics, do not arbitrarily pick one value. The analysis agent should complete its work with both values included and explicitly annotated (with source attribution), and let the coordinator decide how to reconcile before synthesis. The final report should structure findings into explicit sections distinguishing well-established from contested claims, preserving each source's original characterization and methodological context instead of homogenizing them.
Temporal data is a frequent false-conflict source. Require subagents to include publication or data-collection dates in structured outputs, so a 2023 figure and a 2026 figure are interpreted as a time series, not a contradiction. Without dates, temporal differences masquerade as disagreements.
Finally, render content types appropriately in synthesis: financial data as tables, news as prose, technical findings as structured lists. Forcing everything into one uniform format loses meaning and can obscure the structure that makes provenance and conflicts legible. The unifying idea: synthesis should carry forward sources, dates, conflicts, and methodological context as first-class structured data - never paraphrase them away.
{
"claim": "EV sales grew 18% in 2025",
"source": "https://iea.org/reports/ev-2026",
"excerpt": "...sales rose 18% year over year in 2025...",
"published": "2026-04"
}{
"metric": "market_size_usd_bn",
"values": [
{"value": 42, "source": "Firm A", "collected": "2024"},
{"value": 51, "source": "Firm B", "collected": "2026"}
],
"note": "Possible time-series difference; coordinator to reconcile."
}"The market is about $51B." // no source, no date, conflict hidden| Trap | Why it fails | Correct pattern |
|---|---|---|
| Letting summarization compress findings without preserving which source supports which claim. | Provenance is lost, so the final report makes unattributable claims and can't be verified or audited. | Require structured claim-source mappings (URL/document + excerpt) that the synthesis agent preserves and merges. |
| Arbitrarily selecting one value when credible sources conflict. | It discards legitimate uncertainty and may pick the wrong figure, hiding the conflict from decision-makers. | Include conflicting values with source attribution and dates, and let the coordinator reconcile; report well-established vs contested findings separately. |
| Omitting publication/collection dates from structured outputs. | Figures from different time periods get misinterpreted as contradictions. | Require dates in every structured output so temporal differences are read as a time series. |
- Summarization steps drop claim-source mappings unless you require them as structured outputs that synthesis preserves and merges.
- Each claim should carry source URL/document name and a relevant excerpt.
- On conflicting statistics from credible sources, annotate the conflict with attribution - don't arbitrarily pick one value; let the coordinator reconcile.
- Require publication/collection dates so temporal differences aren't misread as contradictions.
- Structure reports to separate well-established from contested findings, preserving original characterizations and methodological context.
- Render content types appropriately (tables for financials, prose for news, lists for technical) rather than one uniform format.
Drill Context Management & Reliability
18 scenario-based questions, timed, with full explanations.
Start the Context drill →