CCA-F STUDY
← Domains·Domain 4· 20% of exam

Prompt Engineering & Structured Output

This domain (20% of the CCA-F exam) tests how an architect shapes Claude's behavior through prompting and how to make its output reliable enough to build systems on. It spans two halves: writing prompts that maximize precision and consistency (explicit criteria over vague instructions, few-shot examples for ambiguous cases), and engineering trustworthy output pipelines (tool use / JSON schemas for guaranteed structure, validation-and-retry loops, batch vs synchronous processing, and multi-instance / multi-pass review). The 2026 platform adds native Structured Outputs (output_config.format) and strict tool use (strict: true), which use constrained decoding to eliminate JSON syntax errors entirely. Throughout, the exam rewards recognizing that schema validity is not semantic correctness, and that independent review beats self-review.

What the exam expects
  • 01Distinguish syntax from semantics: tool use, strict tool use, and Structured Outputs guarantee schema-valid JSON via constrained decoding, but they never guarantee the values are correct (line items can still fail to sum to the stated total). Semantic errors require validation logic, not a better schema.
  • 02Memorize the four tool_choice modes and their exact behavior: auto (default; model may answer in text), any (must call some tool), {type:'tool',name:'...'} (must call that specific tool), none. With any or tool the API prefills the assistant turn, so Claude emits NO natural-language text before the tool_use block.
  • 03Prefer explicit categorical criteria over confidence-based filtering. 'Be conservative' or 'only report high-confidence findings' does not improve precision; 'flag only when the comment's claimed behavior contradicts the actual code' does. High false-positive categories erode trust in the accurate ones.
  • 04Few-shot (2-5 examples, diverse, wrapped in <example>/<examples> tags) is the single most reliable lever for consistent formatting and correct edge-case handling. Examples that show the reasoning for choosing one action over plausible alternatives let the model generalize to novel cases.
  • 05Retries fix format and structural errors but NOT missing information. If a required field is simply absent from the source document, retrying with error feedback wastes calls. Make such fields optional/nullable so the model returns null instead of fabricating.
  • 06Match the API to the latency requirement: synchronous Messages API for blocking pre-merge checks; Message Batches API (50% cheaper, up to 24h window, no latency SLA) for overnight/weekly non-blocking work. Always correlate batch results by custom_id; results return out of order.
  • 07Independent review beats self-review. A fresh Claude instance without the generator's reasoning context catches subtle issues that self-review and extended thinking miss. Split large reviews into per-file local passes plus separate cross-file integration passes to avoid attention dilution.
Task 4.1

Design prompts with explicit criteria to improve precision and reduce false positives

Official objective — Knowledge & Skills
Knowledge of
  • The importance of explicit criteria over vague instructions (e.g., "flag comments only when claimed behavior contradicts actual code behavior" vs "check that comments are accurate")
  • How general instructions like "be conservative" or "only report high-confidence findings" fail to improve precision compared to specific categorical criteria
  • The impact of false positive rates on developer trust: high false positive categories undermine confidence in accurate categories
Skills in
  • Writing specific review criteria that define which issues to report (bugs, security) versus skip (minor style, local patterns) rather than relying on confidence-based filtering
  • Temporarily disabling high false-positive categories to restore developer trust while improving prompts for those categories
  • Defining explicit severity criteria with concrete code examples for each severity level to achieve consistent classification

Precision in a Claude-powered reviewer or classifier comes from explicit, operational criteria, not from softening language. The exam's canonical contrast is 'flag comments only when the claimed behavior contradicts the actual code behavior' (precise, testable) versus 'check that comments are accurate' (vague, leaves the model to invent a bar). The latest models (Opus 4.8) interpret prompts literally and do not silently generalize, which makes specificity even more important: state exactly which categories to report (bugs that cause incorrect behavior, security issues) and which to skip (minor style, naming, local conventions).

A common but ineffective fix is confidence-based filtering. Instructions like 'be conservative' or 'only report high-confidence findings' do not reliably raise precision because the model has no shared definition of the threshold. The official guidance is to 'be concrete about where the bar is rather than using qualitative terms like "important"' — for example, 'report any bug that could cause incorrect behavior, a test failure, or a misleading result; omit pure style or naming preferences.' Categorical, example-anchored criteria outperform a numeric confidence dial.

False positives are not evenly damaging. A single high-false-positive category undermines developer trust in every other category, including the accurate ones — once reviewers learn to ignore one noisy bucket, they start ignoring all output. The practical architectural move is to temporarily disable a high-false-positive category entirely (restoring trust in what remains) while you iterate on its prompt offline, rather than shipping it noisy.

For consistent severity classification, define each severity level with concrete code examples of what qualifies (e.g., a worked 'critical' example, a worked 'minor' example). Anchoring severities to examples produces far more consistent labeling than adjectives like 'serious' or 'important' alone. Pair this with measurement: track dismissal rates per category so you can see which criteria are actually working.

textAnti-patternVague criteria that fail to improve precision
Review this code. Be conservative and only report high-confidence issues. Check that the comments are accurate.
textCorrect patternExplicit categorical criteria with severity anchored to examples
Report a finding ONLY when one of these is true:
- A comment's claimed behavior contradicts what the code actually does.
- A bug could cause incorrect output, a test failure, or data loss.
- A security issue (injection, auth bypass, secret exposure) is present.
DO NOT report: style, naming, formatting, or local conventions.
Severity:
- critical: data loss / security (e.g., unsanitized SQL string built from user input).
- minor: harmless redundancy (e.g., a redundant null check after a guaranteed-non-null return).
Anti-patterns & traps
TrapWhy it failsCorrect pattern
Using 'be conservative' or 'only report high-confidence findings' to cut false positives.The model has no shared definition of the confidence threshold, so precision does not reliably improve and behavior is inconsistent across runs.Specify exactly which issue categories to report and which to skip, with concrete examples per severity level.
Shipping a category that produces many false positives because it occasionally finds real issues.High false-positive categories train developers to ignore the tool, undermining trust in the accurate categories too.Temporarily disable the noisy category to preserve trust, then refine its prompt and criteria offline before re-enabling.
Must-know
  • Explicit, testable criteria ('flag only when claimed behavior contradicts actual code') beat vague ones ('check accuracy').
  • Confidence-based hedging ('be conservative', 'high-confidence only') does NOT reliably improve precision; specific categorical rules do.
  • Define what to REPORT (bugs, security) vs what to SKIP (minor style, local patterns) rather than relying on a confidence dial.
  • One high-false-positive category erodes trust in all categories; temporarily disable it while improving its prompt.
  • Anchor each severity level to concrete code examples for consistent classification.
Practice — 3 questions for this taskDrill Task 4.1
Task 4.2

Apply few-shot prompting to improve output consistency and quality

Official objective — Knowledge & Skills
Knowledge of
  • Few-shot examples as the most effective technique for achieving consistently formatted, actionable output when detailed instructions alone produce inconsistent results
  • The role of few-shot examples in demonstrating ambiguous-case handling (e.g., tool selection for ambiguous requests, branch-level test coverage gaps)
  • How few-shot examples enable the model to generalize judgment to novel patterns rather than matching only pre-specified cases
  • The effectiveness of few-shot examples for reducing hallucination in extraction tasks (e.g., handling informal measurements, varied document structures)
Skills in
  • Creating 2-4 targeted few-shot examples for ambiguous scenarios that show reasoning for why one action was chosen over plausible alternatives
  • Including few-shot examples that demonstrate specific desired output format (location, issue, severity, suggested fix) to achieve consistency
  • Providing few-shot examples distinguishing acceptable code patterns from genuine issues to reduce false positives while enabling generalization
  • Using few-shot examples to demonstrate correct handling of varied document structures (inline citations vs bibliographies, methodology sections vs embedded details)
  • Adding few-shot examples showing correct extraction from documents with varied formats to address empty/null extraction of required fields

Few-shot (multishot) prompting is the most reliable technique for getting consistently formatted, actionable output when detailed instructions alone still produce variance. Anthropic's guidance: 'Examples are one of the most reliable ways to steer Claude's output format, tone, and structure,' and they 'can dramatically improve accuracy and consistency.' The recommended range is roughly 2-5 examples (the docs suggest 3-5 for best results), wrapped in <example> tags (multiple in an <examples> block) so the model distinguishes demonstrations from instructions, and made diverse enough to cover edge cases so the model doesn't latch onto an unintended pattern.

The highest-leverage use of few-shot is ambiguous-case handling. For a tool-selection decision, a branch-level test-coverage gap, or an extraction from an oddly structured document, a worked example that shows the reasoning — why one action was chosen over a plausible alternative — teaches judgment rather than a lookup table. This is why examples enable generalization to novel patterns instead of merely matching pre-specified ones. You can place a <thinking> block inside an example to demonstrate the reasoning pattern you want; the model will generalize that style.

For output consistency, show the exact target shape in each example (e.g., location, issue, severity, suggested fix). Demonstrating the format is more effective than describing it. For false-positive reduction, include contrastive examples: an acceptable code pattern labeled 'do not flag' next to a genuine issue labeled 'flag' — this draws the boundary while still letting the model generalize.

Few-shot is also the primary tool against extraction hallucination. Examples that show correct handling of informal measurements, inline citations versus bibliographies, methodology sections versus embedded details, and varied document layouts teach the model how to behave on structures it hasn't seen — and crucially, examples that show returning null/empty for a genuinely-absent field prevent the model from fabricating values to satisfy a field. Keep examples relevant (mirror the real use case) and ask Claude to evaluate your set for relevance and diversity if unsure.

textCorrect patternDiverse, contrastive few-shot examples that teach a boundary and a fixed output shape
<examples>
  <example>
    Input: a // null check after a function annotated @NonNull return
    Reasoning: redundant but harmless; not a correctness or security issue.
    Output: {"flag": false}
  </example>
  <example>
    Input: query = "SELECT * FROM users WHERE id=" + req.id
    Reasoning: user input concatenated into SQL -> injection; correctness/security.
    Output: {"flag": true, "location": "db.py:42", "issue": "SQL injection", "severity": "critical", "suggested_fix": "use a parameterized query"}
  </example>
</examples>
textCorrect patternFew-shot example demonstrating null on a genuinely absent field (anti-hallucination)
<example>
  Document: invoice with no PO number printed anywhere.
  Output: {"invoice_total": "1240.00", "po_number": null}
  Note: po_number is null because it is absent; do not infer or fabricate it.
</example>
Anti-patterns & traps
TrapWhy it failsCorrect pattern
Relying on a long prose instruction list to enforce a format instead of showing examples.Detailed instructions alone often still yield inconsistent structure; the model has no concrete template to imitate.Provide 2-5 tagged examples that demonstrate the exact desired output shape and edge-case handling.
Using only near-identical positive examples.The model picks up unintended surface patterns and fails on edge cases or novel document structures.Make examples diverse and contrastive (including 'do not flag' / null-field cases) so the model generalizes the underlying judgment.
Must-know
  • Few-shot is the most reliable lever for consistent format/quality when instructions alone are inconsistent; use ~2-5 examples (docs suggest 3-5).
  • Wrap examples in <example>/<examples> tags and make them DIVERSE to cover edge cases without teaching unintended patterns.
  • For ambiguous cases, show the reasoning for choosing one action over a plausible alternative so the model generalizes judgment.
  • Demonstrate the exact output shape (location, issue, severity, fix) rather than describing it.
  • Contrastive examples (acceptable pattern vs genuine issue) reduce false positives while preserving generalization.
  • Examples that return null/empty for absent fields prevent fabrication on varied document structures.
Practice — 3 questions for this taskDrill Task 4.2
Task 4.3

Enforce structured output using tool use and JSON schemas

Official objective — Knowledge & Skills
Knowledge of
  • Tool use (tool_use) with JSON schemas as the most reliable approach for guaranteed schema-compliant structured output, eliminating JSON syntax errors
  • The distinction between tool_choice: "auto" (model may return text instead of calling a tool), "any" (model must call a tool but can choose which), and forced tool selection (model must call a specific named tool)
  • That strict JSON schemas via tool use eliminate syntax errors but do not prevent semantic errors (e.g., line items that don't sum to total, values in wrong fields)
  • Schema design considerations: required vs optional fields, enum fields with "other" + detail string patterns for extensible categories
Skills in
  • Defining extraction tools with JSON schemas as input parameters and extracting structured data from the tool_use response
  • Setting tool_choice: "any" to guarantee structured output when multiple extraction schemas exist and the document type is unknown
  • Forcing a specific tool with tool_choice: {"type": "tool", "name": "extract_metadata"} to ensure a particular extraction runs before enrichment steps
  • Designing schema fields as optional (nullable) when source documents may not contain the information, preventing the model from fabricating values to satisfy required fields
  • Adding enum values like "unclear" for ambiguous cases and "other" + detail fields for extensible categorization
  • Including format normalization rules in prompts alongside strict output schemas to handle inconsistent source formatting

The most reliable way to get schema-compliant structured output from Claude is to constrain generation rather than to ask politely for JSON. Two mechanisms do this in 2026: (1) tool use, where you define a tool with an input_schema (JSON Schema) and read structured data from the returned tool_use block; and (2) native Structured Outputs / strict tool use, which add constrained (grammar-based) decoding so the model literally cannot emit tokens that violate the schema. Adding strict: true to a tool definition guarantees the tool input conforms to its input_schema (e.g., passengers becomes 2, never "two"); the JSON Outputs mode (output_config.format with type: json_schema, gated by the beta header structured-outputs-2025-11-13) guarantees the message text is schema-valid JSON. Both eliminate JSON syntax errors — but neither prevents semantic errors: line items can still fail to sum to the stated total, or a value can land in the wrong field. Schema validity is not correctness.

tool_choice controls invocation. auto (the default when tools are present) lets the model return text instead of calling a tool. any forces it to call exactly one of the provided tools but lets it choose which — ideal when you have several extraction schemas and don't yet know the document type. {"type": "tool", "name": "extract_metadata"} forces a specific named tool — use this to guarantee a particular extraction runs before any enrichment step. none disables tools. Important: with any or tool the API prefills the assistant turn, so the model emits no natural-language preamble before the tool_use block. (Forced tool use is also incompatible with extended thinking.)

Schema design choices materially affect quality. Mark fields optional / nullable ("type": ["string", "null"]) when the source may legitimately lack them — required fields pressure the model to fabricate a value to satisfy the schema. Use enums for closed categories, and add an "unclear" enum value for ambiguous cases plus an "other" value paired with a free-text detail field for extensible categorization. Because constrained decoding only enforces structure, put format-normalization rules (date formats, units, currency) in the prompt alongside the schema so inconsistent source formatting is cleaned up before it fills your fields.

jsonCorrect patternExtraction tool with optional/nullable fields and extensible enums
{
  "name": "extract_invoice",
  "strict": true,
  "input_schema": {
    "type": "object",
    "properties": {
      "invoice_total": {"type": "string"},
      "po_number": {"type": ["string", "null"]},
      "document_type": {"type": "string", "enum": ["invoice", "receipt", "unclear", "other"]},
      "document_type_detail": {"type": ["string", "null"]}
    },
    "required": ["invoice_total", "document_type"],
    "additionalProperties": false
  }
}
pythonCorrect patternForce a specific extraction tool to run before enrichment
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    tools=[extract_metadata_tool],
    tool_choice={"type": "tool", "name": "extract_metadata"},
    messages=[{"role": "user", "content": document}],
)
data = next(b.input for b in resp.content if b.type == "tool_use")
textAnti-patternRelying on a prose 'return only JSON' instruction
Read the document and respond with only a JSON object containing total and date. Do not include any other text.
Anti-patterns & traps
TrapWhy it failsCorrect pattern
Asking the model in prose to 'respond with only valid JSON' instead of using tool use / Structured Outputs.Free-text generation can produce trailing prose, markdown fences, or malformed JSON; there is no structural guarantee.Define a tool with input_schema (optionally strict: true) or use output_config.format, and read the structured result from the tool_use block.
Marking every extraction field as required.When the source genuinely lacks a value, a required field forces the model to fabricate one to satisfy the schema.Make fields optional or nullable when the source may not contain them, so the model returns null instead of inventing data.
Assuming strict/constrained output also makes the data correct.Constrained decoding only enforces structure; semantic errors (line items not summing to total, value in wrong field) still occur.Add downstream semantic validation (e.g., recompute totals) on top of the schema guarantee.
Must-know
  • Tool use with a JSON input_schema (and strict: true / Structured Outputs) is the most reliable path to schema-valid output; it eliminates JSON syntax errors.
  • tool_choice: auto = may return text; any = must call SOME tool; {type:'tool',name} = must call THAT tool; none = no tools. any/tool prefill the turn, so no text precedes the tool_use block.
  • Use any when the document type is unknown and multiple schemas exist; force a named tool to guarantee a specific extraction runs first.
  • Constrained decoding guarantees SYNTAX, not SEMANTICS (totals can still be wrong, values can be misplaced).
  • Make fields optional/nullable when the source may lack them, to stop the model fabricating values for required fields.
  • Use enums with 'unclear' for ambiguity and 'other' + a detail string for extensible categories; put normalization rules in the prompt.
Practice — 3 questions for this taskDrill Task 4.3
Task 4.4

Implement validation, retry, and feedback loops for extraction quality

Official objective — Knowledge & Skills
Knowledge of
  • Retry-with-error-feedback: appending specific validation errors to the prompt on retry to guide the model toward correction
  • The limits of retry: retries are ineffective when the required information is simply absent from the source document (vs format or structural errors)
  • Feedback loop design: tracking which code constructs trigger findings (detected_pattern field) to enable systematic analysis of dismissal patterns
  • The difference between semantic validation errors (values don't sum, wrong field placement) and schema syntax errors (eliminated by tool use)
Skills in
  • Implementing follow-up requests that include the original document, the failed extraction, and specific validation errors for model self-correction
  • Identifying when retries will be ineffective (e.g., information exists only in an external document not provided) versus when they will succeed (format mismatches, structural output errors)
  • Adding detected_pattern fields to structured findings to enable analysis of false positive patterns when developers dismiss findings
  • Designing self-correction validation flows: extracting "calculated_total" alongside "stated_total" to flag discrepancies, adding "conflict_detected" booleans for inconsistent source data

Even with schema-guaranteed output, extraction quality requires validation, retry, and feedback loops because constrained decoding fixes syntax, not meaning. The core retry pattern is retry-with-error-feedback: when a result fails a semantic validation check, send a follow-up request that includes the original document, the failed extraction, and the specific validation errors, so the model can self-correct toward a valid answer. Generic 'try again' is far weaker than naming the exact discrepancy.

Retries have a hard limit you must recognize: they help with format mismatches and structural output errors, but they cannot recover information that is simply absent from the source. If a required value lives only in an external document you didn't provide, no number of retries will surface it — the correct response is to mark that field nullable and return null, or to fetch the missing document, not to loop. Knowing when a retry will succeed (format/structure) versus when it will waste calls (missing data) is an explicitly tested skill.

Design self-correcting validation directly into the schema. To catch arithmetic inconsistency, have the model emit both a stated_total (copied from the document) and a calculated_total (summed from line items); your code flags any discrepancy and triggers a targeted retry. For inconsistent source data, add a conflict_detected boolean so the model can surface that two parts of the document disagree rather than silently picking one. This shifts validation from a post-hoc guess to a structured, checkable signal.

Feedback loops also drive long-term precision. Add a detected_pattern field to each finding that records which code construct triggered it (e.g., 'null check after non-null return'). When developers dismiss findings, you can aggregate by detected_pattern to see exactly which constructs generate false positives, then fix the prompt or disable that pattern. The throughline: separate the two error classes — schema syntax errors are eliminated by tool use; semantic errors (values don't sum, wrong field placement, source conflicts) need explicit validation, targeted retries, and instrumentation.

pythonCorrect patternRetry with specific validation errors appended
def validate(extraction):
    errors = []
    if extraction["stated_total"] != extraction["calculated_total"]:
        errors.append(
            f"stated_total {extraction['stated_total']} != sum of line items "
            f"{extraction['calculated_total']}"
        )
    return errors

errors = validate(result)
if errors:
    retry = client.messages.create(
        model="claude-opus-4-8", max_tokens=1024, tools=[extract_tool],
        tool_choice={"type": "tool", "name": "extract"},
        messages=[
            {"role": "user", "content": original_document},
            {"role": "assistant", "content": str(result)},
            {"role": "user", "content": "Fix these validation errors: " + "; ".join(errors)},
        ],
    )
jsonCorrect patternSchema fields that enable self-validation and conflict surfacing
{
  "stated_total": {"type": "string"},
  "calculated_total": {"type": "string"},
  "conflict_detected": {"type": "boolean"},
  "detected_pattern": {"type": ["string", "null"]}
}
Anti-patterns & traps
TrapWhy it failsCorrect pattern
Retrying extraction repeatedly when a required value isn't present in the supplied document.The information does not exist in the model's input, so every retry burns cost and still cannot produce the value.Distinguish missing-data from format errors; make the field nullable (return null) or fetch the external source rather than looping.
Retrying with a generic 'that was wrong, try again' message.Without the specific error, the model has no signal about what to change and may repeat the same mistake.Append the exact validation errors (and the failed extraction) to the retry so the model can target the correction.
Must-know
  • Retry-with-error-feedback: resend the original document + failed extraction + specific validation errors so the model self-corrects.
  • Retries fix format/structural errors but CANNOT recover information absent from the source; for missing data, return null or fetch the source instead of looping.
  • Build self-correction into the schema: emit stated_total AND calculated_total to flag arithmetic discrepancies; add conflict_detected for inconsistent source data.
  • Add a detected_pattern field to findings so dismissals can be aggregated to identify false-positive patterns.
  • Schema syntax errors are eliminated by tool use; semantic errors require explicit validation logic.
Practice — 2 questions for this taskDrill Task 4.4
Task 4.5

Design efficient batch processing strategies

Official objective — Knowledge & Skills
Knowledge of
  • The Message Batches API: 50% cost savings, up to 24-hour processing window, no guaranteed latency SLA
  • Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation) and inappropriate for blocking workflows (pre-merge checks)
  • The batch API does not support multi-turn tool calling within a single request (cannot execute tools mid-request and return results)
  • custom_id fields for correlating batch request/response pairs
Skills in
  • Matching API approach to workflow latency requirements: synchronous API for blocking pre-merge checks, batch API for overnight/weekly analysis
  • Calculating batch submission frequency based on SLA constraints (e.g., 4-hour windows to guarantee 30-hour SLA with 24-hour batch processing)
  • Handling batch failures: resubmitting only failed documents (identified by custom_id) with appropriate modifications (e.g., chunking documents that exceeded context limits)
  • Using prompt refinement on a sample set before batch-processing large volumes to maximize first-pass success rates and reduce iterative resubmission costs

The Message Batches API processes large volumes of Messages requests asynchronously at 50% of standard API prices. Most batches finish in under an hour, but the only guarantee is that results are available when all requests complete or after 24 hours, whichever comes first — there is no latency SLA, and requests not completed within 24 hours expire (and are not billed). This makes batch processing the right choice for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation) and the wrong choice for blocking workflows like pre-merge checks, where you must use the synchronous Messages API.

Each batch request carries a custom_id (1-64 chars, ^[a-zA-Z0-9_-]+$) that you use to correlate responses with requests — essential because results may return in any order, not the submission order. A batch is capped at 100,000 requests or 256 MB. While batches CAN include tool use (and the request params are standard Messages params), the practical constraint for the exam is that batch requests are not interactive: you cannot pause mid-request to execute a client tool and feed the result back within the same request, so multi-turn client-tool loops belong in the synchronous API.

Matching frequency to an SLA is a tested calculation. Because a batch can take up to 24 hours, to honor a 30-hour end-to-end SLA you submit on a window narrower than the slack — e.g., a 4-6 hour submission cadence leaves margin (24h processing + buffer) under 30 hours. Pick a window such that submission interval + 24h worst-case processing stays within the promised SLA.

Failure handling is per-request. Results carry a type of succeeded, errored, canceled, or expired. Resubmit only the failed items, keyed by custom_id, applying appropriate fixes — for instance, chunking a document that exceeded the context window before resubmitting, or simply retrying transient server errors (invalid_request_error requires fixing the request body first). Finally, refine and test your prompt on a small sample synchronously before committing a large batch; maximizing first-pass success rate avoids expensive iterative resubmission across thousands of requests.

pythonCorrect patternCreating a batch with custom_id for correlation
batch = client.messages.batches.create(requests=[
    {"custom_id": "doc-001", "params": {
        "model": "claude-opus-4-8", "max_tokens": 1024,
        "messages": [{"role": "user", "content": doc_001}]}},
    {"custom_id": "doc-002", "params": {
        "model": "claude-opus-4-8", "max_tokens": 1024,
        "messages": [{"role": "user", "content": doc_002}]}},
])
pythonCorrect patternResubmitting only failed/expired items by custom_id
for r in client.messages.batches.results(batch.id):
    if r.result.type in ("errored", "expired"):
        # fix as needed (e.g., chunk oversized doc) then resubmit just this custom_id
        retry_requests.append(rebuild_request(r.custom_id))
textAnti-patternChoosing batch for a blocking workflow
Run the pre-merge security review via the Batch API to save 50%.
Anti-patterns & traps
TrapWhy it failsCorrect pattern
Using the Batch API for a blocking pre-merge check to save cost.Batch has no latency SLA and can take up to 24 hours, so it cannot gate a merge that must complete in seconds/minutes.Use the synchronous Messages API for blocking checks; reserve batch for overnight/weekly latency-tolerant jobs.
Assuming batch results come back in submission order.Results can be returned in any order, so positional matching silently misassigns outputs to inputs.Always correlate each result to its request via custom_id.
Must-know
  • Message Batches API: 50% cheaper, asynchronous, up to a 24-hour window, NO latency SLA; unfinished requests expire at 24h (unbilled).
  • Use batch for non-blocking/latency-tolerant work (overnight, weekly); use the synchronous API for blocking pre-merge checks.
  • custom_id correlates requests with responses; results can return out of order, so never rely on ordering.
  • Batch requests are not interactive: no mid-request client-tool execution-and-return within a single request.
  • Size SLA windows so submission interval + 24h worst-case processing stays within the promised SLA (e.g., ~4-6h cadence for a 30h SLA).
  • Resubmit only failed items by custom_id with fixes (e.g., chunk oversized docs); test prompts on a sample first to raise first-pass success.
Practice — 1 question for this taskDrill Task 4.5
Task 4.6

Design multi-instance and multi-pass review architectures

Official objective — Knowledge & Skills
Knowledge of
  • Self-review limitations: a model retains reasoning context from generation, making it less likely to question its own decisions in the same session
  • Independent review instances (without prior reasoning context) are more effective at catching subtle issues than self-review instructions or extended thinking
  • Multi-pass review: splitting large reviews into per-file local analysis passes plus cross-file integration passes to avoid attention dilution and contradictory findings
Skills in
  • Using a second independent Claude instance to review generated code without the generator's reasoning context
  • Splitting large multi-file reviews into focused per-file passes for local issues plus separate integration passes for cross-file data flow analysis
  • Running verification passes where the model self-reports confidence alongside each finding to enable calibrated review routing

Self-review is structurally weak. When the same Claude instance that generated an answer is asked to critique it in the same session, it retains the reasoning context that led to its decisions and is therefore less likely to question them — it tends to rationalize rather than re-examine. Neither a 'now review your own work' instruction nor extended thinking fully overcomes this. The architecturally stronger pattern is an independent review instance: a second Claude call that sees the artifact (the generated code or extraction) but not the generator's reasoning trace. Without the prior context, the reviewer approaches the work fresh and catches subtle issues the author missed. This mirrors Anthropic's broader guidance that for tasks with multiple considerations, separate LLM calls focused on each consideration outperform one call doing everything (the evaluator-optimizer and sectioning patterns).

Large multi-file reviews degrade for a different reason: attention dilution. Asking one pass to review many files at once spreads the model thin, producing shallow and sometimes contradictory findings. The fix is multi-pass review. Run focused per-file passes that look only for local issues within each file (where the model can attend deeply), then run separate integration passes that analyze cross-file concerns — data flow between modules, interface contracts, and consistency across files. Splitting local from cross-cutting analysis keeps each pass tractable and avoids the contradictions that arise when one pass juggles both scopes.

A complementary technique is calibrated verification: run a pass in which the model self-reports a confidence level alongside each finding. You then route reviews by confidence — auto-accept high-confidence findings, send low-confidence ones to human review or a second instance — turning confidence into a routing signal rather than a filter applied inside a single prompt. Combined, these patterns (independent reviewer, per-file plus integration passes, confidence-routed verification) form the multi-instance / multi-pass architecture the exam expects for high-quality review at scale.

pythonCorrect patternIndependent reviewer instance (no generator reasoning passed)
code = generate_code(spec)            # instance A produces the artifact
# instance B sees ONLY the artifact + spec, not A's reasoning/thinking
review = client.messages.create(
    model="claude-opus-4-8", max_tokens=2048,
    messages=[{"role": "user",
               "content": f"Review this code against the spec for bugs and security "
                          f"issues. Report each finding with a confidence level.\n\n"
                          f"Spec:\n{spec}\n\nCode:\n{code}"}],
)
textCorrect patternPer-file local passes plus a separate cross-file integration pass
Pass 1 (per file): For file X only, report local correctness/security issues.
Pass 2 (per file): For file Y only, report local correctness/security issues.
Pass 3 (integration): Given the interfaces of X and Y, check cross-file data flow,
  contract mismatches, and consistency. Do not re-report local issues.
Anti-patterns & traps
TrapWhy it failsCorrect pattern
Asking the same instance that wrote the code to 'now review your own work,' or relying on extended thinking, to catch its own mistakes.It retains the reasoning context from generation and tends to justify its decisions rather than question them.Use a separate independent Claude instance that sees the artifact but not the generator's reasoning trace.
Reviewing many files in a single large pass.Attention is diluted across files, producing shallow and sometimes contradictory findings.Run focused per-file passes for local issues plus a separate integration pass for cross-file data flow.
Must-know
  • Self-review is weak: the generating instance keeps its reasoning context and rationalizes its own decisions; instructions to self-review and extended thinking don't fully fix this.
  • A second INDEPENDENT instance (no generator context) catches subtle issues self-review misses.
  • Split large reviews into focused per-file passes (local issues) plus separate integration passes (cross-file data flow) to avoid attention dilution and contradictory findings.
  • Have the model self-report confidence per finding to enable calibrated routing (auto-accept high, escalate low).
  • This mirrors Anthropic's guidance that separate focused LLM calls beat one call handling every consideration.
Practice — 2 questions for this taskDrill Task 4.6
Ready to test it?

Drill Prompt Engineering & Structured Output

16 scenario-based questions, timed, with full explanations.

Start the Prompt drill →