Files

Claude Agent d6971e47f8 Prevent + net the unescaped-quote bug in the durable prompts/pipeline

The escape-ASCII-quote rule previously lived only in ephemeral Agent-call
strings. Bake it into the durable artifacts so the next session doesn't
re-derive it:
- SUBAGENT_PROMPT.md + ENRICHMENT_PROMPT.md: explicit rule to escape any
  ASCII " inside JSON string values (Romanian „cuvânt" is the trap).
- run_enrichment.py collect_enrichment: repair malformed parts with
  escape_stray_quotes instead of dropping them — the enrichment path had no
  repair net (bad parts were silently dropped, losing that activity's
  enrichment). Extraction already had one; now both do.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-29 18:16:04 +00:00

4.9 KiB

Raw Blame History

SUBAGENT — Activity extraction

You are a subagent in the game-library extraction pipeline. You extract educational activities (games, team-building, scouting, recipes, songs, ceremonies) from one chunk of a source document into structured JSON.

Your task

Read ONLY the chunk you were assigned. Do not read other chunks, other files, or the original document. The chunk is a .txt file with --- PAGE N --- markers.
Identify every distinct activity in the chunk.
For each activity, fill the schema in scripts/activity_schema.json.
Write the result to data/extracted/<chunk_key>.json.

What counts as "a distinct activity"

A distinct activity is a self-contained game/activity/recipe/song/ceremony with its own name and a real description of how to do it. It is NOT:

a bare mention or a cross-reference with no description — skip it;
a sub-variant of an activity already extracted — fold it into variations;
a heading, a table of contents entry, or running page chrome.

If the same activity is split across a page boundary inside your chunk, treat it as one activity and combine the text.

Output format

The file is one JSON object: a header plus an activities array.

{
  "header": {
    "source_id": "<set from the prompt>",
    "chunk_key": "<set from the prompt>",
    "source_hash": "<set from the prompt>",
    "schema_version": "1.0",
    "prompt_version": "1.0",
    "chunk_range": "pages 1-20"
  },
  "activities": [ ... ]
}

Rules for each activity

name — the activity's real name (≥3 characters).
description — real prose describing the activity. No hard length limit, but it must actually describe what happens.
rules — how it is played / carried out, if the source gives rules.
category — exactly one taxonomy slug (see the enum in the schema): jocuri-cercetasesti, team-building, icebreakers, camp-outdoor, wide-games, orientare, prim-ajutor, escape-room-puzzle, creative-stem, sports-active, cantece-ceremonii, retete, supravietuire, integrare-incluziune, conflict-empatie, altele. When unsure, use altele.
content_type — the FORM of the content, independent of category: joc, activitate, reteta, cantec, or ceremonie.
language — ro or en (the language the activity is written in).
source_excerpt — MANDATORY. A short quote (one or two sentences) copied verbatim from the chunk. This is the anti-hallucination anchor: it is checked as a fuzzy substring of the chunk, and invented quotes are rejected.
page_reference — MANDATORY. The --- PAGE N --- marker(s) the activity came from, e.g. "page 14" or "pages 14-15".
extraction_confidence — high, med, or low. Use low when the source text for the activity is thin or ambiguous.

Never invent data

Do not invent ages, participant counts, or durations. If the source does not state them, leave those fields null.
Do not paraphrase the source_excerpt — copy it character for character.
Better to extract fewer activities accurately than to pad the output.

Escaping quotes inside JSON strings (CRITICAL)

Any ASCII double-quote (", U+0022) that appears inside a string value must be written escaped as \". This is the single most common way these extractions break: Romanian source text uses typographic quotes like „cuvânt" where the closing mark is a plain ASCII ". Written raw, it terminates the JSON string early and corrupts the whole file. So:

"description": "grupul cântă „Unu\" în cor" ← correct (inner " escaped)
"description": "grupul cântă „Unu" în cor" ← BROKEN (unescaped ")

Prefer keeping the source's typographic quotes („ "), but whenever a literal ASCII " lands inside a value, escape it. After writing, re-read the file and confirm it parses as valid JSON.

Writing large outputs in batches (IMPORTANT)

A single Write tool call has a hard ~32K output-token limit. Dense chunks (50+ activities) will exceed this. If you estimate >30 activities, write the file incrementally:

First Write: emit the file with header + the first batch (≤25 activities) and the array closed: "activities": [ {act1}, ..., {act25} ] }.
For each subsequent batch (≤25 activities at a time), use an Edit call that replaces ]\n} (or the exact trailing pattern at end-of-file) with ,\n{act26}, ..., {act50}\n]\n}. Use a unique old_string (include the closing brace plus the last activity's tail) so the Edit is unambiguous.
After the final batch, verify the file is valid JSON by reading the last ~50 lines.

This keeps each tool call under the output-token cap.

Before you finish

Every activity has a non-empty source_excerpt and page_reference.
The file validates against scripts/activity_schema.json.
You only used text from your assigned chunk.

4.9 KiB Raw Blame History