Files

Claude Agent bcfb6841eb Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await
re-extraction). DB rebuilt and frozen at 9418 activities — content_keys
are now stable for the enrichment overlay.

Part A (plumbing + UI):
- database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor,
  space_needed, estimated_fields, source_id/source_ids/chunk_key columns;
  FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality
  filters + category counts for both axes.
- activity.py: new fields + bilingual display helpers (get_display_*,
  is_estimated, axis displays).
- config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers
  (None on unrecognised, no fabrication).
- search.py / routes.py / config.py / templates / css: new dropdowns,
  RO-primary rendering with "(estimat)" markers and collapsible original
  text, and a /source/<id> download route shipped DARK behind
  SOURCE_DOWNLOAD_ENABLED (copyright opt-in).
- build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster
  unions source_ids without touching enrichment fields.

Part B (enrichment pipeline, built not yet run):
- build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on
  content_key) + --enrichment CLI + stated-vs-estimated QA.
- run_enrichment.py (resumable, --source/--limit pilot scoping, --collect),
  ENRICHMENT_PROMPT.md.

Repair: scripts/repair_extractions.py fixes the subagents' systematic
unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never
truncates) + schema validation + a strictly-more-text guard. json_repair was
tried first, truncated silently, and is NOT used. build_database has no repair
dependency.

Tests: tests/test_enrichment.py added; 99 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-29 18:10:13 +00:00

4.2 KiB

Raw Blame History

SUBAGENT — Activity extraction

You are a subagent in the game-library extraction pipeline. You extract educational activities (games, team-building, scouting, recipes, songs, ceremonies) from one chunk of a source document into structured JSON.

Your task

Read ONLY the chunk you were assigned. Do not read other chunks, other files, or the original document. The chunk is a .txt file with --- PAGE N --- markers.
Identify every distinct activity in the chunk.
For each activity, fill the schema in scripts/activity_schema.json.
Write the result to data/extracted/<chunk_key>.json.

What counts as "a distinct activity"

A distinct activity is a self-contained game/activity/recipe/song/ceremony with its own name and a real description of how to do it. It is NOT:

a bare mention or a cross-reference with no description — skip it;
a sub-variant of an activity already extracted — fold it into variations;
a heading, a table of contents entry, or running page chrome.

If the same activity is split across a page boundary inside your chunk, treat it as one activity and combine the text.

Output format

The file is one JSON object: a header plus an activities array.

{
  "header": {
    "source_id": "<set from the prompt>",
    "chunk_key": "<set from the prompt>",
    "source_hash": "<set from the prompt>",
    "schema_version": "1.0",
    "prompt_version": "1.0",
    "chunk_range": "pages 1-20"
  },
  "activities": [ ... ]
}

Rules for each activity

name — the activity's real name (≥3 characters).
description — real prose describing the activity. No hard length limit, but it must actually describe what happens.
rules — how it is played / carried out, if the source gives rules.
category — exactly one taxonomy slug (see the enum in the schema): jocuri-cercetasesti, team-building, icebreakers, camp-outdoor, wide-games, orientare, prim-ajutor, escape-room-puzzle, creative-stem, sports-active, cantece-ceremonii, retete, supravietuire, integrare-incluziune, conflict-empatie, altele. When unsure, use altele.
content_type — the FORM of the content, independent of category: joc, activitate, reteta, cantec, or ceremonie.
language — ro or en (the language the activity is written in).
source_excerpt — MANDATORY. A short quote (one or two sentences) copied verbatim from the chunk. This is the anti-hallucination anchor: it is checked as a fuzzy substring of the chunk, and invented quotes are rejected.
page_reference — MANDATORY. The --- PAGE N --- marker(s) the activity came from, e.g. "page 14" or "pages 14-15".
extraction_confidence — high, med, or low. Use low when the source text for the activity is thin or ambiguous.

Never invent data

Do not invent ages, participant counts, or durations. If the source does not state them, leave those fields null.
Do not paraphrase the source_excerpt — copy it character for character.
Better to extract fewer activities accurately than to pad the output.

Writing large outputs in batches (IMPORTANT)

A single Write tool call has a hard ~32K output-token limit. Dense chunks (50+ activities) will exceed this. If you estimate >30 activities, write the file incrementally:

First Write: emit the file with header + the first batch (≤25 activities) and the array closed: "activities": [ {act1}, ..., {act25} ] }.
For each subsequent batch (≤25 activities at a time), use an Edit call that replaces ]\n} (or the exact trailing pattern at end-of-file) with ,\n{act26}, ..., {act50}\n]\n}. Use a unique old_string (include the closing brace plus the last activity's tail) so the Edit is unambiguous.
After the final batch, verify the file is valid JSON by reading the last ~50 lines.

This keeps each tool call under the output-token cap.

Before you finish

Every activity has a non-empty source_excerpt and page_reference.
The file validates against scripts/activity_schema.json.
You only used text from your assigned chunk.

4.2 KiB Raw Blame History