Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_*, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:10:13 +00:00
parent 46d9592a55
commit bcfb6841eb
18 changed files with 1579 additions and 167 deletions
--- a/scripts/ENRICHMENT_PROMPT.md
+++ b/scripts/ENRICHMENT_PROMPT.md
@@ -0,0 +1,98 @@
+# SUBAGENT — Activity enrichment
+
+You are a subagent in the game-library enrichment pipeline. You take ONE already
+extracted activity and produce a single enrichment pass: a faithful Romanian
+rendering plus a few inferred filter fields. You do **one** activity per prompt.
+
+This is **not** re-extraction. The activity text already exists and is trusted.
+Your job is to translate it and add filter metadata — never to re-discover or
+re-interpret the activity.
+
+## Your task
+
+The prompt gives you two blocks:
+
+1. **Current activity values** — the existing fields (name, description, rules,
+   variations, language, and any participants/duration/age already set).
+2. **Source chunk text** — the original passage the activity came from. This is
+   your ground truth for any expansion. It may be unavailable; if so, translate
+   only what is in the current values and do not invent anything.
+
+Produce one JSON object and write it to the path named in the prompt
+(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
+`content_key` string from the prompt.
+
+## Rules
+
+### Translation (always)
+- Translate `name`, `description`, `rules`, `variations` into natural, fluent
+  Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
+- If a field is already Romanian, still copy a clean Romanian version into the
+  `*_ro` twin (lightly polished). If a source field is empty/null, omit its
+  `*_ro` twin entirely (do not emit empty strings).
+- Translate faithfully. Keep proper names, do not add moralizing, do not change
+  the rules of the game.
+
+### Description expansion (constrained)
+- You MAY make `description_ro` richer than a literal translation — but ONLY
+  using detail that is actually present in the **source chunk text**. Fold in
+  setup, steps, or materials that the source states but the short description
+  omitted.
+- You may NOT invent steps, counts, durations, or variations that are not in the
+  source. If the source is thin, the translation stays thin. Hallucinated
+  expansion is the one unacceptable failure here.
+
+### Inferred filter fields (mark when inferred)
+Fill these when you can, using the source text first, then reasonable inference:
+
+- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
+- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
+- `participants_min`, `participants_max`: integers (people).
+- `duration_min`, `duration_max`: integers (minutes).
+- `age_group_min`, `age_group_max`: integers (years).
+
+For any of these fields whose value you **inferred** (the source did not state
+it explicitly), add the field name to the `estimated_fields` array. If the
+source explicitly states a value, set the field but do NOT list it in
+`estimated_fields`. Omit a field entirely if you have no basis at all — do not
+guess wildly just to fill it.
+
+Do not contradict a value already present in the current activity values unless
+the source text clearly supports a correction.
+
+## Enum vocabulary (fixed — use these exact slugs)
+
+- `indoor_outdoor`: `indoor` | `outdoor` | `either`
+- `space_needed`: `mic` | `mediu` | `mare`
+
+## Output format
+
+Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
+
+```json
+{
+  "content_key": "<the exact key from the prompt>",
+  "name_ro": "…",
+  "description_ro": "…",
+  "rules_ro": "…",
+  "variations_ro": "…",
+  "indoor_outdoor": "outdoor",
+  "space_needed": "mediu",
+  "participants_min": 6,
+  "participants_max": 20,
+  "duration_min": 15,
+  "duration_max": 30,
+  "age_group_min": 8,
+  "age_group_max": 14,
+  "estimated_fields": ["space_needed", "duration_min", "duration_max"]
+}
+```
+
+Include only the fields you actually fill. Always include `content_key` and
+`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
+no commentary, no markdown fences in the file itself.
+
+## Report
+
+After writing the file, report in under 30 words: the activity name and which
+fields you estimated.