Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_*, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:10:13 +00:00
parent 46d9592a55
commit bcfb6841eb
18 changed files with 1579 additions and 167 deletions
--- a/scripts/ENRICHMENT_PROMPT.md
+++ b/scripts/ENRICHMENT_PROMPT.md
@@ -0,0 +1,98 @@
+# SUBAGENT — Activity enrichment
+
+You are a subagent in the game-library enrichment pipeline. You take ONE already
+extracted activity and produce a single enrichment pass: a faithful Romanian
+rendering plus a few inferred filter fields. You do **one** activity per prompt.
+
+This is **not** re-extraction. The activity text already exists and is trusted.
+Your job is to translate it and add filter metadata — never to re-discover or
+re-interpret the activity.
+
+## Your task
+
+The prompt gives you two blocks:
+
+1. **Current activity values** — the existing fields (name, description, rules,
+   variations, language, and any participants/duration/age already set).
+2. **Source chunk text** — the original passage the activity came from. This is
+   your ground truth for any expansion. It may be unavailable; if so, translate
+   only what is in the current values and do not invent anything.
+
+Produce one JSON object and write it to the path named in the prompt
+(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
+`content_key` string from the prompt.
+
+## Rules
+
+### Translation (always)
+- Translate `name`, `description`, `rules`, `variations` into natural, fluent
+  Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
+- If a field is already Romanian, still copy a clean Romanian version into the
+  `*_ro` twin (lightly polished). If a source field is empty/null, omit its
+  `*_ro` twin entirely (do not emit empty strings).
+- Translate faithfully. Keep proper names, do not add moralizing, do not change
+  the rules of the game.
+
+### Description expansion (constrained)
+- You MAY make `description_ro` richer than a literal translation — but ONLY
+  using detail that is actually present in the **source chunk text**. Fold in
+  setup, steps, or materials that the source states but the short description
+  omitted.
+- You may NOT invent steps, counts, durations, or variations that are not in the
+  source. If the source is thin, the translation stays thin. Hallucinated
+  expansion is the one unacceptable failure here.
+
+### Inferred filter fields (mark when inferred)
+Fill these when you can, using the source text first, then reasonable inference:
+
+- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
+- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
+- `participants_min`, `participants_max`: integers (people).
+- `duration_min`, `duration_max`: integers (minutes).
+- `age_group_min`, `age_group_max`: integers (years).
+
+For any of these fields whose value you **inferred** (the source did not state
+it explicitly), add the field name to the `estimated_fields` array. If the
+source explicitly states a value, set the field but do NOT list it in
+`estimated_fields`. Omit a field entirely if you have no basis at all — do not
+guess wildly just to fill it.
+
+Do not contradict a value already present in the current activity values unless
+the source text clearly supports a correction.
+
+## Enum vocabulary (fixed — use these exact slugs)
+
+- `indoor_outdoor`: `indoor` | `outdoor` | `either`
+- `space_needed`: `mic` | `mediu` | `mare`
+
+## Output format
+
+Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
+
+```json
+{
+  "content_key": "<the exact key from the prompt>",
+  "name_ro": "…",
+  "description_ro": "…",
+  "rules_ro": "…",
+  "variations_ro": "…",
+  "indoor_outdoor": "outdoor",
+  "space_needed": "mediu",
+  "participants_min": 6,
+  "participants_max": 20,
+  "duration_min": 15,
+  "duration_max": 30,
+  "age_group_min": 8,
+  "age_group_max": 14,
+  "estimated_fields": ["space_needed", "duration_min", "duration_max"]
+}
+```
+
+Include only the fields you actually fill. Always include `content_key` and
+`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
+no commentary, no markdown fences in the file itself.
+
+## Report
+
+After writing the file, report in under 30 words: the activity name and which
+fields you estimated.
--- a/scripts/SUBAGENT_PROMPT.md
+++ b/scripts/SUBAGENT_PROMPT.md
@@ -74,6 +74,23 @@ The file is one JSON object: a `header` plus an `activities` array.
 - Do **not** paraphrase the `source_excerpt` — copy it character for character.
 - Better to extract fewer activities accurately than to pad the output.

+## Writing large outputs in batches (IMPORTANT)
+
+A single Write tool call has a hard ~32K output-token limit. Dense chunks
+(50+ activities) will exceed this. If you estimate >30 activities, write the
+file **incrementally**:
+
+1. First Write: emit the file with `header` + the first batch (≤25 activities)
+   and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
+2. For each subsequent batch (≤25 activities at a time), use an Edit call
+   that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
+   `,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
+   closing brace plus the last activity's tail) so the Edit is unambiguous.
+3. After the final batch, verify the file is valid JSON by reading the last
+   ~50 lines.
+
+This keeps each tool call under the output-token cap.
+
 ## Before you finish

 - Every activity has a non-empty `source_excerpt` and `page_reference`.
--- a/scripts/build_database.py
+++ b/scripts/build_database.py
@@ -86,7 +86,12 @@ def _split_csv(value: Optional[str]) -> list[str]:
    return [p.strip() for p in str(value).split(",") if p.strip()]


-def dict_to_activity(adict: dict, source_file: str) -> Activity:
+def dict_to_activity(
+    adict: dict,
+    source_file: str,
+    source_id: Optional[str] = None,
+    chunk_key: Optional[str] = None,
+) -> Activity:
    """Build an Activity from one extraction-JSON activity object."""
    tags = adict.get("tags") or []
    if isinstance(tags, str):
@@ -99,6 +104,9 @@ def dict_to_activity(adict: dict, source_file: str) -> Activity:
        source_files = [source_file, *source_files]

    return Activity(
+        source_id=source_id,
+        source_ids=[source_id] if source_id else [],
+        chunk_key=chunk_key,
        name=(adict.get("name") or "").strip(),
        description=(adict.get("description") or "").strip(),
        rules=adict.get("rules"),
@@ -206,6 +214,19 @@ def merge_cluster(cluster: list[Activity]) -> Activity:
            if s and s not in sources:
                sources.append(s)
    merged.source_files = sources
+    # source provenance: keep rep's chunk_key/source_id as primary, union the
+    # source_ids for the download route. Enrichment fields (name_ro,
+    # description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
+    # enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
+    # row's content_key, so merging must not pre-populate them.
+    merged.source_id = rep.source_id
+    merged.chunk_key = rep.chunk_key
+    source_ids: list[str] = []
+    for a in cluster:
+        for sid in [a.source_id, *(a.source_ids or [])]:
+            if sid and sid not in source_ids:
+                source_ids.append(sid)
+    merged.source_ids = source_ids
    # popularity_score++ per merged duplicate (plan §4)
    merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
    return merged
@@ -313,6 +334,108 @@ def apply_review_decisions(
    return kept, stats


+# --------------------------------------------------------------------------
+# step 5b — enrichment overlay (plan Part B)
+# --------------------------------------------------------------------------
+# Translation / inferred-filter fields written by run_enrichment.py. Applied
+# AFTER dedup + review decisions, keyed on the same stable content_key, so the
+# overlay survives rebuilds as long as extraction text is frozen.
+_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
+_ENRICHMENT_INT_FIELDS = (
+    "participants_min", "participants_max",
+    "duration_min", "duration_max",
+    "age_group_min", "age_group_max",
+)
+
+
+def load_enrichment(path: Path) -> dict:
+    """Load data/enrichment.json (flat map content_key -> field dict)."""
+    if path and path.is_file():
+        try:
+            data = json.loads(path.read_text(encoding="utf-8"))
+            if isinstance(data, dict):
+                return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {}
+
+
+def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
+    """
+    Overlay enrichment fields onto the post-dedup activity list (plan B2).
+
+    Keyed by content_key. Only fields PRESENT in an entry are written; absent
+    fields leave the underlying DB value untouched. indoor_outdoor /
+    space_needed are normalized to slugs (None on unrecognised). Inferred
+    fields are recorded in `estimated_fields`. Translated / expanded text is
+    NOT re-validated against the source here — expansion fidelity is the
+    enrichment prompt's responsibility (plan B2 comment).
+
+    Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
+    """
+    from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
+
+    matched_keys: set[str] = set()
+    fields_stated: dict[str, int] = defaultdict(int)
+    fields_estimated: dict[str, int] = defaultdict(int)
+
+    for act in activities:
+        key = content_key(
+            act.normalized_name or normalize_name(act.name),
+            act.language,
+            act.description or "",
+        )
+        entry = enrichment.get(key)
+        if not isinstance(entry, dict):
+            continue
+        matched_keys.add(key)
+
+        estimated = set(entry.get("estimated_fields") or [])
+
+        # bilingual text twins
+        for fld in _ENRICHMENT_TEXT_FIELDS:
+            val = entry.get(fld)
+            if isinstance(val, str) and val.strip():
+                setattr(act, fld, val.strip())
+
+        # inferred / clarified structured numeric fields
+        for fld in _ENRICHMENT_INT_FIELDS:
+            if entry.get(fld) is not None:
+                try:
+                    setattr(act, fld, int(entry[fld]))
+                except (TypeError, ValueError):
+                    pass
+
+        # enum filters — normalized to slug, dropped if unrecognised
+        if entry.get("indoor_outdoor") is not None:
+            slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
+            if slug:
+                act.indoor_outdoor = slug
+        if entry.get("space_needed") is not None:
+            slug = normalize_space_needed(entry["space_needed"])
+            if slug:
+                act.space_needed = slug
+
+        act.estimated_fields = sorted(estimated)
+
+        # QA tally: stated vs estimated population, per field
+        for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
+            if entry.get(fld) is None:
+                continue
+            if fld in estimated:
+                fields_estimated[fld] += 1
+            else:
+                fields_stated[fld] += 1
+
+    return {
+        "entries": len(enrichment),
+        "matched": len(matched_keys),
+        "orphaned": len(enrichment) - len(matched_keys),
+        "fields_stated": dict(fields_stated),
+        "fields_estimated": dict(fields_estimated),
+    }
+
+
 # --------------------------------------------------------------------------
 # golden-set recall (plan §7)
 # --------------------------------------------------------------------------
@@ -390,9 +513,8 @@ def collect_activities(

        header = data.get("header", {})
        chunk_text = find_chunk_text(json_path, header, chunks_dir)
-        source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit(
-            ".part", 1
-        )[0]
+        chunk_key = chunk_key_for(json_path, header)
+        source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
        fallback_source = (
            source_path_for(source_id, sources_dir) or source_id or json_path.stem
        )
@@ -409,7 +531,7 @@ def collect_activities(
                continue
            src = adict.get("source_file") or fallback_source
            raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
-            activities.append(dict_to_activity(adict, src))
+            activities.append(dict_to_activity(adict, src, source_id, chunk_key))

        if hallucinated:
            _log_hallucinations(json_path, rejected_dir, hallucinated)
@@ -496,6 +618,7 @@ def rebuild(
    sources_dir: Path,
    db_path: Path,
    decisions_path: Optional[Path] = None,
+    enrichment_path: Optional[Path] = None,
    schema_path: Path = DEFAULT_SCHEMA_PATH,
    golden_dir: Optional[Path] = None,
    do_swap: bool = True,
@@ -517,6 +640,11 @@ def rebuild(
    decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
    final, decision_stats = apply_review_decisions(deduped, decisions)

+    # Enrichment overlay — applied immediately after review decisions, on the
+    # post-dedup list, keyed on the same stable content_key (plan B2).
+    enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
+    enrichment_stats = apply_enrichment(final, enrichment)
+
    try:
        write_database(db_tmp_path, final)
        backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
@@ -529,6 +657,7 @@ def rebuild(
        **collected,
        "dedup": dedup_stats,
        "decisions": decision_stats,
+        "enrichment": enrichment_stats,
        "final_count": len(final),
        "backup": str(backup) if backup else None,
        "swapped": do_swap,
@@ -579,6 +708,16 @@ def print_report(report: dict) -> None:
          f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
    print(f"review decisions     : dropped {report['decisions']['dropped']}, "
          f"resolved {report['decisions']['resolved']}")
+    enr = report.get("enrichment")
+    if enr and enr.get("entries"):
+        print(f"enrichment           : {enr['entries']} entries "
+              f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
+        stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
+        all_fields = sorted(set(stated) | set(estimated))
+        if all_fields:
+            print("  field population   :  (stated / estimated)")
+            for fld in all_fields:
+                print(f"    {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
    print(f"final inserted       : {report['final_count']}")
    print(f"% with rules         : {qa['pct_with_rules']}")
    print(f"needs_review rows    : {qa['needs_review']}")
@@ -615,6 +754,7 @@ def main(argv: Optional[list[str]] = None) -> int:
    parser.add_argument("--sources", default="data/sources")
    parser.add_argument("--db", default="data/activities.db")
    parser.add_argument("--decisions", default="data/review_decisions.json")
+    parser.add_argument("--enrichment", default="data/enrichment.json")
    parser.add_argument("--golden", default="data/golden")
    parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
    args = parser.parse_args(argv)
@@ -628,6 +768,7 @@ def main(argv: Optional[list[str]] = None) -> int:
        sources_dir=Path(args.sources),
        db_path=Path(args.db),
        decisions_path=Path(args.decisions),
+        enrichment_path=Path(args.enrichment),
        schema_path=Path(args.schema),
        golden_dir=Path(args.golden),
    )
--- a/scripts/repair_extractions.py
+++ b/scripts/repair_extractions.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+repair_extractions.py — one-shot repair of malformed extraction JSON.
+
+Subagents systematically emit unescaped ASCII double-quotes inside string
+values (Romanian text like  „Unu"  uses a closing " that terminates the JSON
+string early). Re-extraction reproduces the bug, so we repair instead.
+
+IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
+ending the string at the stray quote and reinterpreting the trailing text as a
+new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
+truncation is silent (the field is still non-empty) and slips past a naive
+presence check. So we use a faithful char-scanner that ESCAPES stray quotes
+(\\") instead of splitting on them, then validate the result against the real
+activity schema (additionalProperties:false also catches any residual split).
+
+This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
+the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
+disk. We write clean JSON back to data/extracted/ and the build reads vanilla
+json.
+
+Source selection (faithful recovery needs the ORIGINAL malformed text):
+  * a chunk is a candidate when a MALFORMED original exists — either the
+    top-level data/extracted/<key>.json is itself invalid, or a malformed
+    original sits in data/extracted/_rejected/<key>.json.
+  * the malformed original is preferred as the repair source.
+  * chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
+    output that lost the original) are NOT silently "repaired" — if such a chunk
+    has no valid top-level file it is reported as needing RE-EXTRACTION.
+
+Usage:
+    python scripts/repair_extractions.py            # report only (dry run)
+    python scripts/repair_extractions.py --apply     # write repaired JSON
+"""
+
+from __future__ import annotations
+
+import argparse
+import glob
+import json
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+EXTRACTED = REPO_ROOT / "data" / "extracted"
+REJECTED = EXTRACTED / "_rejected"
+
+if str(SCRIPT_DIR) not in __import__("sys").path:
+    __import__("sys").path.insert(0, str(SCRIPT_DIR))
+from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction  # noqa: E402
+
+
+def escape_stray_quotes(s: str) -> str:
+    """Escape ASCII double-quotes that occur INSIDE a JSON string value.
+
+    A `"` inside a string is treated as a real string-close only when the next
+    non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
+    content and is escaped to `\\"`. This preserves the full value instead of
+    truncating it (the json_repair failure mode).
+    """
+    out: list[str] = []
+    in_str = False
+    esc = False
+    n = len(s)
+    i = 0
+    while i < n:
+        c = s[i]
+        if esc:
+            out.append(c)
+            esc = False
+            i += 1
+            continue
+        if c == "\\":
+            out.append(c)
+            esc = True
+            i += 1
+            continue
+        if c == '"':
+            if not in_str:
+                in_str = True
+                out.append(c)
+            else:
+                j = i + 1
+                while j < n and s[j] in " \t\r\n":
+                    j += 1
+                nxt = s[j] if j < n else ""
+                if nxt in ",}]:" or nxt == "":
+                    in_str = False
+                    out.append(c)
+                else:
+                    out.append('\\"')  # content quote → escape, keep value whole
+            i += 1
+            continue
+        out.append(c)
+        i += 1
+    return "".join(out)
+
+
+def _is_valid_json(path: Path) -> bool:
+    try:
+        json.loads(path.read_text(encoding="utf-8"))
+        return True
+    except (json.JSONDecodeError, OSError):
+        return False
+
+
+def _malformed_source(key: str) -> Optional[Path]:
+    """Return the malformed-original file for a chunk, preferring top-level."""
+    live = EXTRACTED / f"{key}.json"
+    if live.exists() and not _is_valid_json(live):
+        return live
+    rej = REJECTED / f"{key}.json"
+    if rej.exists() and not _is_valid_json(rej):
+        return rej
+    return None
+
+
+def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
+    """
+    (repair_candidates, needs_reextraction).
+
+    repair_candidates: key -> malformed source file (faithfully repairable).
+    needs_reextraction: chunks with no malformed original AND no valid
+    top-level file (their original was lost) — must be re-extracted.
+    """
+    keys = set()
+    for fn in glob.glob(str(EXTRACTED / "*.json")):
+        keys.add(Path(fn).stem)
+    for fn in glob.glob(str(REJECTED / "*.json")):
+        keys.add(Path(fn).stem)
+
+    candidates: dict[str, Path] = {}
+    needs_reextraction: list[str] = []
+    for key in sorted(keys):
+        # A malformed original anywhere is faithfully repairable, and is the
+        # source of truth even if a (json_repair-produced, possibly truncated)
+        # valid top-level file exists — escaping the original never truncates,
+        # so re-repairing from it is always >= the json_repair output.
+        src = _malformed_source(key)
+        if src is not None:
+            candidates[key] = src
+            continue
+        live = EXTRACTED / f"{key}.json"
+        if live.exists() and _is_valid_json(live):
+            continue  # genuinely-valid extraction, nothing to do
+        # no valid top-level and no malformed original to repair from
+        needs_reextraction.append(key)
+    return candidates, needs_reextraction
+
+
+def repair(apply: bool) -> int:
+    schema = load_schema(DEFAULT_SCHEMA_PATH)
+    candidates, needs_reextraction = _candidate_keys()
+
+    print("=" * 64)
+    print(f"REPAIR EXTRACTIONS  ({'APPLY' if apply else 'dry run'})")
+    print("=" * 64)
+    print(f"repair candidates: {len(candidates)}")
+
+    def _textlen(data: dict) -> int:
+        total = 0
+        for a in data.get("activities", []):
+            if isinstance(a, dict):
+                for v in a.values():
+                    if isinstance(v, str):
+                        total += len(v)
+        return total
+
+    ok = 0
+    kept_toplevel = 0
+    still_bad: list[str] = []
+    schema_fail: list[tuple[str, str]] = []
+
+    for key, src in candidates.items():
+        live = EXTRACTED / f"{key}.json"
+        live_valid = live.exists() and _is_valid_json(live)
+
+        raw = src.read_text(encoding="utf-8")
+        fixed = escape_stray_quotes(raw)
+        try:
+            data = json.loads(fixed)
+        except json.JSONDecodeError as exc:
+            if live_valid:
+                kept_toplevel += 1  # genuine top-level is fine; stale _rejected
+            else:
+                still_bad.append(f"{key}: still invalid after escape ({exc})")
+            continue
+        errors = validate_extraction(data, schema)
+        if errors:
+            if live_valid:
+                kept_toplevel += 1
+            else:
+                schema_fail.append((key, errors[0]))
+                print(f"  {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
+            continue
+
+        # Faithfulness guard: only replace a valid top-level when the escaped
+        # repair carries STRICTLY more text (i.e. the top-level was a truncated
+        # json_repair output). Genuine extractions are kept untouched.
+        if live_valid:
+            try:
+                live_data = json.loads(live.read_text(encoding="utf-8"))
+            except json.JSONDecodeError:
+                live_data = {}
+            if _textlen(data) <= _textlen(live_data):
+                kept_toplevel += 1
+                continue
+
+        n = len(data.get("activities", []))
+        print(f"  {key[:50]:<50} {n:>3} acts  REPAIR")
+        if apply:
+            live.write_text(
+                json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
+            )
+        ok += 1
+
+    print("-" * 64)
+    print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
+          f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
+          f"needs re-extraction: {len(needs_reextraction)}")
+    for key, err in schema_fail:
+        print(f"  ⚠ schema {key}: {err[:60]}")
+    for msg in still_bad:
+        print(f"  ✘ {msg}")
+    for key in needs_reextraction:
+        print(f"  ↻ re-extract: {key}")
+    if not apply:
+        print("\nDry run — re-run with --apply to write repaired JSON.")
+    print("=" * 64)
+    return 0
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
+    parser.add_argument("--apply", action="store_true",
+                        help="write repaired JSON (default: dry run)")
+    args = parser.parse_args(argv)
+    return repair(args.apply)
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/run_enrichment.py
+++ b/scripts/run_enrichment.py
@@ -0,0 +1,270 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+run_enrichment.py — enrichment orchestrator (plan Part B3).
+
+Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
+already-rebuilt data/activities.db, and for every activity emits one subagent
+prompt asking for a single bilingual + inferred-filter enrichment pass. Like
+extraction, this script does NOT call the LLM — the interactive Claude Code
+orchestrator launches waves of subagents on the emitted prompts.
+
+Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
+import_common.content_key(normalized_name, language, _normalize_text(description))
+— the SAME function build_database uses to apply the overlay. The key is stable
+only while the extraction text is frozen, so enrichment runs AFTER the freezing
+rebuild.
+
+Modes:
+  (default)    emit one prompt per activity that has no enrichment part yet
+               (resumable: data/enrichment_parts/<key>.json present => skip)
+  --collect    merge data/enrichment_parts/*.json -> data/enrichment.json
+
+Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
+the emitted prompts to a single source / category for the sign-off pilot.
+
+Usage:
+    python scripts/run_enrichment.py --source teambuilding_corbu        # pilot
+    python scripts/run_enrichment.py                                    # all rows
+    python scripts/run_enrichment.py --collect                          # merge parts
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sqlite3
+import sys
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from import_common import (  # noqa: E402
+    content_key,
+    find_chunk_text,
+    normalize_name,
+)
+
+ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
+
+# Columns pulled from the DB into the prompt as the "current value" context.
+_DB_COLUMNS = (
+    "id", "name", "description", "rules", "variations",
+    "category", "content_type", "language", "normalized_name",
+    "page_reference", "source_id", "chunk_key",
+    "participants_min", "participants_max",
+    "duration_min", "duration_max",
+    "age_group_min", "age_group_max",
+)
+
+# How much source-chunk text to inline. Chunks are page-sized; cap so a dense
+# chunk does not blow the prompt up, but keep enough to ground the expansion.
+_CHUNK_TEXT_CAP = 12000
+
+
+def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
+    conn = sqlite3.connect(db_path)
+    conn.row_factory = sqlite3.Row
+    try:
+        cols = ", ".join(_DB_COLUMNS)
+        sql = f"SELECT {cols} FROM activities"
+        params: list = []
+        if source_substr:
+            sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
+            params = [f"%{source_substr}%", f"%{source_substr}%"]
+        sql += " ORDER BY source_id, id"
+        return [dict(r) for r in conn.execute(sql, params).fetchall()]
+    finally:
+        conn.close()
+
+
+def _row_content_key(row: dict) -> str:
+    return content_key(
+        row.get("normalized_name") or normalize_name(row.get("name") or ""),
+        row.get("language"),
+        row.get("description") or "",
+    )
+
+
+def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
+    """Locate the source-chunk text via the row's chunk_key / source_id."""
+    header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
+    if not header["chunk_key"]:
+        return None
+    # find_chunk_text resolves from the header when chunk_key is present;
+    # the json_path arg is only a fallback, so a synthetic path is fine.
+    text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
+    if text and len(text) > _CHUNK_TEXT_CAP:
+        text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
+    return text
+
+
+def _current_fields_block(row: dict) -> str:
+    """The activity's current DB values, as a compact JSON block for context."""
+    fields = {
+        "name": row.get("name"),
+        "description": row.get("description"),
+        "rules": row.get("rules"),
+        "variations": row.get("variations"),
+        "category": row.get("category"),
+        "content_type": row.get("content_type"),
+        "language": row.get("language"),
+        "participants_min": row.get("participants_min"),
+        "participants_max": row.get("participants_max"),
+        "duration_min": row.get("duration_min"),
+        "duration_max": row.get("duration_max"),
+        "age_group_min": row.get("age_group_min"),
+        "age_group_max": row.get("age_group_max"),
+    }
+    return json.dumps(fields, ensure_ascii=False, indent=2)
+
+
+def emit_enrichment_prompt(
+    row: dict, key: str, chunks_dir: Path, prompts_dir: Path
+) -> Path:
+    """Write the subagent enrichment prompt for one activity."""
+    chunk_text = _chunk_text_for_row(row, chunks_dir)
+    source_block = (
+        chunk_text if chunk_text is not None
+        else "[source chunk text unavailable — translate only what is given "
+             "above; do NOT invent steps, and mark any inferred filter field "
+             "as estimated]"
+    )
+    part_path = f"data/enrichment_parts/{key}.json"
+    text = "\n".join([
+        f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
+        "",
+        f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
+        "Single pass. Translate faithfully to Romanian; expand description_ro "
+        "ONLY from the source chunk text below; mark inferred filter fields in "
+        "`estimated_fields`.",
+        "",
+        f"Write the result JSON to: `{part_path}`",
+        f'It MUST include `"content_key": "{key}"`.',
+        f'Page reference: {row.get("page_reference") or "?"}',
+        "",
+        "## Current activity values (the text to translate / enrich)",
+        "```json",
+        _current_fields_block(row),
+        "```",
+        "",
+        "## Source chunk text (ground description_ro expansion in THIS only)",
+        "```",
+        source_block,
+        "```",
+        "",
+    ])
+    prompts_dir.mkdir(parents=True, exist_ok=True)
+    out = prompts_dir / f"{key}.prompt.md"
+    out.write_text(text, encoding="utf-8")
+    return out
+
+
+def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
+    """Merge data/enrichment_parts/*.json into one flat content_key map."""
+    merged: dict = {}
+    bad: list[str] = []
+    if parts_dir.is_dir():
+        for part in sorted(parts_dir.glob("*.json")):
+            try:
+                data = json.loads(part.read_text(encoding="utf-8"))
+            except (json.JSONDecodeError, OSError):
+                bad.append(part.name)
+                continue
+            key = data.get("content_key") or part.stem
+            entry = {k: v for k, v in data.items() if k != "content_key"}
+            merged[key] = entry
+    out_path.write_text(
+        json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
+    )
+    return {"entries": len(merged), "bad_parts": bad, "out": str(out_path)}
+
+
+def run_emit(
+    *,
+    db_path: Path,
+    chunks_dir: Path,
+    parts_dir: Path,
+    prompts_dir: Path,
+    source_substr: Optional[str],
+    limit: Optional[int],
+) -> dict:
+    rows = _fetch_rows(db_path, source_substr)
+    emitted, skipped = 0, 0
+    for row in rows:
+        key = _row_content_key(row)
+        if (parts_dir / f"{key}.json").is_file():
+            skipped += 1
+            continue
+        emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
+        emitted += 1
+        if limit and emitted >= limit:
+            break
+    return {
+        "rows": len(rows),
+        "emitted": emitted,
+        "skipped_done": skipped,
+        "prompts_dir": str(prompts_dir),
+    }
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
+    parser.add_argument("--db", default="data/activities.db")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--parts", default="data/enrichment_parts")
+    parser.add_argument("--prompts", default="data/enrichment_prompts")
+    parser.add_argument("--out", default="data/enrichment.json")
+    parser.add_argument("--source", default=None,
+                        help="only rows whose source_id/chunk_key contains this (pilot)")
+    parser.add_argument("--limit", type=int, default=None,
+                        help="cap emitted prompts (pilot)")
+    parser.add_argument("--collect", action="store_true",
+                        help="merge enrichment parts into the overlay JSON")
+    args = parser.parse_args(argv)
+
+    print("=" * 60)
+    print("ENRICHMENT ORCHESTRATOR")
+    print("=" * 60)
+
+    if args.collect:
+        result = collect_enrichment(Path(args.parts), Path(args.out))
+        print(f"collected  : {result['entries']} entries -> {result['out']}")
+        if result["bad_parts"]:
+            print(f"bad parts  : {len(result['bad_parts'])} (skipped)")
+            for name in result["bad_parts"]:
+                print(f"  - {name}")
+        print("Run build_database.py --rebuild to apply the overlay.")
+        print("=" * 60)
+        return 0
+
+    summary = run_emit(
+        db_path=Path(args.db),
+        chunks_dir=Path(args.chunks),
+        parts_dir=Path(args.parts),
+        prompts_dir=Path(args.prompts),
+        source_substr=args.source,
+        limit=args.limit,
+    )
+    print(f"rows in DB        : {summary['rows']}"
+          + (f"  (filtered by '{args.source}')" if args.source else ""))
+    print(f"already enriched  : {summary['skipped_done']}")
+    print(f"prompts emitted   : {summary['emitted']}")
+    if summary["emitted"]:
+        print(f"prompts dir       : {summary['prompts_dir']}/")
+        print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
+              "writing data/enrichment_parts/<key>.json, then run "
+              "run_enrichment.py --collect and build_database.py --rebuild.")
+    else:
+        print("Nothing to emit — run --collect then build_database.py --rebuild.")
+    print("=" * 60)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())