Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_*, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:10:13 +00:00
parent 46d9592a55
commit bcfb6841eb
18 changed files with 1579 additions and 167 deletions
--- a/HANDOFF.md
+++ b/HANDOFF.md
@@ -1,135 +1,143 @@
-# HANDOFF — Faza 1 extraction in progress
+# HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot
-**Last updated:** 2026-05-20, commit `3d9f266` (pilot complete) + uncommitted Faza 1 work.
+**Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual
 index + enrichment + new filters + source download).
-## State of play
+## Where we are
-Faza 0 (pilot) is **complete and committed**. Faza 1 (full corpus) is **in progress at 10.9%**.
+| Step (plan Part C) | Status |
 |--------------------|--------|
 | 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) |
 | 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
 | 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
 | 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
 | 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9418 activities** |
 | Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
 | 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) |
 | 5. Final rebuild `--enrichment` | not started |
-| Phase | Status | DB rows | Tests |
+Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
-|-------|--------|---------|-------|
+is gitignored (575 files on disk, durable across /clear).
 | Faza 0 pilot (5 files) | committed (`3d9f266`) | 1751 in `data/activities.db` | 71 passed |
 | Faza 1 extraction | **64/588 chunks done**, 1949 activities in `data/extracted/*.json` (not yet imported to DB) | — | — |
-## What "Faza 1" is
+## The 13 missing chunks (out of 588)
-Process the full 96-source corpus (was 116 raw files; some are duplicates/empty zips/skipped junk) through the LLM-subagent extraction pipeline. Same code path as the pilot. Two large mirror dirs dominate the chunk count:
+**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
-
+- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
- `87850302_dragon_sleepdeprived` — 116 chunks (full dragon.sleepdeprived.ca mirror)
+- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`
 - `c3162825_resource_pack__learning_by_playing_catalunya_...` — 97 chunks (the catalunya mirror)
 Combined they are 213/588 = 36% of all remaining chunks.
 ## Critical recommendation for the next session
 **Use Sonnet 4.6 for subagent extractions, not Opus.** Opus was used through chunks 1-64 and burned through a 5-hour rate-limit window faster than this scale needs. Sonnet has 200K context which is plenty for the ~25k-token prompt + ~20k-token output of a single chunk extraction. The task is structured JSON extraction with a fixed schema — no complex reasoning needed.
 The Agent tool's `model` parameter takes `"sonnet"`. Pass it on every Agent call below.
 ## Resuming — exact mechanical steps
 1. **Verify state.**
   ```bash
   cd /workspace/game-library
   ls data/extracted/*.json | wc -l       # should be 64 (or higher if more done)
   ls data/chunks/_prompts/ | wc -l       # 588 — the full prompt set
   git log --oneline -3                   # 3d9f266 must be HEAD or earlier
   ```
 2. **Find what's still pending.** Compare prompts to extracted files:
   ```bash
   ls data/chunks/_prompts/ | sed 's/\.prompt\.md$//' | sort > /tmp/all.txt
   ls data/extracted/*.json 2>/dev/null | sed 's|.*/||;s|\.json$||' | sort > /tmp/done.txt
   comm -23 /tmp/all.txt /tmp/done.txt > /tmp/pending.txt
   wc -l /tmp/pending.txt                 # how many remain
   head /tmp/pending.txt                  # what's next
   ```
 3. **Launch waves of 16 subagents in parallel.** One Agent call per chunk. Single message with 16 Agent tool calls. Use this exact template (substitute `<CHUNK_KEY>`):
   ```
   Agent(
     description: "Extract <CHUNK_KEY>",
     subagent_type: "general-purpose",
     model: "sonnet",                  ← critical
     prompt: "Working directory: /workspace/game-library. Extraction subagent — read data/chunks/_prompts/<CHUNK_KEY>.prompt.md and follow it EXACTLY. Apply rules from scripts/SUBAGENT_PROMPT.md and schema from scripts/activity_schema.json. Write the JSON. Set language per chunk content ('ro' or 'en'). Report under 40 words."
   )
   ```
   The per-chunk prompt file is fully self-contained — it points to the right chunk, sets source_id/source_hash/chunk_key, and references the rules + schema. The subagent just follows it.
 4. **After every wave**, briefly check progress and continue:
   ```bash
   ls data/extracted/*.json | wc -l
   ```
   Repeat step 3 with the next 16 pending chunks. If an agent reports `"You've hit your limit · resets ..."` AND `tool_uses: 5` with `total_tokens: 0`, check whether the JSON was written anyway — agents often persist the file before the limit hit. Only re-launch if the JSON is missing.
 5. **When all 588 chunks are done**, finalize:
   ```bash
   python3 scripts/validate_extractions.py    # any chunks marked rejected go to data/extracted/_reextract/
   # re-extract any rejected chunks (same template, prompt from _reextract/)
   python3 scripts/build_database.py --rebuild
   # if many borderline needs_review rows:
   python3 -c "
   import sys; sys.path.insert(0,'scripts')
   from import_common import content_key, normalize_name
   import sqlite3, json
   conn = sqlite3.connect('data/activities.db')
   conn.row_factory = sqlite3.Row
   rows = list(conn.execute('SELECT name, normalized_name, language, description FROM activities WHERE needs_review=1'))
   d = {content_key(r['normalized_name'] or normalize_name(r['name']), r['language'], r['description'] or ''): 'merge' for r in rows}
   json.dump(d, open('data/review_decisions.json','w'), indent=2)
   print(f'{len(d)} merge decisions')
   "
   python3 scripts/build_database.py --rebuild   # apply decisions
   python3 -m pytest tests/ -q                   # 71 should pass
   git add data/activities.db data/review_decisions.json
   git commit -m "Faza 1: full corpus extraction"
   ```
 ## Code reference — what each script does
 - `scripts/normalize_sources.py --corpus data/carti-camp-jocuri --out data/sources` → produces 96 `data/sources/<id>.txt` files with `--- PAGE N ---` markers. **Done. Don't re-run.**
 - `scripts/chunk_sources.py --sources data/sources --chunks data/chunks` → splits each into ~20pg chunks with 4pg overlap, writes `data/chunks/<id>/<id>.partNN.txt` and updates `data/chunks/manifest.json`. **Done. Don't re-run unless sources change.**
 - `scripts/run_extraction.py` → regenerates the per-chunk prompts in `data/chunks/_prompts/`. **Done. Don't re-run unless schema/prompt changes.**
 - `scripts/SUBAGENT_PROMPT.md` — extraction rules (what subagents follow).
 - `scripts/activity_schema.json` — JSON schema each extraction must validate against.
 - `scripts/validate_extractions.py` — per-file schema check + fuzzy `source_excerpt` substring check; writes re-extraction prompts to `data/extracted/_reextract/` for rejected chunks; marks chunks `rejected` in manifest.
 - `scripts/build_database.py --rebuild` — validates every `data/extracted/*.json` against schema, drops per-activity hallucinations, dedup, applies `data/review_decisions.json`, atomic swap into `data/activities.db`.
 - `scripts/review_queue.py list|resolve <id> <merge|keep-separate|drop>` — CLI for borderline-dedup decisions; persisted in `data/review_decisions.json`.
 ## Pilot lessons that apply
 - **~1.07% hallucinated drops** at pilot scale (well below the 2% threshold). Caused by source_excerpts straddling `--- PAGE N ---` markers. Re-extraction with verbatim within-page quotes fixed all 13 affected chunks. Expect similar rate at Faza 1 scale (~10-30 chunks may need re-extraction).
 - **Borderline dedup queue** (369 rows in pilot) — same-name activities re-extracted from chunk overlap with slightly-different prose. Bulk-merge is the right call: same normalized_name + same language + 60-85% desc similarity → merge takes the longest fields. Use the snippet in step 5 above.
 - **OCR not needed.** The candidate scanned PDF (`07.Cartea_Mare`) extracted 151 pages of real text via pdfplumber alone. Skip OCR for v1.
 ## Files not yet committed (uncommitted in this session)
 - `data/sources/` — all 96 normalized `.txt` files (in `.gitignore`, don't try to commit them)
 - `data/chunks/` — all 588 chunks + manifest (in `.gitignore`)
 - `data/extracted/` — 64 JSON files so far (in `.gitignore`)
 - `data/activities.db` — **still the pilot's 1751-row DB.** Will be rebuilt after Faza 1 finishes.
 The schema, all scripts, all tests, and the pilot DB are already committed at `3d9f266`. No code changes are needed for Faza 1 — just data.
 ## Status snapshot (as of handoff)
 **7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
 incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
 ```
-chunks done       : 64 / 588   (10.9%)
+3f9c8232_teambuilding_corbu_29092023.part01
-activities so far : 1949
+5f959f85_scoli_fara_bullying.part02
-remaining chunks  : 524
+83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
-largest pending sources:
+d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
-  87850302_dragon_sleepdeprived               116 chunks (full dragon mirror)
+d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
-  c3162825_resource_pack__learning_by_playing  97 chunks (catalunya mirror)
+d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
-  4da6431e_cub_scout_leader_how_to_book         18
+e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
  4a765782_1000_fantastic_scout_games           18 (re-extract; was in pilot)
  bee67427_the_big_book_of_conflict_resolution  15
  e3bd0953_1001_idei_pentru_o_educatie_timp     14
  d5e51389_09_culegere_de_jocuri_si_povestiri   13
  ce4b48f1_impact_culegere_de_jocuri_si_povest  13
  193fdd94_ghid_de_integrare_a_persoanelor_vul  12 (in progress)
  779f4fa0_ghidul_animatorului_855_de_jocuri    11
 ```
 Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
 `data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
 the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
 overlay keys depend on the current freeze yet.
-In a fresh session: `cat HANDOFF.md`, then go straight to step 3 above.
+## The json_repair incident (important — root cause + what was fixed)
 Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
 text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
 were affected.
 First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
 ends the string and reinterprets the trailing text as a new key, silently dropping the
 rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
 the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
 an extra key slipped through. Applying json_repair output to disk also **overwrote the
 malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
 re-extraction.
 **Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
 (`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
 validates against the real schema, and only replaces a valid top-level file when the
 repaired version carries **strictly more text** (a length guard that catches truncated
 json_repair output while leaving genuine extractions untouched). Re-running it cleanly
 repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
 `json_repair` is no longer used anywhere. Do NOT reintroduce it.
 `build_database.py` does NOT depend on the repair script (the "DB regenerable from
 data/extracted/" invariant holds — plain `json.loads` only).
 ## What the code does now (all committed)
 **Part A — plumbing (corpus-independent):**
 - `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
  indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
  chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
  indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
  and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
  the categories table so dropdowns populate.
 - `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
  / `get_display_description` / `get_display_rules` / `get_display_variations`
  (RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
  `get_indoor_outdoor_display`, `get_space_needed_display`.
 - `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
  `normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
  fallback — never fabricate a value) + display-name helpers.
 - `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
  `merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
  **never** touches enrichment fields (those are applied post-dedup).
 **Part A — UI/search:**
 - `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
  `space_needed` to DB equality filters.
 - `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
  `SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
  `source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
  web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
 - `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
 - Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
  helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
  "Text original" section, indoor/space cards, estimat markers, and the download link
  (only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
 **Part B — enrichment pipeline (built, not yet run):**
 - `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
  applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
  `import_common.content_key(normalized_name, language, _normalize_text(description))`
  (reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
  `enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
  Translated/expanded text is NOT re-validated against source (by design).
 - `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
  skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
  per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
  `find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
  merges parts → `data/enrichment.json`.
 - `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
  `description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
  fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
 ## Exact next steps
 1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
   JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
   If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
 2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
   note the new total (~9418 + the 7 chunks' activities).
 3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 6–8k LLM calls):
   - Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
     (or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
   - Launch a small wave of Sonnet subagents on those prompts (each writes
     `data/enrichment_parts/<key>.json`).
   - `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
   - `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
   - **STOP. Hand the user translation-quality + estimation-plausibility + description-
     fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
     auto-proceed past this gate.
 4. After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, `--collect`,
   final `--rebuild --enrichment`.
 ## Verify / run
 - Tests: `python3 -m pytest tests/ -q` → 99 pass.
 - App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
  only if the user accepts the copyright exposure of serving original files.
 - `data/activities.db.bak` is the pre-this-freeze backup.
--- a/app/config.py
+++ b/app/config.py
@@ -22,6 +22,18 @@ class Config:
    # Search settings
    SEARCH_RESULTS_LIMIT = int(os.environ.get('SEARCH_RESULTS_LIMIT', '100'))
    FTS_ENABLED = True
    # Source-file download (plan A6). Shipped DARK by default: serving the
    # original PDFs/books carries a copyright exposure the user must opt into.
    # The /source/<id> route 404s entirely while this is false; the UI hides
    # the download link. Enable with SOURCE_DOWNLOAD_ENABLED=true.
    SOURCE_DOWNLOAD_ENABLED = (
        os.environ.get('SOURCE_DOWNLOAD_ENABLED', 'false').lower() == 'true'
    )
    # Root of the original corpus. source_file values are relative to this.
    CORPUS_DIR = os.environ.get('CORPUS_DIR') or str(
        Path(__file__).parent.parent / 'data' / 'carti-camp-jocuri'
    )
    @staticmethod
    def ensure_directories():
--- a/app/config_taxonomy.py
+++ b/app/config_taxonomy.py
@@ -8,7 +8,7 @@ the UI displays the Romanian name. `category` (thematic domain) and
 import unicodedata
 import re
-from typing import Dict, List
+from typing import Dict, List, Optional
 # --- Categories (thematic domain) --------------------------------------------
 # slug -> Romanian display name. ~16 fixed slugs; `altele` is the mandatory
@@ -215,6 +215,89 @@ def normalize_content_type(value: str) -> str:
    return aliases.get(slug, DEFAULT_CONTENT_TYPE)
 # --- Indoor / outdoor (enrichment axis) --------------------------------------
 # Where the activity is run. Inferred during enrichment when the source is
 # silent — such inferences are flagged in `estimated_fields`. slug -> RO label.
 INDOOR_OUTDOOR: Dict[str, str] = {
    "indoor": "Interior",
    "outdoor": "Exterior",
    "either": "Interior sau exterior",
 }
 # --- Space needed (enrichment axis) ------------------------------------------
 # Rough footprint the activity requires. slug -> RO label.
 SPACE_NEEDED: Dict[str, str] = {
    "mic": "Spațiu mic",
    "mediu": "Spațiu mediu",
    "mare": "Spațiu mare",
 }
 # Aliases for robustness against LLM output variation. Keys are _slugify'd.
 _INDOOR_OUTDOOR_ALIASES: Dict[str, str] = {
    "interior": "indoor",
    "inside": "indoor",
    "in": "indoor",
    "exterior": "outdoor",
    "outside": "outdoor",
    "out": "outdoor",
    "aer-liber": "outdoor",
    "both": "either",
    "any": "either",
    "ambele": "either",
    "interior-exterior": "either",
    "indoor-outdoor": "either",
 }
 _SPACE_NEEDED_ALIASES: Dict[str, str] = {
    "small": "mic",
    "redus": "mic",
    "putin": "mic",
    "medium": "mediu",
    "moderat": "mediu",
    "large": "mare",
    "big": "mare",
    "mult": "mare",
    "spatiu-mic": "mic",
    "spatiu-mediu": "mediu",
    "spatiu-mare": "mare",
 }
 def normalize_indoor_outdoor(value: str) -> Optional[str]:
    """Map an arbitrary string to an indoor_outdoor slug, or None.
    Unlike categories, this has NO mandatory fallback: an unrecognised or
    empty value yields None (field simply absent), so we never fabricate a
    location the enrichment did not assert.
    """
    if not value:
        return None
    slug = _slugify(str(value))
    if slug in INDOOR_OUTDOOR:
        return slug
    return _INDOOR_OUTDOOR_ALIASES.get(slug)
 def normalize_space_needed(value: str) -> Optional[str]:
    """Map an arbitrary string to a space_needed slug, or None (no fallback)."""
    if not value:
        return None
    slug = _slugify(str(value))
    if slug in SPACE_NEEDED:
        return slug
    return _SPACE_NEEDED_ALIASES.get(slug)
 def indoor_outdoor_display_name(slug: str) -> str:
    """RO display name for an indoor_outdoor slug."""
    return INDOOR_OUTDOOR.get(slug, slug)
 def space_needed_display_name(slug: str) -> str:
    """RO display name for a space_needed slug."""
    return SPACE_NEEDED.get(slug, slug)
 def is_valid_category(slug: str) -> bool:
    """True if `slug` is a valid category slug."""
    return slug in CATEGORIES
--- a/app/models/activity.py
+++ b/app/models/activity.py
@@ -76,6 +76,25 @@ class Activity:
    extraction_confidence: Optional[str] = None  # 'high' / 'med' / 'low'
    needs_review: int = 0
    # Enrichment overlay (applied at build time from data/enrichment.json; see
    # plan Part B). Bilingual: the EN/source text stays in name/description/...
    # and the Romanian rendering lands in the *_ro twins. Absent fields leave
    # the underlying DB value untouched.
    name_ro: Optional[str] = None
    description_ro: Optional[str] = None
    rules_ro: Optional[str] = None
    variations_ro: Optional[str] = None
    indoor_outdoor: Optional[str] = None     # slug: indoor / outdoor / either
    space_needed: Optional[str] = None       # slug: mic / mediu / mare
    # Names of fields whose value was INFERRED by enrichment (source was
    # silent) rather than stated in the source — surfaced as "(estimat)" in UI.
    estimated_fields: List[str] = field(default_factory=list)
    # Source provenance for the download route + enrichment keying.
    source_id: Optional[str] = None          # e.g. "876d1a2d_marcaje_turistice"
    source_ids: List[str] = field(default_factory=list)  # all source_ids merged
    chunk_key: Optional[str] = None          # e.g. "<source_id>.part01"
    # Database fields
    id: Optional[int] = None
    created_at: Optional[str] = None
@@ -117,6 +136,16 @@ class Activity:
            'normalized_name': self.normalized_name or normalize_name(self.name),
            'extraction_confidence': self.extraction_confidence,
            'needs_review': self.needs_review,
            'name_ro': self.name_ro,
            'description_ro': self.description_ro,
            'rules_ro': self.rules_ro,
            'variations_ro': self.variations_ro,
            'indoor_outdoor': self.indoor_outdoor,
            'space_needed': self.space_needed,
            'estimated_fields': json.dumps(self.estimated_fields) if self.estimated_fields else None,
            'source_id': self.source_id,
            'source_ids': json.dumps(self.source_ids) if self.source_ids else None,
            'chunk_key': self.chunk_key,
        }
    @classmethod
@@ -140,6 +169,19 @@ class Activity:
        elif source_files is None:
            source_files = []
        # estimated_fields / source_ids: JSON string (DB) or list (in-memory)
        def _json_list(value):
            if isinstance(value, str):
                try:
                    parsed = json.loads(value)
                    return parsed if isinstance(parsed, list) else []
                except (json.JSONDecodeError, TypeError):
                    return []
            return list(value) if value else []
        estimated_fields = _json_list(data.get('estimated_fields'))
        source_ids = _json_list(data.get('source_ids'))
        return cls(
            id=data.get('id'),
            name=data.get('name', ''),
@@ -170,6 +212,16 @@ class Activity:
            normalized_name=data.get('normalized_name'),
            extraction_confidence=data.get('extraction_confidence'),
            needs_review=data.get('needs_review', 0) or 0,
            name_ro=data.get('name_ro'),
            description_ro=data.get('description_ro'),
            rules_ro=data.get('rules_ro'),
            variations_ro=data.get('variations_ro'),
            indoor_outdoor=data.get('indoor_outdoor'),
            space_needed=data.get('space_needed'),
            estimated_fields=estimated_fields,
            source_id=data.get('source_id'),
            source_ids=source_ids,
            chunk_key=data.get('chunk_key'),
            created_at=data.get('created_at'),
            updated_at=data.get('updated_at')
        )
@@ -210,4 +262,44 @@ class Activity:
            return self.materials_category
        elif self.materials_list:
            return self.materials_list[:100] + "..." if len(self.materials_list) > 100 else self.materials_list
-        return "nu specificate"
+        return "nu specificate"
    # --- Enrichment / bilingual display helpers ------------------------------
    def get_display_name(self) -> str:
        """Romanian name when enriched, else the original."""
        return self.name_ro or self.name
    def get_display_description(self) -> str:
        """Romanian description when enriched, else the original."""
        return self.description_ro or self.description
    def get_display_rules(self) -> Optional[str]:
        """Romanian rules when enriched, else the original."""
        return self.rules_ro or self.rules
    def get_display_variations(self) -> Optional[str]:
        """Romanian variations when enriched, else the original."""
        return self.variations_ro or self.variations
    def has_translation(self) -> bool:
        """True if any Romanian enrichment text is present."""
        return bool(self.name_ro or self.description_ro
                    or self.rules_ro or self.variations_ro)
    def is_estimated(self, field_name: str) -> bool:
        """True if `field_name` was inferred by enrichment (source was silent)."""
        return field_name in (self.estimated_fields or [])
    def get_indoor_outdoor_display(self) -> Optional[str]:
        """RO label for indoor_outdoor, or None when unset."""
        if not self.indoor_outdoor:
            return None
        from app.config_taxonomy import indoor_outdoor_display_name
        return indoor_outdoor_display_name(self.indoor_outdoor)
    def get_space_needed_display(self) -> Optional[str]:
        """RO label for space_needed, or None when unset."""
        if not self.space_needed:
            return None
        from app.config_taxonomy import space_needed_display_name
        return space_needed_display_name(self.space_needed)
--- a/app/models/database.py
+++ b/app/models/database.py
@@ -72,6 +72,18 @@ class DatabaseManager:
                    extraction_confidence TEXT,
                    needs_review INTEGER DEFAULT 0,
                    -- Enrichment overlay (bilingual + inferred filters; Part B)
                    name_ro TEXT,
                    description_ro TEXT,
                    rules_ro TEXT,
                    variations_ro TEXT,
                    indoor_outdoor TEXT,
                    space_needed TEXT,
                    estimated_fields TEXT,
                    source_id TEXT,
                    source_ids TEXT,
                    chunk_key TEXT,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
@@ -82,6 +94,7 @@ class DatabaseManager:
                CREATE VIRTUAL TABLE IF NOT EXISTS activities_fts USING fts5(
                    name, description, rules, variations, keywords,
                    materials_list, skills_developed,
                    name_ro, description_ro, rules_ro, variations_ro,
                    content='activities',
                    content_rowid='id'
                )
@@ -106,6 +119,8 @@ class DatabaseManager:
                "CREATE INDEX IF NOT EXISTS idx_activities_participants ON activities(participants_min, participants_max)",
                "CREATE INDEX IF NOT EXISTS idx_activities_duration ON activities(duration_min, duration_max)",
                "CREATE INDEX IF NOT EXISTS idx_activities_normalized_name ON activities(normalized_name)",
                "CREATE INDEX IF NOT EXISTS idx_activities_indoor_outdoor ON activities(indoor_outdoor)",
                "CREATE INDEX IF NOT EXISTS idx_activities_space_needed ON activities(space_needed)",
                "CREATE INDEX IF NOT EXISTS idx_categories_type ON categories(type)"
            ]
@@ -117,9 +132,11 @@ class DatabaseManager:
                CREATE TRIGGER IF NOT EXISTS activities_fts_insert AFTER INSERT ON activities
                BEGIN
                    INSERT INTO activities_fts(rowid, name, description, rules, variations,
-                                               keywords, materials_list, skills_developed)
+                                               keywords, materials_list, skills_developed,
                                               name_ro, description_ro, rules_ro, variations_ro)
                    VALUES (new.id, new.name, new.description, new.rules, new.variations,
-                            new.keywords, new.materials_list, new.skills_developed);
+                            new.keywords, new.materials_list, new.skills_developed,
                            new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
                END
            """)
@@ -127,9 +144,11 @@ class DatabaseManager:
                CREATE TRIGGER IF NOT EXISTS activities_fts_delete AFTER DELETE ON activities
                BEGIN
                    INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
-                                               variations, keywords, materials_list, skills_developed)
+                                               variations, keywords, materials_list, skills_developed,
                                               name_ro, description_ro, rules_ro, variations_ro)
                    VALUES ('delete', old.id, old.name, old.description, old.rules,
-                            old.variations, old.keywords, old.materials_list, old.skills_developed);
+                            old.variations, old.keywords, old.materials_list, old.skills_developed,
                            old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
                END
            """)
@@ -137,13 +156,17 @@ class DatabaseManager:
                CREATE TRIGGER IF NOT EXISTS activities_fts_update AFTER UPDATE ON activities
                BEGIN
                    INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
-                                               variations, keywords, materials_list, skills_developed)
+                                               variations, keywords, materials_list, skills_developed,
                                               name_ro, description_ro, rules_ro, variations_ro)
                    VALUES ('delete', old.id, old.name, old.description, old.rules,
-                            old.variations, old.keywords, old.materials_list, old.skills_developed);
+                            old.variations, old.keywords, old.materials_list, old.skills_developed,
                            old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
                    INSERT INTO activities_fts(rowid, name, description, rules, variations,
-                                               keywords, materials_list, skills_developed)
+                                               keywords, materials_list, skills_developed,
                                               name_ro, description_ro, rules_ro, variations_ro)
                    VALUES (new.id, new.name, new.description, new.rules, new.variations,
-                            new.keywords, new.materials_list, new.skills_developed);
+                            new.keywords, new.materials_list, new.skills_developed,
                            new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
                END
            """)
@@ -210,6 +233,10 @@ class DatabaseManager:
            ('duration', activity.get_duration_display()),
            ('materials', activity.get_materials_display()),
            ('difficulty', activity.difficulty_level),
            # Enrichment axes — slugs stored as value; UI maps to RO via
            # DISPLAY_NAMES. Without these the new dropdowns would be empty.
            ('indoor_outdoor', activity.indoor_outdoor),
            ('space_needed', activity.space_needed),
        ]
        for cat_type, cat_value in categories_to_update:
@@ -236,6 +263,8 @@ class DatabaseManager:
                         duration_max: Optional[int] = None,
                         materials_category: Optional[str] = None,
                         difficulty_level: Optional[str] = None,
                         indoor_outdoor: Optional[str] = None,
                         space_needed: Optional[str] = None,
                         limit: int = 100) -> List[Dict[str, Any]]:
        """Enhanced search with FTS5 and filters"""
@@ -293,7 +322,15 @@ class DatabaseManager:
            if difficulty_level:
                base_query += " AND difficulty_level = ?"
                params.append(difficulty_level)
-            
+
            if indoor_outdoor:
                base_query += " AND indoor_outdoor = ?"
                params.append(indoor_outdoor)
            if space_needed:
                base_query += " AND space_needed = ?"
                params.append(space_needed)
            # Add ordering and limit
            query = f"{base_query} {order_clause} LIMIT ?"
            params.append(limit)
--- a/app/services/search.py
+++ b/app/services/search.py
@@ -200,7 +200,14 @@ class SearchService:
            elif filter_key == 'difficulty':
                db_filters['difficulty_level'] = filter_value
-            
+
            elif filter_key == 'indoor_outdoor':
                # Equality filter on the slug column (mirror difficulty).
                db_filters['indoor_outdoor'] = filter_value
            elif filter_key == 'space_needed':
                db_filters['space_needed'] = filter_value
            # Handle any other custom filters
            else:
                # Generic filter handling - try to match against keywords or tags
--- a/app/static/css/main.css
+++ b/app/static/css/main.css
@@ -705,4 +705,30 @@ body {
        box-shadow: none;
        border: 1px solid #ddd;
    }
 }
 /* Enrichment markers (plan Part A7) */
 .estimated {
    color: #8a6d3b;
    font-style: italic;
    font-size: 0.85em;
    font-weight: normal;
 }
 .original-text > summary {
    cursor: pointer;
    color: #555;
    user-select: none;
 }
 .original-text .original-content {
    margin-top: 0.75rem;
    padding-left: 1rem;
    border-left: 3px solid #e0e0e0;
    color: #555;
 }
 .download-hint {
    color: #888;
    font-size: 0.85em;
 }
--- a/app/templates/activity.html
+++ b/app/templates/activity.html
@@ -8,13 +8,13 @@
    <nav class="breadcrumb">
        <a href="{{ url_for('main.index') }}">Căutare</a>
        <span class="breadcrumb-separator">»</span>
-        <span class="breadcrumb-current">{{ activity.name }}</span>
+        <span class="breadcrumb-current">{{ activity.get_display_name() }}</span>
    </nav>
    <!-- Activity header -->
    <header class="activity-detail-header">
        <div class="activity-title-section">
-            <h1 class="activity-detail-title">{{ activity.name }}</h1>
+            <h1 class="activity-detail-title">{{ activity.get_display_name() }}</h1>
            <span class="activity-category-badge">{{ display_names.get(activity.category, activity.category) }}</span>
            {% if activity.content_type %}
            <span class="activity-content-type-badge">{{ display_names.get(activity.content_type, activity.content_type) }}</span>
@@ -31,27 +31,46 @@
    <!-- Activity content -->
    <div class="activity-detail-content">
-        <!-- Main description -->
+        <!-- Main description (Romanian-primary, falls back to original) -->
        <section class="activity-section">
            <h2 class="section-title">Descriere</h2>
-            <div class="activity-description">{{ activity.description }}</div>
+            <div class="activity-description">{{ activity.get_display_description() }}</div>
        </section>
        <!-- Rules and variations -->
-        {% if activity.rules %}
+        {% if activity.get_display_rules() %}
        <section class="activity-section">
            <h2 class="section-title">Reguli</h2>
-            <div class="activity-rules">{{ activity.rules }}</div>
+            <div class="activity-rules">{{ activity.get_display_rules() }}</div>
        </section>
        {% endif %}
-        {% if activity.variations %}
+        {% if activity.get_display_variations() %}
        <section class="activity-section">
            <h2 class="section-title">Variații</h2>
-            <div class="activity-variations">{{ activity.variations }}</div>
+            <div class="activity-variations">{{ activity.get_display_variations() }}</div>
        </section>
        {% endif %}
        <!-- Original (pre-translation) text, collapsed by default -->
        {% if activity.has_translation() %}
        <details class="activity-section original-text">
            <summary class="section-title">Text original ({{ display_names.get(activity.language, activity.language or 'sursă') }})</summary>
            <div class="original-content">
                <h3 class="metadata-title">{{ activity.name }}</h3>
                <div class="activity-description">{{ activity.description }}</div>
                {% if activity.rules %}
                <h4 class="metadata-title">Reguli</h4>
                <div class="activity-rules">{{ activity.rules }}</div>
                {% endif %}
                {% if activity.variations %}
                <h4 class="metadata-title">Variații</h4>
                <div class="activity-variations">{{ activity.variations }}</div>
                {% endif %}
            </div>
        </details>
        {% endif %}
        <!-- Metadata grid -->
        <section class="activity-section">
            <h2 class="section-title">Detalii activitate</h2>
@@ -59,21 +78,35 @@
                {% if activity.get_age_range_display() != "toate vârstele" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Grupa de vârstă</h3>
-                    <p class="metadata-value">{{ activity.get_age_range_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}
                {% if activity.get_participants_display() != "orice număr" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Participanți</h3>
-                    <p class="metadata-value">{{ activity.get_participants_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}
                {% if activity.get_duration_display() != "durată variabilă" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Durata</h3>
-                    <p class="metadata-value">{{ activity.get_duration_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}
                {% if activity.get_indoor_outdoor_display() %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Interior / exterior</h3>
                    <p class="metadata-value">{{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}
                {% if activity.get_space_needed_display() %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Spațiu necesar</h3>
                    <p class="metadata-value">{{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}
@@ -125,9 +158,15 @@
            <h2 class="section-title">Informații sursă</h2>
            <div class="source-info">
                {% if activity.source_file %}
                {% if config.SOURCE_DOWNLOAD_ENABLED %}
                <p><strong>Fișier sursă:</strong>
                   <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a>
                   <span class="download-hint">(descarcă)</span></p>
                {% else %}
                <p><strong>Fișier sursă:</strong> {{ activity.source_file }}</p>
                {% endif %}
-                
+                {% endif %}
                {% if activity.page_reference %}
                <p><strong>Referință:</strong> {{ activity.page_reference }}</p>
                {% endif %}
--- a/app/templates/index.html
+++ b/app/templates/index.html
@@ -125,6 +125,30 @@
                    </select>
                </div>
                {% endif %}
                {% if filters.indoor_outdoor %}
                <div class="filter-group">
                    <label for="indoor_outdoor" class="filter-label">Interior / exterior</label>
                    <select name="indoor_outdoor" id="indoor_outdoor" class="filter-select">
                        <option value="">Oriunde</option>
                        {% for io in filters.indoor_outdoor %}
                        <option value="{{ io }}">{{ display_names.get(io, io) }}</option>
                        {% endfor %}
                    </select>
                </div>
                {% endif %}
                {% if filters.space_needed %}
                <div class="filter-group">
                    <label for="space_needed" class="filter-label">Spațiu necesar</label>
                    <select name="space_needed" id="space_needed" class="filter-select">
                        <option value="">Orice spațiu</option>
                        {% for sp in filters.space_needed %}
                        <option value="{{ sp }}">{{ display_names.get(sp, sp) }}</option>
                        {% endfor %}
                    </select>
                </div>
                {% endif %}
            {% endif %}
        </div>
--- a/app/templates/results.html
+++ b/app/templates/results.html
@@ -85,6 +85,28 @@
            </select>
            {% endif %}
            {% if filters.indoor_outdoor %}
            <select name="indoor_outdoor" class="filter-select compact">
                <option value="">Oriunde</option>
                {% for io in filters.indoor_outdoor %}
                <option value="{{ io }}" {% if applied_filters.indoor_outdoor == io %}selected{% endif %}>
                    {{ display_names.get(io, io) }}
                </option>
                {% endfor %}
            </select>
            {% endif %}
            {% if filters.space_needed %}
            <select name="space_needed" class="filter-select compact">
                <option value="">Orice spațiu</option>
                {% for sp in filters.space_needed %}
                <option value="{{ sp }}" {% if applied_filters.space_needed == sp %}selected{% endif %}>
                    {{ display_names.get(sp, sp) }}
                </option>
                {% endfor %}
            </select>
            {% endif %}
            <button type="button" class="btn btn-secondary btn-sm" onclick="clearFilters()">
                Resetează
            </button>
@@ -128,7 +150,7 @@
            <header class="activity-header">
                <h3 class="activity-title">
                    <a href="{{ url_for('main.activity_detail', activity_id=activity.id) }}">
-                        {{ activity.name }}
+                        {{ activity.get_display_name() }}
                    </a>
                </h3>
                <span class="activity-category">{{ display_names.get(activity.category, activity.category) }}</span>
@@ -138,24 +160,36 @@
            </header>
            <div class="activity-content">
-                <p class="activity-description">{{ activity.description }}</p>
+                <p class="activity-description">{{ activity.get_display_description() }}</p>
-                
+
                <div class="activity-metadata">
                    {% if activity.get_age_range_display() != "toate vârstele" %}
                    <span class="metadata-item">
-                        <strong>Vârsta:</strong> {{ activity.get_age_range_display() }}
+                        <strong>Vârsta:</strong> {{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}
                    {% if activity.get_participants_display() != "orice număr" %}
                    <span class="metadata-item">
-                        <strong>Participanți:</strong> {{ activity.get_participants_display() }}
+                        <strong>Participanți:</strong> {{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}
                    {% if activity.get_duration_display() != "durată variabilă" %}
                    <span class="metadata-item">
-                        <strong>Durata:</strong> {{ activity.get_duration_display() }}
+                        <strong>Durata:</strong> {{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}
                    {% if activity.get_indoor_outdoor_display() %}
                    <span class="metadata-item">
                        <strong>Loc:</strong> {{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}
                    {% if activity.get_space_needed_display() %}
                    <span class="metadata-item">
                        <strong>Spațiu:</strong> {{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}
@@ -168,7 +202,11 @@
                {% if activity.source_file %}
                <div class="activity-source">
                    {% if config.SOURCE_DOWNLOAD_ENABLED %}
                    <small>Sursă: <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a></small>
                    {% else %}
                    <small>Sursă: {{ activity.source_file }}</small>
                    {% endif %}
                </div>
                {% endif %}
            </div>
--- a/app/web/routes.py
+++ b/app/web/routes.py
@@ -3,20 +3,27 @@ Flask routes for INDEX-SISTEM-JOCURI v2.0
 Clean, minimalist web interface with dynamic filters
 """
-from flask import Blueprint, request, render_template, jsonify, current_app
+from flask import (
    Blueprint, request, render_template, jsonify, current_app,
    send_from_directory,
 )
 from app.models.database import DatabaseManager
 from app.models.activity import Activity
 from app.services.search import SearchService
-from app.config_taxonomy import CATEGORIES, CONTENT_TYPES
+from app.config_taxonomy import (
-import os
+    CATEGORIES, CONTENT_TYPES, INDOOR_OUTDOOR, SPACE_NEEDED,
-from pathlib import Path
+)
 bp = Blueprint('main', __name__)
-# Slug -> Romanian display name. Category and content_type slugs never collide,
+# Slug -> Romanian display name. Category, content_type, indoor_outdoor and
-# so a single flat map is enough for the UI filter labels.
+# space_needed slugs never collide, so a single flat map is enough for the UI
 # filter labels.
 LANGUAGE_NAMES = {'ro': 'Română', 'en': 'Engleză'}
-DISPLAY_NAMES = {**CATEGORIES, **CONTENT_TYPES, **LANGUAGE_NAMES}
+DISPLAY_NAMES = {
    **CATEGORIES, **CONTENT_TYPES, **INDOOR_OUTDOOR, **SPACE_NEEDED,
    **LANGUAGE_NAMES,
 }
 # Initialize database manager (will be configured in application factory)
 def get_db_manager():
@@ -138,6 +145,44 @@ def activity_detail(activity_id):
        print(f"Error loading activity {activity_id}: {e}")
        return render_template('404.html'), 404
@bp.route('/source/<int:activity_id>')
 def source_download(activity_id):
    """Download the original source file for an activity (plan A6).
    Shipped DARK: returns 404 unless SOURCE_DOWNLOAD_ENABLED is set (copyright
    exposure — the user opts in). Resolves the activity's `source_file` under
    CORPUS_DIR. send_from_directory does the safe-join and blocks traversal;
    web-mirror / extension-less sources that are not real files 404 gracefully.
    """
    if not current_app.config.get('SOURCE_DOWNLOAD_ENABLED', False):
        return render_template('404.html'), 404
    try:
        db = get_db_manager()
        activity_data = db.get_activity_by_id(activity_id)
        if not activity_data:
            return render_template('404.html'), 404
        source_file = (activity_data.get('source_file') or '').strip()
        if not source_file:
            return render_template('404.html'), 404
        corpus_dir = current_app.config.get('CORPUS_DIR')
        if not corpus_dir:
            return render_template('404.html'), 404
        try:
            # send_from_directory rejects path traversal and missing files with
            # a 404 (NotFound) — no manual safe_join needed.
            return send_from_directory(
                corpus_dir, source_file, as_attachment=True
            )
        except Exception:
            # Missing file / web-mirror source with no on-disk original.
            return render_template('404.html'), 404
    except Exception as e:
        print(f"Source download error for {activity_id}: {e}")
        return render_template('404.html'), 404
@bp.route('/health')
 def health_check():
    """Health check endpoint for Docker"""
--- a/data/activities.db
+++ b/data/activities.db
--- a/scripts/ENRICHMENT_PROMPT.md
+++ b/scripts/ENRICHMENT_PROMPT.md
@@ -0,0 +1,98 @@
 # SUBAGENT — Activity enrichment
 You are a subagent in the game-library enrichment pipeline. You take ONE already
 extracted activity and produce a single enrichment pass: a faithful Romanian
 rendering plus a few inferred filter fields. You do **one** activity per prompt.
 This is **not** re-extraction. The activity text already exists and is trusted.
 Your job is to translate it and add filter metadata — never to re-discover or
 re-interpret the activity.
 ## Your task
 The prompt gives you two blocks:
 1. **Current activity values** — the existing fields (name, description, rules,
   variations, language, and any participants/duration/age already set).
 2. **Source chunk text** — the original passage the activity came from. This is
   your ground truth for any expansion. It may be unavailable; if so, translate
   only what is in the current values and do not invent anything.
 Produce one JSON object and write it to the path named in the prompt
 (`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
 `content_key` string from the prompt.
 ## Rules
 ### Translation (always)
 - Translate `name`, `description`, `rules`, `variations` into natural, fluent
  Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
 - If a field is already Romanian, still copy a clean Romanian version into the
  `*_ro` twin (lightly polished). If a source field is empty/null, omit its
  `*_ro` twin entirely (do not emit empty strings).
 - Translate faithfully. Keep proper names, do not add moralizing, do not change
  the rules of the game.
 ### Description expansion (constrained)
 - You MAY make `description_ro` richer than a literal translation — but ONLY
  using detail that is actually present in the **source chunk text**. Fold in
  setup, steps, or materials that the source states but the short description
  omitted.
 - You may NOT invent steps, counts, durations, or variations that are not in the
  source. If the source is thin, the translation stays thin. Hallucinated
  expansion is the one unacceptable failure here.
 ### Inferred filter fields (mark when inferred)
 Fill these when you can, using the source text first, then reasonable inference:
 - `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
 - `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
 - `participants_min`, `participants_max`: integers (people).
 - `duration_min`, `duration_max`: integers (minutes).
 - `age_group_min`, `age_group_max`: integers (years).
 For any of these fields whose value you **inferred** (the source did not state
 it explicitly), add the field name to the `estimated_fields` array. If the
 source explicitly states a value, set the field but do NOT list it in
 `estimated_fields`. Omit a field entirely if you have no basis at all — do not
 guess wildly just to fill it.
 Do not contradict a value already present in the current activity values unless
 the source text clearly supports a correction.
 ## Enum vocabulary (fixed — use these exact slugs)
 - `indoor_outdoor`: `indoor` | `outdoor` | `either`
 - `space_needed`: `mic` | `mediu` | `mare`
 ## Output format
 Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
 ```json
 {
  "content_key": "<the exact key from the prompt>",
  "name_ro": "…",
  "description_ro": "…",
  "rules_ro": "…",
  "variations_ro": "…",
  "indoor_outdoor": "outdoor",
  "space_needed": "mediu",
  "participants_min": 6,
  "participants_max": 20,
  "duration_min": 15,
  "duration_max": 30,
  "age_group_min": 8,
  "age_group_max": 14,
  "estimated_fields": ["space_needed", "duration_min", "duration_max"]
 }
 ```
 Include only the fields you actually fill. Always include `content_key` and
 `estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
 no commentary, no markdown fences in the file itself.
 ## Report
 After writing the file, report in under 30 words: the activity name and which
 fields you estimated.
--- a/scripts/SUBAGENT_PROMPT.md
+++ b/scripts/SUBAGENT_PROMPT.md
@@ -74,6 +74,23 @@ The file is one JSON object: a `header` plus an `activities` array.
 - Do **not** paraphrase the `source_excerpt` — copy it character for character.
 - Better to extract fewer activities accurately than to pad the output.
 ## Writing large outputs in batches (IMPORTANT)
 A single Write tool call has a hard ~32K output-token limit. Dense chunks
 (50+ activities) will exceed this. If you estimate >30 activities, write the
 file **incrementally**:
 1. First Write: emit the file with `header` + the first batch (≤25 activities)
   and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
 2. For each subsequent batch (≤25 activities at a time), use an Edit call
   that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
   `,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
   closing brace plus the last activity's tail) so the Edit is unambiguous.
 3. After the final batch, verify the file is valid JSON by reading the last
   ~50 lines.
 This keeps each tool call under the output-token cap.
 ## Before you finish
 - Every activity has a non-empty `source_excerpt` and `page_reference`.
--- a/scripts/build_database.py
+++ b/scripts/build_database.py
@@ -86,7 +86,12 @@ def _split_csv(value: Optional[str]) -> list[str]:
    return [p.strip() for p in str(value).split(",") if p.strip()]
-def dict_to_activity(adict: dict, source_file: str) -> Activity:
+def dict_to_activity(
    adict: dict,
    source_file: str,
    source_id: Optional[str] = None,
    chunk_key: Optional[str] = None,
 ) -> Activity:
    """Build an Activity from one extraction-JSON activity object."""
    tags = adict.get("tags") or []
    if isinstance(tags, str):
@@ -99,6 +104,9 @@ def dict_to_activity(adict: dict, source_file: str) -> Activity:
        source_files = [source_file, *source_files]
    return Activity(
        source_id=source_id,
        source_ids=[source_id] if source_id else [],
        chunk_key=chunk_key,
        name=(adict.get("name") or "").strip(),
        description=(adict.get("description") or "").strip(),
        rules=adict.get("rules"),
@@ -206,6 +214,19 @@ def merge_cluster(cluster: list[Activity]) -> Activity:
            if s and s not in sources:
                sources.append(s)
    merged.source_files = sources
    # source provenance: keep rep's chunk_key/source_id as primary, union the
    # source_ids for the download route. Enrichment fields (name_ro,
    # description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
    # enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
    # row's content_key, so merging must not pre-populate them.
    merged.source_id = rep.source_id
    merged.chunk_key = rep.chunk_key
    source_ids: list[str] = []
    for a in cluster:
        for sid in [a.source_id, *(a.source_ids or [])]:
            if sid and sid not in source_ids:
                source_ids.append(sid)
    merged.source_ids = source_ids
    # popularity_score++ per merged duplicate (plan §4)
    merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
    return merged
@@ -313,6 +334,108 @@ def apply_review_decisions(
    return kept, stats
 # --------------------------------------------------------------------------
 # step 5b — enrichment overlay (plan Part B)
 # --------------------------------------------------------------------------
 # Translation / inferred-filter fields written by run_enrichment.py. Applied
 # AFTER dedup + review decisions, keyed on the same stable content_key, so the
 # overlay survives rebuilds as long as extraction text is frozen.
 _ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
 _ENRICHMENT_INT_FIELDS = (
    "participants_min", "participants_max",
    "duration_min", "duration_max",
    "age_group_min", "age_group_max",
 )
 def load_enrichment(path: Path) -> dict:
    """Load data/enrichment.json (flat map content_key -> field dict)."""
    if path and path.is_file():
        try:
            data = json.loads(path.read_text(encoding="utf-8"))
            if isinstance(data, dict):
                return data
        except (json.JSONDecodeError, OSError):
            pass
    return {}
 def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
    """
    Overlay enrichment fields onto the post-dedup activity list (plan B2).
    Keyed by content_key. Only fields PRESENT in an entry are written; absent
    fields leave the underlying DB value untouched. indoor_outdoor /
    space_needed are normalized to slugs (None on unrecognised). Inferred
    fields are recorded in `estimated_fields`. Translated / expanded text is
    NOT re-validated against the source here — expansion fidelity is the
    enrichment prompt's responsibility (plan B2 comment).
    Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
    """
    from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
    matched_keys: set[str] = set()
    fields_stated: dict[str, int] = defaultdict(int)
    fields_estimated: dict[str, int] = defaultdict(int)
    for act in activities:
        key = content_key(
            act.normalized_name or normalize_name(act.name),
            act.language,
            act.description or "",
        )
        entry = enrichment.get(key)
        if not isinstance(entry, dict):
            continue
        matched_keys.add(key)
        estimated = set(entry.get("estimated_fields") or [])
        # bilingual text twins
        for fld in _ENRICHMENT_TEXT_FIELDS:
            val = entry.get(fld)
            if isinstance(val, str) and val.strip():
                setattr(act, fld, val.strip())
        # inferred / clarified structured numeric fields
        for fld in _ENRICHMENT_INT_FIELDS:
            if entry.get(fld) is not None:
                try:
                    setattr(act, fld, int(entry[fld]))
                except (TypeError, ValueError):
                    pass
        # enum filters — normalized to slug, dropped if unrecognised
        if entry.get("indoor_outdoor") is not None:
            slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
            if slug:
                act.indoor_outdoor = slug
        if entry.get("space_needed") is not None:
            slug = normalize_space_needed(entry["space_needed"])
            if slug:
                act.space_needed = slug
        act.estimated_fields = sorted(estimated)
        # QA tally: stated vs estimated population, per field
        for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
            if entry.get(fld) is None:
                continue
            if fld in estimated:
                fields_estimated[fld] += 1
            else:
                fields_stated[fld] += 1
    return {
        "entries": len(enrichment),
        "matched": len(matched_keys),
        "orphaned": len(enrichment) - len(matched_keys),
        "fields_stated": dict(fields_stated),
        "fields_estimated": dict(fields_estimated),
    }
 # --------------------------------------------------------------------------
 # golden-set recall (plan §7)
 # --------------------------------------------------------------------------
@@ -390,9 +513,8 @@ def collect_activities(
        header = data.get("header", {})
        chunk_text = find_chunk_text(json_path, header, chunks_dir)
-        source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit(
+        chunk_key = chunk_key_for(json_path, header)
-            ".part", 1
+        source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
        )[0]
        fallback_source = (
            source_path_for(source_id, sources_dir) or source_id or json_path.stem
        )
@@ -409,7 +531,7 @@ def collect_activities(
                continue
            src = adict.get("source_file") or fallback_source
            raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
-            activities.append(dict_to_activity(adict, src))
+            activities.append(dict_to_activity(adict, src, source_id, chunk_key))
        if hallucinated:
            _log_hallucinations(json_path, rejected_dir, hallucinated)
@@ -496,6 +618,7 @@ def rebuild(
    sources_dir: Path,
    db_path: Path,
    decisions_path: Optional[Path] = None,
    enrichment_path: Optional[Path] = None,
    schema_path: Path = DEFAULT_SCHEMA_PATH,
    golden_dir: Optional[Path] = None,
    do_swap: bool = True,
@@ -517,6 +640,11 @@ def rebuild(
    decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
    final, decision_stats = apply_review_decisions(deduped, decisions)
    # Enrichment overlay — applied immediately after review decisions, on the
    # post-dedup list, keyed on the same stable content_key (plan B2).
    enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
    enrichment_stats = apply_enrichment(final, enrichment)
    try:
        write_database(db_tmp_path, final)
        backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
@@ -529,6 +657,7 @@ def rebuild(
        **collected,
        "dedup": dedup_stats,
        "decisions": decision_stats,
        "enrichment": enrichment_stats,
        "final_count": len(final),
        "backup": str(backup) if backup else None,
        "swapped": do_swap,
@@ -579,6 +708,16 @@ def print_report(report: dict) -> None:
          f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
    print(f"review decisions     : dropped {report['decisions']['dropped']}, "
          f"resolved {report['decisions']['resolved']}")
    enr = report.get("enrichment")
    if enr and enr.get("entries"):
        print(f"enrichment           : {enr['entries']} entries "
              f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
        stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
        all_fields = sorted(set(stated) | set(estimated))
        if all_fields:
            print("  field population   :  (stated / estimated)")
            for fld in all_fields:
                print(f"    {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
    print(f"final inserted       : {report['final_count']}")
    print(f"% with rules         : {qa['pct_with_rules']}")
    print(f"needs_review rows    : {qa['needs_review']}")
@@ -615,6 +754,7 @@ def main(argv: Optional[list[str]] = None) -> int:
    parser.add_argument("--sources", default="data/sources")
    parser.add_argument("--db", default="data/activities.db")
    parser.add_argument("--decisions", default="data/review_decisions.json")
    parser.add_argument("--enrichment", default="data/enrichment.json")
    parser.add_argument("--golden", default="data/golden")
    parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
    args = parser.parse_args(argv)
@@ -628,6 +768,7 @@ def main(argv: Optional[list[str]] = None) -> int:
        sources_dir=Path(args.sources),
        db_path=Path(args.db),
        decisions_path=Path(args.decisions),
        enrichment_path=Path(args.enrichment),
        schema_path=Path(args.schema),
        golden_dir=Path(args.golden),
    )
--- a/scripts/repair_extractions.py
+++ b/scripts/repair_extractions.py
@@ -0,0 +1,244 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
 repair_extractions.py — one-shot repair of malformed extraction JSON.
 Subagents systematically emit unescaped ASCII double-quotes inside string
 values (Romanian text like  „Unu"  uses a closing " that terminates the JSON
 string early). Re-extraction reproduces the bug, so we repair instead.
 IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
 ending the string at the stray quote and reinterpreting the trailing text as a
 new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
 truncation is silent (the field is still non-empty) and slips past a naive
 presence check. So we use a faithful char-scanner that ESCAPES stray quotes
 (\\") instead of splitting on them, then validate the result against the real
 activity schema (additionalProperties:false also catches any residual split).
 This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
 the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
 disk. We write clean JSON back to data/extracted/ and the build reads vanilla
 json.
 Source selection (faithful recovery needs the ORIGINAL malformed text):
  * a chunk is a candidate when a MALFORMED original exists — either the
    top-level data/extracted/<key>.json is itself invalid, or a malformed
    original sits in data/extracted/_rejected/<key>.json.
  * the malformed original is preferred as the repair source.
  * chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
    output that lost the original) are NOT silently "repaired" — if such a chunk
    has no valid top-level file it is reported as needing RE-EXTRACTION.
 Usage:
    python scripts/repair_extractions.py            # report only (dry run)
    python scripts/repair_extractions.py --apply     # write repaired JSON
 """
 from __future__ import annotations
 import argparse
 import glob
 import json
 from pathlib import Path
 from typing import Optional
 SCRIPT_DIR = Path(__file__).resolve().parent
 REPO_ROOT = SCRIPT_DIR.parent
 EXTRACTED = REPO_ROOT / "data" / "extracted"
 REJECTED = EXTRACTED / "_rejected"
 if str(SCRIPT_DIR) not in __import__("sys").path:
    __import__("sys").path.insert(0, str(SCRIPT_DIR))
 from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction  # noqa: E402
 def escape_stray_quotes(s: str) -> str:
    """Escape ASCII double-quotes that occur INSIDE a JSON string value.
    A `"` inside a string is treated as a real string-close only when the next
    non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
    content and is escaped to `\\"`. This preserves the full value instead of
    truncating it (the json_repair failure mode).
    """
    out: list[str] = []
    in_str = False
    esc = False
    n = len(s)
    i = 0
    while i < n:
        c = s[i]
        if esc:
            out.append(c)
            esc = False
            i += 1
            continue
        if c == "\\":
            out.append(c)
            esc = True
            i += 1
            continue
        if c == '"':
            if not in_str:
                in_str = True
                out.append(c)
            else:
                j = i + 1
                while j < n and s[j] in " \t\r\n":
                    j += 1
                nxt = s[j] if j < n else ""
                if nxt in ",}]:" or nxt == "":
                    in_str = False
                    out.append(c)
                else:
                    out.append('\\"')  # content quote → escape, keep value whole
            i += 1
            continue
        out.append(c)
        i += 1
    return "".join(out)
 def _is_valid_json(path: Path) -> bool:
    try:
        json.loads(path.read_text(encoding="utf-8"))
        return True
    except (json.JSONDecodeError, OSError):
        return False
 def _malformed_source(key: str) -> Optional[Path]:
    """Return the malformed-original file for a chunk, preferring top-level."""
    live = EXTRACTED / f"{key}.json"
    if live.exists() and not _is_valid_json(live):
        return live
    rej = REJECTED / f"{key}.json"
    if rej.exists() and not _is_valid_json(rej):
        return rej
    return None
 def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
    """
    (repair_candidates, needs_reextraction).
    repair_candidates: key -> malformed source file (faithfully repairable).
    needs_reextraction: chunks with no malformed original AND no valid
    top-level file (their original was lost) — must be re-extracted.
    """
    keys = set()
    for fn in glob.glob(str(EXTRACTED / "*.json")):
        keys.add(Path(fn).stem)
    for fn in glob.glob(str(REJECTED / "*.json")):
        keys.add(Path(fn).stem)
    candidates: dict[str, Path] = {}
    needs_reextraction: list[str] = []
    for key in sorted(keys):
        # A malformed original anywhere is faithfully repairable, and is the
        # source of truth even if a (json_repair-produced, possibly truncated)
        # valid top-level file exists — escaping the original never truncates,
        # so re-repairing from it is always >= the json_repair output.
        src = _malformed_source(key)
        if src is not None:
            candidates[key] = src
            continue
        live = EXTRACTED / f"{key}.json"
        if live.exists() and _is_valid_json(live):
            continue  # genuinely-valid extraction, nothing to do
        # no valid top-level and no malformed original to repair from
        needs_reextraction.append(key)
    return candidates, needs_reextraction
 def repair(apply: bool) -> int:
    schema = load_schema(DEFAULT_SCHEMA_PATH)
    candidates, needs_reextraction = _candidate_keys()
    print("=" * 64)
    print(f"REPAIR EXTRACTIONS  ({'APPLY' if apply else 'dry run'})")
    print("=" * 64)
    print(f"repair candidates: {len(candidates)}")
    def _textlen(data: dict) -> int:
        total = 0
        for a in data.get("activities", []):
            if isinstance(a, dict):
                for v in a.values():
                    if isinstance(v, str):
                        total += len(v)
        return total
    ok = 0
    kept_toplevel = 0
    still_bad: list[str] = []
    schema_fail: list[tuple[str, str]] = []
    for key, src in candidates.items():
        live = EXTRACTED / f"{key}.json"
        live_valid = live.exists() and _is_valid_json(live)
        raw = src.read_text(encoding="utf-8")
        fixed = escape_stray_quotes(raw)
        try:
            data = json.loads(fixed)
        except json.JSONDecodeError as exc:
            if live_valid:
                kept_toplevel += 1  # genuine top-level is fine; stale _rejected
            else:
                still_bad.append(f"{key}: still invalid after escape ({exc})")
            continue
        errors = validate_extraction(data, schema)
        if errors:
            if live_valid:
                kept_toplevel += 1
            else:
                schema_fail.append((key, errors[0]))
                print(f"  {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
            continue
        # Faithfulness guard: only replace a valid top-level when the escaped
        # repair carries STRICTLY more text (i.e. the top-level was a truncated
        # json_repair output). Genuine extractions are kept untouched.
        if live_valid:
            try:
                live_data = json.loads(live.read_text(encoding="utf-8"))
            except json.JSONDecodeError:
                live_data = {}
            if _textlen(data) <= _textlen(live_data):
                kept_toplevel += 1
                continue
        n = len(data.get("activities", []))
        print(f"  {key[:50]:<50} {n:>3} acts  REPAIR")
        if apply:
            live.write_text(
                json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
            )
        ok += 1
    print("-" * 64)
    print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
          f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
          f"needs re-extraction: {len(needs_reextraction)}")
    for key, err in schema_fail:
        print(f"  ⚠ schema {key}: {err[:60]}")
    for msg in still_bad:
        print(f"  ✘ {msg}")
    for key in needs_reextraction:
        print(f"  ↻ re-extract: {key}")
    if not apply:
        print("\nDry run — re-run with --apply to write repaired JSON.")
    print("=" * 64)
    return 0
 def main(argv: Optional[list[str]] = None) -> int:
    parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
    parser.add_argument("--apply", action="store_true",
                        help="write repaired JSON (default: dry run)")
    args = parser.parse_args(argv)
    return repair(args.apply)
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/scripts/run_enrichment.py
+++ b/scripts/run_enrichment.py
@@ -0,0 +1,270 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
 run_enrichment.py — enrichment orchestrator (plan Part B3).
 Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
 already-rebuilt data/activities.db, and for every activity emits one subagent
 prompt asking for a single bilingual + inferred-filter enrichment pass. Like
 extraction, this script does NOT call the LLM — the interactive Claude Code
 orchestrator launches waves of subagents on the emitted prompts.
 Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
 import_common.content_key(normalized_name, language, _normalize_text(description))
 — the SAME function build_database uses to apply the overlay. The key is stable
 only while the extraction text is frozen, so enrichment runs AFTER the freezing
 rebuild.
 Modes:
  (default)    emit one prompt per activity that has no enrichment part yet
               (resumable: data/enrichment_parts/<key>.json present => skip)
  --collect    merge data/enrichment_parts/*.json -> data/enrichment.json
 Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
 the emitted prompts to a single source / category for the sign-off pilot.
 Usage:
    python scripts/run_enrichment.py --source teambuilding_corbu        # pilot
    python scripts/run_enrichment.py                                    # all rows
    python scripts/run_enrichment.py --collect                          # merge parts
 """
 from __future__ import annotations
 import argparse
 import json
 import sqlite3
 import sys
 from pathlib import Path
 from typing import Optional
 SCRIPT_DIR = Path(__file__).resolve().parent
 REPO_ROOT = SCRIPT_DIR.parent
 for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
    if _p not in sys.path:
        sys.path.insert(0, _p)
 from import_common import (  # noqa: E402
    content_key,
    find_chunk_text,
    normalize_name,
 )
 ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
 # Columns pulled from the DB into the prompt as the "current value" context.
 _DB_COLUMNS = (
    "id", "name", "description", "rules", "variations",
    "category", "content_type", "language", "normalized_name",
    "page_reference", "source_id", "chunk_key",
    "participants_min", "participants_max",
    "duration_min", "duration_max",
    "age_group_min", "age_group_max",
 )
 # How much source-chunk text to inline. Chunks are page-sized; cap so a dense
 # chunk does not blow the prompt up, but keep enough to ground the expansion.
 _CHUNK_TEXT_CAP = 12000
 def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row
    try:
        cols = ", ".join(_DB_COLUMNS)
        sql = f"SELECT {cols} FROM activities"
        params: list = []
        if source_substr:
            sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
            params = [f"%{source_substr}%", f"%{source_substr}%"]
        sql += " ORDER BY source_id, id"
        return [dict(r) for r in conn.execute(sql, params).fetchall()]
    finally:
        conn.close()
 def _row_content_key(row: dict) -> str:
    return content_key(
        row.get("normalized_name") or normalize_name(row.get("name") or ""),
        row.get("language"),
        row.get("description") or "",
    )
 def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
    """Locate the source-chunk text via the row's chunk_key / source_id."""
    header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
    if not header["chunk_key"]:
        return None
    # find_chunk_text resolves from the header when chunk_key is present;
    # the json_path arg is only a fallback, so a synthetic path is fine.
    text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
    if text and len(text) > _CHUNK_TEXT_CAP:
        text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
    return text
 def _current_fields_block(row: dict) -> str:
    """The activity's current DB values, as a compact JSON block for context."""
    fields = {
        "name": row.get("name"),
        "description": row.get("description"),
        "rules": row.get("rules"),
        "variations": row.get("variations"),
        "category": row.get("category"),
        "content_type": row.get("content_type"),
        "language": row.get("language"),
        "participants_min": row.get("participants_min"),
        "participants_max": row.get("participants_max"),
        "duration_min": row.get("duration_min"),
        "duration_max": row.get("duration_max"),
        "age_group_min": row.get("age_group_min"),
        "age_group_max": row.get("age_group_max"),
    }
    return json.dumps(fields, ensure_ascii=False, indent=2)
 def emit_enrichment_prompt(
    row: dict, key: str, chunks_dir: Path, prompts_dir: Path
 ) -> Path:
    """Write the subagent enrichment prompt for one activity."""
    chunk_text = _chunk_text_for_row(row, chunks_dir)
    source_block = (
        chunk_text if chunk_text is not None
        else "[source chunk text unavailable — translate only what is given "
             "above; do NOT invent steps, and mark any inferred filter field "
             "as estimated]"
    )
    part_path = f"data/enrichment_parts/{key}.json"
    text = "\n".join([
        f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
        "",
        f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
        "Single pass. Translate faithfully to Romanian; expand description_ro "
        "ONLY from the source chunk text below; mark inferred filter fields in "
        "`estimated_fields`.",
        "",
        f"Write the result JSON to: `{part_path}`",
        f'It MUST include `"content_key": "{key}"`.',
        f'Page reference: {row.get("page_reference") or "?"}',
        "",
        "## Current activity values (the text to translate / enrich)",
        "```json",
        _current_fields_block(row),
        "```",
        "",
        "## Source chunk text (ground description_ro expansion in THIS only)",
        "```",
        source_block,
        "```",
        "",
    ])
    prompts_dir.mkdir(parents=True, exist_ok=True)
    out = prompts_dir / f"{key}.prompt.md"
    out.write_text(text, encoding="utf-8")
    return out
 def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
    """Merge data/enrichment_parts/*.json into one flat content_key map."""
    merged: dict = {}
    bad: list[str] = []
    if parts_dir.is_dir():
        for part in sorted(parts_dir.glob("*.json")):
            try:
                data = json.loads(part.read_text(encoding="utf-8"))
            except (json.JSONDecodeError, OSError):
                bad.append(part.name)
                continue
            key = data.get("content_key") or part.stem
            entry = {k: v for k, v in data.items() if k != "content_key"}
            merged[key] = entry
    out_path.write_text(
        json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
    )
    return {"entries": len(merged), "bad_parts": bad, "out": str(out_path)}
 def run_emit(
    *,
    db_path: Path,
    chunks_dir: Path,
    parts_dir: Path,
    prompts_dir: Path,
    source_substr: Optional[str],
    limit: Optional[int],
 ) -> dict:
    rows = _fetch_rows(db_path, source_substr)
    emitted, skipped = 0, 0
    for row in rows:
        key = _row_content_key(row)
        if (parts_dir / f"{key}.json").is_file():
            skipped += 1
            continue
        emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
        emitted += 1
        if limit and emitted >= limit:
            break
    return {
        "rows": len(rows),
        "emitted": emitted,
        "skipped_done": skipped,
        "prompts_dir": str(prompts_dir),
    }
 def main(argv: Optional[list[str]] = None) -> int:
    parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
    parser.add_argument("--db", default="data/activities.db")
    parser.add_argument("--chunks", default="data/chunks")
    parser.add_argument("--parts", default="data/enrichment_parts")
    parser.add_argument("--prompts", default="data/enrichment_prompts")
    parser.add_argument("--out", default="data/enrichment.json")
    parser.add_argument("--source", default=None,
                        help="only rows whose source_id/chunk_key contains this (pilot)")
    parser.add_argument("--limit", type=int, default=None,
                        help="cap emitted prompts (pilot)")
    parser.add_argument("--collect", action="store_true",
                        help="merge enrichment parts into the overlay JSON")
    args = parser.parse_args(argv)
    print("=" * 60)
    print("ENRICHMENT ORCHESTRATOR")
    print("=" * 60)
    if args.collect:
        result = collect_enrichment(Path(args.parts), Path(args.out))
        print(f"collected  : {result['entries']} entries -> {result['out']}")
        if result["bad_parts"]:
            print(f"bad parts  : {len(result['bad_parts'])} (skipped)")
            for name in result["bad_parts"]:
                print(f"  - {name}")
        print("Run build_database.py --rebuild to apply the overlay.")
        print("=" * 60)
        return 0
    summary = run_emit(
        db_path=Path(args.db),
        chunks_dir=Path(args.chunks),
        parts_dir=Path(args.parts),
        prompts_dir=Path(args.prompts),
        source_substr=args.source,
        limit=args.limit,
    )
    print(f"rows in DB        : {summary['rows']}"
          + (f"  (filtered by '{args.source}')" if args.source else ""))
    print(f"already enriched  : {summary['skipped_done']}")
    print(f"prompts emitted   : {summary['emitted']}")
    if summary["emitted"]:
        print(f"prompts dir       : {summary['prompts_dir']}/")
        print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
              "writing data/enrichment_parts/<key>.json, then run "
              "run_enrichment.py --collect and build_database.py --rebuild.")
    else:
        print("Nothing to emit — run --collect then build_database.py --rebuild.")
    print("=" * 60)
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/tests/test_enrichment.py
+++ b/tests/test_enrichment.py
@@ -0,0 +1,231 @@
 """
 Tests for the enrichment overlay (plan Part B) and the new filter axes /
 bilingual display helpers (plan Part A).
 Covers:
  * config_taxonomy.normalize_indoor_outdoor / normalize_space_needed
  * build_database.apply_enrichment keying, field application, estimated tally
  * DatabaseManager indoor_outdoor / space_needed equality filters
  * FTS5 indexing of the *_ro columns
  * Activity bilingual display helpers
 """
 import os
 import sys
 import pytest
 PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
 if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)
 SCRIPTS = os.path.join(PROJECT_ROOT, "scripts")
 if SCRIPTS not in sys.path:
    sys.path.insert(0, SCRIPTS)
 from app.models.activity import Activity  # noqa: E402
 from app.models.database import DatabaseManager  # noqa: E402
 from app.config_taxonomy import (  # noqa: E402
    normalize_indoor_outdoor,
    normalize_space_needed,
 )
 from import_common import content_key, normalize_name  # noqa: E402
 from build_database import apply_enrichment  # noqa: E402
 # --------------------------------------------------------------------------
 # taxonomy normalizers
 # --------------------------------------------------------------------------
@pytest.mark.parametrize("raw,expected", [
    ("indoor", "indoor"),
    ("Outdoor", "outdoor"),
    ("either", "either"),
    ("interior", "indoor"),
    ("aer liber", "outdoor"),
    ("both", "either"),
    ("", None),
    ("nonsense", None),
    (None, None),
 ])
 def test_normalize_indoor_outdoor(raw, expected):
    assert normalize_indoor_outdoor(raw) == expected
@pytest.mark.parametrize("raw,expected", [
    ("mic", "mic"),
    ("MEDIU", "mediu"),
    ("mare", "mare"),
    ("small", "mic"),
    ("large", "mare"),
    ("", None),
    ("huge", None),
    (None, None),
 ])
 def test_normalize_space_needed(raw, expected):
    assert normalize_space_needed(raw) == expected
 # --------------------------------------------------------------------------
 # apply_enrichment
 # --------------------------------------------------------------------------
 def _activity(name="Joc de test", description="O descriere de test.", language="ro"):
    return Activity(
        name=name, description=description, category="team-building",
        content_type="joc", source_file="t.txt", language=language,
    )
 def _key_for(act: Activity) -> str:
    return content_key(
        act.normalized_name or normalize_name(act.name),
        act.language,
        act.description or "",
    )
 def test_apply_enrichment_matches_and_applies_fields():
    act = _activity()
    key = _key_for(act)
    enrichment = {
        key: {
            "name_ro": "Joc de test (RO)",
            "description_ro": "Descriere îmbogățită în română.",
            "indoor_outdoor": "outdoor",
            "space_needed": "mediu",
            "participants_min": 4,
            "participants_max": 12,
            "estimated_fields": ["space_needed", "participants_min", "participants_max"],
        }
    }
    stats = apply_enrichment([act], enrichment)
    assert act.name_ro == "Joc de test (RO)"
    assert act.description_ro == "Descriere îmbogățită în română."
    assert act.indoor_outdoor == "outdoor"
    assert act.space_needed == "mediu"
    assert act.participants_min == 4 and act.participants_max == 12
    assert set(act.estimated_fields) == {"space_needed", "participants_min", "participants_max"}
    assert stats["entries"] == 1
    assert stats["matched"] == 1
    assert stats["orphaned"] == 0
    # indoor_outdoor stated, space_needed estimated
    assert stats["fields_stated"].get("indoor_outdoor") == 1
    assert stats["fields_estimated"].get("space_needed") == 1
 def test_apply_enrichment_orphan_entry_counted():
    act = _activity()
    enrichment = {"deadbeef" * 5: {"name_ro": "nu se potrivește"}}
    stats = apply_enrichment([act], enrichment)
    assert stats["matched"] == 0
    assert stats["orphaned"] == 1
    assert act.name_ro is None  # untouched
 def test_apply_enrichment_absent_fields_leave_value_untouched():
    act = _activity()
    act.participants_min = 5
    key = _key_for(act)
    # entry only translates name; participants must be preserved
    apply_enrichment([act], {key: {"name_ro": "Tradus"}})
    assert act.participants_min == 5
    assert act.name_ro == "Tradus"
 def test_apply_enrichment_drops_unrecognised_enum():
    act = _activity()
    key = _key_for(act)
    apply_enrichment([act], {key: {"indoor_outdoor": "spaceship", "space_needed": "mic"}})
    assert act.indoor_outdoor is None       # unrecognised → dropped
    assert act.space_needed == "mic"
 # --------------------------------------------------------------------------
 # DB equality filters + FTS on *_ro
 # --------------------------------------------------------------------------
@pytest.fixture
 def db(tmp_path):
    return DatabaseManager(str(tmp_path / "enrich.db"))
 def _insert(db, **overrides):
    base = dict(
        name="Activitate", description="desc", category="camp-outdoor",
        content_type="joc", source_file="t.txt", language="ro",
    )
    base.update(overrides)
    return db.insert_activity(Activity(**base))
 def test_indoor_outdoor_equality_filter(db):
    _insert(db, name="In casa", indoor_outdoor="indoor")
    _insert(db, name="Afara", indoor_outdoor="outdoor")
    res = db.search_activities(indoor_outdoor="outdoor")
    assert len(res) == 1
    assert res[0]["name"] == "Afara"
 def test_space_needed_equality_filter(db):
    _insert(db, name="Mic", space_needed="mic")
    _insert(db, name="Mare", space_needed="mare")
    res = db.search_activities(space_needed="mare")
    assert len(res) == 1
    assert res[0]["name"] == "Mare"
 def test_fts_indexes_name_ro(db):
    _insert(db, name="Treasure Hunt", name_ro="Vânătoarea de comori")
    # term only present in the Romanian twin
    res = db.search_activities(search_text="comori")
    assert len(res) == 1
    assert res[0]["name"] == "Treasure Hunt"
 def test_fts_indexes_description_ro(db):
    _insert(db, name="Game", description="english desc",
            description_ro="o activitate de cooperare")
    res = db.search_activities(search_text="cooperare")
    assert len(res) == 1
 def test_ro_columns_round_trip(db):
    aid = _insert(
        db, name="X", name_ro="X-ro", description_ro="d-ro",
        rules_ro="r-ro", variations_ro="v-ro",
        indoor_outdoor="either", space_needed="mediu",
        estimated_fields=["duration_min"], source_id="src1",
        source_ids=["src1", "src2"], chunk_key="src1.part01",
    )
    row = db.get_activity_by_id(aid)
    loaded = Activity.from_dict(row)
    assert loaded.name_ro == "X-ro"
    assert loaded.indoor_outdoor == "either"
    assert loaded.space_needed == "mediu"
    assert loaded.estimated_fields == ["duration_min"]
    assert loaded.source_ids == ["src1", "src2"]
    assert loaded.chunk_key == "src1.part01"
 # --------------------------------------------------------------------------
 # display helpers
 # --------------------------------------------------------------------------
 def test_display_helpers_prefer_ro_with_fallback():
    act = _activity(name="Original", description="Original desc")
    assert act.get_display_name() == "Original"          # no translation yet
    assert act.get_display_description() == "Original desc"
    act.name_ro = "Tradus"
    act.description_ro = "Descriere tradusă"
    assert act.get_display_name() == "Tradus"
    assert act.get_display_description() == "Descriere tradusă"
    assert act.has_translation() is True
 def test_is_estimated_and_axis_displays():
    act = _activity()
    act.indoor_outdoor = "outdoor"
    act.space_needed = "mare"
    act.estimated_fields = ["space_needed"]
    assert act.get_indoor_outdoor_display() == "Exterior"
    assert act.get_space_needed_display() == "Spațiu mare"
    assert act.is_estimated("space_needed") is True
    assert act.is_estimated("indoor_outdoor") is False