Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_*, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:10:13 +00:00
parent 46d9592a55
commit bcfb6841eb
18 changed files with 1579 additions and 167 deletions
--- a/HANDOFF.md
+++ b/HANDOFF.md
@@ -1,135 +1,143 @@
-# HANDOFF — Faza 1 extraction in progress
+# HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot

-**Last updated:** 2026-05-20, commit `3d9f266` (pilot complete) + uncommitted Faza 1 work.
+**Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual
+index + enrichment + new filters + source download).

-## State of play
+## Where we are

-Faza 0 (pilot) is **complete and committed**. Faza 1 (full corpus) is **in progress at 10.9%**.
+| Step (plan Part C) | Status |
+|--------------------|--------|
+| 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) |
+| 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
+| 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
+| 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
+| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9418 activities** |
+| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
+| 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) |
+| 5. Final rebuild `--enrichment` | not started |

-| Phase | Status | DB rows | Tests |
-|-------|--------|---------|-------|
-| Faza 0 pilot (5 files) | committed (`3d9f266`) | 1751 in `data/activities.db` | 71 passed |
-| Faza 1 extraction | **64/588 chunks done**, 1949 activities in `data/extracted/*.json` (not yet imported to DB) | — | — |
+Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
+is gitignored (575 files on disk, durable across /clear).

-## What "Faza 1" is
+## The 13 missing chunks (out of 588)

-Process the full 96-source corpus (was 116 raw files; some are duplicates/empty zips/skipped junk) through the LLM-subagent extraction pipeline. Same code path as the pilot. Two large mirror dirs dominate the chunk count:
+**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
+- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
+- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`

- `87850302_dragon_sleepdeprived` — 116 chunks (full dragon.sleepdeprived.ca mirror)
- `c3162825_resource_pack__learning_by_playing_catalunya_...` — 97 chunks (the catalunya mirror)
-
-Combined they are 213/588 = 36% of all remaining chunks.
-
-## Critical recommendation for the next session
-
-**Use Sonnet 4.6 for subagent extractions, not Opus.** Opus was used through chunks 1-64 and burned through a 5-hour rate-limit window faster than this scale needs. Sonnet has 200K context which is plenty for the ~25k-token prompt + ~20k-token output of a single chunk extraction. The task is structured JSON extraction with a fixed schema — no complex reasoning needed.
-
-The Agent tool's `model` parameter takes `"sonnet"`. Pass it on every Agent call below.
-
-## Resuming — exact mechanical steps
-
-1. **Verify state.**
-   ```bash
-   cd /workspace/game-library
-   ls data/extracted/*.json | wc -l       # should be 64 (or higher if more done)
-   ls data/chunks/_prompts/ | wc -l       # 588 — the full prompt set
-   git log --oneline -3                   # 3d9f266 must be HEAD or earlier
+**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
+incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
 ```
-
-2. **Find what's still pending.** Compare prompts to extracted files:
-   ```bash
-   ls data/chunks/_prompts/ | sed 's/\.prompt\.md$//' | sort > /tmp/all.txt
-   ls data/extracted/*.json 2>/dev/null | sed 's|.*/||;s|\.json$||' | sort > /tmp/done.txt
-   comm -23 /tmp/all.txt /tmp/done.txt > /tmp/pending.txt
-   wc -l /tmp/pending.txt                 # how many remain
-   head /tmp/pending.txt                  # what's next
+3f9c8232_teambuilding_corbu_29092023.part01
+5f959f85_scoli_fara_bullying.part02
+83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
+d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
+d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
+d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
+e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
 ```
+Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
+`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
+the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
+overlay keys depend on the current freeze yet.

-3. **Launch waves of 16 subagents in parallel.** One Agent call per chunk. Single message with 16 Agent tool calls. Use this exact template (substitute `<CHUNK_KEY>`):
+## The json_repair incident (important — root cause + what was fixed)

-   ```
-   Agent(
-     description: "Extract <CHUNK_KEY>",
-     subagent_type: "general-purpose",
-     model: "sonnet",                  ← critical
-     prompt: "Working directory: /workspace/game-library. Extraction subagent — read data/chunks/_prompts/<CHUNK_KEY>.prompt.md and follow it EXACTLY. Apply rules from scripts/SUBAGENT_PROMPT.md and schema from scripts/activity_schema.json. Write the JSON. Set language per chunk content ('ro' or 'en'). Report under 40 words."
-   )
-   ```
+Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
+text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
+were affected.

-   The per-chunk prompt file is fully self-contained — it points to the right chunk, sets source_id/source_hash/chunk_key, and references the rules + schema. The subagent just follows it.
+First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
+ends the string and reinterprets the trailing text as a new key, silently dropping the
+rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
+the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
+an extra key slipped through. Applying json_repair output to disk also **overwrote the
+malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
+re-extraction.

-4. **After every wave**, briefly check progress and continue:
-   ```bash
-   ls data/extracted/*.json | wc -l
-   ```
-   Repeat step 3 with the next 16 pending chunks. If an agent reports `"You've hit your limit · resets ..."` AND `tool_uses: 5` with `total_tokens: 0`, check whether the JSON was written anyway — agents often persist the file before the limit hit. Only re-launch if the JSON is missing.
+**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
+(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
+validates against the real schema, and only replaces a valid top-level file when the
+repaired version carries **strictly more text** (a length guard that catches truncated
+json_repair output while leaving genuine extractions untouched). Re-running it cleanly
+repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
+`json_repair` is no longer used anywhere. Do NOT reintroduce it.

-5. **When all 588 chunks are done**, finalize:
-   ```bash
-   python3 scripts/validate_extractions.py    # any chunks marked rejected go to data/extracted/_reextract/
-   # re-extract any rejected chunks (same template, prompt from _reextract/)
-   python3 scripts/build_database.py --rebuild
-   # if many borderline needs_review rows:
-   python3 -c "
-   import sys; sys.path.insert(0,'scripts')
-   from import_common import content_key, normalize_name
-   import sqlite3, json
-   conn = sqlite3.connect('data/activities.db')
-   conn.row_factory = sqlite3.Row
-   rows = list(conn.execute('SELECT name, normalized_name, language, description FROM activities WHERE needs_review=1'))
-   d = {content_key(r['normalized_name'] or normalize_name(r['name']), r['language'], r['description'] or ''): 'merge' for r in rows}
-   json.dump(d, open('data/review_decisions.json','w'), indent=2)
-   print(f'{len(d)} merge decisions')
-   "
-   python3 scripts/build_database.py --rebuild   # apply decisions
-   python3 -m pytest tests/ -q                   # 71 should pass
-   git add data/activities.db data/review_decisions.json
-   git commit -m "Faza 1: full corpus extraction"
-   ```
+`build_database.py` does NOT depend on the repair script (the "DB regenerable from
+data/extracted/" invariant holds — plain `json.loads` only).

-## Code reference — what each script does
+## What the code does now (all committed)

- `scripts/normalize_sources.py --corpus data/carti-camp-jocuri --out data/sources` → produces 96 `data/sources/<id>.txt` files with `--- PAGE N ---` markers. **Done. Don't re-run.**
- `scripts/chunk_sources.py --sources data/sources --chunks data/chunks` → splits each into ~20pg chunks with 4pg overlap, writes `data/chunks/<id>/<id>.partNN.txt` and updates `data/chunks/manifest.json`. **Done. Don't re-run unless sources change.**
- `scripts/run_extraction.py` → regenerates the per-chunk prompts in `data/chunks/_prompts/`. **Done. Don't re-run unless schema/prompt changes.**
- `scripts/SUBAGENT_PROMPT.md` — extraction rules (what subagents follow).
- `scripts/activity_schema.json` — JSON schema each extraction must validate against.
- `scripts/validate_extractions.py` — per-file schema check + fuzzy `source_excerpt` substring check; writes re-extraction prompts to `data/extracted/_reextract/` for rejected chunks; marks chunks `rejected` in manifest.
- `scripts/build_database.py --rebuild` — validates every `data/extracted/*.json` against schema, drops per-activity hallucinations, dedup, applies `data/review_decisions.json`, atomic swap into `data/activities.db`.
- `scripts/review_queue.py list|resolve <id> <merge|keep-separate|drop>` — CLI for borderline-dedup decisions; persisted in `data/review_decisions.json`.
+**Part A — plumbing (corpus-independent):**
+- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
+  indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
+  chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
+  indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
+  and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
+  the categories table so dropdowns populate.
+- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
+  / `get_display_description` / `get_display_rules` / `get_display_variations`
+  (RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
+  `get_indoor_outdoor_display`, `get_space_needed_display`.
+- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
+  `normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
+  fallback — never fabricate a value) + display-name helpers.
+- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
+  `merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
+  **never** touches enrichment fields (those are applied post-dedup).

-## Pilot lessons that apply
+**Part A — UI/search:**
+- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
+  `space_needed` to DB equality filters.
+- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
+  `SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
+  `source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
+  web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
+- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
+- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
+  helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
+  "Text original" section, indoor/space cards, estimat markers, and the download link
+  (only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.

- **~1.07% hallucinated drops** at pilot scale (well below the 2% threshold). Caused by source_excerpts straddling `--- PAGE N ---` markers. Re-extraction with verbatim within-page quotes fixed all 13 affected chunks. Expect similar rate at Faza 1 scale (~10-30 chunks may need re-extraction).
- **Borderline dedup queue** (369 rows in pilot) — same-name activities re-extracted from chunk overlap with slightly-different prose. Bulk-merge is the right call: same normalized_name + same language + 60-85% desc similarity → merge takes the longest fields. Use the snippet in step 5 above.
- **OCR not needed.** The candidate scanned PDF (`07.Cartea_Mare`) extracted 151 pages of real text via pdfplumber alone. Skip OCR for v1.
+**Part B — enrichment pipeline (built, not yet run):**
+- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
+  applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
+  `import_common.content_key(normalized_name, language, _normalize_text(description))`
+  (reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
+  `enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
+  Translated/expanded text is NOT re-validated against source (by design).
+- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
+  skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
+  per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
+  `find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
+  merges parts → `data/enrichment.json`.
+- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
+  `description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
+  fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.

-## Files not yet committed (uncommitted in this session)
+## Exact next steps

- `data/sources/` — all 96 normalized `.txt` files (in `.gitignore`, don't try to commit them)
- `data/chunks/` — all 588 chunks + manifest (in `.gitignore`)
- `data/extracted/` — 64 JSON files so far (in `.gitignore`)
- `data/activities.db` — **still the pilot's 1751-row DB.** Will be rebuilt after Faza 1 finishes.
+1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
+   JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
+   If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
+2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
+   note the new total (~9418 + the 7 chunks' activities).
+3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 6–8k LLM calls):
+   - Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
+     (or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
+   - Launch a small wave of Sonnet subagents on those prompts (each writes
+     `data/enrichment_parts/<key>.json`).
+   - `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
+   - `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
+   - **STOP. Hand the user translation-quality + estimation-plausibility + description-
+     fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
+     auto-proceed past this gate.
+4. After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, `--collect`,
+   final `--rebuild --enrichment`.

-The schema, all scripts, all tests, and the pilot DB are already committed at `3d9f266`. No code changes are needed for Faza 1 — just data.
+## Verify / run

-## Status snapshot (as of handoff)
-
-```
-chunks done       : 64 / 588   (10.9%)
-activities so far : 1949
-remaining chunks  : 524
-largest pending sources:
-  87850302_dragon_sleepdeprived               116 chunks (full dragon mirror)
-  c3162825_resource_pack__learning_by_playing  97 chunks (catalunya mirror)
-  4da6431e_cub_scout_leader_how_to_book         18
-  4a765782_1000_fantastic_scout_games           18 (re-extract; was in pilot)
-  bee67427_the_big_book_of_conflict_resolution  15
-  e3bd0953_1001_idei_pentru_o_educatie_timp     14
-  d5e51389_09_culegere_de_jocuri_si_povestiri   13
-  ce4b48f1_impact_culegere_de_jocuri_si_povest  13
-  193fdd94_ghid_de_integrare_a_persoanelor_vul  12 (in progress)
-  779f4fa0_ghidul_animatorului_855_de_jocuri    11
-```
-
-In a fresh session: `cat HANDOFF.md`, then go straight to step 3 above.
+- Tests: `python3 -m pytest tests/ -q` → 99 pass.
+- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
+  only if the user accepts the copyright exposure of serving original files.
+- `data/activities.db.bak` is the pre-this-freeze backup.
--- a/app/config.py
+++ b/app/config.py
@@ -23,6 +23,18 @@ class Config:
    SEARCH_RESULTS_LIMIT = int(os.environ.get('SEARCH_RESULTS_LIMIT', '100'))
    FTS_ENABLED = True

+    # Source-file download (plan A6). Shipped DARK by default: serving the
+    # original PDFs/books carries a copyright exposure the user must opt into.
+    # The /source/<id> route 404s entirely while this is false; the UI hides
+    # the download link. Enable with SOURCE_DOWNLOAD_ENABLED=true.
+    SOURCE_DOWNLOAD_ENABLED = (
+        os.environ.get('SOURCE_DOWNLOAD_ENABLED', 'false').lower() == 'true'
+    )
+    # Root of the original corpus. source_file values are relative to this.
+    CORPUS_DIR = os.environ.get('CORPUS_DIR') or str(
+        Path(__file__).parent.parent / 'data' / 'carti-camp-jocuri'
+    )
+    
    @staticmethod
    def ensure_directories():
        """Ensure required directories exist"""
--- a/app/config_taxonomy.py
+++ b/app/config_taxonomy.py
@@ -8,7 +8,7 @@ the UI displays the Romanian name. `category` (thematic domain) and

 import unicodedata
 import re
-from typing import Dict, List
+from typing import Dict, List, Optional

 # --- Categories (thematic domain) --------------------------------------------
 # slug -> Romanian display name. ~16 fixed slugs; `altele` is the mandatory
@@ -215,6 +215,89 @@ def normalize_content_type(value: str) -> str:
    return aliases.get(slug, DEFAULT_CONTENT_TYPE)


+# --- Indoor / outdoor (enrichment axis) --------------------------------------
+# Where the activity is run. Inferred during enrichment when the source is
+# silent — such inferences are flagged in `estimated_fields`. slug -> RO label.
+INDOOR_OUTDOOR: Dict[str, str] = {
+    "indoor": "Interior",
+    "outdoor": "Exterior",
+    "either": "Interior sau exterior",
+}
+
+# --- Space needed (enrichment axis) ------------------------------------------
+# Rough footprint the activity requires. slug -> RO label.
+SPACE_NEEDED: Dict[str, str] = {
+    "mic": "Spațiu mic",
+    "mediu": "Spațiu mediu",
+    "mare": "Spațiu mare",
+}
+
+# Aliases for robustness against LLM output variation. Keys are _slugify'd.
+_INDOOR_OUTDOOR_ALIASES: Dict[str, str] = {
+    "interior": "indoor",
+    "inside": "indoor",
+    "in": "indoor",
+    "exterior": "outdoor",
+    "outside": "outdoor",
+    "out": "outdoor",
+    "aer-liber": "outdoor",
+    "both": "either",
+    "any": "either",
+    "ambele": "either",
+    "interior-exterior": "either",
+    "indoor-outdoor": "either",
+}
+
+_SPACE_NEEDED_ALIASES: Dict[str, str] = {
+    "small": "mic",
+    "redus": "mic",
+    "putin": "mic",
+    "medium": "mediu",
+    "moderat": "mediu",
+    "large": "mare",
+    "big": "mare",
+    "mult": "mare",
+    "spatiu-mic": "mic",
+    "spatiu-mediu": "mediu",
+    "spatiu-mare": "mare",
+}
+
+
+def normalize_indoor_outdoor(value: str) -> Optional[str]:
+    """Map an arbitrary string to an indoor_outdoor slug, or None.
+
+    Unlike categories, this has NO mandatory fallback: an unrecognised or
+    empty value yields None (field simply absent), so we never fabricate a
+    location the enrichment did not assert.
+    """
+    if not value:
+        return None
+    slug = _slugify(str(value))
+    if slug in INDOOR_OUTDOOR:
+        return slug
+    return _INDOOR_OUTDOOR_ALIASES.get(slug)
+
+
+def normalize_space_needed(value: str) -> Optional[str]:
+    """Map an arbitrary string to a space_needed slug, or None (no fallback)."""
+    if not value:
+        return None
+    slug = _slugify(str(value))
+    if slug in SPACE_NEEDED:
+        return slug
+    return _SPACE_NEEDED_ALIASES.get(slug)
+
+
+def indoor_outdoor_display_name(slug: str) -> str:
+    """RO display name for an indoor_outdoor slug."""
+    return INDOOR_OUTDOOR.get(slug, slug)
+
+
+def space_needed_display_name(slug: str) -> str:
+    """RO display name for a space_needed slug."""
+    return SPACE_NEEDED.get(slug, slug)
+
+
 def is_valid_category(slug: str) -> bool:
    """True if `slug` is a valid category slug."""
    return slug in CATEGORIES
--- a/app/models/activity.py
+++ b/app/models/activity.py
@@ -76,6 +76,25 @@ class Activity:
    extraction_confidence: Optional[str] = None  # 'high' / 'med' / 'low'
    needs_review: int = 0

+    # Enrichment overlay (applied at build time from data/enrichment.json; see
+    # plan Part B). Bilingual: the EN/source text stays in name/description/...
+    # and the Romanian rendering lands in the *_ro twins. Absent fields leave
+    # the underlying DB value untouched.
+    name_ro: Optional[str] = None
+    description_ro: Optional[str] = None
+    rules_ro: Optional[str] = None
+    variations_ro: Optional[str] = None
+    indoor_outdoor: Optional[str] = None     # slug: indoor / outdoor / either
+    space_needed: Optional[str] = None       # slug: mic / mediu / mare
+    # Names of fields whose value was INFERRED by enrichment (source was
+    # silent) rather than stated in the source — surfaced as "(estimat)" in UI.
+    estimated_fields: List[str] = field(default_factory=list)
+
+    # Source provenance for the download route + enrichment keying.
+    source_id: Optional[str] = None          # e.g. "876d1a2d_marcaje_turistice"
+    source_ids: List[str] = field(default_factory=list)  # all source_ids merged
+    chunk_key: Optional[str] = None          # e.g. "<source_id>.part01"
+
    # Database fields
    id: Optional[int] = None
    created_at: Optional[str] = None
@@ -117,6 +136,16 @@ class Activity:
            'normalized_name': self.normalized_name or normalize_name(self.name),
            'extraction_confidence': self.extraction_confidence,
            'needs_review': self.needs_review,
+            'name_ro': self.name_ro,
+            'description_ro': self.description_ro,
+            'rules_ro': self.rules_ro,
+            'variations_ro': self.variations_ro,
+            'indoor_outdoor': self.indoor_outdoor,
+            'space_needed': self.space_needed,
+            'estimated_fields': json.dumps(self.estimated_fields) if self.estimated_fields else None,
+            'source_id': self.source_id,
+            'source_ids': json.dumps(self.source_ids) if self.source_ids else None,
+            'chunk_key': self.chunk_key,
        }
    
    @classmethod
@@ -140,6 +169,19 @@ class Activity:
        elif source_files is None:
            source_files = []

+        # estimated_fields / source_ids: JSON string (DB) or list (in-memory)
+        def _json_list(value):
+            if isinstance(value, str):
+                try:
+                    parsed = json.loads(value)
+                    return parsed if isinstance(parsed, list) else []
+                except (json.JSONDecodeError, TypeError):
+                    return []
+            return list(value) if value else []
+
+        estimated_fields = _json_list(data.get('estimated_fields'))
+        source_ids = _json_list(data.get('source_ids'))
+
        return cls(
            id=data.get('id'),
            name=data.get('name', ''),
@@ -170,6 +212,16 @@ class Activity:
            normalized_name=data.get('normalized_name'),
            extraction_confidence=data.get('extraction_confidence'),
            needs_review=data.get('needs_review', 0) or 0,
+            name_ro=data.get('name_ro'),
+            description_ro=data.get('description_ro'),
+            rules_ro=data.get('rules_ro'),
+            variations_ro=data.get('variations_ro'),
+            indoor_outdoor=data.get('indoor_outdoor'),
+            space_needed=data.get('space_needed'),
+            estimated_fields=estimated_fields,
+            source_id=data.get('source_id'),
+            source_ids=source_ids,
+            chunk_key=data.get('chunk_key'),
            created_at=data.get('created_at'),
            updated_at=data.get('updated_at')
        )
@@ -211,3 +263,43 @@ class Activity:
        elif self.materials_list:
            return self.materials_list[:100] + "..." if len(self.materials_list) > 100 else self.materials_list
        return "nu specificate"
+
+    # --- Enrichment / bilingual display helpers ------------------------------
+    def get_display_name(self) -> str:
+        """Romanian name when enriched, else the original."""
+        return self.name_ro or self.name
+
+    def get_display_description(self) -> str:
+        """Romanian description when enriched, else the original."""
+        return self.description_ro or self.description
+
+    def get_display_rules(self) -> Optional[str]:
+        """Romanian rules when enriched, else the original."""
+        return self.rules_ro or self.rules
+
+    def get_display_variations(self) -> Optional[str]:
+        """Romanian variations when enriched, else the original."""
+        return self.variations_ro or self.variations
+
+    def has_translation(self) -> bool:
+        """True if any Romanian enrichment text is present."""
+        return bool(self.name_ro or self.description_ro
+                    or self.rules_ro or self.variations_ro)
+
+    def is_estimated(self, field_name: str) -> bool:
+        """True if `field_name` was inferred by enrichment (source was silent)."""
+        return field_name in (self.estimated_fields or [])
+
+    def get_indoor_outdoor_display(self) -> Optional[str]:
+        """RO label for indoor_outdoor, or None when unset."""
+        if not self.indoor_outdoor:
+            return None
+        from app.config_taxonomy import indoor_outdoor_display_name
+        return indoor_outdoor_display_name(self.indoor_outdoor)
+
+    def get_space_needed_display(self) -> Optional[str]:
+        """RO label for space_needed, or None when unset."""
+        if not self.space_needed:
+            return None
+        from app.config_taxonomy import space_needed_display_name
+        return space_needed_display_name(self.space_needed)
--- a/app/models/database.py
+++ b/app/models/database.py
@@ -72,6 +72,18 @@ class DatabaseManager:
                    extraction_confidence TEXT,
                    needs_review INTEGER DEFAULT 0,

+                    -- Enrichment overlay (bilingual + inferred filters; Part B)
+                    name_ro TEXT,
+                    description_ro TEXT,
+                    rules_ro TEXT,
+                    variations_ro TEXT,
+                    indoor_outdoor TEXT,
+                    space_needed TEXT,
+                    estimated_fields TEXT,
+                    source_id TEXT,
+                    source_ids TEXT,
+                    chunk_key TEXT,
+
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
@@ -82,6 +94,7 @@ class DatabaseManager:
                CREATE VIRTUAL TABLE IF NOT EXISTS activities_fts USING fts5(
                    name, description, rules, variations, keywords,
                    materials_list, skills_developed,
+                    name_ro, description_ro, rules_ro, variations_ro,
                    content='activities',
                    content_rowid='id'
                )
@@ -106,6 +119,8 @@ class DatabaseManager:
                "CREATE INDEX IF NOT EXISTS idx_activities_participants ON activities(participants_min, participants_max)",
                "CREATE INDEX IF NOT EXISTS idx_activities_duration ON activities(duration_min, duration_max)",
                "CREATE INDEX IF NOT EXISTS idx_activities_normalized_name ON activities(normalized_name)",
+                "CREATE INDEX IF NOT EXISTS idx_activities_indoor_outdoor ON activities(indoor_outdoor)",
+                "CREATE INDEX IF NOT EXISTS idx_activities_space_needed ON activities(space_needed)",
                "CREATE INDEX IF NOT EXISTS idx_categories_type ON categories(type)"
            ]
            
@@ -117,9 +132,11 @@ class DatabaseManager:
                CREATE TRIGGER IF NOT EXISTS activities_fts_insert AFTER INSERT ON activities
                BEGIN
                    INSERT INTO activities_fts(rowid, name, description, rules, variations,
-                                               keywords, materials_list, skills_developed)
+                                               keywords, materials_list, skills_developed,
+                                               name_ro, description_ro, rules_ro, variations_ro)
                    VALUES (new.id, new.name, new.description, new.rules, new.variations,
-                            new.keywords, new.materials_list, new.skills_developed);
+                            new.keywords, new.materials_list, new.skills_developed,
+                            new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
                END
            """)

@@ -127,9 +144,11 @@ class DatabaseManager:
                CREATE TRIGGER IF NOT EXISTS activities_fts_delete AFTER DELETE ON activities
                BEGIN
                    INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
-                                               variations, keywords, materials_list, skills_developed)
+                                               variations, keywords, materials_list, skills_developed,
+                                               name_ro, description_ro, rules_ro, variations_ro)
                    VALUES ('delete', old.id, old.name, old.description, old.rules,
-                            old.variations, old.keywords, old.materials_list, old.skills_developed);
+                            old.variations, old.keywords, old.materials_list, old.skills_developed,
+                            old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
                END
            """)

@@ -137,13 +156,17 @@ class DatabaseManager:
                CREATE TRIGGER IF NOT EXISTS activities_fts_update AFTER UPDATE ON activities
                BEGIN
                    INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
-                                               variations, keywords, materials_list, skills_developed)
+                                               variations, keywords, materials_list, skills_developed,
+                                               name_ro, description_ro, rules_ro, variations_ro)
                    VALUES ('delete', old.id, old.name, old.description, old.rules,
-                            old.variations, old.keywords, old.materials_list, old.skills_developed);
+                            old.variations, old.keywords, old.materials_list, old.skills_developed,
+                            old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
                    INSERT INTO activities_fts(rowid, name, description, rules, variations,
-                                               keywords, materials_list, skills_developed)
+                                               keywords, materials_list, skills_developed,
+                                               name_ro, description_ro, rules_ro, variations_ro)
                    VALUES (new.id, new.name, new.description, new.rules, new.variations,
-                            new.keywords, new.materials_list, new.skills_developed);
+                            new.keywords, new.materials_list, new.skills_developed,
+                            new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
                END
            """)
            
@@ -210,6 +233,10 @@ class DatabaseManager:
            ('duration', activity.get_duration_display()),
            ('materials', activity.get_materials_display()),
            ('difficulty', activity.difficulty_level),
+            # Enrichment axes — slugs stored as value; UI maps to RO via
+            # DISPLAY_NAMES. Without these the new dropdowns would be empty.
+            ('indoor_outdoor', activity.indoor_outdoor),
+            ('space_needed', activity.space_needed),
        ]
        
        for cat_type, cat_value in categories_to_update:
@@ -236,6 +263,8 @@ class DatabaseManager:
                         duration_max: Optional[int] = None,
                         materials_category: Optional[str] = None,
                         difficulty_level: Optional[str] = None,
+                         indoor_outdoor: Optional[str] = None,
+                         space_needed: Optional[str] = None,
                         limit: int = 100) -> List[Dict[str, Any]]:
        """Enhanced search with FTS5 and filters"""
        
@@ -294,6 +323,14 @@ class DatabaseManager:
                base_query += " AND difficulty_level = ?"
                params.append(difficulty_level)

+            if indoor_outdoor:
+                base_query += " AND indoor_outdoor = ?"
+                params.append(indoor_outdoor)
+
+            if space_needed:
+                base_query += " AND space_needed = ?"
+                params.append(space_needed)
+
            # Add ordering and limit
            query = f"{base_query} {order_clause} LIMIT ?"
            params.append(limit)
--- a/app/services/search.py
+++ b/app/services/search.py
@@ -201,6 +201,13 @@ class SearchService:
            elif filter_key == 'difficulty':
                db_filters['difficulty_level'] = filter_value

+            elif filter_key == 'indoor_outdoor':
+                # Equality filter on the slug column (mirror difficulty).
+                db_filters['indoor_outdoor'] = filter_value
+
+            elif filter_key == 'space_needed':
+                db_filters['space_needed'] = filter_value
+
            # Handle any other custom filters
            else:
                # Generic filter handling - try to match against keywords or tags
--- a/app/static/css/main.css
+++ b/app/static/css/main.css
@@ -706,3 +706,29 @@ body {
        border: 1px solid #ddd;
    }
 }
+
+/* Enrichment markers (plan Part A7) */
+.estimated {
+    color: #8a6d3b;
+    font-style: italic;
+    font-size: 0.85em;
+    font-weight: normal;
+}
+
+.original-text > summary {
+    cursor: pointer;
+    color: #555;
+    user-select: none;
+}
+
+.original-text .original-content {
+    margin-top: 0.75rem;
+    padding-left: 1rem;
+    border-left: 3px solid #e0e0e0;
+    color: #555;
+}
+
+.download-hint {
+    color: #888;
+    font-size: 0.85em;
+}
--- a/app/templates/activity.html
+++ b/app/templates/activity.html
@@ -8,13 +8,13 @@
    <nav class="breadcrumb">
        <a href="{{ url_for('main.index') }}">Căutare</a>
        <span class="breadcrumb-separator">»</span>
-        <span class="breadcrumb-current">{{ activity.name }}</span>
+        <span class="breadcrumb-current">{{ activity.get_display_name() }}</span>
    </nav>

    <!-- Activity header -->
    <header class="activity-detail-header">
        <div class="activity-title-section">
-            <h1 class="activity-detail-title">{{ activity.name }}</h1>
+            <h1 class="activity-detail-title">{{ activity.get_display_name() }}</h1>
            <span class="activity-category-badge">{{ display_names.get(activity.category, activity.category) }}</span>
            {% if activity.content_type %}
            <span class="activity-content-type-badge">{{ display_names.get(activity.content_type, activity.content_type) }}</span>
@@ -31,27 +31,46 @@

    <!-- Activity content -->
    <div class="activity-detail-content">
-        <!-- Main description -->
+        <!-- Main description (Romanian-primary, falls back to original) -->
        <section class="activity-section">
            <h2 class="section-title">Descriere</h2>
-            <div class="activity-description">{{ activity.description }}</div>
+            <div class="activity-description">{{ activity.get_display_description() }}</div>
        </section>

        <!-- Rules and variations -->
-        {% if activity.rules %}
+        {% if activity.get_display_rules() %}
        <section class="activity-section">
            <h2 class="section-title">Reguli</h2>
-            <div class="activity-rules">{{ activity.rules }}</div>
+            <div class="activity-rules">{{ activity.get_display_rules() }}</div>
        </section>
        {% endif %}

-        {% if activity.variations %}
+        {% if activity.get_display_variations() %}
        <section class="activity-section">
            <h2 class="section-title">Variații</h2>
-            <div class="activity-variations">{{ activity.variations }}</div>
+            <div class="activity-variations">{{ activity.get_display_variations() }}</div>
        </section>
        {% endif %}

+        <!-- Original (pre-translation) text, collapsed by default -->
+        {% if activity.has_translation() %}
+        <details class="activity-section original-text">
+            <summary class="section-title">Text original ({{ display_names.get(activity.language, activity.language or 'sursă') }})</summary>
+            <div class="original-content">
+                <h3 class="metadata-title">{{ activity.name }}</h3>
+                <div class="activity-description">{{ activity.description }}</div>
+                {% if activity.rules %}
+                <h4 class="metadata-title">Reguli</h4>
+                <div class="activity-rules">{{ activity.rules }}</div>
+                {% endif %}
+                {% if activity.variations %}
+                <h4 class="metadata-title">Variații</h4>
+                <div class="activity-variations">{{ activity.variations }}</div>
+                {% endif %}
+            </div>
+        </details>
+        {% endif %}
+
        <!-- Metadata grid -->
        <section class="activity-section">
            <h2 class="section-title">Detalii activitate</h2>
@@ -59,21 +78,35 @@
                {% if activity.get_age_range_display() != "toate vârstele" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Grupa de vârstă</h3>
-                    <p class="metadata-value">{{ activity.get_age_range_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}

                {% if activity.get_participants_display() != "orice număr" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Participanți</h3>
-                    <p class="metadata-value">{{ activity.get_participants_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}

                {% if activity.get_duration_display() != "durată variabilă" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Durata</h3>
-                    <p class="metadata-value">{{ activity.get_duration_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
+                </div>
+                {% endif %}
+
+                {% if activity.get_indoor_outdoor_display() %}
+                <div class="metadata-card">
+                    <h3 class="metadata-title">Interior / exterior</h3>
+                    <p class="metadata-value">{{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}</p>
+                </div>
+                {% endif %}
+
+                {% if activity.get_space_needed_display() %}
+                <div class="metadata-card">
+                    <h3 class="metadata-title">Spațiu necesar</h3>
+                    <p class="metadata-value">{{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}

@@ -125,8 +158,14 @@
            <h2 class="section-title">Informații sursă</h2>
            <div class="source-info">
                {% if activity.source_file %}
+                {% if config.SOURCE_DOWNLOAD_ENABLED %}
+                <p><strong>Fișier sursă:</strong>
+                   <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a>
+                   <span class="download-hint">(descarcă)</span></p>
+                {% else %}
                <p><strong>Fișier sursă:</strong> {{ activity.source_file }}</p>
                {% endif %}
+                {% endif %}

                {% if activity.page_reference %}
                <p><strong>Referință:</strong> {{ activity.page_reference }}</p>
--- a/app/templates/index.html
+++ b/app/templates/index.html
@@ -125,6 +125,30 @@
                    </select>
                </div>
                {% endif %}
+
+                {% if filters.indoor_outdoor %}
+                <div class="filter-group">
+                    <label for="indoor_outdoor" class="filter-label">Interior / exterior</label>
+                    <select name="indoor_outdoor" id="indoor_outdoor" class="filter-select">
+                        <option value="">Oriunde</option>
+                        {% for io in filters.indoor_outdoor %}
+                        <option value="{{ io }}">{{ display_names.get(io, io) }}</option>
+                        {% endfor %}
+                    </select>
+                </div>
+                {% endif %}
+
+                {% if filters.space_needed %}
+                <div class="filter-group">
+                    <label for="space_needed" class="filter-label">Spațiu necesar</label>
+                    <select name="space_needed" id="space_needed" class="filter-select">
+                        <option value="">Orice spațiu</option>
+                        {% for sp in filters.space_needed %}
+                        <option value="{{ sp }}">{{ display_names.get(sp, sp) }}</option>
+                        {% endfor %}
+                    </select>
+                </div>
+                {% endif %}
            {% endif %}
        </div>

--- a/app/templates/results.html
+++ b/app/templates/results.html
@@ -85,6 +85,28 @@
            </select>
            {% endif %}

+            {% if filters.indoor_outdoor %}
+            <select name="indoor_outdoor" class="filter-select compact">
+                <option value="">Oriunde</option>
+                {% for io in filters.indoor_outdoor %}
+                <option value="{{ io }}" {% if applied_filters.indoor_outdoor == io %}selected{% endif %}>
+                    {{ display_names.get(io, io) }}
+                </option>
+                {% endfor %}
+            </select>
+            {% endif %}
+
+            {% if filters.space_needed %}
+            <select name="space_needed" class="filter-select compact">
+                <option value="">Orice spațiu</option>
+                {% for sp in filters.space_needed %}
+                <option value="{{ sp }}" {% if applied_filters.space_needed == sp %}selected{% endif %}>
+                    {{ display_names.get(sp, sp) }}
+                </option>
+                {% endfor %}
+            </select>
+            {% endif %}
+
            <button type="button" class="btn btn-secondary btn-sm" onclick="clearFilters()">
                Resetează
            </button>
@@ -128,7 +150,7 @@
            <header class="activity-header">
                <h3 class="activity-title">
                    <a href="{{ url_for('main.activity_detail', activity_id=activity.id) }}">
-                        {{ activity.name }}
+                        {{ activity.get_display_name() }}
                    </a>
                </h3>
                <span class="activity-category">{{ display_names.get(activity.category, activity.category) }}</span>
@@ -138,24 +160,36 @@
            </header>

            <div class="activity-content">
-                <p class="activity-description">{{ activity.description }}</p>
+                <p class="activity-description">{{ activity.get_display_description() }}</p>

                <div class="activity-metadata">
                    {% if activity.get_age_range_display() != "toate vârstele" %}
                    <span class="metadata-item">
-                        <strong>Vârsta:</strong> {{ activity.get_age_range_display() }}
+                        <strong>Vârsta:</strong> {{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}

                    {% if activity.get_participants_display() != "orice număr" %}
                    <span class="metadata-item">
-                        <strong>Participanți:</strong> {{ activity.get_participants_display() }}
+                        <strong>Participanți:</strong> {{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}

                    {% if activity.get_duration_display() != "durată variabilă" %}
                    <span class="metadata-item">
-                        <strong>Durata:</strong> {{ activity.get_duration_display() }}
+                        <strong>Durata:</strong> {{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}
+                    </span>
+                    {% endif %}
+
+                    {% if activity.get_indoor_outdoor_display() %}
+                    <span class="metadata-item">
+                        <strong>Loc:</strong> {{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}
+                    </span>
+                    {% endif %}
+
+                    {% if activity.get_space_needed_display() %}
+                    <span class="metadata-item">
+                        <strong>Spațiu:</strong> {{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}

@@ -168,7 +202,11 @@

                {% if activity.source_file %}
                <div class="activity-source">
+                    {% if config.SOURCE_DOWNLOAD_ENABLED %}
+                    <small>Sursă: <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a></small>
+                    {% else %}
                    <small>Sursă: {{ activity.source_file }}</small>
+                    {% endif %}
                </div>
                {% endif %}
            </div>
--- a/app/web/routes.py
+++ b/app/web/routes.py
@@ -3,20 +3,27 @@ Flask routes for INDEX-SISTEM-JOCURI v2.0
 Clean, minimalist web interface with dynamic filters
 """

-from flask import Blueprint, request, render_template, jsonify, current_app
+from flask import (
+    Blueprint, request, render_template, jsonify, current_app,
+    send_from_directory,
+)
 from app.models.database import DatabaseManager
 from app.models.activity import Activity
 from app.services.search import SearchService
-from app.config_taxonomy import CATEGORIES, CONTENT_TYPES
-import os
-from pathlib import Path
+from app.config_taxonomy import (
+    CATEGORIES, CONTENT_TYPES, INDOOR_OUTDOOR, SPACE_NEEDED,
+)

 bp = Blueprint('main', __name__)

-# Slug -> Romanian display name. Category and content_type slugs never collide,
-# so a single flat map is enough for the UI filter labels.
+# Slug -> Romanian display name. Category, content_type, indoor_outdoor and
+# space_needed slugs never collide, so a single flat map is enough for the UI
+# filter labels.
 LANGUAGE_NAMES = {'ro': 'Română', 'en': 'Engleză'}
-DISPLAY_NAMES = {**CATEGORIES, **CONTENT_TYPES, **LANGUAGE_NAMES}
+DISPLAY_NAMES = {
+    **CATEGORIES, **CONTENT_TYPES, **INDOOR_OUTDOOR, **SPACE_NEEDED,
+    **LANGUAGE_NAMES,
+}

 # Initialize database manager (will be configured in application factory)
 def get_db_manager():
@@ -138,6 +145,44 @@ def activity_detail(activity_id):
        print(f"Error loading activity {activity_id}: {e}")
        return render_template('404.html'), 404

+@bp.route('/source/<int:activity_id>')
+def source_download(activity_id):
+    """Download the original source file for an activity (plan A6).
+
+    Shipped DARK: returns 404 unless SOURCE_DOWNLOAD_ENABLED is set (copyright
+    exposure — the user opts in). Resolves the activity's `source_file` under
+    CORPUS_DIR. send_from_directory does the safe-join and blocks traversal;
+    web-mirror / extension-less sources that are not real files 404 gracefully.
+    """
+    if not current_app.config.get('SOURCE_DOWNLOAD_ENABLED', False):
+        return render_template('404.html'), 404
+    try:
+        db = get_db_manager()
+        activity_data = db.get_activity_by_id(activity_id)
+        if not activity_data:
+            return render_template('404.html'), 404
+
+        source_file = (activity_data.get('source_file') or '').strip()
+        if not source_file:
+            return render_template('404.html'), 404
+
+        corpus_dir = current_app.config.get('CORPUS_DIR')
+        if not corpus_dir:
+            return render_template('404.html'), 404
+        try:
+            # send_from_directory rejects path traversal and missing files with
+            # a 404 (NotFound) — no manual safe_join needed.
+            return send_from_directory(
+                corpus_dir, source_file, as_attachment=True
+            )
+        except Exception:
+            # Missing file / web-mirror source with no on-disk original.
+            return render_template('404.html'), 404
+    except Exception as e:
+        print(f"Source download error for {activity_id}: {e}")
+        return render_template('404.html'), 404
+
+
@bp.route('/health')
 def health_check():
    """Health check endpoint for Docker"""
--- a/data/activities.db
+++ b/data/activities.db
--- a/scripts/ENRICHMENT_PROMPT.md
+++ b/scripts/ENRICHMENT_PROMPT.md
@@ -0,0 +1,98 @@
+# SUBAGENT — Activity enrichment
+
+You are a subagent in the game-library enrichment pipeline. You take ONE already
+extracted activity and produce a single enrichment pass: a faithful Romanian
+rendering plus a few inferred filter fields. You do **one** activity per prompt.
+
+This is **not** re-extraction. The activity text already exists and is trusted.
+Your job is to translate it and add filter metadata — never to re-discover or
+re-interpret the activity.
+
+## Your task
+
+The prompt gives you two blocks:
+
+1. **Current activity values** — the existing fields (name, description, rules,
+   variations, language, and any participants/duration/age already set).
+2. **Source chunk text** — the original passage the activity came from. This is
+   your ground truth for any expansion. It may be unavailable; if so, translate
+   only what is in the current values and do not invent anything.
+
+Produce one JSON object and write it to the path named in the prompt
+(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
+`content_key` string from the prompt.
+
+## Rules
+
+### Translation (always)
+- Translate `name`, `description`, `rules`, `variations` into natural, fluent
+  Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
+- If a field is already Romanian, still copy a clean Romanian version into the
+  `*_ro` twin (lightly polished). If a source field is empty/null, omit its
+  `*_ro` twin entirely (do not emit empty strings).
+- Translate faithfully. Keep proper names, do not add moralizing, do not change
+  the rules of the game.
+
+### Description expansion (constrained)
+- You MAY make `description_ro` richer than a literal translation — but ONLY
+  using detail that is actually present in the **source chunk text**. Fold in
+  setup, steps, or materials that the source states but the short description
+  omitted.
+- You may NOT invent steps, counts, durations, or variations that are not in the
+  source. If the source is thin, the translation stays thin. Hallucinated
+  expansion is the one unacceptable failure here.
+
+### Inferred filter fields (mark when inferred)
+Fill these when you can, using the source text first, then reasonable inference:
+
+- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
+- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
+- `participants_min`, `participants_max`: integers (people).
+- `duration_min`, `duration_max`: integers (minutes).
+- `age_group_min`, `age_group_max`: integers (years).
+
+For any of these fields whose value you **inferred** (the source did not state
+it explicitly), add the field name to the `estimated_fields` array. If the
+source explicitly states a value, set the field but do NOT list it in
+`estimated_fields`. Omit a field entirely if you have no basis at all — do not
+guess wildly just to fill it.
+
+Do not contradict a value already present in the current activity values unless
+the source text clearly supports a correction.
+
+## Enum vocabulary (fixed — use these exact slugs)
+
+- `indoor_outdoor`: `indoor` | `outdoor` | `either`
+- `space_needed`: `mic` | `mediu` | `mare`
+
+## Output format
+
+Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
+
+```json
+{
+  "content_key": "<the exact key from the prompt>",
+  "name_ro": "…",
+  "description_ro": "…",
+  "rules_ro": "…",
+  "variations_ro": "…",
+  "indoor_outdoor": "outdoor",
+  "space_needed": "mediu",
+  "participants_min": 6,
+  "participants_max": 20,
+  "duration_min": 15,
+  "duration_max": 30,
+  "age_group_min": 8,
+  "age_group_max": 14,
+  "estimated_fields": ["space_needed", "duration_min", "duration_max"]
+}
+```
+
+Include only the fields you actually fill. Always include `content_key` and
+`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
+no commentary, no markdown fences in the file itself.
+
+## Report
+
+After writing the file, report in under 30 words: the activity name and which
+fields you estimated.
--- a/scripts/SUBAGENT_PROMPT.md
+++ b/scripts/SUBAGENT_PROMPT.md
@@ -74,6 +74,23 @@ The file is one JSON object: a `header` plus an `activities` array.
 - Do **not** paraphrase the `source_excerpt` — copy it character for character.
 - Better to extract fewer activities accurately than to pad the output.

+## Writing large outputs in batches (IMPORTANT)
+
+A single Write tool call has a hard ~32K output-token limit. Dense chunks
+(50+ activities) will exceed this. If you estimate >30 activities, write the
+file **incrementally**:
+
+1. First Write: emit the file with `header` + the first batch (≤25 activities)
+   and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
+2. For each subsequent batch (≤25 activities at a time), use an Edit call
+   that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
+   `,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
+   closing brace plus the last activity's tail) so the Edit is unambiguous.
+3. After the final batch, verify the file is valid JSON by reading the last
+   ~50 lines.
+
+This keeps each tool call under the output-token cap.
+
 ## Before you finish

 - Every activity has a non-empty `source_excerpt` and `page_reference`.
--- a/scripts/build_database.py
+++ b/scripts/build_database.py
@@ -86,7 +86,12 @@ def _split_csv(value: Optional[str]) -> list[str]:
    return [p.strip() for p in str(value).split(",") if p.strip()]


-def dict_to_activity(adict: dict, source_file: str) -> Activity:
+def dict_to_activity(
+    adict: dict,
+    source_file: str,
+    source_id: Optional[str] = None,
+    chunk_key: Optional[str] = None,
+) -> Activity:
    """Build an Activity from one extraction-JSON activity object."""
    tags = adict.get("tags") or []
    if isinstance(tags, str):
@@ -99,6 +104,9 @@ def dict_to_activity(adict: dict, source_file: str) -> Activity:
        source_files = [source_file, *source_files]

    return Activity(
+        source_id=source_id,
+        source_ids=[source_id] if source_id else [],
+        chunk_key=chunk_key,
        name=(adict.get("name") or "").strip(),
        description=(adict.get("description") or "").strip(),
        rules=adict.get("rules"),
@@ -206,6 +214,19 @@ def merge_cluster(cluster: list[Activity]) -> Activity:
            if s and s not in sources:
                sources.append(s)
    merged.source_files = sources
+    # source provenance: keep rep's chunk_key/source_id as primary, union the
+    # source_ids for the download route. Enrichment fields (name_ro,
+    # description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
+    # enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
+    # row's content_key, so merging must not pre-populate them.
+    merged.source_id = rep.source_id
+    merged.chunk_key = rep.chunk_key
+    source_ids: list[str] = []
+    for a in cluster:
+        for sid in [a.source_id, *(a.source_ids or [])]:
+            if sid and sid not in source_ids:
+                source_ids.append(sid)
+    merged.source_ids = source_ids
    # popularity_score++ per merged duplicate (plan §4)
    merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
    return merged
@@ -313,6 +334,108 @@ def apply_review_decisions(
    return kept, stats


+# --------------------------------------------------------------------------
+# step 5b — enrichment overlay (plan Part B)
+# --------------------------------------------------------------------------
+# Translation / inferred-filter fields written by run_enrichment.py. Applied
+# AFTER dedup + review decisions, keyed on the same stable content_key, so the
+# overlay survives rebuilds as long as extraction text is frozen.
+_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
+_ENRICHMENT_INT_FIELDS = (
+    "participants_min", "participants_max",
+    "duration_min", "duration_max",
+    "age_group_min", "age_group_max",
+)
+
+
+def load_enrichment(path: Path) -> dict:
+    """Load data/enrichment.json (flat map content_key -> field dict)."""
+    if path and path.is_file():
+        try:
+            data = json.loads(path.read_text(encoding="utf-8"))
+            if isinstance(data, dict):
+                return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {}
+
+
+def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
+    """
+    Overlay enrichment fields onto the post-dedup activity list (plan B2).
+
+    Keyed by content_key. Only fields PRESENT in an entry are written; absent
+    fields leave the underlying DB value untouched. indoor_outdoor /
+    space_needed are normalized to slugs (None on unrecognised). Inferred
+    fields are recorded in `estimated_fields`. Translated / expanded text is
+    NOT re-validated against the source here — expansion fidelity is the
+    enrichment prompt's responsibility (plan B2 comment).
+
+    Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
+    """
+    from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
+
+    matched_keys: set[str] = set()
+    fields_stated: dict[str, int] = defaultdict(int)
+    fields_estimated: dict[str, int] = defaultdict(int)
+
+    for act in activities:
+        key = content_key(
+            act.normalized_name or normalize_name(act.name),
+            act.language,
+            act.description or "",
+        )
+        entry = enrichment.get(key)
+        if not isinstance(entry, dict):
+            continue
+        matched_keys.add(key)
+
+        estimated = set(entry.get("estimated_fields") or [])
+
+        # bilingual text twins
+        for fld in _ENRICHMENT_TEXT_FIELDS:
+            val = entry.get(fld)
+            if isinstance(val, str) and val.strip():
+                setattr(act, fld, val.strip())
+
+        # inferred / clarified structured numeric fields
+        for fld in _ENRICHMENT_INT_FIELDS:
+            if entry.get(fld) is not None:
+                try:
+                    setattr(act, fld, int(entry[fld]))
+                except (TypeError, ValueError):
+                    pass
+
+        # enum filters — normalized to slug, dropped if unrecognised
+        if entry.get("indoor_outdoor") is not None:
+            slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
+            if slug:
+                act.indoor_outdoor = slug
+        if entry.get("space_needed") is not None:
+            slug = normalize_space_needed(entry["space_needed"])
+            if slug:
+                act.space_needed = slug
+
+        act.estimated_fields = sorted(estimated)
+
+        # QA tally: stated vs estimated population, per field
+        for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
+            if entry.get(fld) is None:
+                continue
+            if fld in estimated:
+                fields_estimated[fld] += 1
+            else:
+                fields_stated[fld] += 1
+
+    return {
+        "entries": len(enrichment),
+        "matched": len(matched_keys),
+        "orphaned": len(enrichment) - len(matched_keys),
+        "fields_stated": dict(fields_stated),
+        "fields_estimated": dict(fields_estimated),
+    }
+
+
 # --------------------------------------------------------------------------
 # golden-set recall (plan §7)
 # --------------------------------------------------------------------------
@@ -390,9 +513,8 @@ def collect_activities(

        header = data.get("header", {})
        chunk_text = find_chunk_text(json_path, header, chunks_dir)
-        source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit(
-            ".part", 1
-        )[0]
+        chunk_key = chunk_key_for(json_path, header)
+        source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
        fallback_source = (
            source_path_for(source_id, sources_dir) or source_id or json_path.stem
        )
@@ -409,7 +531,7 @@ def collect_activities(
                continue
            src = adict.get("source_file") or fallback_source
            raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
-            activities.append(dict_to_activity(adict, src))
+            activities.append(dict_to_activity(adict, src, source_id, chunk_key))

        if hallucinated:
            _log_hallucinations(json_path, rejected_dir, hallucinated)
@@ -496,6 +618,7 @@ def rebuild(
    sources_dir: Path,
    db_path: Path,
    decisions_path: Optional[Path] = None,
+    enrichment_path: Optional[Path] = None,
    schema_path: Path = DEFAULT_SCHEMA_PATH,
    golden_dir: Optional[Path] = None,
    do_swap: bool = True,
@@ -517,6 +640,11 @@ def rebuild(
    decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
    final, decision_stats = apply_review_decisions(deduped, decisions)

+    # Enrichment overlay — applied immediately after review decisions, on the
+    # post-dedup list, keyed on the same stable content_key (plan B2).
+    enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
+    enrichment_stats = apply_enrichment(final, enrichment)
+
    try:
        write_database(db_tmp_path, final)
        backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
@@ -529,6 +657,7 @@ def rebuild(
        **collected,
        "dedup": dedup_stats,
        "decisions": decision_stats,
+        "enrichment": enrichment_stats,
        "final_count": len(final),
        "backup": str(backup) if backup else None,
        "swapped": do_swap,
@@ -579,6 +708,16 @@ def print_report(report: dict) -> None:
          f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
    print(f"review decisions     : dropped {report['decisions']['dropped']}, "
          f"resolved {report['decisions']['resolved']}")
+    enr = report.get("enrichment")
+    if enr and enr.get("entries"):
+        print(f"enrichment           : {enr['entries']} entries "
+              f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
+        stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
+        all_fields = sorted(set(stated) | set(estimated))
+        if all_fields:
+            print("  field population   :  (stated / estimated)")
+            for fld in all_fields:
+                print(f"    {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
    print(f"final inserted       : {report['final_count']}")
    print(f"% with rules         : {qa['pct_with_rules']}")
    print(f"needs_review rows    : {qa['needs_review']}")
@@ -615,6 +754,7 @@ def main(argv: Optional[list[str]] = None) -> int:
    parser.add_argument("--sources", default="data/sources")
    parser.add_argument("--db", default="data/activities.db")
    parser.add_argument("--decisions", default="data/review_decisions.json")
+    parser.add_argument("--enrichment", default="data/enrichment.json")
    parser.add_argument("--golden", default="data/golden")
    parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
    args = parser.parse_args(argv)
@@ -628,6 +768,7 @@ def main(argv: Optional[list[str]] = None) -> int:
        sources_dir=Path(args.sources),
        db_path=Path(args.db),
        decisions_path=Path(args.decisions),
+        enrichment_path=Path(args.enrichment),
        schema_path=Path(args.schema),
        golden_dir=Path(args.golden),
    )
--- a/scripts/repair_extractions.py
+++ b/scripts/repair_extractions.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+repair_extractions.py — one-shot repair of malformed extraction JSON.
+
+Subagents systematically emit unescaped ASCII double-quotes inside string
+values (Romanian text like  „Unu"  uses a closing " that terminates the JSON
+string early). Re-extraction reproduces the bug, so we repair instead.
+
+IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
+ending the string at the stray quote and reinterpreting the trailing text as a
+new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
+truncation is silent (the field is still non-empty) and slips past a naive
+presence check. So we use a faithful char-scanner that ESCAPES stray quotes
+(\\") instead of splitting on them, then validate the result against the real
+activity schema (additionalProperties:false also catches any residual split).
+
+This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
+the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
+disk. We write clean JSON back to data/extracted/ and the build reads vanilla
+json.
+
+Source selection (faithful recovery needs the ORIGINAL malformed text):
+  * a chunk is a candidate when a MALFORMED original exists — either the
+    top-level data/extracted/<key>.json is itself invalid, or a malformed
+    original sits in data/extracted/_rejected/<key>.json.
+  * the malformed original is preferred as the repair source.
+  * chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
+    output that lost the original) are NOT silently "repaired" — if such a chunk
+    has no valid top-level file it is reported as needing RE-EXTRACTION.
+
+Usage:
+    python scripts/repair_extractions.py            # report only (dry run)
+    python scripts/repair_extractions.py --apply     # write repaired JSON
+"""
+
+from __future__ import annotations
+
+import argparse
+import glob
+import json
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+EXTRACTED = REPO_ROOT / "data" / "extracted"
+REJECTED = EXTRACTED / "_rejected"
+
+if str(SCRIPT_DIR) not in __import__("sys").path:
+    __import__("sys").path.insert(0, str(SCRIPT_DIR))
+from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction  # noqa: E402
+
+
+def escape_stray_quotes(s: str) -> str:
+    """Escape ASCII double-quotes that occur INSIDE a JSON string value.
+
+    A `"` inside a string is treated as a real string-close only when the next
+    non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
+    content and is escaped to `\\"`. This preserves the full value instead of
+    truncating it (the json_repair failure mode).
+    """
+    out: list[str] = []
+    in_str = False
+    esc = False
+    n = len(s)
+    i = 0
+    while i < n:
+        c = s[i]
+        if esc:
+            out.append(c)
+            esc = False
+            i += 1
+            continue
+        if c == "\\":
+            out.append(c)
+            esc = True
+            i += 1
+            continue
+        if c == '"':
+            if not in_str:
+                in_str = True
+                out.append(c)
+            else:
+                j = i + 1
+                while j < n and s[j] in " \t\r\n":
+                    j += 1
+                nxt = s[j] if j < n else ""
+                if nxt in ",}]:" or nxt == "":
+                    in_str = False
+                    out.append(c)
+                else:
+                    out.append('\\"')  # content quote → escape, keep value whole
+            i += 1
+            continue
+        out.append(c)
+        i += 1
+    return "".join(out)
+
+
+def _is_valid_json(path: Path) -> bool:
+    try:
+        json.loads(path.read_text(encoding="utf-8"))
+        return True
+    except (json.JSONDecodeError, OSError):
+        return False
+
+
+def _malformed_source(key: str) -> Optional[Path]:
+    """Return the malformed-original file for a chunk, preferring top-level."""
+    live = EXTRACTED / f"{key}.json"
+    if live.exists() and not _is_valid_json(live):
+        return live
+    rej = REJECTED / f"{key}.json"
+    if rej.exists() and not _is_valid_json(rej):
+        return rej
+    return None
+
+
+def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
+    """
+    (repair_candidates, needs_reextraction).
+
+    repair_candidates: key -> malformed source file (faithfully repairable).
+    needs_reextraction: chunks with no malformed original AND no valid
+    top-level file (their original was lost) — must be re-extracted.
+    """
+    keys = set()
+    for fn in glob.glob(str(EXTRACTED / "*.json")):
+        keys.add(Path(fn).stem)
+    for fn in glob.glob(str(REJECTED / "*.json")):
+        keys.add(Path(fn).stem)
+
+    candidates: dict[str, Path] = {}
+    needs_reextraction: list[str] = []
+    for key in sorted(keys):
+        # A malformed original anywhere is faithfully repairable, and is the
+        # source of truth even if a (json_repair-produced, possibly truncated)
+        # valid top-level file exists — escaping the original never truncates,
+        # so re-repairing from it is always >= the json_repair output.
+        src = _malformed_source(key)
+        if src is not None:
+            candidates[key] = src
+            continue
+        live = EXTRACTED / f"{key}.json"
+        if live.exists() and _is_valid_json(live):
+            continue  # genuinely-valid extraction, nothing to do
+        # no valid top-level and no malformed original to repair from
+        needs_reextraction.append(key)
+    return candidates, needs_reextraction
+
+
+def repair(apply: bool) -> int:
+    schema = load_schema(DEFAULT_SCHEMA_PATH)
+    candidates, needs_reextraction = _candidate_keys()
+
+    print("=" * 64)
+    print(f"REPAIR EXTRACTIONS  ({'APPLY' if apply else 'dry run'})")
+    print("=" * 64)
+    print(f"repair candidates: {len(candidates)}")
+
+    def _textlen(data: dict) -> int:
+        total = 0
+        for a in data.get("activities", []):
+            if isinstance(a, dict):
+                for v in a.values():
+                    if isinstance(v, str):
+                        total += len(v)
+        return total
+
+    ok = 0
+    kept_toplevel = 0
+    still_bad: list[str] = []
+    schema_fail: list[tuple[str, str]] = []
+
+    for key, src in candidates.items():
+        live = EXTRACTED / f"{key}.json"
+        live_valid = live.exists() and _is_valid_json(live)
+
+        raw = src.read_text(encoding="utf-8")
+        fixed = escape_stray_quotes(raw)
+        try:
+            data = json.loads(fixed)
+        except json.JSONDecodeError as exc:
+            if live_valid:
+                kept_toplevel += 1  # genuine top-level is fine; stale _rejected
+            else:
+                still_bad.append(f"{key}: still invalid after escape ({exc})")
+            continue
+        errors = validate_extraction(data, schema)
+        if errors:
+            if live_valid:
+                kept_toplevel += 1
+            else:
+                schema_fail.append((key, errors[0]))
+                print(f"  {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
+            continue
+
+        # Faithfulness guard: only replace a valid top-level when the escaped
+        # repair carries STRICTLY more text (i.e. the top-level was a truncated
+        # json_repair output). Genuine extractions are kept untouched.
+        if live_valid:
+            try:
+                live_data = json.loads(live.read_text(encoding="utf-8"))
+            except json.JSONDecodeError:
+                live_data = {}
+            if _textlen(data) <= _textlen(live_data):
+                kept_toplevel += 1
+                continue
+
+        n = len(data.get("activities", []))
+        print(f"  {key[:50]:<50} {n:>3} acts  REPAIR")
+        if apply:
+            live.write_text(
+                json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
+            )
+        ok += 1
+
+    print("-" * 64)
+    print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
+          f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
+          f"needs re-extraction: {len(needs_reextraction)}")
+    for key, err in schema_fail:
+        print(f"  ⚠ schema {key}: {err[:60]}")
+    for msg in still_bad:
+        print(f"  ✘ {msg}")
+    for key in needs_reextraction:
+        print(f"  ↻ re-extract: {key}")
+    if not apply:
+        print("\nDry run — re-run with --apply to write repaired JSON.")
+    print("=" * 64)
+    return 0
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
+    parser.add_argument("--apply", action="store_true",
+                        help="write repaired JSON (default: dry run)")
+    args = parser.parse_args(argv)
+    return repair(args.apply)
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/run_enrichment.py
+++ b/scripts/run_enrichment.py
@@ -0,0 +1,270 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+run_enrichment.py — enrichment orchestrator (plan Part B3).
+
+Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
+already-rebuilt data/activities.db, and for every activity emits one subagent
+prompt asking for a single bilingual + inferred-filter enrichment pass. Like
+extraction, this script does NOT call the LLM — the interactive Claude Code
+orchestrator launches waves of subagents on the emitted prompts.
+
+Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
+import_common.content_key(normalized_name, language, _normalize_text(description))
+— the SAME function build_database uses to apply the overlay. The key is stable
+only while the extraction text is frozen, so enrichment runs AFTER the freezing
+rebuild.
+
+Modes:
+  (default)    emit one prompt per activity that has no enrichment part yet
+               (resumable: data/enrichment_parts/<key>.json present => skip)
+  --collect    merge data/enrichment_parts/*.json -> data/enrichment.json
+
+Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
+the emitted prompts to a single source / category for the sign-off pilot.
+
+Usage:
+    python scripts/run_enrichment.py --source teambuilding_corbu        # pilot
+    python scripts/run_enrichment.py                                    # all rows
+    python scripts/run_enrichment.py --collect                          # merge parts
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sqlite3
+import sys
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from import_common import (  # noqa: E402
+    content_key,
+    find_chunk_text,
+    normalize_name,
+)
+
+ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
+
+# Columns pulled from the DB into the prompt as the "current value" context.
+_DB_COLUMNS = (
+    "id", "name", "description", "rules", "variations",
+    "category", "content_type", "language", "normalized_name",
+    "page_reference", "source_id", "chunk_key",
+    "participants_min", "participants_max",
+    "duration_min", "duration_max",
+    "age_group_min", "age_group_max",
+)
+
+# How much source-chunk text to inline. Chunks are page-sized; cap so a dense
+# chunk does not blow the prompt up, but keep enough to ground the expansion.
+_CHUNK_TEXT_CAP = 12000
+
+
+def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
+    conn = sqlite3.connect(db_path)
+    conn.row_factory = sqlite3.Row
+    try:
+        cols = ", ".join(_DB_COLUMNS)
+        sql = f"SELECT {cols} FROM activities"
+        params: list = []
+        if source_substr:
+            sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
+            params = [f"%{source_substr}%", f"%{source_substr}%"]
+        sql += " ORDER BY source_id, id"
+        return [dict(r) for r in conn.execute(sql, params).fetchall()]
+    finally:
+        conn.close()
+
+
+def _row_content_key(row: dict) -> str:
+    return content_key(
+        row.get("normalized_name") or normalize_name(row.get("name") or ""),
+        row.get("language"),
+        row.get("description") or "",
+    )
+
+
+def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
+    """Locate the source-chunk text via the row's chunk_key / source_id."""
+    header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
+    if not header["chunk_key"]:
+        return None
+    # find_chunk_text resolves from the header when chunk_key is present;
+    # the json_path arg is only a fallback, so a synthetic path is fine.
+    text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
+    if text and len(text) > _CHUNK_TEXT_CAP:
+        text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
+    return text
+
+
+def _current_fields_block(row: dict) -> str:
+    """The activity's current DB values, as a compact JSON block for context."""
+    fields = {
+        "name": row.get("name"),
+        "description": row.get("description"),
+        "rules": row.get("rules"),
+        "variations": row.get("variations"),
+        "category": row.get("category"),
+        "content_type": row.get("content_type"),
+        "language": row.get("language"),
+        "participants_min": row.get("participants_min"),
+        "participants_max": row.get("participants_max"),
+        "duration_min": row.get("duration_min"),
+        "duration_max": row.get("duration_max"),
+        "age_group_min": row.get("age_group_min"),
+        "age_group_max": row.get("age_group_max"),
+    }
+    return json.dumps(fields, ensure_ascii=False, indent=2)
+
+
+def emit_enrichment_prompt(
+    row: dict, key: str, chunks_dir: Path, prompts_dir: Path
+) -> Path:
+    """Write the subagent enrichment prompt for one activity."""
+    chunk_text = _chunk_text_for_row(row, chunks_dir)
+    source_block = (
+        chunk_text if chunk_text is not None
+        else "[source chunk text unavailable — translate only what is given "
+             "above; do NOT invent steps, and mark any inferred filter field "
+             "as estimated]"
+    )
+    part_path = f"data/enrichment_parts/{key}.json"
+    text = "\n".join([
+        f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
+        "",
+        f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
+        "Single pass. Translate faithfully to Romanian; expand description_ro "
+        "ONLY from the source chunk text below; mark inferred filter fields in "
+        "`estimated_fields`.",
+        "",
+        f"Write the result JSON to: `{part_path}`",
+        f'It MUST include `"content_key": "{key}"`.',
+        f'Page reference: {row.get("page_reference") or "?"}',
+        "",
+        "## Current activity values (the text to translate / enrich)",
+        "```json",
+        _current_fields_block(row),
+        "```",
+        "",
+        "## Source chunk text (ground description_ro expansion in THIS only)",
+        "```",
+        source_block,
+        "```",
+        "",
+    ])
+    prompts_dir.mkdir(parents=True, exist_ok=True)
+    out = prompts_dir / f"{key}.prompt.md"
+    out.write_text(text, encoding="utf-8")
+    return out
+
+
+def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
+    """Merge data/enrichment_parts/*.json into one flat content_key map."""
+    merged: dict = {}
+    bad: list[str] = []
+    if parts_dir.is_dir():
+        for part in sorted(parts_dir.glob("*.json")):
+            try:
+                data = json.loads(part.read_text(encoding="utf-8"))
+            except (json.JSONDecodeError, OSError):
+                bad.append(part.name)
+                continue
+            key = data.get("content_key") or part.stem
+            entry = {k: v for k, v in data.items() if k != "content_key"}
+            merged[key] = entry
+    out_path.write_text(
+        json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
+    )
+    return {"entries": len(merged), "bad_parts": bad, "out": str(out_path)}
+
+
+def run_emit(
+    *,
+    db_path: Path,
+    chunks_dir: Path,
+    parts_dir: Path,
+    prompts_dir: Path,
+    source_substr: Optional[str],
+    limit: Optional[int],
+) -> dict:
+    rows = _fetch_rows(db_path, source_substr)
+    emitted, skipped = 0, 0
+    for row in rows:
+        key = _row_content_key(row)
+        if (parts_dir / f"{key}.json").is_file():
+            skipped += 1
+            continue
+        emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
+        emitted += 1
+        if limit and emitted >= limit:
+            break
+    return {
+        "rows": len(rows),
+        "emitted": emitted,
+        "skipped_done": skipped,
+        "prompts_dir": str(prompts_dir),
+    }
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
+    parser.add_argument("--db", default="data/activities.db")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--parts", default="data/enrichment_parts")
+    parser.add_argument("--prompts", default="data/enrichment_prompts")
+    parser.add_argument("--out", default="data/enrichment.json")
+    parser.add_argument("--source", default=None,
+                        help="only rows whose source_id/chunk_key contains this (pilot)")
+    parser.add_argument("--limit", type=int, default=None,
+                        help="cap emitted prompts (pilot)")
+    parser.add_argument("--collect", action="store_true",
+                        help="merge enrichment parts into the overlay JSON")
+    args = parser.parse_args(argv)
+
+    print("=" * 60)
+    print("ENRICHMENT ORCHESTRATOR")
+    print("=" * 60)
+
+    if args.collect:
+        result = collect_enrichment(Path(args.parts), Path(args.out))
+        print(f"collected  : {result['entries']} entries -> {result['out']}")
+        if result["bad_parts"]:
+            print(f"bad parts  : {len(result['bad_parts'])} (skipped)")
+            for name in result["bad_parts"]:
+                print(f"  - {name}")
+        print("Run build_database.py --rebuild to apply the overlay.")
+        print("=" * 60)
+        return 0
+
+    summary = run_emit(
+        db_path=Path(args.db),
+        chunks_dir=Path(args.chunks),
+        parts_dir=Path(args.parts),
+        prompts_dir=Path(args.prompts),
+        source_substr=args.source,
+        limit=args.limit,
+    )
+    print(f"rows in DB        : {summary['rows']}"
+          + (f"  (filtered by '{args.source}')" if args.source else ""))
+    print(f"already enriched  : {summary['skipped_done']}")
+    print(f"prompts emitted   : {summary['emitted']}")
+    if summary["emitted"]:
+        print(f"prompts dir       : {summary['prompts_dir']}/")
+        print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
+              "writing data/enrichment_parts/<key>.json, then run "
+              "run_enrichment.py --collect and build_database.py --rebuild.")
+    else:
+        print("Nothing to emit — run --collect then build_database.py --rebuild.")
+    print("=" * 60)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/tests/test_enrichment.py
+++ b/tests/test_enrichment.py
@@ -0,0 +1,231 @@
+"""
+Tests for the enrichment overlay (plan Part B) and the new filter axes /
+bilingual display helpers (plan Part A).
+
+Covers:
+  * config_taxonomy.normalize_indoor_outdoor / normalize_space_needed
+  * build_database.apply_enrichment keying, field application, estimated tally
+  * DatabaseManager indoor_outdoor / space_needed equality filters
+  * FTS5 indexing of the *_ro columns
+  * Activity bilingual display helpers
+"""
+
+import os
+import sys
+
+import pytest
+
+PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+if PROJECT_ROOT not in sys.path:
+    sys.path.insert(0, PROJECT_ROOT)
+SCRIPTS = os.path.join(PROJECT_ROOT, "scripts")
+if SCRIPTS not in sys.path:
+    sys.path.insert(0, SCRIPTS)
+
+from app.models.activity import Activity  # noqa: E402
+from app.models.database import DatabaseManager  # noqa: E402
+from app.config_taxonomy import (  # noqa: E402
+    normalize_indoor_outdoor,
+    normalize_space_needed,
+)
+from import_common import content_key, normalize_name  # noqa: E402
+from build_database import apply_enrichment  # noqa: E402
+
+
+# --------------------------------------------------------------------------
+# taxonomy normalizers
+# --------------------------------------------------------------------------
+@pytest.mark.parametrize("raw,expected", [
+    ("indoor", "indoor"),
+    ("Outdoor", "outdoor"),
+    ("either", "either"),
+    ("interior", "indoor"),
+    ("aer liber", "outdoor"),
+    ("both", "either"),
+    ("", None),
+    ("nonsense", None),
+    (None, None),
+])
+def test_normalize_indoor_outdoor(raw, expected):
+    assert normalize_indoor_outdoor(raw) == expected
+
+
+@pytest.mark.parametrize("raw,expected", [
+    ("mic", "mic"),
+    ("MEDIU", "mediu"),
+    ("mare", "mare"),
+    ("small", "mic"),
+    ("large", "mare"),
+    ("", None),
+    ("huge", None),
+    (None, None),
+])
+def test_normalize_space_needed(raw, expected):
+    assert normalize_space_needed(raw) == expected
+
+
+# --------------------------------------------------------------------------
+# apply_enrichment
+# --------------------------------------------------------------------------
+def _activity(name="Joc de test", description="O descriere de test.", language="ro"):
+    return Activity(
+        name=name, description=description, category="team-building",
+        content_type="joc", source_file="t.txt", language=language,
+    )
+
+
+def _key_for(act: Activity) -> str:
+    return content_key(
+        act.normalized_name or normalize_name(act.name),
+        act.language,
+        act.description or "",
+    )
+
+
+def test_apply_enrichment_matches_and_applies_fields():
+    act = _activity()
+    key = _key_for(act)
+    enrichment = {
+        key: {
+            "name_ro": "Joc de test (RO)",
+            "description_ro": "Descriere îmbogățită în română.",
+            "indoor_outdoor": "outdoor",
+            "space_needed": "mediu",
+            "participants_min": 4,
+            "participants_max": 12,
+            "estimated_fields": ["space_needed", "participants_min", "participants_max"],
+        }
+    }
+    stats = apply_enrichment([act], enrichment)
+
+    assert act.name_ro == "Joc de test (RO)"
+    assert act.description_ro == "Descriere îmbogățită în română."
+    assert act.indoor_outdoor == "outdoor"
+    assert act.space_needed == "mediu"
+    assert act.participants_min == 4 and act.participants_max == 12
+    assert set(act.estimated_fields) == {"space_needed", "participants_min", "participants_max"}
+
+    assert stats["entries"] == 1
+    assert stats["matched"] == 1
+    assert stats["orphaned"] == 0
+    # indoor_outdoor stated, space_needed estimated
+    assert stats["fields_stated"].get("indoor_outdoor") == 1
+    assert stats["fields_estimated"].get("space_needed") == 1
+
+
+def test_apply_enrichment_orphan_entry_counted():
+    act = _activity()
+    enrichment = {"deadbeef" * 5: {"name_ro": "nu se potrivește"}}
+    stats = apply_enrichment([act], enrichment)
+    assert stats["matched"] == 0
+    assert stats["orphaned"] == 1
+    assert act.name_ro is None  # untouched
+
+
+def test_apply_enrichment_absent_fields_leave_value_untouched():
+    act = _activity()
+    act.participants_min = 5
+    key = _key_for(act)
+    # entry only translates name; participants must be preserved
+    apply_enrichment([act], {key: {"name_ro": "Tradus"}})
+    assert act.participants_min == 5
+    assert act.name_ro == "Tradus"
+
+
+def test_apply_enrichment_drops_unrecognised_enum():
+    act = _activity()
+    key = _key_for(act)
+    apply_enrichment([act], {key: {"indoor_outdoor": "spaceship", "space_needed": "mic"}})
+    assert act.indoor_outdoor is None       # unrecognised → dropped
+    assert act.space_needed == "mic"
+
+
+# --------------------------------------------------------------------------
+# DB equality filters + FTS on *_ro
+# --------------------------------------------------------------------------
+@pytest.fixture
+def db(tmp_path):
+    return DatabaseManager(str(tmp_path / "enrich.db"))
+
+
+def _insert(db, **overrides):
+    base = dict(
+        name="Activitate", description="desc", category="camp-outdoor",
+        content_type="joc", source_file="t.txt", language="ro",
+    )
+    base.update(overrides)
+    return db.insert_activity(Activity(**base))
+
+
+def test_indoor_outdoor_equality_filter(db):
+    _insert(db, name="In casa", indoor_outdoor="indoor")
+    _insert(db, name="Afara", indoor_outdoor="outdoor")
+    res = db.search_activities(indoor_outdoor="outdoor")
+    assert len(res) == 1
+    assert res[0]["name"] == "Afara"
+
+
+def test_space_needed_equality_filter(db):
+    _insert(db, name="Mic", space_needed="mic")
+    _insert(db, name="Mare", space_needed="mare")
+    res = db.search_activities(space_needed="mare")
+    assert len(res) == 1
+    assert res[0]["name"] == "Mare"
+
+
+def test_fts_indexes_name_ro(db):
+    _insert(db, name="Treasure Hunt", name_ro="Vânătoarea de comori")
+    # term only present in the Romanian twin
+    res = db.search_activities(search_text="comori")
+    assert len(res) == 1
+    assert res[0]["name"] == "Treasure Hunt"
+
+
+def test_fts_indexes_description_ro(db):
+    _insert(db, name="Game", description="english desc",
+            description_ro="o activitate de cooperare")
+    res = db.search_activities(search_text="cooperare")
+    assert len(res) == 1
+
+
+def test_ro_columns_round_trip(db):
+    aid = _insert(
+        db, name="X", name_ro="X-ro", description_ro="d-ro",
+        rules_ro="r-ro", variations_ro="v-ro",
+        indoor_outdoor="either", space_needed="mediu",
+        estimated_fields=["duration_min"], source_id="src1",
+        source_ids=["src1", "src2"], chunk_key="src1.part01",
+    )
+    row = db.get_activity_by_id(aid)
+    loaded = Activity.from_dict(row)
+    assert loaded.name_ro == "X-ro"
+    assert loaded.indoor_outdoor == "either"
+    assert loaded.space_needed == "mediu"
+    assert loaded.estimated_fields == ["duration_min"]
+    assert loaded.source_ids == ["src1", "src2"]
+    assert loaded.chunk_key == "src1.part01"
+
+
+# --------------------------------------------------------------------------
+# display helpers
+# --------------------------------------------------------------------------
+def test_display_helpers_prefer_ro_with_fallback():
+    act = _activity(name="Original", description="Original desc")
+    assert act.get_display_name() == "Original"          # no translation yet
+    assert act.get_display_description() == "Original desc"
+    act.name_ro = "Tradus"
+    act.description_ro = "Descriere tradusă"
+    assert act.get_display_name() == "Tradus"
+    assert act.get_display_description() == "Descriere tradusă"
+    assert act.has_translation() is True
+
+
+def test_is_estimated_and_axis_displays():
+    act = _activity()
+    act.indoor_outdoor = "outdoor"
+    act.space_needed = "mare"
+    act.estimated_fields = ["space_needed"]
+    assert act.get_indoor_outdoor_display() == "Exterior"
+    assert act.get_space_needed_display() == "Spațiu mare"
+    assert act.is_estimated("space_needed") is True
+    assert act.is_estimated("indoor_outdoor") is False