Headless cron enrichment system + progress checkpoint at 32%

OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully headless: one claude -p per batch via xargs, flock-guarded, idempotent. DB updated to 9541 activities; .gitignore covers enrichment intermediates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prevent + net the unescaped-quote bug in the durable prompts/pipeline
2026-06-01 21:26:35 +00:00 · 2026-05-29 18:16:04 +00:00 · 2026-05-29 18:10:13 +00:00 · 2026-05-20 19:32:44 +00:00 · 2026-05-20 07:59:36 +00:00 · 2026-05-20 07:43:42 +00:00
50 changed files with 6706 additions and 1905 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -165,9 +165,14 @@ cython_debug/
 *.db.backup
 *.db.bak
 *.db.tmp
+*.db.prefreeze*
 *.sqlite.backup
 *.sqlite3.backup

+# Agent runtime locks
+.claude/scheduled_tasks.lock
+.claude/*.lock
+
 # Temporary files
 *.tmp
 *.backup
@@ -179,6 +184,13 @@ data/sources/
 data/chunks/
 data/extracted/

+# Enrichment pipeline intermediates (LLM output; final result lands in data/activities.db)
+data/enrichment_prompts/
+data/enrichment_parts/
+data/enrichment_batches/
+data/enrichment_wf/
+data/enrichment.json
+
 # Keep main production database, the hand-written index, and committed golden set
 !data/activities.db
 !data/INDEX_MASTER_JOCURI_ACTIVITATI.md
--- a/ENRICHMENT_PILOT.md
+++ b/ENRICHMENT_PILOT.md
@@ -0,0 +1,71 @@
+# Enrichment PILOT — sign-off required before full-corpus scaling
+
+**Date:** 2026-05-29. Pilot covers **34 activities** (the STOP gate from `HANDOFF.md`
+step 3, guarding ~6–8k LLM calls across the full corpus).
+
+## Pipeline integrity (all green)
+
+| Hop | Expected | Actual |
+|-----|----------|--------|
+| prompts emitted | 34 | 34 |
+| part files on disk (valid JSON, key matches filename) | 34 | 34 |
+| `enrichment.json` entries after `--collect` | 34 | 34 |
+| rebuild overlay: `matched` / `orphaned` | 34 / 0 | **34 / 0** |
+
+No leak at any hop. `orphaned 0` confirms the content_key the rebuild computes
+matches what `run_enrichment` emitted (no dedup rep-selection drift).
+
+## Pilot composition
+
+Deliberately mixed to exercise BOTH operations (corpus is 7076 EN / 2465 RO, so
+en→ro translation is the dominant + highest-risk path):
+
+- **26** rows from `teambuilding_corbu` — all Romanian → **ro→ro polish**
+- **8** rows from `d3959920_outdoor_games` — all English → **en→ro translation**
+
+Result: ~7 genuine en→ro translations + ~27 ro→ro polish.
+
+## Field population (stated vs estimated)
+
+```
+age_group_max     : 0 stated / 30 estimated
+age_group_min     : 0 / 34
+duration_max      : 3 / 29
+duration_min      : 4 / 28
+indoor_outdoor    : 12 / 22
+participants_max  : 0 / 24
+participants_min  : 4 / 30
+space_needed      : 2 / 32
+```
+
+Almost everything is estimated — sources rarely state ages/durations explicitly.
+The pipeline marks every inferred field in `estimated_fields`, and the UI shows an
+`(estimat)` marker, so estimates are transparent to end users.
+
+## What to evaluate (the three sign-off axes)
+
+1. **Translation fidelity (en→ro)** — e.g. *Labels → Etichete*, *Ships in a Fog →
+   Nave în ceață*, *Spot the Colours → Găsește culorile*. Game rules preserved,
+   no moralizing added, proper terms kept.
+2. **Description fidelity / expansion** — ro→ro rows fold in setup/material detail
+   that IS in the source chunk (e.g. *Găsește-ți fratele și sora* adds "carton A6"
+   + "la semnal, toți încep simultan"; *Ce-mi place?* folds in the character-traits
+   discussion). No invented steps observed.
+3. **Estimation plausibility** — mostly reasonable. **Weak spots to judge:** a few
+   age ranges are very wide/defaulted (e.g. *Găsește-ți fratele și sora* → age
+   10–99). If wide age defaults are unacceptable, tighten the ENRICHMENT_PROMPT
+   guidance before scaling.
+
+## Inspect the data yourself
+
+```bash
+sqlite3 data/activities.db "select name, name_ro, language, indoor_outdoor, space_needed, estimated_fields from activities where name_ro is not null;"
+# raw overlay: data/enrichment.json (34 entries)
+# per-activity parts: data/enrichment_parts/*.json
+```
+
+## After sign-off (do NOT auto-proceed)
+
+Scale in waves of ~8–16 Sonnet subagents over the rest of the corpus
+(`run_enrichment.py` is additive + resumable — skips already-enriched keys),
+`--collect`, then final `build_database.py --rebuild --enrichment`.
--- a/HANDOFF.md
+++ b/HANDOFF.md
@@ -0,0 +1,276 @@
+# HANDOFF — Enrichment PILOT done; STOP at user sign-off gate before full-corpus scaling
+
+**Snapshot:** 2026-05-29 (updated). Executing plan `enumerated-petting-badger.md`
+(bilingual index + enrichment + new filters + source download).
+
+**>>> CURRENT STATE (2026-05-29): user SIGNED OFF on the pilot → full-corpus
+enrichment SCALING IN PROGRESS via 8 PARALLEL background Workflows on Sonnet.
+Parts on disk (`data/enrichment_parts/<key>.json`) = the durable checkpoint. <<<**
+
+Two earlier single-workflow runs were stopped: the first ran on Opus by mistake
+(workflow `agent()` inherits the main-loop model unless `model:'sonnet'` is passed —
+fixed). Measured rate: ~4.7 keys/min single-shard, ~17 keys/min at 3 shards (linear,
+no real rate-limit errors — the "429" hits in transcripts are line numbers in chunk
+text, not API errors). Concurrency is capped at 2 PER workflow (`nproc`=4 →
+`min(16,cores-2)`), so parallelism = run many workflows.
+
+**8 shard scripts: `data/enrichment_wf/shard_0.js` … `shard_7.js`**, each owns a
+disjoint batch range of `data/enrichment_batches/batch_NNNN.txt` (780 batches × ~12
+keys = 9357 keys; ranges: s0 [0,98) s1 [98,196) s2 [196,294) s3 [294,392) s4 [392,490)
+s5 [490,588) s6 [588,686) s7 [686,780)). Each agent is IDEMPOTENT (skips keys whose
+part already exists + parses), so re-launching any shard is safe. Run IDs:
+s0 `wf_3c314d06-01c` · s1 `wf_ecc7d151-a11` · s2 `wf_4156be35-748` ·
+s3 `wf_fa16abee-17a` · s4 `wf_a0f595b8-8fe` · s5 `wf_b3505593-09a` ·
+s6 `wf_ad0d731e-12e` · s7 `wf_a919a99b-1d2`.
+
+### ▶ RESUME HERE (2026-06-01 — THROTTLED CRON SYSTEM now drives enrichment)
+
+**Enrichment progress: 3074 / 9541 done (32.2%), ~6467 missing.** Nothing running.
+Parts on disk (`data/enrichment_parts/*.json`) are the durable, idempotent checkpoint.
+
+**A paced wave system now runs FULLY HEADLESS via the LXC's OS cron — NO Claude
+session required.** Fixes the "always runs to exhaustion" bug: each wave caps at
+~75% of a 5h window and the next window is reached by time (cron).
+
+ARCHITECTURE: OS cron → `scripts/enrich_wave.sh` → one `claude -p` per batch,
+PAR-way parallel (OS-level — NOT the Workflow tool, which can't be used headless:
+`claude -p` is one-shot and would exit before background workflows finish). Each
+headless `claude -p` reads a batch file and writes `data/enrichment_parts/<key>.json`.
+
+- **`scripts/enrichment_wave.py`** (prepares a bounded wave, no LLM):
+  - `--status` — read-only progress (done / missing / pct / corrupt count).
+  - `--prepare --keys 700 --no-shards` — drop corrupt parts; take FIRST 700
+    sorted-missing keys; write batch files for ONLY those; print
+    `WAVE: PREPARED …` or `WAVE: COMPLETE`. `--no-shards` = batch files only
+    (the headless path). (Without `--no-shards` it also regenerates Workflow shard
+    JS from `data/enrichment_wf/shard.js.tmpl` — only needed for the old Workflow path.)
+- **`scripts/enrich_wave.sh [KEYS] [PAR]`** (the headless orchestrator, run by cron):
+  flock-guarded (waves never overlap); `--prepare`; if `WAVE: COMPLETE` → `--collect`
+  + `--rebuild` and stop; else `xargs -P PAR` one `claude -p` per batch
+  (`--allowedTools Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)`, `</dev/null`).
+  Logs to `/workspace/.claude-logs/enrich_<ts>.log`. Detects + logs "WINDOW EXHAUSTED".
+- **OS crontab (user `claude`, `crontab -l` to view):** two night fires
+  `20 23 * * *` and `50 0 * * *` UTC (= 02:20 & 03:50 EEST). Timed AFTER the live-
+  confirmed **23:00 UTC usage-window reset** so both land in the fresh post-reset
+  window (user asleep → safe to use it fully; two 700-caps top out at the window's
+  ~950 capacity). Self-healing: a fire into an exhausted window is a harmless no-op
+  (`claude -p` prints "session limit", writes nothing) and those keys retry next fire.
+
+**Auth caveat:** headless `claude -p` uses the OAuth token in
+`~/.claude/.credentials.json` (verified working). If it ever expires and can't refresh
+non-interactively, cron fires fail with auth errors → user must `claude` login once.
+
+**Manual fallback (one wave, any time, no session needed):**
+```bash
+/workspace/game-library/scripts/enrich_wave.sh 700 6      # runs a full wave now
+# or step-by-step:
+python3 scripts/enrichment_wave.py --status               # progress
+python3 scripts/run_enrichment.py --collect && python3 scripts/build_database.py --rebuild  # at WAVE: COMPLETE
+#   gate: rebuild must print enrichment {N} (matched N, orphaned 0)
+```
+
+**Control:** `crontab -e` to retime/disable; `crontab -r` removes all. Tune `--keys`
+(KEYS arg) up to drain faster, down if logs show "WINDOW EXHAUSTED" early. One full
+window ≈ 950 keys ≈ 100%.
+
+**Hard facts learned:**
+- Workflow concurrency is capped at **2 per workflow** (`nproc`=4 → `min(16,cores-2)`); parallelism = run many workflow processes. 3 shards measured ~17 keys/min (linear, no real rate-limit).
+- Workflow `agent()` inherits the **main-loop model unless `model:'sonnet'` is passed** — the FIRST run silently used Opus; always pass model.
+- The full corpus does **NOT fit in one 5h usage window** — it needs SEVERAL windows. Parallelism only cuts wall-clock inside a window, not total token budget.
+- Main-session token drain was **polling** (sleep/grep loops), NOT launching workflows. Launch + wait-for-notification only.
+- StructuredOutput failures appear when a window exhausts mid-run — harmless; idempotent skip + the regenerate-from-missing reconcile recover every dropped key.
+
+(prev note) Earlier STOPPED at 593/9541 — hit 92% of the 5h Anthropic
+usage window (resets 23:00 UTC). KEY LESSON: the full corpus does NOT fit in one 5h
+window; 6.2% + the session's other work already used ~92%. Enrichment must be spread
+across MANY 5h windows (parallelism only cuts wall-clock inside a window, not total
+token budget). Resume model: after each window reset, regenerate batches from
+currently-missing, relaunch a bounded number of shards, stop before the window
+exhausts. Idempotent shards + parts-on-disk make this safe to repeat. Consider a
+cron/scheduled job that runs a bounded wave each reset.
+
+**To regenerate batches from currently-missing + relaunch a shard** (reconcile):
+```bash
+python3 - <<'PY'
+import glob, os
+BATCH=12
+missing=sorted(os.path.basename(p)[:-9-len('.md')] for p in glob.glob('data/enrichment_prompts/*.prompt.md')
+               if not os.path.exists('data/enrichment_parts/'+os.path.basename(p)[:-len('.prompt.md')]+'.json'))
+for old in glob.glob('data/enrichment_batches/batch_*.txt'): os.remove(old)
+for n,i in enumerate(range(0,len(missing),BATCH)):
+    open(f'data/enrichment_batches/batch_{n:04d}.txt','w').write('\n'.join(missing[i:i+BATCH])+'\n')
+print('missing',len(missing),'batches',n+1)
+PY
+# then edit START/END in the shard files to cover the new batch count and re-invoke Workflow({scriptPath: 'data/enrichment_wf/shard_K.js'})
+```
+
+### Resume / completion procedure (do this when the workflow finishes — or to continue a new session)
+
+The pipeline is RESUMABLE: parts on disk are truth; re-running regenerates work only for missing keys.
+
+1. **Rebuild the batch list from what's still missing** (prompt exists, part absent), then re-run the workflow for the gap:
+   ```bash
+   # regenerate batch files for missing keys (script below already lives in shell history; logic:
+   #   for each data/enrichment_prompts/<key>.prompt.md with no data/enrichment_parts/<key>.json,
+   #   split into data/enrichment_batches/batch_NNNN.txt of 12)
+   ```
+   The workflow script is at
+   `.../workflows/scripts/enrich-corpus-wf_440c0a2f-17f.js` (nBatches hardcoded → update it to the new batch count, or it defaults to 793). Re-invoke with `{scriptPath: ...}`.
+2. **Reconcile loop** (expect 2–3 passes — some parts WILL drop: flaky agents, a stray quote that slips re-validation): repeat step 1 until `missing == 0`.
+3. **Collect + final rebuild ONCE at the end** (don't rebuild after every wave — 9541 rows is wasted work):
+   ```bash
+   python3 scripts/run_enrichment.py --collect      # robust: repairs stray-quote parts, skips+reports truly-broken
+   python3 scripts/build_database.py --rebuild       # picks up --enrichment by default
+   ```
+   **Gate:** rebuild must print `enrichment {entries} (matched {entries}, orphaned 0)`. Done-criterion is the reconcile counts converging: `emitted == parts-on-disk == entries == matched`.
+
+### ⚠ FREEZE IS NOW LOCKED
+Enrichment content_keys depend on the current freeze. The earlier "re-freezing is safe"
+note is **INVERTED** now — do NOT re-extract or re-freeze `data/extracted/` until the
+final `--rebuild`, or content_keys drift and the overlay orphans.
+
+## Where we are
+
+| Step (plan Part C) | Status |
+|--------------------|--------|
+| 1. Finish extraction | **DONE** — 582 chunks extracted & valid (7 re-extracted this session); 6 content-filter-blocked, accepted as missing |
+| 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
+| 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
+| 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
+| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9541 activities** (re-frozen with the 7 chunks) |
+| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
+| 4. Enrichment pilot → **STOP for user sign-off** | **DONE — 34 activities enriched (26 ro-polish + 8 en→ro), pipeline 34/34 matched, 0 orphaned. AWAITING SIGN-OFF.** |
+| 5. Final rebuild `--enrichment` | not started (post sign-off) |
+
+## The 7 re-extracted chunks (this session)
+
+Re-extracted via Sonnet subagents, all valid JSON, re-frozen into the corpus.
+One (`d297a434…part01`) had an activity named "Eu" (<3 chars, schema-rejected);
+renamed faithfully to "Eu sunt..." (matches the source affirmation). The 6
+content-filter-blocked chunks remain accepted as missing.
+
+Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
+is gitignored (575 files on disk, durable across /clear).
+
+## The 13 missing chunks (out of 588)
+
+**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
+- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
+- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`
+
+**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
+incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
+```
+3f9c8232_teambuilding_corbu_29092023.part01
+5f959f85_scoli_fara_bullying.part02
+83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
+d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
+d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
+d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
+e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
+```
+Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
+`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
+the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
+overlay keys depend on the current freeze yet.
+
+## The json_repair incident (important — root cause + what was fixed)
+
+Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
+text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
+were affected.
+
+First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
+ends the string and reinterprets the trailing text as a new key, silently dropping the
+rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
+the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
+an extra key slipped through. Applying json_repair output to disk also **overwrote the
+malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
+re-extraction.
+
+**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
+(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
+validates against the real schema, and only replaces a valid top-level file when the
+repaired version carries **strictly more text** (a length guard that catches truncated
+json_repair output while leaving genuine extractions untouched). Re-running it cleanly
+repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
+`json_repair` is no longer used anywhere. Do NOT reintroduce it.
+
+`build_database.py` does NOT depend on the repair script (the "DB regenerable from
+data/extracted/" invariant holds — plain `json.loads` only).
+
+## What the code does now (all committed)
+
+**Part A — plumbing (corpus-independent):**
+- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
+  indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
+  chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
+  indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
+  and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
+  the categories table so dropdowns populate.
+- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
+  / `get_display_description` / `get_display_rules` / `get_display_variations`
+  (RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
+  `get_indoor_outdoor_display`, `get_space_needed_display`.
+- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
+  `normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
+  fallback — never fabricate a value) + display-name helpers.
+- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
+  `merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
+  **never** touches enrichment fields (those are applied post-dedup).
+
+**Part A — UI/search:**
+- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
+  `space_needed` to DB equality filters.
+- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
+  `SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
+  `source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
+  web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
+- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
+- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
+  helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
+  "Text original" section, indoor/space cards, estimat markers, and the download link
+  (only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
+
+**Part B — enrichment pipeline (built, not yet run):**
+- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
+  applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
+  `import_common.content_key(normalized_name, language, _normalize_text(description))`
+  (reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
+  `enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
+  Translated/expanded text is NOT re-validated against source (by design).
+- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
+  skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
+  per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
+  `find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
+  merges parts → `data/enrichment.json`.
+- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
+  `description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
+  fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
+
+## Exact next steps
+
+1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
+   JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
+   If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
+2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
+   note the new total (~9418 + the 7 chunks' activities).
+3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 6–8k LLM calls):
+   - Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
+     (or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
+   - Launch a small wave of Sonnet subagents on those prompts (each writes
+     `data/enrichment_parts/<key>.json`).
+   - `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
+   - `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
+   - **STOP. Hand the user translation-quality + estimation-plausibility + description-
+     fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
+     auto-proceed past this gate.
+4. After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, `--collect`,
+   final `--rebuild --enrichment`.
+
+## Verify / run
+
+- Tests: `python3 -m pytest tests/ -q` → 99 pass.
+- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
+  only if the user accepts the copyright exposure of serving original files.
+- `data/activities.db.bak` is the pre-this-freeze backup.
--- a/app/config.py
+++ b/app/config.py
@@ -22,6 +22,18 @@ class Config:
    # Search settings
    SEARCH_RESULTS_LIMIT = int(os.environ.get('SEARCH_RESULTS_LIMIT', '100'))
    FTS_ENABLED = True
+
+    # Source-file download (plan A6). Shipped DARK by default: serving the
+    # original PDFs/books carries a copyright exposure the user must opt into.
+    # The /source/<id> route 404s entirely while this is false; the UI hides
+    # the download link. Enable with SOURCE_DOWNLOAD_ENABLED=true.
+    SOURCE_DOWNLOAD_ENABLED = (
+        os.environ.get('SOURCE_DOWNLOAD_ENABLED', 'false').lower() == 'true'
+    )
+    # Root of the original corpus. source_file values are relative to this.
+    CORPUS_DIR = os.environ.get('CORPUS_DIR') or str(
+        Path(__file__).parent.parent / 'data' / 'carti-camp-jocuri'
+    )
    
    @staticmethod
    def ensure_directories():
--- a/app/config_taxonomy.py
+++ b/app/config_taxonomy.py
@@ -0,0 +1,313 @@
+"""
+Controlled category taxonomy for game-library.
+
+Single source of truth for activity categories. The DB stores the *slug*;
+the UI displays the Romanian name. `category` (thematic domain) and
+`content_type` (form of the content) are INDEPENDENT axes — see plan §2.
+"""
+
+import unicodedata
+import re
+from typing import Dict, List, Optional
+
+# --- Categories (thematic domain) --------------------------------------------
+# slug -> Romanian display name. ~16 fixed slugs; `altele` is the mandatory
+# fallback and MUST always be present.
+CATEGORIES: Dict[str, str] = {
+    "jocuri-cercetasesti": "Jocuri cercetășești",
+    "team-building": "Team-building",
+    "icebreakers": "Icebreakers / spargerea gheții",
+    "camp-outdoor": "Tabără și activități în aer liber",
+    "wide-games": "Wide games / jocuri de teren",
+    "orientare": "Orientare",
+    "prim-ajutor": "Prim ajutor",
+    "escape-room-puzzle": "Escape room și puzzle",
+    "creative-stem": "Creativitate și STEM",
+    "sports-active": "Sport și activități fizice",
+    "cantece-ceremonii": "Cântece și ceremonii",
+    "retete": "Rețete",
+    "supravietuire": "Supraviețuire",
+    "integrare-incluziune": "Integrare și incluziune",
+    "conflict-empatie": "Conflict și empatie",
+    "altele": "Altele",
+}
+
+# Mandatory fallback slug.
+FALLBACK_CATEGORY = "altele"
+
+# Ordered list of valid slugs.
+CATEGORY_SLUGS: List[str] = list(CATEGORIES.keys())
+
+# --- Content type (form of the content) --------------------------------------
+# Independent axis from `category`. The UI default search excludes the
+# non-game content types (see plan §6).
+CONTENT_TYPES: Dict[str, str] = {
+    "joc": "Joc",
+    "activitate": "Activitate",
+    "reteta": "Rețetă",
+    "cantec": "Cântec",
+    "ceremonie": "Ceremonie",
+}
+
+CONTENT_TYPE_SLUGS: List[str] = list(CONTENT_TYPES.keys())
+
+# Content types considered "non-game" — excluded from the default UI search.
+NON_GAME_CONTENT_TYPES: List[str] = ["reteta", "cantec", "ceremonie"]
+
+DEFAULT_CONTENT_TYPE = "activitate"
+
+# --- Aliases -----------------------------------------------------------------
+# Map of normalized arbitrary strings -> canonical slug. Keys are already
+# diacritic-stripped, lowercased and hyphenated (see _slugify). This catches
+# legacy / messy values from the old DB and common English/Romanian variants.
+_CATEGORY_ALIASES: Dict[str, str] = {
+    # legacy junk
+    "general-activity": "altele",
+    "general": "altele",
+    "educational": "creative-stem",
+    "d": "altele",
+    "a": "altele",
+    "b": "altele",
+    "c": "altele",
+    # scouting
+    "cercetasie": "jocuri-cercetasesti",
+    "cercetasesti": "jocuri-cercetasesti",
+    "scout": "jocuri-cercetasesti",
+    "scouting": "jocuri-cercetasesti",
+    "scout-games": "jocuri-cercetasesti",
+    "jocuri-cercetasesti": "jocuri-cercetasesti",
+    # team building
+    "teambuilding": "team-building",
+    "team": "team-building",
+    "cooperare": "team-building",
+    # icebreakers
+    "icebreaker": "icebreakers",
+    "spargerea-ghetii": "icebreakers",
+    "cunoastere": "icebreakers",
+    "energizers": "icebreakers",
+    "energizer": "icebreakers",
+    # camp / outdoor
+    "camp": "camp-outdoor",
+    "tabara": "camp-outdoor",
+    "outdoor": "camp-outdoor",
+    "aer-liber": "camp-outdoor",
+    # wide games
+    "wide-game": "wide-games",
+    "jocuri-de-teren": "wide-games",
+    "joc-de-teren": "wide-games",
+    "big-games": "wide-games",
+    # orientare
+    "orienteering": "orientare",
+    "navigatie": "orientare",
+    # prim ajutor
+    "first-aid": "prim-ajutor",
+    "primul-ajutor": "prim-ajutor",
+    # escape room / puzzle
+    "escape-room": "escape-room-puzzle",
+    "escaperoom": "escape-room-puzzle",
+    "puzzle": "escape-room-puzzle",
+    "puzzles": "escape-room-puzzle",
+    "ghicitori": "escape-room-puzzle",
+    # creative / stem
+    "creative": "creative-stem",
+    "creativitate": "creative-stem",
+    "stem": "creative-stem",
+    "arts-and-crafts": "creative-stem",
+    "craft": "creative-stem",
+    "crafts": "creative-stem",
+    "stiinta": "creative-stem",
+    # sports
+    "sport": "sports-active",
+    "sports": "sports-active",
+    "sportive": "sports-active",
+    "active": "sports-active",
+    "miscare": "sports-active",
+    "physical": "sports-active",
+    # songs / ceremonies
+    "cantece": "cantece-ceremonii",
+    "cantec": "cantece-ceremonii",
+    "songs": "cantece-ceremonii",
+    "ceremonii": "cantece-ceremonii",
+    "ceremonie": "cantece-ceremonii",
+    "ceremony": "cantece-ceremonii",
+    # recipes
+    "reteta": "retete",
+    "recipe": "retete",
+    "recipes": "retete",
+    "cooking": "retete",
+    "gatit": "retete",
+    # survival
+    "survival": "supravietuire",
+    "supravietuire": "supravietuire",
+    # inclusion
+    "integrare": "integrare-incluziune",
+    "incluziune": "integrare-incluziune",
+    "inclusion": "integrare-incluziune",
+    # conflict / empathy
+    "conflict": "conflict-empatie",
+    "empatie": "conflict-empatie",
+    "empathy": "conflict-empatie",
+    "rezolvarea-conflictelor": "conflict-empatie",
+    # fallback
+    "altele": "altele",
+    "other": "altele",
+    "others": "altele",
+    "misc": "altele",
+}
+
+
+def _slugify(value: str) -> str:
+    """Lowercase, strip diacritics, collapse non-alphanumerics to hyphens."""
+    if not value:
+        return ""
+    # Decompose accents (ă -> a, ș -> s, ț -> t, etc.)
+    decomposed = unicodedata.normalize("NFKD", value)
+    ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
+    ascii_str = ascii_str.lower().strip()
+    ascii_str = re.sub(r"[^a-z0-9]+", "-", ascii_str)
+    return ascii_str.strip("-")
+
+
+def normalize_category(value: str) -> str:
+    """Map an arbitrary string to a valid category slug.
+
+    Returns one of CATEGORY_SLUGS, falling back to `altele` for anything
+    unrecognised or empty.
+    """
+    if not value:
+        return FALLBACK_CATEGORY
+    slug = _slugify(str(value))
+    if not slug:
+        return FALLBACK_CATEGORY
+    # Exact slug match.
+    if slug in CATEGORIES:
+        return slug
+    # Alias match.
+    if slug in _CATEGORY_ALIASES:
+        return _CATEGORY_ALIASES[slug]
+    return FALLBACK_CATEGORY
+
+
+def normalize_content_type(value: str) -> str:
+    """Map an arbitrary string to a valid content_type slug.
+
+    Returns one of CONTENT_TYPE_SLUGS, falling back to `activitate`.
+    """
+    if not value:
+        return DEFAULT_CONTENT_TYPE
+    slug = _slugify(str(value))
+    if slug in CONTENT_TYPES:
+        return slug
+    # Light alias handling for plural / English forms.
+    aliases = {
+        "jocuri": "joc",
+        "game": "joc",
+        "games": "joc",
+        "activitati": "activitate",
+        "activity": "activitate",
+        "retete": "reteta",
+        "recipe": "reteta",
+        "cantece": "cantec",
+        "song": "cantec",
+        "ceremonii": "ceremonie",
+        "ceremony": "ceremonie",
+    }
+    return aliases.get(slug, DEFAULT_CONTENT_TYPE)
+
+
+# --- Indoor / outdoor (enrichment axis) --------------------------------------
+# Where the activity is run. Inferred during enrichment when the source is
+# silent — such inferences are flagged in `estimated_fields`. slug -> RO label.
+INDOOR_OUTDOOR: Dict[str, str] = {
+    "indoor": "Interior",
+    "outdoor": "Exterior",
+    "either": "Interior sau exterior",
+}
+
+# --- Space needed (enrichment axis) ------------------------------------------
+# Rough footprint the activity requires. slug -> RO label.
+SPACE_NEEDED: Dict[str, str] = {
+    "mic": "Spațiu mic",
+    "mediu": "Spațiu mediu",
+    "mare": "Spațiu mare",
+}
+
+# Aliases for robustness against LLM output variation. Keys are _slugify'd.
+_INDOOR_OUTDOOR_ALIASES: Dict[str, str] = {
+    "interior": "indoor",
+    "inside": "indoor",
+    "in": "indoor",
+    "exterior": "outdoor",
+    "outside": "outdoor",
+    "out": "outdoor",
+    "aer-liber": "outdoor",
+    "both": "either",
+    "any": "either",
+    "ambele": "either",
+    "interior-exterior": "either",
+    "indoor-outdoor": "either",
+}
+
+_SPACE_NEEDED_ALIASES: Dict[str, str] = {
+    "small": "mic",
+    "redus": "mic",
+    "putin": "mic",
+    "medium": "mediu",
+    "moderat": "mediu",
+    "large": "mare",
+    "big": "mare",
+    "mult": "mare",
+    "spatiu-mic": "mic",
+    "spatiu-mediu": "mediu",
+    "spatiu-mare": "mare",
+}
+
+
+def normalize_indoor_outdoor(value: str) -> Optional[str]:
+    """Map an arbitrary string to an indoor_outdoor slug, or None.
+
+    Unlike categories, this has NO mandatory fallback: an unrecognised or
+    empty value yields None (field simply absent), so we never fabricate a
+    location the enrichment did not assert.
+    """
+    if not value:
+        return None
+    slug = _slugify(str(value))
+    if slug in INDOOR_OUTDOOR:
+        return slug
+    return _INDOOR_OUTDOOR_ALIASES.get(slug)
+
+
+def normalize_space_needed(value: str) -> Optional[str]:
+    """Map an arbitrary string to a space_needed slug, or None (no fallback)."""
+    if not value:
+        return None
+    slug = _slugify(str(value))
+    if slug in SPACE_NEEDED:
+        return slug
+    return _SPACE_NEEDED_ALIASES.get(slug)
+
+
+def indoor_outdoor_display_name(slug: str) -> str:
+    """RO display name for an indoor_outdoor slug."""
+    return INDOOR_OUTDOOR.get(slug, slug)
+
+
+def space_needed_display_name(slug: str) -> str:
+    """RO display name for a space_needed slug."""
+    return SPACE_NEEDED.get(slug, slug)
+
+
+def is_valid_category(slug: str) -> bool:
+    """True if `slug` is a valid category slug."""
+    return slug in CATEGORIES
+
+
+def category_display_name(slug: str) -> str:
+    """Romanian display name for a slug (fallback to the slug itself)."""
+    return CATEGORIES.get(slug, slug)
+
+
+def content_type_display_name(slug: str) -> str:
+    """Romanian display name for a content_type slug."""
+    return CONTENT_TYPES.get(slug, slug)
--- a/app/models/activity.py
+++ b/app/models/activity.py
@@ -5,6 +5,22 @@ Activity data model for INDEX-SISTEM-JOCURI v2.0
 from dataclasses import dataclass, field
 from typing import List, Optional, Dict, Any
 import json
+import re
+import unicodedata
+
+
+def normalize_name(name: str) -> str:
+    """Diacritic-free, lowercased, whitespace-collapsed form of a name.
+
+    Used as the exact-match key for dedup grouping (see plan §4).
+    """
+    if not name:
+        return ""
+    decomposed = unicodedata.normalize("NFKD", name)
+    ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
+    ascii_str = ascii_str.lower().strip()
+    ascii_str = re.sub(r"\s+", " ", ascii_str)
+    return ascii_str

@dataclass
 class Activity:
@@ -19,10 +35,19 @@ class Activity:
    # Categories
    category: str = ""
    subcategory: Optional[str] = None
-    
+    # content_type is an axis INDEPENDENT of category:
+    # one of joc/activitate/reteta/cantec/ceremonie (see config_taxonomy).
+    content_type: Optional[str] = None
+
    # Source information
    source_file: str = ""
    page_reference: Optional[str] = None
+    # source_files: JSON-encoded list of every source the activity was seen in.
+    # `source_file` (singular) stays as the primary/original source; build_database
+    # (Lane C) accumulates the full list here on dedup-merge.
+    source_files: List[str] = field(default_factory=list)
+    # Short verbatim quote from the source — anti-hallucination anchor.
+    source_excerpt: Optional[str] = None
    
    # Age and participants
    age_group_min: Optional[int] = None
@@ -44,11 +69,41 @@ class Activity:
    keywords: Optional[str] = None
    tags: List[str] = field(default_factory=list)
    popularity_score: int = 0
-    
+
+    # Extraction / language metadata
+    language: Optional[str] = None          # 'ro' / 'en'
+    normalized_name: Optional[str] = None   # dedup key; auto-derived from name
+    extraction_confidence: Optional[str] = None  # 'high' / 'med' / 'low'
+    needs_review: int = 0
+
+    # Enrichment overlay (applied at build time from data/enrichment.json; see
+    # plan Part B). Bilingual: the EN/source text stays in name/description/...
+    # and the Romanian rendering lands in the *_ro twins. Absent fields leave
+    # the underlying DB value untouched.
+    name_ro: Optional[str] = None
+    description_ro: Optional[str] = None
+    rules_ro: Optional[str] = None
+    variations_ro: Optional[str] = None
+    indoor_outdoor: Optional[str] = None     # slug: indoor / outdoor / either
+    space_needed: Optional[str] = None       # slug: mic / mediu / mare
+    # Names of fields whose value was INFERRED by enrichment (source was
+    # silent) rather than stated in the source — surfaced as "(estimat)" in UI.
+    estimated_fields: List[str] = field(default_factory=list)
+
+    # Source provenance for the download route + enrichment keying.
+    source_id: Optional[str] = None          # e.g. "876d1a2d_marcaje_turistice"
+    source_ids: List[str] = field(default_factory=list)  # all source_ids merged
+    chunk_key: Optional[str] = None          # e.g. "<source_id>.part01"
+
    # Database fields
    id: Optional[int] = None
    created_at: Optional[str] = None
    updated_at: Optional[str] = None
+
+    def __post_init__(self):
+        """Derive normalized_name from name when not explicitly provided."""
+        if not self.normalized_name:
+            self.normalized_name = normalize_name(self.name)
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert activity to dictionary for database storage"""
@@ -59,8 +114,11 @@ class Activity:
            'variations': self.variations,
            'category': self.category,
            'subcategory': self.subcategory,
+            'content_type': self.content_type,
            'source_file': self.source_file,
+            'source_files': json.dumps(self.source_files) if self.source_files else None,
            'page_reference': self.page_reference,
+            'source_excerpt': self.source_excerpt,
            'age_group_min': self.age_group_min,
            'age_group_max': self.age_group_max,
            'participants_min': self.participants_min,
@@ -73,7 +131,21 @@ class Activity:
            'difficulty_level': self.difficulty_level,
            'keywords': self.keywords,
            'tags': json.dumps(self.tags) if self.tags else None,
-            'popularity_score': self.popularity_score
+            'popularity_score': self.popularity_score,
+            'language': self.language,
+            'normalized_name': self.normalized_name or normalize_name(self.name),
+            'extraction_confidence': self.extraction_confidence,
+            'needs_review': self.needs_review,
+            'name_ro': self.name_ro,
+            'description_ro': self.description_ro,
+            'rules_ro': self.rules_ro,
+            'variations_ro': self.variations_ro,
+            'indoor_outdoor': self.indoor_outdoor,
+            'space_needed': self.space_needed,
+            'estimated_fields': json.dumps(self.estimated_fields) if self.estimated_fields else None,
+            'source_id': self.source_id,
+            'source_ids': json.dumps(self.source_ids) if self.source_ids else None,
+            'chunk_key': self.chunk_key,
        }
    
    @classmethod
@@ -86,7 +158,30 @@ class Activity:
                tags = json.loads(data['tags'])
            except (json.JSONDecodeError, TypeError):
                tags = []
-        
+
+        # source_files may arrive as a JSON string (DB) or a list (extraction)
+        source_files = data.get('source_files')
+        if isinstance(source_files, str):
+            try:
+                source_files = json.loads(source_files)
+            except (json.JSONDecodeError, TypeError):
+                source_files = []
+        elif source_files is None:
+            source_files = []
+
+        # estimated_fields / source_ids: JSON string (DB) or list (in-memory)
+        def _json_list(value):
+            if isinstance(value, str):
+                try:
+                    parsed = json.loads(value)
+                    return parsed if isinstance(parsed, list) else []
+                except (json.JSONDecodeError, TypeError):
+                    return []
+            return list(value) if value else []
+
+        estimated_fields = _json_list(data.get('estimated_fields'))
+        source_ids = _json_list(data.get('source_ids'))
+
        return cls(
            id=data.get('id'),
            name=data.get('name', ''),
@@ -95,8 +190,11 @@ class Activity:
            variations=data.get('variations'),
            category=data.get('category', ''),
            subcategory=data.get('subcategory'),
+            content_type=data.get('content_type'),
            source_file=data.get('source_file', ''),
+            source_files=source_files,
            page_reference=data.get('page_reference'),
+            source_excerpt=data.get('source_excerpt'),
            age_group_min=data.get('age_group_min'),
            age_group_max=data.get('age_group_max'),
            participants_min=data.get('participants_min'),
@@ -110,6 +208,20 @@ class Activity:
            keywords=data.get('keywords'),
            tags=tags,
            popularity_score=data.get('popularity_score', 0),
+            language=data.get('language'),
+            normalized_name=data.get('normalized_name'),
+            extraction_confidence=data.get('extraction_confidence'),
+            needs_review=data.get('needs_review', 0) or 0,
+            name_ro=data.get('name_ro'),
+            description_ro=data.get('description_ro'),
+            rules_ro=data.get('rules_ro'),
+            variations_ro=data.get('variations_ro'),
+            indoor_outdoor=data.get('indoor_outdoor'),
+            space_needed=data.get('space_needed'),
+            estimated_fields=estimated_fields,
+            source_id=data.get('source_id'),
+            source_ids=source_ids,
+            chunk_key=data.get('chunk_key'),
            created_at=data.get('created_at'),
            updated_at=data.get('updated_at')
        )
@@ -150,4 +262,44 @@ class Activity:
            return self.materials_category
        elif self.materials_list:
            return self.materials_list[:100] + "..." if len(self.materials_list) > 100 else self.materials_list
-        return "nu specificate"
+        return "nu specificate"
+
+    # --- Enrichment / bilingual display helpers ------------------------------
+    def get_display_name(self) -> str:
+        """Romanian name when enriched, else the original."""
+        return self.name_ro or self.name
+
+    def get_display_description(self) -> str:
+        """Romanian description when enriched, else the original."""
+        return self.description_ro or self.description
+
+    def get_display_rules(self) -> Optional[str]:
+        """Romanian rules when enriched, else the original."""
+        return self.rules_ro or self.rules
+
+    def get_display_variations(self) -> Optional[str]:
+        """Romanian variations when enriched, else the original."""
+        return self.variations_ro or self.variations
+
+    def has_translation(self) -> bool:
+        """True if any Romanian enrichment text is present."""
+        return bool(self.name_ro or self.description_ro
+                    or self.rules_ro or self.variations_ro)
+
+    def is_estimated(self, field_name: str) -> bool:
+        """True if `field_name` was inferred by enrichment (source was silent)."""
+        return field_name in (self.estimated_fields or [])
+
+    def get_indoor_outdoor_display(self) -> Optional[str]:
+        """RO label for indoor_outdoor, or None when unset."""
+        if not self.indoor_outdoor:
+            return None
+        from app.config_taxonomy import indoor_outdoor_display_name
+        return indoor_outdoor_display_name(self.indoor_outdoor)
+
+    def get_space_needed_display(self) -> Optional[str]:
+        """RO label for space_needed, or None when unset."""
+        if not self.space_needed:
+            return None
+        from app.config_taxonomy import space_needed_display_name
+        return space_needed_display_name(self.space_needed)
--- a/app/models/database.py
+++ b/app/models/database.py
@@ -30,6 +30,8 @@ class DatabaseManager:
        """Initialize database with v2.0 schema"""
        with self._get_connection() as conn:
            # Main activities table
+            # NOTE: schema is rebuilt from scratch (plan §6) — no in-place
+            # migration. The old DB is deleted and recreated by build_database.
            conn.execute("""
                CREATE TABLE IF NOT EXISTS activities (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
@@ -39,9 +41,12 @@ class DatabaseManager:
                    variations TEXT,
                    category TEXT NOT NULL,
                    subcategory TEXT,
+                    content_type TEXT,
                    source_file TEXT NOT NULL,
+                    source_files TEXT,
                    page_reference TEXT,
-                    
+                    source_excerpt TEXT,
+
                    -- Structured parameters
                    age_group_min INTEGER,
                    age_group_max INTEGER,
@@ -49,26 +54,47 @@ class DatabaseManager:
                    participants_max INTEGER,
                    duration_min INTEGER,
                    duration_max INTEGER,
-                    
+
                    -- Categories for filtering
                    materials_category TEXT,
                    materials_list TEXT,
                    skills_developed TEXT,
                    difficulty_level TEXT,
-                    
+
                    -- Metadata
                    keywords TEXT,
                    tags TEXT,
                    popularity_score INTEGER DEFAULT 0,
+
+                    -- Extraction / language metadata
+                    language TEXT,
+                    normalized_name TEXT,
+                    extraction_confidence TEXT,
+                    needs_review INTEGER DEFAULT 0,
+
+                    -- Enrichment overlay (bilingual + inferred filters; Part B)
+                    name_ro TEXT,
+                    description_ro TEXT,
+                    rules_ro TEXT,
+                    variations_ro TEXT,
+                    indoor_outdoor TEXT,
+                    space_needed TEXT,
+                    estimated_fields TEXT,
+                    source_id TEXT,
+                    source_ids TEXT,
+                    chunk_key TEXT,
+
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
            """)
-            
+
            # FTS5 virtual table for search
            conn.execute("""
                CREATE VIRTUAL TABLE IF NOT EXISTS activities_fts USING fts5(
                    name, description, rules, variations, keywords,
+                    materials_list, skills_developed,
+                    name_ro, description_ro, rules_ro, variations_ro,
                    content='activities',
                    content_rowid='id'
                )
@@ -92,6 +118,9 @@ class DatabaseManager:
                "CREATE INDEX IF NOT EXISTS idx_activities_age ON activities(age_group_min, age_group_max)",
                "CREATE INDEX IF NOT EXISTS idx_activities_participants ON activities(participants_min, participants_max)",
                "CREATE INDEX IF NOT EXISTS idx_activities_duration ON activities(duration_min, duration_max)",
+                "CREATE INDEX IF NOT EXISTS idx_activities_normalized_name ON activities(normalized_name)",
+                "CREATE INDEX IF NOT EXISTS idx_activities_indoor_outdoor ON activities(indoor_outdoor)",
+                "CREATE INDEX IF NOT EXISTS idx_activities_space_needed ON activities(space_needed)",
                "CREATE INDEX IF NOT EXISTS idx_categories_type ON categories(type)"
            ]
            
@@ -102,24 +131,42 @@ class DatabaseManager:
            conn.execute("""
                CREATE TRIGGER IF NOT EXISTS activities_fts_insert AFTER INSERT ON activities
                BEGIN
-                    INSERT INTO activities_fts(rowid, name, description, rules, variations, keywords)
-                    VALUES (new.id, new.name, new.description, new.rules, new.variations, new.keywords);
+                    INSERT INTO activities_fts(rowid, name, description, rules, variations,
+                                               keywords, materials_list, skills_developed,
+                                               name_ro, description_ro, rules_ro, variations_ro)
+                    VALUES (new.id, new.name, new.description, new.rules, new.variations,
+                            new.keywords, new.materials_list, new.skills_developed,
+                            new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
                END
            """)
-            
+
            conn.execute("""
                CREATE TRIGGER IF NOT EXISTS activities_fts_delete AFTER DELETE ON activities
                BEGIN
-                    DELETE FROM activities_fts WHERE rowid = old.id;
+                    INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
+                                               variations, keywords, materials_list, skills_developed,
+                                               name_ro, description_ro, rules_ro, variations_ro)
+                    VALUES ('delete', old.id, old.name, old.description, old.rules,
+                            old.variations, old.keywords, old.materials_list, old.skills_developed,
+                            old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
                END
            """)
-            
+
            conn.execute("""
                CREATE TRIGGER IF NOT EXISTS activities_fts_update AFTER UPDATE ON activities
                BEGIN
-                    DELETE FROM activities_fts WHERE rowid = old.id;
-                    INSERT INTO activities_fts(rowid, name, description, rules, variations, keywords)
-                    VALUES (new.id, new.name, new.description, new.rules, new.variations, new.keywords);
+                    INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
+                                               variations, keywords, materials_list, skills_developed,
+                                               name_ro, description_ro, rules_ro, variations_ro)
+                    VALUES ('delete', old.id, old.name, old.description, old.rules,
+                            old.variations, old.keywords, old.materials_list, old.skills_developed,
+                            old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
+                    INSERT INTO activities_fts(rowid, name, description, rules, variations,
+                                               keywords, materials_list, skills_developed,
+                                               name_ro, description_ro, rules_ro, variations_ro)
+                    VALUES (new.id, new.name, new.description, new.rules, new.variations,
+                            new.keywords, new.materials_list, new.skills_developed,
+                            new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
                END
            """)
            
@@ -179,11 +226,17 @@ class DatabaseManager:
        """Update category usage counts"""
        categories_to_update = [
            ('category', activity.category),
+            ('content_type', activity.content_type),
+            ('language', activity.language),
            ('age_group', activity.get_age_range_display()),
            ('participants', activity.get_participants_display()),
            ('duration', activity.get_duration_display()),
            ('materials', activity.get_materials_display()),
            ('difficulty', activity.difficulty_level),
+            # Enrichment axes — slugs stored as value; UI maps to RO via
+            # DISPLAY_NAMES. Without these the new dropdowns would be empty.
+            ('indoor_outdoor', activity.indoor_outdoor),
+            ('space_needed', activity.space_needed),
        ]
        
        for cat_type, cat_value in categories_to_update:
@@ -210,6 +263,8 @@ class DatabaseManager:
                         duration_max: Optional[int] = None,
                         materials_category: Optional[str] = None,
                         difficulty_level: Optional[str] = None,
+                         indoor_outdoor: Optional[str] = None,
+                         space_needed: Optional[str] = None,
                         limit: int = 100) -> List[Dict[str, Any]]:
        """Enhanced search with FTS5 and filters"""
        
@@ -267,7 +322,15 @@ class DatabaseManager:
            if difficulty_level:
                base_query += " AND difficulty_level = ?"
                params.append(difficulty_level)
-            
+
+            if indoor_outdoor:
+                base_query += " AND indoor_outdoor = ?"
+                params.append(indoor_outdoor)
+
+            if space_needed:
+                base_query += " AND space_needed = ?"
+                params.append(space_needed)
+
            # Add ordering and limit
            query = f"{base_query} {order_clause} LIMIT ?"
            params.append(limit)
@@ -332,8 +395,11 @@ class DatabaseManager:
    def clear_database(self):
        """Clear all data from database"""
        with self._get_connection() as conn:
+            # Deleting from activities fires the delete trigger, which removes
+            # the matching FTS rows. The explicit 'delete-all' command then
+            # guarantees the external-content FTS index is fully cleared.
            conn.execute("DELETE FROM activities")
-            conn.execute("DELETE FROM activities_fts")
+            conn.execute("INSERT INTO activities_fts(activities_fts) VALUES('delete-all')")
            conn.execute("DELETE FROM categories")
            conn.commit()
    
--- a/app/services/init.py
+++ b/app/services/init.py
@@ -2,8 +2,6 @@
 Services for INDEX-SISTEM-JOCURI v2.0
 """

-from .parser import IndexMasterParser
-from .indexer import ActivityIndexer
 from .search import SearchService

-__all__ = ['IndexMasterParser', 'ActivityIndexer', 'SearchService']
+__all__ = ['SearchService']
--- a/app/services/indexer.py
+++ b/app/services/indexer.py
@@ -1,248 +0,0 @@
-"""
-Activity indexer service for INDEX-SISTEM-JOCURI v2.0
-Coordinates parsing and database indexing
-"""
-
-from typing import List, Dict, Any
-from pathlib import Path
-from app.models.database import DatabaseManager
-from app.models.activity import Activity
-from app.services.parser import IndexMasterParser
-import time
-
-class ActivityIndexer:
-    """Service for indexing activities from INDEX_MASTER into database"""
-    
-    def __init__(self, db_manager: DatabaseManager, index_master_path: str):
-        """Initialize indexer with database manager and INDEX_MASTER path"""
-        self.db = db_manager
-        self.parser = IndexMasterParser(index_master_path)
-        self.indexing_stats = {}
-    
-    def index_all_activities(self, clear_existing: bool = False) -> Dict[str, Any]:
-        """Index all activities from INDEX_MASTER into database"""
-        
-        print("🚀 Starting activity indexing process...")
-        start_time = time.time()
-        
-        # Clear existing data if requested
-        if clear_existing:
-            print("🗑️  Clearing existing database...")
-            self.db.clear_database()
-        
-        # Parse activities from INDEX_MASTER
-        print("📖 Parsing INDEX_MASTER file...")
-        activities = self.parser.parse_all_categories()
-        
-        if not activities:
-            print("❌ No activities were parsed!")
-            return {'success': False, 'error': 'No activities parsed'}
-        
-        # Filter valid activities
-        valid_activities = []
-        for activity in activities:
-            if self.parser.validate_activity_completeness(activity):
-                valid_activities.append(activity)
-            else:
-                print(f"⚠️  Skipping incomplete activity: {activity.name[:50]}...")
-        
-        print(f"✅ Validated {len(valid_activities)} activities out of {len(activities)} parsed")
-        
-        if len(valid_activities) < 100:
-            print(f"⚠️  Warning: Only {len(valid_activities)} valid activities found. Expected 500+")
-        
-        # Bulk insert into database
-        print("💾 Inserting activities into database...")
-        try:
-            inserted_count = self.db.bulk_insert_activities(valid_activities)
-            
-            # Rebuild FTS index for optimal search performance
-            print("🔍 Rebuilding search index...")
-            self.db.rebuild_fts_index()
-            
-            end_time = time.time()
-            indexing_time = end_time - start_time
-            
-            # Generate final statistics (with error handling)
-            try:
-                stats = self._generate_indexing_stats(valid_activities, indexing_time)
-                stats['inserted_count'] = inserted_count
-                stats['success'] = True
-            except Exception as e:
-                print(f"⚠️  Error generating statistics: {e}")
-                stats = {
-                    'success': True,
-                    'inserted_count': inserted_count,
-                    'indexing_time_seconds': indexing_time,
-                    'error': f'Stats generation failed: {str(e)}'
-                }
-            
-            print(f"✅ Indexing complete! {inserted_count} activities indexed in {indexing_time:.2f}s")
-            
-            # Verify database state (with error handling)
-            try:
-                db_stats = self.db.get_statistics()
-                print(f"📊 Database now contains {db_stats['total_activities']} activities")
-            except Exception as e:
-                print(f"⚠️  Error getting database statistics: {e}")
-                print(f"📊 Database insertion completed, statistics unavailable")
-            
-            return stats
-            
-        except Exception as e:
-            print(f"❌ Error during database insertion: {e}")
-            return {'success': False, 'error': str(e)}
-    
-    def index_specific_category(self, category_code: str) -> Dict[str, Any]:
-        """Index activities from a specific category only"""
-        
-        print(f"🎯 Indexing specific category: {category_code}")
-        
-        # Load content and parse specific category
-        if not self.parser.load_content():
-            return {'success': False, 'error': 'Could not load INDEX_MASTER'}
-        
-        category_name = self.parser.category_mapping.get(category_code)
-        if not category_name:
-            return {'success': False, 'error': f'Unknown category code: {category_code}'}
-        
-        activities = self.parser.parse_category_section(category_code, category_name)
-        
-        if not activities:
-            return {'success': False, 'error': f'No activities found in category {category_code}'}
-        
-        # Filter valid activities
-        valid_activities = [a for a in activities if self.parser.validate_activity_completeness(a)]
-        
-        try:
-            inserted_count = self.db.bulk_insert_activities(valid_activities)
-            return {
-                'success': True,
-                'category': category_name,
-                'inserted_count': inserted_count,
-                'total_parsed': len(activities),
-                'valid_activities': len(valid_activities)
-            }
-        except Exception as e:
-            return {'success': False, 'error': str(e)}
-    
-    def _generate_indexing_stats(self, activities: List[Activity], indexing_time: float) -> Dict[str, Any]:
-        """Generate comprehensive indexing statistics"""
-        
-        # Get parser statistics
-        parser_stats = self.parser.get_parsing_statistics()
-        
-        # Calculate additional metrics
-        categories = {}
-        age_ranges = {}
-        durations = {}
-        materials = {}
-        
-        for activity in activities:
-            # Category breakdown
-            if activity.category in categories:
-                categories[activity.category] += 1
-            else:
-                categories[activity.category] = 1
-            
-            # Age range analysis (with safety check)
-            try:
-                age_key = activity.get_age_range_display() or "nespecificat"
-                age_ranges[age_key] = age_ranges.get(age_key, 0) + 1
-            except Exception as e:
-                print(f"Warning: Error getting age range for activity {activity.name}: {e}")
-                age_ranges["nespecificat"] = age_ranges.get("nespecificat", 0) + 1
-            
-            # Duration analysis (with safety check)
-            try:
-                duration_key = activity.get_duration_display() or "nespecificat"
-                durations[duration_key] = durations.get(duration_key, 0) + 1
-            except Exception as e:
-                print(f"Warning: Error getting duration for activity {activity.name}: {e}")
-                durations["nespecificat"] = durations.get("nespecificat", 0) + 1
-            
-            # Materials analysis (with safety check)
-            try:
-                materials_key = activity.get_materials_display() or "nespecificat"
-                materials[materials_key] = materials.get(materials_key, 0) + 1
-            except Exception as e:
-                print(f"Warning: Error getting materials for activity {activity.name}: {e}")
-                materials["nespecificat"] = materials.get("nespecificat", 0) + 1
-        
-        return {
-            'indexing_time_seconds': indexing_time,
-            'parsing_stats': parser_stats,
-            'distribution': {
-                'categories': categories,
-                'age_ranges': age_ranges,
-                'durations': durations,
-                'materials': materials
-            },
-            'quality_metrics': {
-                'completion_rate': parser_stats.get('completion_rate', 0),
-                'average_description_length': parser_stats.get('average_description_length', 0),
-                'activities_with_metadata': sum(1 for a in activities if a.age_group_min or a.participants_min or a.duration_min)
-            }
-        }
-    
-    def verify_indexing_quality(self) -> Dict[str, Any]:
-        """Verify the quality of indexed data"""
-        
-        try:
-            # Get database statistics
-            db_stats = self.db.get_statistics()
-            
-            # Check for minimum activity count
-            total_activities = db_stats['total_activities']
-            meets_minimum = total_activities >= 500
-            
-            # Check category distribution
-            categories = db_stats.get('categories', {})
-            category_coverage = len(categories)
-            
-            # Sample some activities to check quality
-            sample_activities = self.db.search_activities(limit=10)
-            
-            quality_issues = []
-            for activity in sample_activities:
-                if not activity.get('description') or len(activity['description']) < 10:
-                    quality_issues.append(f"Activity {activity.get('name', 'Unknown')} has insufficient description")
-                
-                if not activity.get('category'):
-                    quality_issues.append(f"Activity {activity.get('name', 'Unknown')} missing category")
-            
-            return {
-                'total_activities': total_activities,
-                'meets_minimum_requirement': meets_minimum,
-                'minimum_target': 500,
-                'category_coverage': category_coverage,
-                'expected_categories': len(self.parser.category_mapping),
-                'quality_issues': quality_issues,
-                'quality_score': max(0, 100 - len(quality_issues) * 10),
-                'database_stats': db_stats
-            }
-            
-        except Exception as e:
-            return {'error': str(e), 'quality_score': 0}
-    
-    def get_indexing_progress(self) -> Dict[str, Any]:
-        """Get current indexing progress and status"""
-        try:
-            db_stats = self.db.get_statistics()
-            
-            # Calculate progress towards 500+ activities goal
-            total_activities = db_stats['total_activities']
-            target_activities = 500
-            progress_percentage = min(100, (total_activities / target_activities) * 100)
-            
-            return {
-                'current_activities': total_activities,
-                'target_activities': target_activities,
-                'progress_percentage': progress_percentage,
-                'status': 'completed' if total_activities >= target_activities else 'in_progress',
-                'categories_indexed': list(db_stats.get('categories', {}).keys()),
-                'database_size_mb': db_stats.get('database_size_bytes', 0) / (1024 * 1024)
-            }
-            
-        except Exception as e:
-            return {'error': str(e), 'status': 'error'}
--- a/app/services/parser.py
+++ b/app/services/parser.py
@@ -1,340 +0,0 @@
-"""
-Advanced parser for INDEX_MASTER_JOCURI_ACTIVITATI.md
-Extracts 500+ individual activities with full details
-"""
-
-import re
-from pathlib import Path
-from typing import List, Dict, Optional, Tuple
-from app.models.activity import Activity
-
-class IndexMasterParser:
-    """Advanced parser for extracting real activities from INDEX_MASTER"""
-    
-    def __init__(self, index_file_path: str):
-        """Initialize parser with INDEX_MASTER file path"""
-        self.index_file_path = Path(index_file_path)
-        self.content = ""
-        self.activities = []
-        
-        # Category mapping for main sections (exact match from file)
-        self.category_mapping = {
-            '[A]': 'JOCURI CERCETĂȘEȘTI ȘI SCOUT',
-            '[B]': 'TEAM BUILDING ȘI COMUNICARE',
-            '[C]': 'CAMPING ȘI ACTIVITĂȚI EXTERIOR', 
-            '[D]': 'ESCAPE ROOM ȘI PUZZLE-URI',
-            '[E]': 'ORIENTARE ȘI BUSOLE',
-            '[F]': 'PRIMUL AJUTOR ȘI SIGURANȚA',
-            '[G]': 'ACTIVITĂȚI EDUCAȚIONALE',
-            '[H]': 'RESURSE SPECIALE'
-        }
-    
-    def load_content(self) -> bool:
-        """Load and validate INDEX_MASTER content"""
-        try:
-            if not self.index_file_path.exists():
-                print(f"❌ INDEX_MASTER file not found: {self.index_file_path}")
-                return False
-            
-            with open(self.index_file_path, 'r', encoding='utf-8') as f:
-                self.content = f.read()
-            
-            if len(self.content) < 1000:  # Sanity check
-                print(f"⚠️  INDEX_MASTER file seems too small: {len(self.content)} chars")
-                return False
-            
-            print(f"✅ Loaded INDEX_MASTER: {len(self.content)} characters")
-            return True
-            
-        except Exception as e:
-            print(f"❌ Error loading INDEX_MASTER: {e}")
-            return False
-    
-    def parse_all_categories(self) -> List[Activity]:
-        """Parse all categories and extract individual activities"""
-        if not self.load_content():
-            return []
-        
-        print("🔍 Starting comprehensive parsing of INDEX_MASTER...")
-        
-        # Parse each main category
-        for category_code, category_name in self.category_mapping.items():
-            print(f"\n📂 Processing category {category_code}: {category_name}")
-            category_activities = self.parse_category_section(category_code, category_name)
-            self.activities.extend(category_activities)
-            print(f"   ✅ Extracted {len(category_activities)} activities")
-        
-        print(f"\n🎯 Total activities extracted: {len(self.activities)}")
-        return self.activities
-    
-    def parse_category_section(self, category_code: str, category_name: str) -> List[Activity]:
-        """Parse a specific category section"""
-        activities = []
-        
-        # Find the category section - exact pattern match
-        # Look for the actual section, not the table of contents
-        pattern = rf"^## {re.escape(category_code)} {re.escape(category_name)}\s*$"
-        matches = list(re.finditer(pattern, self.content, re.MULTILINE | re.IGNORECASE))
-        
-        if not matches:
-            print(f"   ⚠️  Category section not found: {category_code}")
-            return activities
-        
-        # Take the last match (should be the actual section, not TOC)
-        match = matches[-1]
-        print(f"   📍 Found section at position {match.start()}")
-        
-        # Extract content until next main category or end
-        start_pos = match.end()
-        
-        # Find next main category (look for complete header)
-        next_category_pattern = r"^## \[[A-H]\] [A-ZĂÂÎȘȚ]"
-        next_match = re.search(next_category_pattern, self.content[start_pos:], re.MULTILINE)
-        
-        if next_match:
-            end_pos = start_pos + next_match.start()
-            section_content = self.content[start_pos:end_pos]
-        else:
-            section_content = self.content[start_pos:]
-        
-        # Parse subsections within the category
-        activities.extend(self._parse_subsections(section_content, category_name))
-        
-        return activities
-    
-    def _parse_subsections(self, section_content: str, category_name: str) -> List[Activity]:
-        """Parse subsections within a category"""
-        activities = []
-        
-        # Find all subsections (### markers)
-        subsection_pattern = r"^### (.+?)$"
-        subsections = re.finditer(subsection_pattern, section_content, re.MULTILINE)
-        
-        subsection_list = list(subsections)
-        
-        for i, subsection in enumerate(subsection_list):
-            subsection_title = subsection.group(1).strip()
-            subsection_start = subsection.end()
-            
-            # Find end of subsection
-            if i + 1 < len(subsection_list):
-                subsection_end = subsection_list[i + 1].start()
-            else:
-                subsection_end = len(section_content)
-            
-            subsection_text = section_content[subsection_start:subsection_end]
-            
-            # Parse individual games in this subsection
-            subsection_activities = self._parse_games_in_subsection(
-                subsection_text, category_name, subsection_title
-            )
-            activities.extend(subsection_activities)
-        
-        return activities
-    
-    def _parse_games_in_subsection(self, subsection_text: str, category_name: str, subsection_title: str) -> List[Activity]:
-        """Parse individual games within a subsection"""
-        activities = []
-        
-        # Look for "Exemple de jocuri:" sections
-        examples_pattern = r"\*\*Exemple de jocuri:\*\*\s*\n(.*?)(?=\n\*\*|$)"
-        examples_matches = re.finditer(examples_pattern, subsection_text, re.DOTALL)
-        
-        for examples_match in examples_matches:
-            examples_text = examples_match.group(1)
-            
-            # Extract individual games (numbered list)
-            game_pattern = r"^(\d+)\.\s*\*\*(.+?)\*\*\s*-\s*(.+?)$"
-            games = re.finditer(game_pattern, examples_text, re.MULTILINE)
-            
-            for game_match in games:
-                game_number = game_match.group(1)
-                game_name = game_match.group(2).strip()
-                game_description = game_match.group(3).strip()
-                
-                # Extract metadata from subsection
-                metadata = self._extract_subsection_metadata(subsection_text)
-                
-                # Create activity
-                activity = Activity(
-                    name=game_name,
-                    description=game_description,
-                    category=category_name,
-                    subcategory=subsection_title,
-                    source_file=f"INDEX_MASTER_JOCURI_ACTIVITATI.md",
-                    page_reference=f"{category_name} > {subsection_title} > #{game_number}",
-                    **metadata
-                )
-                
-                activities.append(activity)
-        
-        # Also extract from direct activity descriptions without "Exemple de jocuri"
-        activities.extend(self._parse_direct_activities(subsection_text, category_name, subsection_title))
-        
-        return activities
-    
-    def _extract_subsection_metadata(self, subsection_text: str) -> Dict:
-        """Extract metadata from subsection text"""
-        metadata = {}
-        
-        # Extract participants info
-        participants_pattern = r"\*\*Participanți:\*\*\s*(.+?)(?:\n|\*\*)"
-        participants_match = re.search(participants_pattern, subsection_text)
-        if participants_match:
-            participants_text = participants_match.group(1).strip()
-            participants = self._parse_participants(participants_text)
-            metadata.update(participants)
-        
-        # Extract duration
-        duration_pattern = r"\*\*Durata:\*\*\s*(.+?)(?:\n|\*\*)"
-        duration_match = re.search(duration_pattern, subsection_text)
-        if duration_match:
-            duration_text = duration_match.group(1).strip()
-            duration = self._parse_duration(duration_text)
-            metadata.update(duration)
-        
-        # Extract materials
-        materials_pattern = r"\*\*Materiale:\*\*\s*(.+?)(?:\n|\*\*)"
-        materials_match = re.search(materials_pattern, subsection_text)
-        if materials_match:
-            materials_text = materials_match.group(1).strip()
-            metadata['materials_list'] = materials_text
-            metadata['materials_category'] = self._categorize_materials(materials_text)
-        
-        # Extract keywords
-        keywords_pattern = r"\*\*Cuvinte cheie:\*\*\s*(.+?)(?:\n|\*\*)"
-        keywords_match = re.search(keywords_pattern, subsection_text)
-        if keywords_match:
-            metadata['keywords'] = keywords_match.group(1).strip()
-        
-        return metadata
-    
-    def _parse_participants(self, participants_text: str) -> Dict:
-        """Parse participants information"""
-        result = {}
-        
-        # Look for number ranges like "8-30 copii" or "5-15 persoane"
-        range_pattern = r"(\d+)-(\d+)"
-        range_match = re.search(range_pattern, participants_text)
-        
-        if range_match:
-            result['participants_min'] = int(range_match.group(1))
-            result['participants_max'] = int(range_match.group(2))
-        else:
-            # Look for single numbers
-            number_pattern = r"(\d+)\+"
-            number_match = re.search(number_pattern, participants_text)
-            if number_match:
-                result['participants_min'] = int(number_match.group(1))
-        
-        # Extract age information
-        age_pattern = r"(\d+)-(\d+)\s*ani"
-        age_match = re.search(age_pattern, participants_text)
-        if age_match:
-            result['age_group_min'] = int(age_match.group(1))
-            result['age_group_max'] = int(age_match.group(2))
-        
-        return result
-    
-    def _parse_duration(self, duration_text: str) -> Dict:
-        """Parse duration information"""
-        result = {}
-        
-        # Look for time ranges like "5-20 minute" or "15-30min"
-        range_pattern = r"(\d+)-(\d+)\s*(?:minute|min)"
-        range_match = re.search(range_pattern, duration_text)
-        
-        if range_match:
-            result['duration_min'] = int(range_match.group(1))
-            result['duration_max'] = int(range_match.group(2))
-        else:
-            # Look for single duration
-            single_pattern = r"(\d+)\+?\s*(?:minute|min)"
-            single_match = re.search(single_pattern, duration_text)
-            if single_match:
-                result['duration_min'] = int(single_match.group(1))
-        
-        return result
-    
-    def _categorize_materials(self, materials_text: str) -> str:
-        """Categorize materials into simple categories"""
-        materials_lower = materials_text.lower()
-        
-        if any(word in materials_lower for word in ['fără', 'nu necesare', 'nimic', 'minime']):
-            return 'Fără materiale'
-        elif any(word in materials_lower for word in ['hârtie', 'creion', 'marker', 'simple']):
-            return 'Materiale simple'
-        elif any(word in materials_lower for word in ['computer', 'proiector', 'echipament', 'complexe']):
-            return 'Materiale complexe'
-        else:
-            return 'Materiale variate'
-    
-    def _parse_direct_activities(self, subsection_text: str, category_name: str, subsection_title: str) -> List[Activity]:
-        """Parse activities that are described directly without 'Exemple de jocuri' section"""
-        activities = []
-        
-        # Look for activity descriptions in sections that don't have "Exemple de jocuri"
-        if "**Exemple de jocuri:**" not in subsection_text:
-            # Try to extract from file descriptions
-            file_pattern = r"\*\*Fișier:\*\*\s*`([^`]+)`.*?\*\*(.+?)\*\*"
-            file_matches = re.finditer(file_pattern, subsection_text, re.DOTALL)
-            
-            for file_match in file_matches:
-                file_name = file_match.group(1)
-                description_part = file_match.group(2)
-                
-                # Create a general activity for this file
-                activity = Activity(
-                    name=f"Activități din {file_name}",
-                    description=f"Colecție de activități din fișierul {file_name}. {description_part[:200]}...",
-                    category=category_name,
-                    subcategory=subsection_title,
-                    source_file=file_name,
-                    page_reference=f"{category_name} > {subsection_title}",
-                    **self._extract_subsection_metadata(subsection_text)
-                )
-                
-                activities.append(activity)
-        
-        return activities
-    
-    def validate_activity_completeness(self, activity: Activity) -> bool:
-        """Validate that an activity has all necessary fields"""
-        required_fields = ['name', 'description', 'category', 'source_file']
-        
-        for field in required_fields:
-            if not getattr(activity, field) or not getattr(activity, field).strip():
-                return False
-        
-        # Check minimum description length
-        if len(activity.description) < 10:
-            return False
-        
-        return True
-    
-    def get_parsing_statistics(self) -> Dict:
-        """Get statistics about the parsing process"""
-        if not self.activities:
-            return {'total_activities': 0}
-        
-        category_counts = {}
-        valid_activities = 0
-        
-        for activity in self.activities:
-            # Count by category
-            if activity.category in category_counts:
-                category_counts[activity.category] += 1
-            else:
-                category_counts[activity.category] = 1
-            
-            # Count valid activities
-            if self.validate_activity_completeness(activity):
-                valid_activities += 1
-        
-        return {
-            'total_activities': len(self.activities),
-            'valid_activities': valid_activities,
-            'completion_rate': (valid_activities / len(self.activities)) * 100 if self.activities else 0,
-            'category_breakdown': category_counts,
-            'average_description_length': sum(len(a.description) for a in self.activities) / len(self.activities) if self.activities else 0
-        }
--- a/app/services/search.py
+++ b/app/services/search.py
@@ -5,8 +5,19 @@ Enhanced search with FTS5 and intelligent filtering

 from typing import List, Dict, Any, Optional
 from app.models.database import DatabaseManager
+from app.config_taxonomy import NON_GAME_CONTENT_TYPES
 import re

+# Category slugs that are themselves "non-game" — selecting one of these as a
+# category filter also lifts the default non-game content_type exclusion.
+NON_GAME_CATEGORIES = {"retete", "cantece-ceremonii"}
+
+# When a Python-side post-filter is active the DB LIMIT is applied *before*
+# filtering, so we over-fetch to still satisfy the caller's `limit`.
+_OVERSCAN_FACTOR = 5
+_OVERSCAN_CAP = 2000
+
+
 class SearchService:
    """Enhanced search service with intelligent query processing"""
    
@@ -24,22 +35,72 @@ class SearchService:
        
        if filters is None:
            filters = {}
-        
+
        # Process and normalize search text
        processed_search = self._process_search_text(search_text)
-        
+
        # Map web filters to database fields
        db_filters = self._map_filters_to_db_fields(filters)
-        
+
+        # content_type and language are filtered in Python: the DB layer does
+        # not expose them as query parameters. The DEFAULT search excludes the
+        # non-game content types (rețete / cântece / ceremonii) — they surface
+        # only when the user explicitly filters that content_type, or picks a
+        # non-game category. See plan §6.
+        content_type, exclude_non_game = self._resolve_content_type_filter(filters)
+        language = (filters.get('language') or '').strip().lower() or None
+        post_filtering = bool(content_type or exclude_non_game or language)
+
+        # Over-fetch when post-filtering so the final list can still reach `limit`.
+        fetch_limit = min(limit * _OVERSCAN_FACTOR, _OVERSCAN_CAP) if post_filtering else limit
+
        # Perform database search
        results = self.db.search_activities(
            search_text=processed_search,
            **db_filters,
-            limit=limit
+            limit=fetch_limit
        )
-        
-        # Post-process results for relevance and ranking
-        return self._post_process_results(results, processed_search, filters)
+
+        # Apply content_type / language post-filters
+        results = self._apply_content_type_filter(results, content_type, exclude_non_game)
+        if language:
+            results = [r for r in results
+                       if (r.get('language') or '').strip().lower() == language]
+
+        # Post-process results for relevance and ranking, then honour `limit`
+        results = self._post_process_results(results, processed_search, filters)
+        return results[:limit]
+
+    def _resolve_content_type_filter(self, filters: Dict[str, str]):
+        """Determine the content_type post-filter.
+
+        Returns (explicit_content_type | None, exclude_non_game: bool):
+        - an explicit `content_type` filter → that value, no exclusion;
+        - a `category` filter on a non-game category → no exclusion;
+        - otherwise → default search, exclude non-game content types.
+        """
+        content_type = (filters.get('content_type') or '').strip()
+        if content_type:
+            return content_type, False
+        category = (filters.get('category') or '').strip()
+        if category in NON_GAME_CATEGORIES:
+            return None, False
+        return None, True
+
+    def _apply_content_type_filter(self,
+                                   results: List[Dict[str, Any]],
+                                   content_type: Optional[str],
+                                   exclude_non_game: bool) -> List[Dict[str, Any]]:
+        """Filter results by content_type (explicit include vs default exclude)."""
+        if content_type:
+            return [r for r in results
+                    if (r.get('content_type') or '') == content_type]
+        if exclude_non_game:
+            # Rows with NULL/unknown content_type are kept — only the known
+            # non-game types are dropped from the default search.
+            return [r for r in results
+                    if (r.get('content_type') or '') not in NON_GAME_CONTENT_TYPES]
+        return results
    
    def _process_search_text(self, search_text: Optional[str]) -> Optional[str]:
        """Process and enhance search text for better FTS5 results"""
@@ -83,10 +144,16 @@ class SearchService:
            if not filter_value or not filter_value.strip():
                continue
            
+            # content_type / language are NOT database query params — they are
+            # applied as Python post-filters in search_activities(). Skip them
+            # here so they never reach DatabaseManager.search_activities().
+            if filter_key in ('content_type', 'language'):
+                continue
+
            # Map filter types to database fields
            if filter_key == 'category':
                db_filters['category'] = filter_value
-            
+
            elif filter_key == 'age_group':
                # Parse age range (e.g., "5-8 ani", "12+ ani")
                age_match = re.search(r'(\d+)(?:-(\d+))?\s*ani?', filter_value)
@@ -133,7 +200,14 @@ class SearchService:
            
            elif filter_key == 'difficulty':
                db_filters['difficulty_level'] = filter_value
-            
+
+            elif filter_key == 'indoor_outdoor':
+                # Equality filter on the slug column (mirror difficulty).
+                db_filters['indoor_outdoor'] = filter_value
+
+            elif filter_key == 'space_needed':
+                db_filters['space_needed'] = filter_value
+
            # Handle any other custom filters
            else:
                # Generic filter handling - try to match against keywords or tags
@@ -177,21 +251,22 @@ class SearchService:
            boost_score = 0
            
            # Check name matches (highest priority)
-            name_lower = result.get('name', '').lower()
+            # NB: use `or ''` — nullable columns come back as None, not ''.
+            name_lower = (result.get('name') or '').lower()
            for term in search_terms:
                if term in name_lower:
                    boost_score += 10
                    if name_lower.startswith(term):
                        boost_score += 5  # Extra boost for name starts with term
-            
+
            # Check description matches
-            desc_lower = result.get('description', '').lower()
+            desc_lower = (result.get('description') or '').lower()
            for term in search_terms:
                if term in desc_lower:
                    boost_score += 3
-            
+
            # Check keywords matches
-            keywords_lower = result.get('keywords', '').lower()
+            keywords_lower = (result.get('keywords') or '').lower()
            for term in search_terms:
                if term in keywords_lower:
                    boost_score += 5
@@ -280,11 +355,14 @@ class SearchService:
            return []
        
        try:
-            # Search for activities that match the partial query
+            # Search for activities that match the partial query.
+            # Over-fetch then drop non-game content types so autocomplete
+            # mirrors the default search (no rețete / cântece / ceremonii).
            results = self.db.search_activities(
                search_text=f'"{partial_query}"',
-                limit=limit * 2
+                limit=limit * 6
            )
+            results = self._apply_content_type_filter(results, None, True)
            
            suggestions = []
            seen = set()
--- a/app/static/css/main.css
+++ b/app/static/css/main.css
@@ -705,4 +705,30 @@ body {
        box-shadow: none;
        border: 1px solid #ddd;
    }
+}
+
+/* Enrichment markers (plan Part A7) */
+.estimated {
+    color: #8a6d3b;
+    font-style: italic;
+    font-size: 0.85em;
+    font-weight: normal;
+}
+
+.original-text > summary {
+    cursor: pointer;
+    color: #555;
+    user-select: none;
+}
+
+.original-text .original-content {
+    margin-top: 0.75rem;
+    padding-left: 1rem;
+    border-left: 3px solid #e0e0e0;
+    color: #555;
+}
+
+.download-hint {
+    color: #888;
+    font-size: 0.85em;
 }
--- a/app/templates/activity.html
+++ b/app/templates/activity.html
@@ -8,14 +8,20 @@
    <nav class="breadcrumb">
        <a href="{{ url_for('main.index') }}">Căutare</a>
        <span class="breadcrumb-separator">»</span>
-        <span class="breadcrumb-current">{{ activity.name }}</span>
+        <span class="breadcrumb-current">{{ activity.get_display_name() }}</span>
    </nav>

    <!-- Activity header -->
    <header class="activity-detail-header">
        <div class="activity-title-section">
-            <h1 class="activity-detail-title">{{ activity.name }}</h1>
-            <span class="activity-category-badge">{{ activity.category }}</span>
+            <h1 class="activity-detail-title">{{ activity.get_display_name() }}</h1>
+            <span class="activity-category-badge">{{ display_names.get(activity.category, activity.category) }}</span>
+            {% if activity.content_type %}
+            <span class="activity-content-type-badge">{{ display_names.get(activity.content_type, activity.content_type) }}</span>
+            {% endif %}
+            {% if activity.needs_review %}
+            <span class="activity-badge needs-review" title="Această activitate necesită verificare">⚠ De verificat</span>
+            {% endif %}
        </div>
        
        {% if activity.subcategory %}
@@ -25,27 +31,46 @@

    <!-- Activity content -->
    <div class="activity-detail-content">
-        <!-- Main description -->
+        <!-- Main description (Romanian-primary, falls back to original) -->
        <section class="activity-section">
            <h2 class="section-title">Descriere</h2>
-            <div class="activity-description">{{ activity.description }}</div>
+            <div class="activity-description">{{ activity.get_display_description() }}</div>
        </section>

        <!-- Rules and variations -->
-        {% if activity.rules %}
+        {% if activity.get_display_rules() %}
        <section class="activity-section">
            <h2 class="section-title">Reguli</h2>
-            <div class="activity-rules">{{ activity.rules }}</div>
+            <div class="activity-rules">{{ activity.get_display_rules() }}</div>
        </section>
        {% endif %}

-        {% if activity.variations %}
+        {% if activity.get_display_variations() %}
        <section class="activity-section">
            <h2 class="section-title">Variații</h2>
-            <div class="activity-variations">{{ activity.variations }}</div>
+            <div class="activity-variations">{{ activity.get_display_variations() }}</div>
        </section>
        {% endif %}

+        <!-- Original (pre-translation) text, collapsed by default -->
+        {% if activity.has_translation() %}
+        <details class="activity-section original-text">
+            <summary class="section-title">Text original ({{ display_names.get(activity.language, activity.language or 'sursă') }})</summary>
+            <div class="original-content">
+                <h3 class="metadata-title">{{ activity.name }}</h3>
+                <div class="activity-description">{{ activity.description }}</div>
+                {% if activity.rules %}
+                <h4 class="metadata-title">Reguli</h4>
+                <div class="activity-rules">{{ activity.rules }}</div>
+                {% endif %}
+                {% if activity.variations %}
+                <h4 class="metadata-title">Variații</h4>
+                <div class="activity-variations">{{ activity.variations }}</div>
+                {% endif %}
+            </div>
+        </details>
+        {% endif %}
+
        <!-- Metadata grid -->
        <section class="activity-section">
            <h2 class="section-title">Detalii activitate</h2>
@@ -53,21 +78,35 @@
                {% if activity.get_age_range_display() != "toate vârstele" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Grupa de vârstă</h3>
-                    <p class="metadata-value">{{ activity.get_age_range_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}

                {% if activity.get_participants_display() != "orice număr" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Participanți</h3>
-                    <p class="metadata-value">{{ activity.get_participants_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}

                {% if activity.get_duration_display() != "durată variabilă" %}
                <div class="metadata-card">
                    <h3 class="metadata-title">Durata</h3>
-                    <p class="metadata-value">{{ activity.get_duration_display() }}</p>
+                    <p class="metadata-value">{{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
+                </div>
+                {% endif %}
+
+                {% if activity.get_indoor_outdoor_display() %}
+                <div class="metadata-card">
+                    <h3 class="metadata-title">Interior / exterior</h3>
+                    <p class="metadata-value">{{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}</p>
+                </div>
+                {% endif %}
+
+                {% if activity.get_space_needed_display() %}
+                <div class="metadata-card">
+                    <h3 class="metadata-title">Spațiu necesar</h3>
+                    <p class="metadata-value">{{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}</p>
                </div>
                {% endif %}

@@ -119,9 +158,15 @@
            <h2 class="section-title">Informații sursă</h2>
            <div class="source-info">
                {% if activity.source_file %}
+                {% if config.SOURCE_DOWNLOAD_ENABLED %}
+                <p><strong>Fișier sursă:</strong>
+                   <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a>
+                   <span class="download-hint">(descarcă)</span></p>
+                {% else %}
                <p><strong>Fișier sursă:</strong> {{ activity.source_file }}</p>
                {% endif %}
-                
+                {% endif %}
+
                {% if activity.page_reference %}
                <p><strong>Referință:</strong> {{ activity.page_reference }}</p>
                {% endif %}
--- a/app/templates/index.html
+++ b/app/templates/index.html
@@ -36,7 +36,31 @@
                    <select name="category" id="category" class="filter-select">
                        <option value="">Toate categoriile</option>
                        {% for category in filters.category %}
-                        <option value="{{ category }}">{{ category }}</option>
+                        <option value="{{ category }}">{{ display_names.get(category, category) }}</option>
+                        {% endfor %}
+                    </select>
+                </div>
+                {% endif %}
+
+                {% if filters.content_type %}
+                <div class="filter-group">
+                    <label for="content_type" class="filter-label">Tip conținut</label>
+                    <select name="content_type" id="content_type" class="filter-select">
+                        <option value="">Doar jocuri și activități</option>
+                        {% for content_type in filters.content_type %}
+                        <option value="{{ content_type }}">{{ display_names.get(content_type, content_type) }}</option>
+                        {% endfor %}
+                    </select>
+                </div>
+                {% endif %}
+
+                {% if filters.language %}
+                <div class="filter-group">
+                    <label for="language" class="filter-label">Limbă</label>
+                    <select name="language" id="language" class="filter-select">
+                        <option value="">Toate limbile</option>
+                        {% for language in filters.language %}
+                        <option value="{{ language }}">{{ display_names.get(language, language) }}</option>
                        {% endfor %}
                    </select>
                </div>
@@ -101,6 +125,30 @@
                    </select>
                </div>
                {% endif %}
+
+                {% if filters.indoor_outdoor %}
+                <div class="filter-group">
+                    <label for="indoor_outdoor" class="filter-label">Interior / exterior</label>
+                    <select name="indoor_outdoor" id="indoor_outdoor" class="filter-select">
+                        <option value="">Oriunde</option>
+                        {% for io in filters.indoor_outdoor %}
+                        <option value="{{ io }}">{{ display_names.get(io, io) }}</option>
+                        {% endfor %}
+                    </select>
+                </div>
+                {% endif %}
+
+                {% if filters.space_needed %}
+                <div class="filter-group">
+                    <label for="space_needed" class="filter-label">Spațiu necesar</label>
+                    <select name="space_needed" id="space_needed" class="filter-select">
+                        <option value="">Orice spațiu</option>
+                        {% for sp in filters.space_needed %}
+                        <option value="{{ sp }}">{{ display_names.get(sp, sp) }}</option>
+                        {% endfor %}
+                    </select>
+                </div>
+                {% endif %}
            {% endif %}
        </div>

--- a/app/templates/results.html
+++ b/app/templates/results.html
@@ -24,7 +24,29 @@
                <option value="">Toate categoriile</option>
                {% for category in filters.category %}
                <option value="{{ category }}" {% if applied_filters.category == category %}selected{% endif %}>
-                    {{ category }}
+                    {{ display_names.get(category, category) }}
+                </option>
+                {% endfor %}
+            </select>
+            {% endif %}
+
+            {% if filters.content_type %}
+            <select name="content_type" class="filter-select compact">
+                <option value="">Doar jocuri și activități</option>
+                {% for content_type in filters.content_type %}
+                <option value="{{ content_type }}" {% if applied_filters.content_type == content_type %}selected{% endif %}>
+                    {{ display_names.get(content_type, content_type) }}
+                </option>
+                {% endfor %}
+            </select>
+            {% endif %}
+
+            {% if filters.language %}
+            <select name="language" class="filter-select compact">
+                <option value="">Toate limbile</option>
+                {% for language in filters.language %}
+                <option value="{{ language }}" {% if applied_filters.language == language %}selected{% endif %}>
+                    {{ display_names.get(language, language) }}
                </option>
                {% endfor %}
            </select>
@@ -63,6 +85,28 @@
            </select>
            {% endif %}

+            {% if filters.indoor_outdoor %}
+            <select name="indoor_outdoor" class="filter-select compact">
+                <option value="">Oriunde</option>
+                {% for io in filters.indoor_outdoor %}
+                <option value="{{ io }}" {% if applied_filters.indoor_outdoor == io %}selected{% endif %}>
+                    {{ display_names.get(io, io) }}
+                </option>
+                {% endfor %}
+            </select>
+            {% endif %}
+
+            {% if filters.space_needed %}
+            <select name="space_needed" class="filter-select compact">
+                <option value="">Orice spațiu</option>
+                {% for sp in filters.space_needed %}
+                <option value="{{ sp }}" {% if applied_filters.space_needed == sp %}selected{% endif %}>
+                    {{ display_names.get(sp, sp) }}
+                </option>
+                {% endfor %}
+            </select>
+            {% endif %}
+
            <button type="button" class="btn btn-secondary btn-sm" onclick="clearFilters()">
                Resetează
            </button>
@@ -106,31 +150,46 @@
            <header class="activity-header">
                <h3 class="activity-title">
                    <a href="{{ url_for('main.activity_detail', activity_id=activity.id) }}">
-                        {{ activity.name }}
+                        {{ activity.get_display_name() }}
                    </a>
                </h3>
-                <span class="activity-category">{{ activity.category }}</span>
+                <span class="activity-category">{{ display_names.get(activity.category, activity.category) }}</span>
+                {% if activity.needs_review %}
+                <span class="activity-badge needs-review" title="Această activitate necesită verificare">⚠ De verificat</span>
+                {% endif %}
            </header>

            <div class="activity-content">
-                <p class="activity-description">{{ activity.description }}</p>
-                
+                <p class="activity-description">{{ activity.get_display_description() }}</p>
+
                <div class="activity-metadata">
                    {% if activity.get_age_range_display() != "toate vârstele" %}
                    <span class="metadata-item">
-                        <strong>Vârsta:</strong> {{ activity.get_age_range_display() }}
+                        <strong>Vârsta:</strong> {{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}

                    {% if activity.get_participants_display() != "orice număr" %}
                    <span class="metadata-item">
-                        <strong>Participanți:</strong> {{ activity.get_participants_display() }}
+                        <strong>Participanți:</strong> {{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}

                    {% if activity.get_duration_display() != "durată variabilă" %}
                    <span class="metadata-item">
-                        <strong>Durata:</strong> {{ activity.get_duration_display() }}
+                        <strong>Durata:</strong> {{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}
+                    </span>
+                    {% endif %}
+
+                    {% if activity.get_indoor_outdoor_display() %}
+                    <span class="metadata-item">
+                        <strong>Loc:</strong> {{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}
+                    </span>
+                    {% endif %}
+
+                    {% if activity.get_space_needed_display() %}
+                    <span class="metadata-item">
+                        <strong>Spațiu:</strong> {{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}
                    </span>
                    {% endif %}

@@ -143,7 +202,11 @@

                {% if activity.source_file %}
                <div class="activity-source">
+                    {% if config.SOURCE_DOWNLOAD_ENABLED %}
+                    <small>Sursă: <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a></small>
+                    {% else %}
                    <small>Sursă: {{ activity.source_file }}</small>
+                    {% endif %}
                </div>
                {% endif %}
            </div>
--- a/app/web/routes.py
+++ b/app/web/routes.py
@@ -3,15 +3,28 @@ Flask routes for INDEX-SISTEM-JOCURI v2.0
 Clean, minimalist web interface with dynamic filters
 """

-from flask import Blueprint, request, render_template, jsonify, current_app
+from flask import (
+    Blueprint, request, render_template, jsonify, current_app,
+    send_from_directory,
+)
 from app.models.database import DatabaseManager
 from app.models.activity import Activity
 from app.services.search import SearchService
-import os
-from pathlib import Path
+from app.config_taxonomy import (
+    CATEGORIES, CONTENT_TYPES, INDOOR_OUTDOOR, SPACE_NEEDED,
+)

 bp = Blueprint('main', __name__)

+# Slug -> Romanian display name. Category, content_type, indoor_outdoor and
+# space_needed slugs never collide, so a single flat map is enough for the UI
+# filter labels.
+LANGUAGE_NAMES = {'ro': 'Română', 'en': 'Engleză'}
+DISPLAY_NAMES = {
+    **CATEGORIES, **CONTENT_TYPES, **INDOOR_OUTDOOR, **SPACE_NEEDED,
+    **LANGUAGE_NAMES,
+}
+
 # Initialize database manager (will be configured in application factory)
 def get_db_manager():
    """Get database manager instance"""
@@ -36,15 +49,17 @@ def index():
        # Get database statistics for the interface
        stats = db.get_statistics()
        
-        return render_template('index.html', 
+        return render_template('index.html',
                             filters=filter_options,
+                             display_names=DISPLAY_NAMES,
                             stats=stats)
-    
+
    except Exception as e:
        print(f"Error loading main page: {e}")
        # Fallback with empty filters
-        return render_template('index.html', 
+        return render_template('index.html',
                             filters={},
+                             display_names=DISPLAY_NAMES,
                             stats={'total_activities': 0})

@bp.route('/search', methods=['GET', 'POST'])
@@ -82,8 +97,9 @@ def search():
                             search_query=search_query,
                             applied_filters=filters,
                             filters=filter_options,
+                             display_names=DISPLAY_NAMES,
                             results_count=len(activities))
-    
+
    except Exception as e:
        print(f"Search error: {e}")
        return render_template('results.html',
@@ -91,6 +107,7 @@ def search():
                             search_query='',
                             applied_filters={},
                             filters={},
+                             display_names=DISPLAY_NAMES,
                             results_count=0,
                             error=str(e))

@@ -121,12 +138,51 @@ def activity_detail(activity_id):
        
        return render_template('activity.html',
                             activity=activity,
+                             display_names=DISPLAY_NAMES,
                             similar_activities=similar_activities)
    
    except Exception as e:
        print(f"Error loading activity {activity_id}: {e}")
        return render_template('404.html'), 404

+@bp.route('/source/<int:activity_id>')
+def source_download(activity_id):
+    """Download the original source file for an activity (plan A6).
+
+    Shipped DARK: returns 404 unless SOURCE_DOWNLOAD_ENABLED is set (copyright
+    exposure — the user opts in). Resolves the activity's `source_file` under
+    CORPUS_DIR. send_from_directory does the safe-join and blocks traversal;
+    web-mirror / extension-less sources that are not real files 404 gracefully.
+    """
+    if not current_app.config.get('SOURCE_DOWNLOAD_ENABLED', False):
+        return render_template('404.html'), 404
+    try:
+        db = get_db_manager()
+        activity_data = db.get_activity_by_id(activity_id)
+        if not activity_data:
+            return render_template('404.html'), 404
+
+        source_file = (activity_data.get('source_file') or '').strip()
+        if not source_file:
+            return render_template('404.html'), 404
+
+        corpus_dir = current_app.config.get('CORPUS_DIR')
+        if not corpus_dir:
+            return render_template('404.html'), 404
+        try:
+            # send_from_directory rejects path traversal and missing files with
+            # a 404 (NotFound) — no manual safe_join needed.
+            return send_from_directory(
+                corpus_dir, source_file, as_attachment=True
+            )
+        except Exception:
+            # Missing file / web-mirror source with no on-disk original.
+            return render_template('404.html'), 404
+    except Exception as e:
+        print(f"Source download error for {activity_id}: {e}")
+        return render_template('404.html'), 404
+
+
@bp.route('/health')
 def health_check():
    """Health check endpoint for Docker"""
--- a/data/activities.db
+++ b/data/activities.db
--- a/data/review_decisions.json
+++ b/data/review_decisions.json
@@ -0,0 +1,379 @@
+{
+  "96560c4dee400911e01ec5bf6f2460f2b6008bdb": "merge",
+  "7b93e6fbec6dab9c43862f4104d2f0e83784ccca": "merge",
+  "2765d81e6572977b411f9521871a7a3e557cab6b": "merge",
+  "bdd723ba8575ec94e6ed9e036ea62512ef643b40": "merge",
+  "78f0c433f6da30bb922ef1f8637a747410ac3528": "merge",
+  "0d3a5f2519d69e04558431d5990de8b1fb59a9bc": "merge",
+  "5abca033e2bd9f49a7b65cad2e357c7ea7e2236d": "merge",
+  "379999f9a603dfe4cb2316687a2266e17001ea8a": "merge",
+  "90a16119eadd2daa1ecb8b897c20875e54c5934a": "merge",
+  "d75b4131ece700309957aa17dae01f1feb9b3c35": "merge",
+  "0d0ccd256e8a02e45128b9dcc196b49ce6b5ec0d": "merge",
+  "db8afcd1ae6e0525dce1403caccf417de1423c68": "merge",
+  "ec157c89b9c3ff0691fc92b9ca0aa586f93c4880": "merge",
+  "67ce4ec87bcc33aab599a1c93c930e93c7762ae4": "merge",
+  "952cbbbd28453b93bae97de9edc473bb9525bac9": "merge",
+  "7a1acb62bb01bcd568fd67699da2523b293ce8a6": "merge",
+  "043c6dcc5a4d80f43ad9308e454e4ec287321a26": "merge",
+  "d10374d745aeb6abe1c0cc827baf98b311933196": "merge",
+  "bdce7856cee7f98e187b35b3ad806782e738f007": "merge",
+  "58103e94a6020d75e2d79fc1877b08b15e2a3d67": "merge",
+  "dec54f2d3015f73eea7558fb52d2107fc43b7937": "merge",
+  "d4b1e9dca9f3bdd4302f311ee96e6b39e96d5b1f": "merge",
+  "93b16e984bc1b98de7ec9403df61522e53ee26e6": "merge",
+  "6202ef279e70c0850369575f8dd9a6a338006a82": "merge",
+  "02b83749ab358c1c591f134d4d252a72843fb443": "merge",
+  "903af17ce988c40ab64c855a8b1f8e989d36883d": "merge",
+  "e0d6a73c57fcd151bbc4d7b7727aa8d6e1a3301c": "merge",
+  "63583f4d78dfe919ed195910a917d8cfe52ebe09": "merge",
+  "37306f964eaa9a123e40aaa7b642520ce3be0349": "merge",
+  "e355dd178be6d4589c227a047f2083e55806c4ae": "merge",
+  "3d99eecf4ad2e034ad9cac724cfb9f5d57bfd45f": "merge",
+  "e7c893a786fba566ea0cd5c8ee2a27e280560fd1": "merge",
+  "cf7a7b6db84a7070e9495f6ff2970d8860621885": "merge",
+  "a7157217540041a575cd3909a8bd7fe9345932ef": "merge",
+  "15f50b3eb8baeac6cf58df3938ddcda2ae0c3224": "merge",
+  "3c977df90998d069d7bba6a11d2bc7a891f5b45d": "merge",
+  "7b4361f2e3d9f017d78513bebfb0f2aa7243f987": "merge",
+  "d386aa70c890c3538b5f6b201e135f7ea9613ae9": "merge",
+  "1da5be5acb36f222a67999de864cbcbcebe8d876": "merge",
+  "7bf9319e4c0322747fd2e35971d7f7932e835eef": "merge",
+  "29d7060cf880d7db900d6fb64cf2cb3874794b88": "merge",
+  "9832cc0a6625f24354d6dfee3527086d042dd52b": "merge",
+  "12ea5a54fb5f123441d8fae1451f6cdd02f572bb": "merge",
+  "0638d8c1c81fe5c52ee29823be9f0e162f9aed12": "merge",
+  "a2d1962c3c3f1f471f33a093291119d67890d26e": "merge",
+  "ca7437967bbe393da5f5f463e359711929d6ba00": "merge",
+  "6637ee7fd0ae519f87e4374ac6d8d3df12a0c0e2": "merge",
+  "b7a8104fa06e2e2491f5699eefa5dbc8ffb26209": "merge",
+  "21bf2790f28f654d0ccd38d99e8bfdd9876f2d10": "merge",
+  "1e6a13a18d750a8b20457f71bfe64950a4c50f22": "merge",
+  "299a31f4187c38a9ff84773d4ed1a3eccfe6904c": "merge",
+  "a013365c84b79fac919ea70a2c2820881bf05e82": "merge",
+  "dfdc05810d2e84b93a2edd64cf8138aeb9a6ccf3": "merge",
+  "cf88b32fe2c2c09907d1813a1fefe9b17754d9fb": "merge",
+  "210ba3d41c7424428365e7ed33cf86c9819f1467": "merge",
+  "a458a6f96eccddd5b36392f38ef5c7dd16a238cd": "merge",
+  "e4198708e6d69d69ce8c7cc14d5c1e4757944808": "merge",
+  "6e3c6b282675e787c8726c0d4cc536357f6b1438": "merge",
+  "6370503a78ecaacf6e8738fe3419d2e4a4b280ec": "merge",
+  "5e6bbf4ea08efd5187954c5bb40d67a0f4ec223c": "merge",
+  "d49a48f1dbf890d9c802c0c7e0e7e38a8b28645f": "merge",
+  "668ca80cde843b5bcfc1e3c3599b114893481c6b": "merge",
+  "2d85fa7ad793b0fec8066b1b45ae51da7f09fae1": "merge",
+  "1416073acad77bef155224b1f8005f778554aada": "merge",
+  "ddb18e2eeb058b1dbbcc1afd444256a82fccc346": "merge",
+  "f1f5fb5b2f73f4035098d5af23da8ca2f20f54ca": "merge",
+  "d3f4d626827e6798675b55a4785c2829ef93ef69": "merge",
+  "8e83a64a6aa030d6d02d3d736c3fffe8249033dd": "merge",
+  "d93a71716835cd1f8856abe093580f1e6aeab023": "merge",
+  "98cc404cdd51987d3a6d15e88269c2978286e80a": "merge",
+  "4d31104a3cf321dcd831e354a6d2556cffeafbd0": "merge",
+  "b0e6d245954401f9a766addafb2dde59b5bf5fe9": "merge",
+  "722f70854b1aff75d059410ebfdcf99fa15fe828": "merge",
+  "3c58885138b0e9da95fcb0ed7906339cd104e168": "merge",
+  "daa05e337e1fc8f02c4679a1da21dd74deafda0c": "merge",
+  "f5ab25d06273404abfef1a55a1288f518d668638": "merge",
+  "7484f446f8c8e59e6f248a465ea74b60f87102a0": "merge",
+  "d072ac1c088c2fccf40536706d6433ecac0e865a": "merge",
+  "9c8c02b5aa8e2ca3fcd6a91b46883beeadfa77a1": "merge",
+  "81fed4a423c79dee6a8c7a3b82b4873e0003597c": "merge",
+  "de6d50103958f692181a2c8a7d8aff51a9e3c6a3": "merge",
+  "d1efa3c2cf23ee5cf845716a94af6e0c9ee2e3e0": "merge",
+  "d762147d9a4bd258a874ac9d899e712fa08e3865": "merge",
+  "1b28f43ef78eb3031c172f8c1a7c62834b902945": "merge",
+  "28dc9dde42db1dbf1fe826219b0cc2bb5b203304": "merge",
+  "3dc8037f2ed01e025c39dae2900ad40b5ccbd546": "merge",
+  "80d95a9a86341093da7a2266225f1e8b016bfdc1": "merge",
+  "7ec81e486f1da7c383c20837f06d162241ccf720": "merge",
+  "f653b01b650d420eeb1a581e3d2ce3823b7000f2": "merge",
+  "45b0184b1c666135ef903680935ccca6a638ee8a": "merge",
+  "9562ba2777d2ed8b3c590beac7697d2ea731a2f9": "merge",
+  "37d803fce5601c7f4e924368b07a74cd03df7317": "merge",
+  "9a92c8f1ef4120ad82993b095dad3b3dbd1c2e5f": "merge",
+  "4265795c137dd7229c4a593beab00d8401225afd": "merge",
+  "d308fbe7842da27ee2e3204f700afd9364dde327": "merge",
+  "3aaa23c16b88060510826b497c6433f010db7bb0": "merge",
+  "d827b842f11c0752ffe8cddb6c372b8deab01924": "merge",
+  "0232a13fd5589c4e1cbfe23c13a8b7b863b665eb": "merge",
+  "1b7d6c88a61344bb2033d5de64d913acf2d527f9": "merge",
+  "d9d8104bd70c4c03b6841ccbc07d078b9bc22c81": "merge",
+  "ad2251d37f8911b33e2b5227d2e7acb0533fb10a": "merge",
+  "7b1ec96916a6b5cac294e05ffc46c42a821076dc": "merge",
+  "7c29838be6491258e5e3d169d931412adc8d2f01": "merge",
+  "c050c15f2d3e0a2cf8d1145d3a9050d29d2f540f": "merge",
+  "3d6731ad69adec25662c431c5d42c7d4a1934c89": "merge",
+  "d07cb8c2e9d0c4a192b3966698e6671d21ea982b": "merge",
+  "218e30c5d34937563d5d328e51d983a33611a1f5": "merge",
+  "4547df7094cd3f8791adc07996d3c3264d1b8ff4": "merge",
+  "7de994f358426f4c3e48b54196e4c0c6c71d60c4": "merge",
+  "60473b807e6a9c940dacc726d3270efe8b6a0c5b": "merge",
+  "7db9d39d8d2baf52c6e586b95472699f42c9c98f": "merge",
+  "2409751dda5ae660a28b230694e0bc2bd2207133": "merge",
+  "b525c8c36de926f6b891d613df9835b06bc6be59": "merge",
+  "533ba4716e2ed1154e3ce6323aa899ea148abdf5": "merge",
+  "b815d67ecd6ac3a64a6f792ffddc63a462754100": "merge",
+  "6d0c3759ee029bb6d3fcd693dca9753f5186ecf3": "merge",
+  "50ae80ee932281e9f4424c71f4d2607ce842737b": "merge",
+  "7507208a9ca4f83a3634904810bdcd6a1633058f": "merge",
+  "8614f97398b7b50fb50ae8405f8695f2960489d9": "merge",
+  "9d8779922c44bc4a805447283b8b037569131b18": "merge",
+  "965d619a9809c4f3831e308cf6214ddcebb9f835": "merge",
+  "7f99acd1fef4a139c96911773d188f5be94ddbb5": "merge",
+  "c5478f32f290c116b8a801736d2832a4ae1fafa5": "merge",
+  "d589902ef6b25634da0fd89c00e3bbb44e8d868b": "merge",
+  "9b3887e6d292ba2d3642689538231fb1a4d1dc08": "merge",
+  "a2571504a7e62563aadb34799f40882afcd56ec8": "merge",
+  "29f4b1f18f23495691ca5d6bc33150f36612a16f": "merge",
+  "be8529c39e659ba00ba30ea4ebbb05510ceaa684": "merge",
+  "3113a3f8ab0795d3dfcbf4b0dc64f5887bfcfeb6": "merge",
+  "53e4ecab7d2fc23efa1f75a93b7da1fbe46fb38a": "merge",
+  "aec818b52ddc3de410cb56692e95493a0661ac40": "merge",
+  "19cd0d05e21206203498eda94b4ba8b777d360f2": "merge",
+  "b596ae2fa7fd7dd059095b989f8477fc0023ef2a": "merge",
+  "68e69967ef4f7437f21373b52f77a3701f61054d": "merge",
+  "d72598ee9abdfd3d5c07b1e6f3e9fa11424b569f": "merge",
+  "d420186d189d783fde3d29908ade3221eac109df": "merge",
+  "7a3c7465e4221511994a2b2ebf6aa5c0b3353b0c": "merge",
+  "c19d94c259c4bb55cda1198b96c8e723ca48904b": "merge",
+  "f7222a7fa8b9930b0aef8e377e820423bd891b7a": "merge",
+  "aeda2c5a5aa5c0ae8b33f72f81a76e0617b45474": "merge",
+  "3c75d378d76756f56f2c5827d0adeaa6c0c88e04": "merge",
+  "9bd53fc3c7b669d84cfe1f9778791ca32c1df621": "merge",
+  "e065420daad8abd3e6908381775c9ab5aeb804c0": "merge",
+  "107bd809c7ca5d24b1ea2d0e59843b6386be43ad": "merge",
+  "32457099f22bfff1d9ab4f24c87768c893e295c6": "merge",
+  "965038421cb0d65fc937cbf4f99b487732807fe0": "merge",
+  "7b0e5e6a8e49a9f3594cdf17559cfff52c1a329a": "merge",
+  "fa88682c3eab4a5fed4bb64f5a9b5733018d12f8": "merge",
+  "4cddf1c9b6424b9d66cf6adf2e631dda8a92b88f": "merge",
+  "0733103d3ec92e1bc7a05f7a6be53b9ea19ea1c9": "merge",
+  "30d28ba19a6ece36d32b54ca30c0a71fa4b04dcf": "merge",
+  "91ca8e066bfd41514ed775e2cbe485d4d74ee85c": "merge",
+  "8b952bbf8e45e128e8a5b224a3c3da7d166327b1": "merge",
+  "5ae4f3199d76f8805e38174425f70fd7bb99296c": "merge",
+  "3ac614a52102ecd7448d4ebd47641264b3feba99": "merge",
+  "e25061e58c6d66b88e89635d670d5360ebf2bc6d": "merge",
+  "277434fe9d2179b91d2cfbd660a16f95ba590031": "merge",
+  "8ddcb8e0885cd61847206cad017e1d65b002e798": "merge",
+  "bcda860319366bab0690195d2d9665f72492e19e": "merge",
+  "d9ed88849aa59a5ba31798854a25843815cc2d3a": "merge",
+  "eae46f64b96fff8db16751cf271698a54336d7d3": "merge",
+  "cfd5e3814f129e59c011acce3ef97d3a95b21465": "merge",
+  "c7929c37f1367d52bd529e671e2c951dbb80f618": "merge",
+  "b5137e14a8914e9007fc78b48b9c8ac5060d3885": "merge",
+  "38373f03cc19db9363c9be6dd1df60f8c0344a3a": "merge",
+  "8e2784d1d0abf50c83d280dc3c50d51bcc8bb947": "merge",
+  "9322e84a111ea8d6aacae5998d742b1ef7689870": "merge",
+  "2ab3a81c5d889f7ed58d72a4c42e54728955a3b1": "merge",
+  "3c87b2d1c3df5d3cfe920e992805d55f840b023f": "merge",
+  "3f14169b7cce95214f874f5dc7acf214a1576e09": "merge",
+  "bfbd7c57a96b5a66970d28f23ec65e71c3d0af63": "merge",
+  "56841bf251e2631d31accd7b44ad1223ef55116e": "merge",
+  "e4a907fb74246d5af8201df32ab1db0ffcf10582": "merge",
+  "af373139d280bc4b36b31a3e917a442d441f76c4": "merge",
+  "6d69ce5b4cb5dae80f7fc6fdd74092624d7e629b": "merge",
+  "5d93d92c1ed5ec1b9eea859a8d10dd5542f6bf0a": "merge",
+  "0d3655929d5a20efade4fdebdbe1e51a857aa4a8": "merge",
+  "c6f5a99b9970977b87527e68a79856957266ec85": "merge",
+  "ac8798c7b1bda482f86770a1152b33e1eb09936f": "merge",
+  "da588a2ebac1e9fce8a00635be058aa08a534006": "merge",
+  "b1b5f3e3195a835e110c12826b0f68aa6c5dfda8": "merge",
+  "2cbfda9b32acb9e3353046193185ae459b6baade": "merge",
+  "4530e7f505b85a4bc418f8e204fd2dce3e4bcd51": "merge",
+  "d0f306635d8df199e336616980ff8367f23a86c6": "merge",
+  "043651de409eaa56f0f5f02ccb36acb4f9379c7c": "merge",
+  "78febd51e41d71f721454190ffe4defd3be64cf2": "merge",
+  "878234eaaa6a82da2a07cbddf7f7a9312b7aee40": "merge",
+  "d2e283086b32df1b74da9ff6bac014867f3e02b9": "merge",
+  "8494ded48880d67c8857860f485a428c85ddb17c": "merge",
+  "d99eab2896ee0c15d96ce9008387f1caae4992d5": "merge",
+  "996d708cc4d0a7532f24889ed8fa33565f1252db": "merge",
+  "bea98344fcc400338f45fa7aea05c1c716ff2909": "merge",
+  "7ed550a014b20ddb6d406ae470fc564501fcdd2c": "merge",
+  "783f22275c9c238507aa5903ec8b0d99e3a4a348": "merge",
+  "d32c5eea6b7e682bc998ae968b835436784c8501": "merge",
+  "e1aae6922f87193a5d0c8efdeabfd55e80ba13c1": "merge",
+  "9aa988ee82d5a2ce587b2cb4ad304c1449985322": "merge",
+  "e11c9441f3fb67f4c4fed049c015fb6eb8b73ca7": "merge",
+  "fdc0642b90b7afb0520294d90a73060ed5211cde": "merge",
+  "29c466523090f5f957dd5feb7a10a8ee839d6a68": "merge",
+  "987e658e3f40c874da5e237a635ea3eba63520af": "merge",
+  "ef677e0935c55d3da20872e27328eddbae34b5b9": "merge",
+  "b34cbc1ac68e6e0cc61f273d030520cb0f6639fb": "merge",
+  "7769d0a01f93ed6437128827ba88987807aa7154": "merge",
+  "5abd16d445171cdc30752149066755886ac627b7": "merge",
+  "b3da0c637ac87ff1032213ce007d365af512ce33": "merge",
+  "ea4065e3f76043bad07c6e6aff17c342abf59668": "merge",
+  "c3e40ebe130c4603359654c228ee8f3518ea6b46": "merge",
+  "0a98a18116f3384808d9472d370e9c65984f10f7": "merge",
+  "057f5aab9399ceab6ac1d8b59da831e952e53dbd": "merge",
+  "2d41a71e7b7db81fbc2757b814d9189c67cda4cf": "merge",
+  "c27d096ed3d5d94e1f9c1574eff52635c78ae282": "merge",
+  "b4eed2c25fd8260393df142bef5d4a964f3bd10f": "merge",
+  "f6bd9d75156541527e66fab5e474c1422ad9dbf9": "merge",
+  "a449eb060de8beba56dd0dbafde9fee4f65288c9": "merge",
+  "cde603b1d5c0ea111641e21abe2e75a49d157de4": "merge",
+  "c3b0799baf3248e581be8a9cecf35351a7c4a363": "merge",
+  "2c08a477dd2717c07da9f2bbaf1940403d0b8f86": "merge",
+  "a55f668e2623aa50abded937ff4400382e80d90a": "merge",
+  "abb24637bfd92e7212e9dafc9a7e5ad14d7cc8a4": "merge",
+  "b862dd55a70d645da81f6c0cdd921c8728674633": "merge",
+  "6197261b63f675835c3290be0f10ced10400e326": "merge",
+  "74c82141e805ebaab913cea281416883849cc771": "merge",
+  "1ce689262c53a212ad8f94a778ea4bf80bc62910": "merge",
+  "05d9af704c387479856a2390b81a27fed4413285": "merge",
+  "492c25ec0c4667bbe01177f689f3c51b8f31190d": "merge",
+  "ec79b6e520a87ca24ae5265aff1b39a87dcadf82": "merge",
+  "eb5d32991120f20c2ff1c16e92b31c78aa3dcd22": "merge",
+  "e2dc20814e229ddef4569ca4d405c0b553313d1a": "merge",
+  "9e37a3e0e6336d306c263bd4739e9231cb8221dd": "merge",
+  "2c541c5bf5b6a93301fc8ee089228a09b8581959": "merge",
+  "f5b717b5da236e42a840093e4562fe2896f44a20": "merge",
+  "04a3b2dcba0a00ee833e2e9c4a5f032f2cf80742": "merge",
+  "f8452066c036fa7c09b4acd03de4f0c661a6bec6": "merge",
+  "b595ffe402b14a1e927c2f0fe9554f581354999d": "merge",
+  "786ab65d5e3c51246f1c457b77eff154860bf959": "merge",
+  "8b6d467df7b0161820cf698a33c665217c597d04": "merge",
+  "264d317f0a0d92e82c2943ca9365367382902c27": "merge",
+  "6d95efd3025e3175aa75c1aa7ed7ec7d1ac3d1a1": "merge",
+  "165c5c31e2fb155e00a2664fcca334c25356af69": "merge",
+  "998d2db54bde3578f8e7dbaa5e4612f20836e5d0": "merge",
+  "c764f15e260fc5f2ac445e8216e27130fa9af9e7": "merge",
+  "0120182e18a57cd9d936bff9bf74675bb9233cdb": "merge",
+  "09c39c6b32699cf61aa71fa6689dd676b5e803a9": "merge",
+  "1549346e5b321eaf1d767f673e642cac11d22700": "merge",
+  "cd7a074efdf2dd6e9272e05cb88a8f8da08a8a8b": "merge",
+  "f418edf54719e46b2707f0c6ec9a47da38ab099a": "merge",
+  "9e20edd1cc156c387a6fd4b47f0495a5d1d5d969": "merge",
+  "dac94f8716a6f192fcfb660453b3074e55b60c3f": "merge",
+  "c62767464497b98bbab2fbf724ce069b749c0d95": "merge",
+  "37bc745042745d9984d9323f83de2bb6f2af0d49": "merge",
+  "3df2f823828b92a35075fb207844cf788b8bfc09": "merge",
+  "153cbe257b938edc3bda1aeea3729c53abc81d57": "merge",
+  "0da73ba76f0e83a24cb7ec86ad426fc79f0be00e": "merge",
+  "517597920bde6cfea71154e6a13fa4a9bcb18db2": "merge",
+  "283bd504ae23412dfcf671c7d6febdf42232383c": "merge",
+  "4d889b00c2811ade234b312f72d725ffab3971e7": "merge",
+  "c5e723a923eb6c0d07f34e05c30465a862b3fd5c": "merge",
+  "9e9afe3963dcbfcb1191110145772373c49739dd": "merge",
+  "4a238771093ed7936603426b9a19355fa48a086a": "merge",
+  "641291a504ecc051c99fc385151f68f6f2c7711a": "merge",
+  "d6e47126fe3754fde729dcb7777cc25165e57dba": "merge",
+  "812cdf28743b2f01cceedca18cf7e237f6f38c1b": "merge",
+  "de2ce5f515b0d9647320bd71d8d013c68fb48bc4": "merge",
+  "14ffc9df5acc2537d3cca6322d097363d7106d52": "merge",
+  "5941736673d484453b49edc863a9dfa669f4cdb2": "merge",
+  "d68039d71f1bc980a8cca9176df5fd672575db46": "merge",
+  "5e8667cd2a0862a7b528779267ea074d88501494": "merge",
+  "7b6c4999ce6d158f72aba749543f5c8a1cbfb08b": "merge",
+  "ace457ccf21922424aa38a0a29d5aa7dbd4e7fc1": "merge",
+  "5707d00d0d3b646d83541c9205405f5e41ad07b0": "merge",
+  "261e67f706670c9f81e7b2c6d0ec1fc934c4109a": "merge",
+  "eec31103367b39bc1b4db3d0bce21650ad8f4382": "merge",
+  "39880f589acdfeb25866adf3246a76df08798f19": "merge",
+  "65ced2d78530d25626bcf18c4005a0aae5ed62c2": "merge",
+  "48cf968689bc72fa3ae82ce65a7b0943489f0154": "merge",
+  "4f2e11bbb3805459f2eacf219e5ec03f1d28cb35": "merge",
+  "0ea6dbabece8f3e5ea404f61c30c377f60a472e5": "merge",
+  "7ede6afb1141d462955d620a82608841220798b7": "merge",
+  "9ff9d8bf1215f2d98d3707653c09bcacf4e296de": "merge",
+  "b2e944340a3cccf5c1c399aa814e2fabe1ac864d": "merge",
+  "ce4d859cbca0368652db3168769a4e6ccac8f91a": "merge",
+  "6f657a344ac4233f86f0cf9ff11aa0aed4be94e0": "merge",
+  "b7101934d892ce071b407567acf6804e41c9b89c": "merge",
+  "2b5d28295e73e25c51359577e55470b59a4c283a": "merge",
+  "f0c9de980532b717b361439c4ae7d58e4b6a49ea": "merge",
+  "13bd864c337a32bc7207d44c4f1af0ce58f15072": "merge",
+  "710c6b384f10cc08c9a2116614e7736d035e65ae": "merge",
+  "7109848fc1fc60cba60a51e3cb0a514d3c90eebc": "merge",
+  "801e5080975fda8bab6c060efb2e7f3ac85d8182": "merge",
+  "a4910a93356eea624e5a08fe05634b94c0dd0285": "merge",
+  "7ff8ff8d36da93bbeade44b70f6b944ea32135d0": "merge",
+  "02402c6f4cde79fc0d4db36d033d4ea504d97c90": "merge",
+  "09c196193a16b47338b3e5446882ffa8ce3f8b16": "merge",
+  "2d427628ac15e8d04daa564e4c0f1a3aa7f93126": "merge",
+  "d65181509712e219fbe71407f0cd24845f306cd1": "merge",
+  "a82a00d0170e01857e493ed08f20d9b2133a8791": "merge",
+  "a1c17e0481ed6b2a3bc68f8bbf135173f636c2f6": "merge",
+  "1cefb01dff5305eba60b9f19e1acd5087bd8b60a": "merge",
+  "2f766fa9b5c81f85fa225f5e9573785fd0ddf5eb": "merge",
+  "12e822ee75c1f3a0d59cbb8e76370a7b7dd0e4dc": "merge",
+  "6e285cebbe113587482274cb3d51eb4ad8139643": "merge",
+  "f79b4edf703c01e6a5b7a75c43b691a2de7c6ca4": "merge",
+  "cc3901d8e10db1561222e2b39c62391298a14d82": "merge",
+  "21ee92f1adb0d480c328db4ded13728e890c0ea0": "merge",
+  "7b95c6dbd4990cbf4400e7ae89e2d530e31baa4e": "merge",
+  "978078ce3a4aaa7c425cc31e3dfece248a4f3a51": "merge",
+  "569a61c811956afb56325ede193d1d24a2d28013": "merge",
+  "77c9f803050c33ef23e6e428fbc98d5bde14ab2b": "merge",
+  "02dc4ec08801ab168b2838f3322a3a477ee179be": "merge",
+  "f804ddaa38c0d16757bd82f4aa58292da39a427c": "merge",
+  "62cc44865952495586686578bfb905b0026a762f": "merge",
+  "ae10bcefb1c772dd8898f56a6590a4342a880539": "merge",
+  "59427406b24625474bcefc245fd1f27b30bc2c81": "merge",
+  "3c7b681ea2f999c7581bf38d4c8e2a5486f8e00f": "merge",
+  "2c17be0a20d9399ab3b7c491080aee118301d4e1": "merge",
+  "072d8163161c207aafa8fff050ad93b7cafa815d": "merge",
+  "881b8d6623cf104d25bef36a17f89af27d0a6e8e": "merge",
+  "41875d183c0ef7d3a71477ae0d0f4bea0763812c": "merge",
+  "ec1cfbe8410b6569a010a708f81ab9768e882cf1": "merge",
+  "bd6faef7a688ea78b5017c19442e302f0337a3f1": "merge",
+  "225823b9e7e36d1d9e2e8ff0dbce20f3de6dc040": "merge",
+  "35d2a6b775147a9b60212d43810338231a239865": "merge",
+  "a6484f51b8be1c8b1898464e2a60b7b4496afd23": "merge",
+  "058172d888c1d39d7af7dc95fb6808bcde2a9a33": "merge",
+  "8d0a231bab26e73c7ed7e908e049fb48f4ee333f": "merge",
+  "96230c861c8a18100f40daa6a1d6e2ea91778707": "merge",
+  "a3f4effb653fdc612188a260ee04e6fd58213d41": "merge",
+  "3adb67d39baa74d72ca6fd926ab000b0e5b50007": "merge",
+  "e21b4aaf288b2223783f39db5383c39fdd32773f": "merge",
+  "885d004c25647a9561922fc2d5f2c2a7fe4f91f4": "merge",
+  "53c8b120a142ebcafbec64ca3c9040afa39f4d02": "merge",
+  "3e5be3f2f9b83d0506d261b45913d2cd5bd1e36a": "merge",
+  "ce65391f84fecb5d44d7de447d87bb1ef64a60bf": "merge",
+  "a39abb5b691049421ebb8720c0d5305a95bb244e": "merge",
+  "a6065db7c298ecb714b86f930bec5988d4466603": "merge",
+  "607aae23587abc641177812447a45099973c41a9": "merge",
+  "c94ec26898c0d2bc0b76a42f690bb3722c01fc43": "merge",
+  "d31c729d0d9dbef75416af90502bf75194e6e28a": "merge",
+  "28442303003251487bbf9376acda1e6f35c1ea59": "merge",
+  "1b263555acc9753b4de67191c636c9d403a69c98": "merge",
+  "88aaa450db0dd1121d47ff9f605e727efbdcaab7": "merge",
+  "d9e19f0a3a5dea8a04d5b2ad241f2ef9c1f06f59": "merge",
+  "16e379debd6d37512791ee9333079c669df15575": "merge",
+  "f51a06926de6189cc24abd4e67a325aaf1ad464a": "merge",
+  "9b8e2100c116ded02740c4cb777a5f157faccdb0": "merge",
+  "f34a747406f51b6531f04c0d14242226503051de": "merge",
+  "db2dedf1eed6cba52eb672be03855792e85da808": "merge",
+  "b6259a36610f0fd926ee6778b779fe538b9a510d": "merge",
+  "717e057c3fccf73ff4c574c7fe925b16058d8809": "merge",
+  "9b6e897d7f92a031ced4241f81f8b1a16f1dcebe": "merge",
+  "7c9434809e1fb66093267cb386ede23690217c5d": "merge",
+  "8d37e5a1fe1ead57a6fd6358f3dc734a661d69c3": "merge",
+  "b8336b4760340e35f4e3160b8e9c6bbe1da9cae1": "merge",
+  "1c3f0ea4fc7e4c995d8a4926081335a8c5f68dce": "merge",
+  "837abf0a5cce27a14c20cd5a21af867f8174be43": "merge",
+  "a1c63a7e1777c43cbf07834d8ad5afc05321445b": "merge",
+  "22c80d3819f0537048ef7d5d5ec9097713fc8dac": "merge",
+  "3eebe3845a49ee6374433d49c75d203021533a88": "merge",
+  "deba84d9ce4b353cd6c332fa1ce130541f9c3ceb": "merge",
+  "2a6f99c6ea2519768459228412712bd5e53ade15": "merge",
+  "469559bb434980822263f6b4986c50b6acba5cf0": "merge",
+  "97caa33b8724da030a27366e497db1848a38b202": "merge",
+  "68fdeb90581c9b5f0ebad6bb83210afd427deee7": "merge",
+  "cbcf7732216210ec544f18a676118c8ad7183695": "merge",
+  "5854cac43108e0ec536ae892a31d53c961386b6c": "merge",
+  "acb7e01a2f4e6516bcdd9d9f41f7993d81ae944c": "merge",
+  "97bac2c4905f6a4d8b21feb0bd7639222ff14b03": "merge",
+  "c29cea5412dc33461b3a57339e324d85cc9d8d2c": "merge",
+  "e8301271a18a1914cea7bf4ed31f38db7aba2dc7": "merge",
+  "a1192f7fa21c7525312de0620d64411918ca6bc7": "merge",
+  "8d9b61b5c93dc9506e9beeb545d28a6f0605b0ce": "merge",
+  "d2d22d4f31fbfa79cd7bc21d804a26cf2733615e": "merge",
+  "a4c98e31776765ce53ab27a4992e7e3797f0aa53": "merge",
+  "f8cafe342cb100e754258b7b8ea1d6f830b1939b": "merge",
+  "5872e950aa89f5211554bb41a4fea2983547fabc": "merge",
+  "26fa80ae816ebdf833bb53b111214dd7d6fcef14": "merge"
+}
--- a/scripts/ENRICHMENT_PROMPT.md
+++ b/scripts/ENRICHMENT_PROMPT.md
@@ -0,0 +1,107 @@
+# SUBAGENT — Activity enrichment
+
+You are a subagent in the game-library enrichment pipeline. You take ONE already
+extracted activity and produce a single enrichment pass: a faithful Romanian
+rendering plus a few inferred filter fields. You do **one** activity per prompt.
+
+This is **not** re-extraction. The activity text already exists and is trusted.
+Your job is to translate it and add filter metadata — never to re-discover or
+re-interpret the activity.
+
+## Your task
+
+The prompt gives you two blocks:
+
+1. **Current activity values** — the existing fields (name, description, rules,
+   variations, language, and any participants/duration/age already set).
+2. **Source chunk text** — the original passage the activity came from. This is
+   your ground truth for any expansion. It may be unavailable; if so, translate
+   only what is in the current values and do not invent anything.
+
+Produce one JSON object and write it to the path named in the prompt
+(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
+`content_key` string from the prompt.
+
+## Rules
+
+### Translation (always)
+- Translate `name`, `description`, `rules`, `variations` into natural, fluent
+  Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
+- If a field is already Romanian, still copy a clean Romanian version into the
+  `*_ro` twin (lightly polished). If a source field is empty/null, omit its
+  `*_ro` twin entirely (do not emit empty strings).
+- Translate faithfully. Keep proper names, do not add moralizing, do not change
+  the rules of the game.
+
+### Description expansion (constrained)
+- You MAY make `description_ro` richer than a literal translation — but ONLY
+  using detail that is actually present in the **source chunk text**. Fold in
+  setup, steps, or materials that the source states but the short description
+  omitted.
+- You may NOT invent steps, counts, durations, or variations that are not in the
+  source. If the source is thin, the translation stays thin. Hallucinated
+  expansion is the one unacceptable failure here.
+
+### Inferred filter fields (mark when inferred)
+Fill these when you can, using the source text first, then reasonable inference:
+
+- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
+- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
+- `participants_min`, `participants_max`: integers (people).
+- `duration_min`, `duration_max`: integers (minutes).
+- `age_group_min`, `age_group_max`: integers (years).
+
+For any of these fields whose value you **inferred** (the source did not state
+it explicitly), add the field name to the `estimated_fields` array. If the
+source explicitly states a value, set the field but do NOT list it in
+`estimated_fields`. Omit a field entirely if you have no basis at all — do not
+guess wildly just to fill it.
+
+Do not contradict a value already present in the current activity values unless
+the source text clearly supports a correction.
+
+## Enum vocabulary (fixed — use these exact slugs)
+
+- `indoor_outdoor`: `indoor` | `outdoor` | `either`
+- `space_needed`: `mic` | `mediu` | `mare`
+
+## Output format
+
+Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
+
+```json
+{
+  "content_key": "<the exact key from the prompt>",
+  "name_ro": "…",
+  "description_ro": "…",
+  "rules_ro": "…",
+  "variations_ro": "…",
+  "indoor_outdoor": "outdoor",
+  "space_needed": "mediu",
+  "participants_min": 6,
+  "participants_max": 20,
+  "duration_min": 15,
+  "duration_max": 30,
+  "age_group_min": 8,
+  "age_group_max": 14,
+  "estimated_fields": ["space_needed", "duration_min", "duration_max"]
+}
+```
+
+Include only the fields you actually fill. Always include `content_key` and
+`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
+no commentary, no markdown fences in the file itself.
+
+### CRITICAL — escape quotes inside string values
+
+Any ASCII double-quote (`"`, U+0022) inside a string value MUST be escaped as
+`\"`. Your Romanian text is full of `„cuvânt"` — written raw, the closing ASCII
+`"` terminates the JSON string early and the whole file fails to parse (and your
+enrichment for this activity is silently lost). Either keep the typographic
+marks (`„ "`) or escape every literal ASCII `"`. After writing, re-read the file
+and confirm it parses as valid JSON.
+
+## Report
+
+After writing the file, report in under 30 words: the activity name and which
+fields you estimated.
--- a/scripts/SUBAGENT_PROMPT.md
+++ b/scripts/SUBAGENT_PROMPT.md
@@ -0,0 +1,113 @@
+# SUBAGENT — Activity extraction
+
+You are a subagent in the game-library extraction pipeline. You extract
+educational activities (games, team-building, scouting, recipes, songs,
+ceremonies) from one chunk of a source document into structured JSON.
+
+## Your task
+
+1. **Read ONLY the chunk you were assigned.** Do not read other chunks, other
+   files, or the original document. The chunk is a `.txt` file with
+   `--- PAGE N ---` markers.
+2. Identify **every distinct activity** in the chunk.
+3. For each activity, fill the schema in `scripts/activity_schema.json`.
+4. Write the result to `data/extracted/<chunk_key>.json`.
+
+## What counts as "a distinct activity"
+
+A distinct activity is a self-contained game/activity/recipe/song/ceremony with
+its own name and a real description of how to do it. It is NOT:
+
+- a bare mention or a cross-reference with no description — **skip it**;
+- a sub-variant of an activity already extracted — fold it into `variations`;
+- a heading, a table of contents entry, or running page chrome.
+
+If the same activity is split across a page boundary inside your chunk, treat it
+as **one** activity and combine the text.
+
+## Output format
+
+The file is one JSON object: a `header` plus an `activities` array.
+
+```json
+{
+  "header": {
+    "source_id": "<set from the prompt>",
+    "chunk_key": "<set from the prompt>",
+    "source_hash": "<set from the prompt>",
+    "schema_version": "1.0",
+    "prompt_version": "1.0",
+    "chunk_range": "pages 1-20"
+  },
+  "activities": [ ... ]
+}
+```
+
+## Rules for each activity
+
+- **`name`** — the activity's real name (≥3 characters).
+- **`description`** — real prose describing the activity. No hard length limit,
+  but it must actually describe what happens.
+- **`rules`** — how it is played / carried out, if the source gives rules.
+- **`category`** — exactly one taxonomy slug (see the `enum` in the schema):
+  `jocuri-cercetasesti`, `team-building`, `icebreakers`, `camp-outdoor`,
+  `wide-games`, `orientare`, `prim-ajutor`, `escape-room-puzzle`,
+  `creative-stem`, `sports-active`, `cantece-ceremonii`, `retete`,
+  `supravietuire`, `integrare-incluziune`, `conflict-empatie`, `altele`.
+  When unsure, use `altele`.
+- **`content_type`** — the FORM of the content, independent of category:
+  `joc`, `activitate`, `reteta`, `cantec`, or `ceremonie`.
+- **`language`** — `ro` or `en` (the language the activity is written in).
+- **`source_excerpt`** — **MANDATORY.** A short quote (one or two sentences)
+  copied **verbatim** from the chunk. This is the anti-hallucination anchor: it
+  is checked as a fuzzy substring of the chunk, and invented quotes are
+  rejected.
+- **`page_reference`** — **MANDATORY.** The `--- PAGE N ---` marker(s) the
+  activity came from, e.g. `"page 14"` or `"pages 14-15"`.
+- **`extraction_confidence`** — `high`, `med`, or `low`. Use `low` when the
+  source text for the activity is thin or ambiguous.
+
+## Never invent data
+
+- Do **not** invent ages, participant counts, or durations. If the source does
+  not state them, leave those fields `null`.
+- Do **not** paraphrase the `source_excerpt` — copy it character for character.
+- Better to extract fewer activities accurately than to pad the output.
+
+## Escaping quotes inside JSON strings (CRITICAL)
+
+Any ASCII double-quote (`"`, U+0022) that appears **inside a string value** must
+be written escaped as `\"`. This is the single most common way these extractions
+break: Romanian source text uses typographic quotes like `„cuvânt"` where the
+closing mark is a plain ASCII `"`. Written raw, it terminates the JSON string
+early and corrupts the whole file. So:
+
+- `"description": "grupul cântă „Unu\" în cor"`  ← correct (inner `"` escaped)
+- `"description": "grupul cântă „Unu" în cor"`   ← BROKEN (unescaped `"`)
+
+Prefer keeping the source's typographic quotes (`„ "`), but whenever a literal
+ASCII `"` lands inside a value, escape it. After writing, re-read the file and
+confirm it parses as valid JSON.
+
+## Writing large outputs in batches (IMPORTANT)
+
+A single Write tool call has a hard ~32K output-token limit. Dense chunks
+(50+ activities) will exceed this. If you estimate >30 activities, write the
+file **incrementally**:
+
+1. First Write: emit the file with `header` + the first batch (≤25 activities)
+   and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
+2. For each subsequent batch (≤25 activities at a time), use an Edit call
+   that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
+   `,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
+   closing brace plus the last activity's tail) so the Edit is unambiguous.
+3. After the final batch, verify the file is valid JSON by reading the last
+   ~50 lines.
+
+This keeps each tool call under the output-token cap.
+
+## Before you finish
+
+- Every activity has a non-empty `source_excerpt` and `page_reference`.
+- The file validates against `scripts/activity_schema.json`.
+- You only used text from your assigned chunk.
--- a/scripts/activity_schema.json
+++ b/scripts/activity_schema.json
@@ -0,0 +1,110 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "title": "Game-library extraction output",
+  "description": "One subagent output file: a header carrying provenance/version metadata plus the list of activities extracted from a single chunk.",
+  "type": "object",
+  "required": ["header", "activities"],
+  "additionalProperties": false,
+  "properties": {
+    "header": {
+      "type": "object",
+      "required": ["source_hash", "schema_version", "prompt_version", "chunk_range"],
+      "additionalProperties": true,
+      "properties": {
+        "source_hash": {"type": "string", "minLength": 8},
+        "schema_version": {"type": "string"},
+        "prompt_version": {"type": "string"},
+        "chunk_range": {"type": "string"},
+        "source_id": {"type": ["string", "null"]},
+        "chunk_key": {"type": ["string", "null"]}
+      }
+    },
+    "activities": {
+      "type": "array",
+      "items": {"$ref": "#/definitions/activity"}
+    }
+  },
+  "definitions": {
+    "activity": {
+      "type": "object",
+      "required": [
+        "name",
+        "description",
+        "category",
+        "content_type",
+        "language",
+        "extraction_confidence",
+        "source_excerpt",
+        "page_reference"
+      ],
+      "additionalProperties": false,
+      "properties": {
+        "name": {"type": "string", "minLength": 3},
+        "description": {"type": "string", "minLength": 1},
+        "rules": {"type": ["string", "null"]},
+        "variations": {"type": ["string", "null"]},
+        "category": {
+          "type": "string",
+          "enum": [
+            "jocuri-cercetasesti",
+            "team-building",
+            "icebreakers",
+            "camp-outdoor",
+            "wide-games",
+            "orientare",
+            "prim-ajutor",
+            "escape-room-puzzle",
+            "creative-stem",
+            "sports-active",
+            "cantece-ceremonii",
+            "retete",
+            "supravietuire",
+            "integrare-incluziune",
+            "conflict-empatie",
+            "altele"
+          ]
+        },
+        "subcategory": {"type": ["string", "null"]},
+        "content_type": {
+          "type": "string",
+          "enum": ["joc", "activitate", "reteta", "cantec", "ceremonie"]
+        },
+        "language": {"type": "string", "enum": ["ro", "en"]},
+        "extraction_confidence": {
+          "type": "string",
+          "enum": ["high", "med", "low"]
+        },
+        "source_excerpt": {"type": "string", "minLength": 1},
+        "page_reference": {"type": "string", "minLength": 1},
+        "source_file": {"type": ["string", "null"]},
+        "age_group_min": {"type": ["integer", "null"], "minimum": 0},
+        "age_group_max": {"type": ["integer", "null"], "minimum": 0},
+        "participants_min": {"type": ["integer", "null"], "minimum": 0},
+        "participants_max": {"type": ["integer", "null"], "minimum": 0},
+        "duration_min": {"type": ["integer", "null"], "minimum": 0},
+        "duration_max": {"type": ["integer", "null"], "minimum": 0},
+        "materials_category": {"type": ["string", "null"]},
+        "materials_list": {
+          "type": ["array", "null"],
+          "items": {"type": "string"}
+        },
+        "skills_developed": {
+          "type": ["array", "null"],
+          "items": {"type": "string"}
+        },
+        "difficulty_level": {
+          "type": ["string", "null"],
+          "enum": ["usor", "mediu", "dificil", null]
+        },
+        "keywords": {
+          "type": ["array", "null"],
+          "items": {"type": "string"}
+        },
+        "tags": {
+          "type": ["array", "null"],
+          "items": {"type": "string"}
+        }
+      }
+    }
+  }
+}
--- a/scripts/build_database.py
+++ b/scripts/build_database.py
@@ -0,0 +1,780 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+build_database.py — build data/activities.db from the subagent extraction JSON.
+
+Replaces the old import_claude_activities.py. Pipeline (plan §4):
+
+  1. `--rebuild` builds into data/activities.db.tmp; on success the live DB is
+     backed up to data/activities.db.bak and the tmp file is swapped in with an
+     atomic os.replace. A mid-build crash leaves the live DB untouched.
+  2. Every data/extracted/*.json is validated against scripts/activity_schema.json;
+     invalid files are moved to data/extracted/_rejected/ with an error log.
+  2b. Each source_excerpt must appear as a fuzzy substring (rapidfuzz
+     partial_ratio >= 90) of its source chunk — non-matches are hallucinations
+     and the activity is dropped (logged to _rejected/).
+  3. `category` is normalized to a valid taxonomy slug (fallback `altele`).
+  4. Dedup (D5): group by exact normalized_name, never across languages; within a
+     group rapidfuzz on descriptions — >=85 auto-merge, 60-85 borderline (keep
+     both, needs_review), <60 separate variants.
+  5. data/review_decisions.json is applied before insert.
+  6. Bulk insert into the tmp DB, populate the categories table, rebuild FTS.
+  7. A QA report is printed.
+
+Usage:
+    python scripts/build_database.py --rebuild
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+import sys
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from app.config_taxonomy import (  # noqa: E402
+    category_display_name,
+    normalize_category,
+    normalize_content_type,
+)
+from app.models.activity import Activity  # noqa: E402
+from app.models.database import DatabaseManager  # noqa: E402
+from import_common import (  # noqa: E402
+    DEFAULT_SCHEMA_PATH,
+    content_key,
+    excerpt_matches,
+    find_chunk_text,
+    iter_extraction_files,
+    load_schema,
+    normalize_name,
+    source_path_for,
+)
+
+# dedup thresholds (rapidfuzz token_sort_ratio, 0..100 scale)
+AUTO_MERGE_THRESHOLD = 85.0
+BORDERLINE_THRESHOLD = 60.0
+
+
+# --------------------------------------------------------------------------
+# extraction dict -> Activity
+# --------------------------------------------------------------------------
+def _csv(value: Any) -> Optional[str]:
+    """Schema arrays -> comma string for the (TEXT) DB columns."""
+    if value is None:
+        return None
+    if isinstance(value, str):
+        return value.strip() or None
+    if isinstance(value, (list, tuple)):
+        parts = [str(v).strip() for v in value if str(v).strip()]
+        return ", ".join(parts) or None
+    return str(value)
+
+
+def _split_csv(value: Optional[str]) -> list[str]:
+    if not value:
+        return []
+    return [p.strip() for p in str(value).split(",") if p.strip()]
+
+
+def dict_to_activity(
+    adict: dict,
+    source_file: str,
+    source_id: Optional[str] = None,
+    chunk_key: Optional[str] = None,
+) -> Activity:
+    """Build an Activity from one extraction-JSON activity object."""
+    tags = adict.get("tags") or []
+    if isinstance(tags, str):
+        tags = _split_csv(tags)
+
+    source_files = adict.get("source_files") or []
+    if isinstance(source_files, str):
+        source_files = _split_csv(source_files)
+    if source_file and source_file not in source_files:
+        source_files = [source_file, *source_files]
+
+    return Activity(
+        source_id=source_id,
+        source_ids=[source_id] if source_id else [],
+        chunk_key=chunk_key,
+        name=(adict.get("name") or "").strip(),
+        description=(adict.get("description") or "").strip(),
+        rules=adict.get("rules"),
+        variations=adict.get("variations"),
+        category=normalize_category(adict.get("category", "")),
+        subcategory=adict.get("subcategory"),
+        content_type=normalize_content_type(adict.get("content_type", "")),
+        source_file=source_file,
+        source_files=list(source_files),
+        page_reference=adict.get("page_reference"),
+        source_excerpt=adict.get("source_excerpt"),
+        age_group_min=adict.get("age_group_min"),
+        age_group_max=adict.get("age_group_max"),
+        participants_min=adict.get("participants_min"),
+        participants_max=adict.get("participants_max"),
+        duration_min=adict.get("duration_min"),
+        duration_max=adict.get("duration_max"),
+        materials_category=adict.get("materials_category"),
+        materials_list=_csv(adict.get("materials_list")),
+        skills_developed=_csv(adict.get("skills_developed")),
+        difficulty_level=adict.get("difficulty_level"),
+        keywords=_csv(adict.get("keywords")),
+        tags=list(tags),
+        language=adict.get("language"),
+        extraction_confidence=adict.get("extraction_confidence"),
+    )
+
+
+# --------------------------------------------------------------------------
+# step 3 — category normalization is done in dict_to_activity; a non-taxonomy
+# value silently falls back to `altele`. This logs the substitutions.
+# --------------------------------------------------------------------------
+def log_category_fallbacks(raw_pairs: list[tuple[str, str]]) -> list[str]:
+    """raw_pairs = (original, slug); return human-readable fallback messages."""
+    msgs = []
+    for original, slug in raw_pairs:
+        if slug == "altele" and normalize_name(original or "") not in ("", "altele"):
+            msgs.append(f"category '{original}' -> altele (not in taxonomy)")
+    return msgs
+
+
+# --------------------------------------------------------------------------
+# step 4 — dedup
+# --------------------------------------------------------------------------
+def _longest(*values: Optional[str]) -> Optional[str]:
+    best: Optional[str] = None
+    for v in values:
+        if v and (best is None or len(v) > len(best)):
+            best = v
+    return best
+
+
+def _union_csv(values: list[Optional[str]]) -> Optional[str]:
+    seen: list[str] = []
+    for value in values:
+        for item in _split_csv(value):
+            if item not in seen:
+                seen.append(item)
+    return ", ".join(seen) or None
+
+
+def merge_cluster(cluster: list[Activity]) -> Activity:
+    """Collapse a cluster of duplicate activities into one merged Activity."""
+    if len(cluster) == 1:
+        return cluster[0]
+
+    # representative = the one with the longest description
+    rep = max(cluster, key=lambda a: len(a.description or ""))
+    merged = Activity(
+        name=rep.name,
+        description=_longest(*(a.description for a in cluster)) or rep.description,
+        rules=_longest(*(a.rules for a in cluster)),
+        variations=_longest(*(a.variations for a in cluster)),
+        category=rep.category,
+        subcategory=rep.subcategory,
+        content_type=rep.content_type,
+        source_file=rep.source_file,
+        page_reference=rep.page_reference,
+        source_excerpt=rep.source_excerpt,
+        age_group_min=rep.age_group_min,
+        age_group_max=rep.age_group_max,
+        participants_min=rep.participants_min,
+        participants_max=rep.participants_max,
+        duration_min=rep.duration_min,
+        duration_max=rep.duration_max,
+        materials_category=rep.materials_category,
+        materials_list=_union_csv([a.materials_list for a in cluster]),
+        skills_developed=_union_csv([a.skills_developed for a in cluster]),
+        difficulty_level=rep.difficulty_level,
+        keywords=_union_csv([a.keywords for a in cluster]),
+        language=rep.language,
+        extraction_confidence=rep.extraction_confidence,
+    )
+    # union of tags
+    tags: list[str] = []
+    for a in cluster:
+        for t in a.tags or []:
+            if t not in tags:
+                tags.append(t)
+    merged.tags = tags
+    # accumulate every source the activity was seen in
+    sources: list[str] = []
+    for a in cluster:
+        for s in [a.source_file, *(a.source_files or [])]:
+            if s and s not in sources:
+                sources.append(s)
+    merged.source_files = sources
+    # source provenance: keep rep's chunk_key/source_id as primary, union the
+    # source_ids for the download route. Enrichment fields (name_ro,
+    # description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
+    # enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
+    # row's content_key, so merging must not pre-populate them.
+    merged.source_id = rep.source_id
+    merged.chunk_key = rep.chunk_key
+    source_ids: list[str] = []
+    for a in cluster:
+        for sid in [a.source_id, *(a.source_ids or [])]:
+            if sid and sid not in source_ids:
+                source_ids.append(sid)
+    merged.source_ids = source_ids
+    # popularity_score++ per merged duplicate (plan §4)
+    merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
+    return merged
+
+
+def dedup_activities(activities: list[Activity]) -> tuple[list[Activity], dict]:
+    """
+    Dedup per plan D5.
+
+    Groups by (normalized_name, language) — different languages are NEVER
+    merged. Within a group, descriptions are clustered with rapidfuzz:
+      >= 85  -> same cluster (auto-merge)
+      60-85  -> borderline: kept as separate clusters, both flagged needs_review
+      < 60   -> separate variants
+    """
+    from rapidfuzz import fuzz
+
+    groups: dict[tuple, list[Activity]] = defaultdict(list)
+    for act in activities:
+        key = (act.normalized_name or normalize_name(act.name), act.language)
+        groups[key].append(act)
+
+    result: list[Activity] = []
+    stats = {"input": len(activities), "auto_merged": 0, "borderline": 0, "output": 0}
+
+    for members in groups.values():
+        clusters: list[list[Activity]] = []
+        borderline_idx: set[int] = set()
+
+        for act in members:
+            best_idx, best_score = -1, -1.0
+            borderline_here: list[int] = []
+            for idx, cluster in enumerate(clusters):
+                score = fuzz.token_sort_ratio(
+                    act.description or "", cluster[0].description or ""
+                )
+                if score >= AUTO_MERGE_THRESHOLD:
+                    if score > best_score:
+                        best_idx, best_score = idx, score
+                elif score >= BORDERLINE_THRESHOLD:
+                    borderline_here.append(idx)
+            if best_idx >= 0:
+                clusters[best_idx].append(act)
+            else:
+                clusters.append([act])
+                new_idx = len(clusters) - 1
+                for bidx in borderline_here:
+                    borderline_idx.add(bidx)
+                    borderline_idx.add(new_idx)
+
+        for idx, cluster in enumerate(clusters):
+            merged = merge_cluster(cluster)
+            if len(cluster) > 1:
+                stats["auto_merged"] += len(cluster) - 1
+            if idx in borderline_idx:
+                merged.needs_review = 1
+                stats["borderline"] += 1
+            result.append(merged)
+
+    stats["output"] = len(result)
+    return result, stats
+
+
+# --------------------------------------------------------------------------
+# step 5 — review decisions
+# --------------------------------------------------------------------------
+def load_review_decisions(path: Path) -> dict:
+    if path and path.is_file():
+        try:
+            data = json.loads(path.read_text(encoding="utf-8"))
+            if isinstance(data, dict):
+                return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {}
+
+
+def apply_review_decisions(
+    activities: list[Activity], decisions: dict
+) -> tuple[list[Activity], dict]:
+    """
+    Apply data/review_decisions.json (plan §5c).
+
+    Keyed by the stable content_key. A decision of `drop` removes the row;
+    `keep-separate` / `merge` clear needs_review (the user has resolved it).
+    Rows with no decision keep needs_review and resurface in the queue.
+    """
+    kept: list[Activity] = []
+    stats = {"dropped": 0, "resolved": 0}
+    for act in activities:
+        key = content_key(
+            act.normalized_name or normalize_name(act.name),
+            act.language,
+            act.description or "",
+        )
+        entry = decisions.get(key)
+        decision = entry.get("decision") if isinstance(entry, dict) else entry
+        if decision == "drop":
+            stats["dropped"] += 1
+            continue
+        if decision in ("keep-separate", "merge"):
+            act.needs_review = 0
+            stats["resolved"] += 1
+        kept.append(act)
+    return kept, stats
+
+
+# --------------------------------------------------------------------------
+# step 5b — enrichment overlay (plan Part B)
+# --------------------------------------------------------------------------
+# Translation / inferred-filter fields written by run_enrichment.py. Applied
+# AFTER dedup + review decisions, keyed on the same stable content_key, so the
+# overlay survives rebuilds as long as extraction text is frozen.
+_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
+_ENRICHMENT_INT_FIELDS = (
+    "participants_min", "participants_max",
+    "duration_min", "duration_max",
+    "age_group_min", "age_group_max",
+)
+
+
+def load_enrichment(path: Path) -> dict:
+    """Load data/enrichment.json (flat map content_key -> field dict)."""
+    if path and path.is_file():
+        try:
+            data = json.loads(path.read_text(encoding="utf-8"))
+            if isinstance(data, dict):
+                return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {}
+
+
+def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
+    """
+    Overlay enrichment fields onto the post-dedup activity list (plan B2).
+
+    Keyed by content_key. Only fields PRESENT in an entry are written; absent
+    fields leave the underlying DB value untouched. indoor_outdoor /
+    space_needed are normalized to slugs (None on unrecognised). Inferred
+    fields are recorded in `estimated_fields`. Translated / expanded text is
+    NOT re-validated against the source here — expansion fidelity is the
+    enrichment prompt's responsibility (plan B2 comment).
+
+    Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
+    """
+    from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
+
+    matched_keys: set[str] = set()
+    fields_stated: dict[str, int] = defaultdict(int)
+    fields_estimated: dict[str, int] = defaultdict(int)
+
+    for act in activities:
+        key = content_key(
+            act.normalized_name or normalize_name(act.name),
+            act.language,
+            act.description or "",
+        )
+        entry = enrichment.get(key)
+        if not isinstance(entry, dict):
+            continue
+        matched_keys.add(key)
+
+        estimated = set(entry.get("estimated_fields") or [])
+
+        # bilingual text twins
+        for fld in _ENRICHMENT_TEXT_FIELDS:
+            val = entry.get(fld)
+            if isinstance(val, str) and val.strip():
+                setattr(act, fld, val.strip())
+
+        # inferred / clarified structured numeric fields
+        for fld in _ENRICHMENT_INT_FIELDS:
+            if entry.get(fld) is not None:
+                try:
+                    setattr(act, fld, int(entry[fld]))
+                except (TypeError, ValueError):
+                    pass
+
+        # enum filters — normalized to slug, dropped if unrecognised
+        if entry.get("indoor_outdoor") is not None:
+            slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
+            if slug:
+                act.indoor_outdoor = slug
+        if entry.get("space_needed") is not None:
+            slug = normalize_space_needed(entry["space_needed"])
+            if slug:
+                act.space_needed = slug
+
+        act.estimated_fields = sorted(estimated)
+
+        # QA tally: stated vs estimated population, per field
+        for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
+            if entry.get(fld) is None:
+                continue
+            if fld in estimated:
+                fields_estimated[fld] += 1
+            else:
+                fields_stated[fld] += 1
+
+    return {
+        "entries": len(enrichment),
+        "matched": len(matched_keys),
+        "orphaned": len(enrichment) - len(matched_keys),
+        "fields_stated": dict(fields_stated),
+        "fields_estimated": dict(fields_estimated),
+    }
+
+
+# --------------------------------------------------------------------------
+# golden-set recall (plan §7)
+# --------------------------------------------------------------------------
+def _golden_names(data: Any) -> list[str]:
+    items = data.get("activities", data) if isinstance(data, dict) else data
+    names: list[str] = []
+    for item in items or []:
+        if isinstance(item, str):
+            names.append(item)
+        elif isinstance(item, dict) and item.get("name"):
+            names.append(item["name"])
+    return names
+
+
+def golden_recall(golden_dir: Path, activities: list[Activity]) -> Optional[dict]:
+    if not golden_dir or not golden_dir.is_dir():
+        return None
+    found = {normalize_name(a.name) for a in activities}
+    expected, hits = 0, 0
+    for gf in sorted(golden_dir.glob("*.json")):
+        try:
+            data = json.loads(gf.read_text(encoding="utf-8"))
+        except (json.JSONDecodeError, OSError):
+            continue
+        for name in _golden_names(data):
+            expected += 1
+            if normalize_name(name) in found:
+                hits += 1
+    if expected == 0:
+        return None
+    return {"expected": expected, "found": hits, "recall": round(hits / expected, 3)}
+
+
+# --------------------------------------------------------------------------
+# load + validate + excerpt-check the extraction files
+# --------------------------------------------------------------------------
+def collect_activities(
+    extracted_dir: Path,
+    chunks_dir: Path,
+    sources_dir: Path,
+    schema: dict,
+) -> dict:
+    """Validate, excerpt-check and convert every extraction file."""
+    rejected_dir = extracted_dir / "_rejected"
+    activities: list[Activity] = []
+    report = {
+        "files_total": 0,
+        "files_valid": 0,
+        "files_rejected_schema": 0,
+        "activities_raw": 0,
+        "activities_hallucinated": 0,
+        "category_fallbacks": [],
+    }
+    raw_categories: list[tuple[str, str]] = []
+
+    from import_common import chunk_key_for  # local import to avoid clutter
+
+    for json_path in iter_extraction_files(extracted_dir):
+        report["files_total"] += 1
+        try:
+            data = json.loads(json_path.read_text(encoding="utf-8"))
+        except json.JSONDecodeError as exc:
+            _reject_file(json_path, rejected_dir, [f"invalid JSON: {exc}"])
+            report["files_rejected_schema"] += 1
+            continue
+
+        from import_common import validate_extraction
+
+        errors = validate_extraction(data, schema)
+        if errors:
+            _reject_file(json_path, rejected_dir, errors)
+            report["files_rejected_schema"] += 1
+            continue
+        report["files_valid"] += 1
+
+        header = data.get("header", {})
+        chunk_text = find_chunk_text(json_path, header, chunks_dir)
+        chunk_key = chunk_key_for(json_path, header)
+        source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
+        fallback_source = (
+            source_path_for(source_id, sources_dir) or source_id or json_path.stem
+        )
+
+        hallucinated: list[dict] = []
+        for adict in data.get("activities", []):
+            report["activities_raw"] += 1
+            excerpt = adict.get("source_excerpt") or ""
+            # if the chunk text is unavailable we cannot verify — keep but the
+            # QA report still counts it under activities_raw.
+            if chunk_text is not None and not excerpt_matches(excerpt, chunk_text):
+                hallucinated.append(adict)
+                report["activities_hallucinated"] += 1
+                continue
+            src = adict.get("source_file") or fallback_source
+            raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
+            activities.append(dict_to_activity(adict, src, source_id, chunk_key))
+
+        if hallucinated:
+            _log_hallucinations(json_path, rejected_dir, hallucinated)
+
+    report["category_fallbacks"] = log_category_fallbacks(raw_categories)
+    report["activities"] = activities
+    return report
+
+
+def _reject_file(json_path: Path, rejected_dir: Path, errors: list[str]) -> None:
+    rejected_dir.mkdir(parents=True, exist_ok=True)
+    dest = rejected_dir / json_path.name
+    shutil.move(str(json_path), str(dest))
+    log = rejected_dir / f"{json_path.stem}.errors.txt"
+    log.write_text(
+        f"REJECTED (schema validation): {json_path.name}\n\n"
+        + "\n".join(f"  - {e}" for e in errors)
+        + "\n",
+        encoding="utf-8",
+    )
+
+
+def _log_hallucinations(
+    json_path: Path, rejected_dir: Path, hallucinated: list[dict]
+) -> None:
+    rejected_dir.mkdir(parents=True, exist_ok=True)
+    log = rejected_dir / f"{json_path.stem}.hallucinations.txt"
+    lines = [f"DROPPED activities (source_excerpt not found in chunk): {json_path.name}", ""]
+    for a in hallucinated:
+        lines.append(f"  - {a.get('name')!r}")
+        lines.append(f"    excerpt: {a.get('source_excerpt')!r}")
+    log.write_text("\n".join(lines) + "\n", encoding="utf-8")
+
+
+# --------------------------------------------------------------------------
+# DB write + atomic swap
+# --------------------------------------------------------------------------
+def _enrich_category_display_names(db_path: Path) -> None:
+    """Give the categories table proper Romanian display names for slugs."""
+    import sqlite3
+
+    conn = sqlite3.connect(db_path)
+    try:
+        rows = conn.execute(
+            "SELECT value FROM categories WHERE type = 'category'"
+        ).fetchall()
+        for (slug,) in rows:
+            conn.execute(
+                "UPDATE categories SET display_name = ? WHERE type='category' AND value = ?",
+                (category_display_name(slug), slug),
+            )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def write_database(db_tmp_path: Path, activities: list[Activity]) -> None:
+    """Create a fresh tmp DB, bulk insert, populate categories, rebuild FTS."""
+    if db_tmp_path.exists():
+        db_tmp_path.unlink()
+    db = DatabaseManager(str(db_tmp_path))
+    db.bulk_insert_activities(activities)
+    _enrich_category_display_names(db_tmp_path)
+    db.rebuild_fts_index()
+
+
+def atomic_swap(db_tmp_path: Path, db_path: Path) -> Optional[Path]:
+    """Back up the live DB then atomically swap the tmp file in."""
+    backup: Optional[Path] = None
+    if db_path.exists():
+        backup = db_path.with_suffix(db_path.suffix + ".bak")
+        shutil.copy2(db_path, backup)
+    os.replace(db_tmp_path, db_path)
+    return backup
+
+
+# --------------------------------------------------------------------------
+# orchestration
+# --------------------------------------------------------------------------
+def rebuild(
+    *,
+    extracted_dir: Path,
+    chunks_dir: Path,
+    sources_dir: Path,
+    db_path: Path,
+    decisions_path: Optional[Path] = None,
+    enrichment_path: Optional[Path] = None,
+    schema_path: Path = DEFAULT_SCHEMA_PATH,
+    golden_dir: Optional[Path] = None,
+    do_swap: bool = True,
+) -> dict:
+    """
+    Full rebuild. Everything is built into <db_path>.tmp; the live DB is only
+    touched by the final atomic swap, so a crash anywhere above leaves it intact.
+    """
+    extracted_dir = Path(extracted_dir)
+    db_path = Path(db_path)
+    db_tmp_path = db_path.with_suffix(db_path.suffix + ".tmp")
+
+    schema = load_schema(schema_path)
+    collected = collect_activities(extracted_dir, Path(chunks_dir), Path(sources_dir), schema)
+    activities: list[Activity] = collected.pop("activities")
+
+    deduped, dedup_stats = dedup_activities(activities)
+
+    decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
+    final, decision_stats = apply_review_decisions(deduped, decisions)
+
+    # Enrichment overlay — applied immediately after review decisions, on the
+    # post-dedup list, keyed on the same stable content_key (plan B2).
+    enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
+    enrichment_stats = apply_enrichment(final, enrichment)
+
+    try:
+        write_database(db_tmp_path, final)
+        backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
+    except Exception:
+        if db_tmp_path.exists():
+            db_tmp_path.unlink()
+        raise
+
+    report = {
+        **collected,
+        "dedup": dedup_stats,
+        "decisions": decision_stats,
+        "enrichment": enrichment_stats,
+        "final_count": len(final),
+        "backup": str(backup) if backup else None,
+        "swapped": do_swap,
+        "qa": _qa_report(final, collected, golden_dir),
+    }
+    return report
+
+
+def _qa_report(
+    activities: list[Activity], collected: dict, golden_dir: Optional[Path]
+) -> dict:
+    per_category: dict[str, int] = defaultdict(int)
+    per_content_type: dict[str, int] = defaultdict(int)
+    confidence: dict[str, int] = defaultdict(int)
+    with_rules = 0
+    for a in activities:
+        per_category[a.category] += 1
+        per_content_type[a.content_type or "?"] += 1
+        confidence[a.extraction_confidence or "?"] += 1
+        if a.rules and a.rules.strip():
+            with_rules += 1
+    raw = collected.get("activities_raw", 0)
+    hallucinated = collected.get("activities_hallucinated", 0)
+    return {
+        "total": len(activities),
+        "per_category": dict(per_category),
+        "per_content_type": dict(per_content_type),
+        "extraction_confidence": dict(confidence),
+        "pct_with_rules": round(100 * with_rules / len(activities), 1) if activities else 0.0,
+        "needs_review": sum(1 for a in activities if a.needs_review),
+        "hallucination_rate": round(100 * hallucinated / raw, 2) if raw else 0.0,
+        "golden_recall": golden_recall(Path(golden_dir), activities) if golden_dir else None,
+    }
+
+
+def print_report(report: dict) -> None:
+    qa = report["qa"]
+    print("=" * 60)
+    print("BUILD DATABASE — QA REPORT")
+    print("=" * 60)
+    print(f"extraction files     : {report['files_total']} "
+          f"(valid {report['files_valid']}, schema-rejected {report['files_rejected_schema']})")
+    print(f"activities raw       : {report['activities_raw']}")
+    print(f"  hallucinated drop  : {report['activities_hallucinated']} "
+          f"({qa['hallucination_rate']}%)")
+    d = report["dedup"]
+    print(f"dedup                : {d['input']} -> {d['output']} "
+          f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
+    print(f"review decisions     : dropped {report['decisions']['dropped']}, "
+          f"resolved {report['decisions']['resolved']}")
+    enr = report.get("enrichment")
+    if enr and enr.get("entries"):
+        print(f"enrichment           : {enr['entries']} entries "
+              f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
+        stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
+        all_fields = sorted(set(stated) | set(estimated))
+        if all_fields:
+            print("  field population   :  (stated / estimated)")
+            for fld in all_fields:
+                print(f"    {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
+    print(f"final inserted       : {report['final_count']}")
+    print(f"% with rules         : {qa['pct_with_rules']}")
+    print(f"needs_review rows    : {qa['needs_review']}")
+    print("per category         :")
+    for slug, n in sorted(qa["per_category"].items(), key=lambda kv: -kv[1]):
+        print(f"  {slug:<24}: {n}")
+    print("per content_type     :")
+    for ct, n in sorted(qa["per_content_type"].items(), key=lambda kv: -kv[1]):
+        print(f"  {ct:<24}: {n}")
+    print("extraction_confidence:")
+    for c, n in sorted(qa["extraction_confidence"].items()):
+        print(f"  {c:<24}: {n}")
+    if qa["golden_recall"]:
+        g = qa["golden_recall"]
+        print(f"golden recall        : {g['found']}/{g['expected']} = {g['recall']}")
+    if report["category_fallbacks"]:
+        print("category fallbacks   :")
+        for msg in report["category_fallbacks"]:
+            print(f"  {msg}")
+    if report["backup"]:
+        print(f"live DB backed up to : {report['backup']}")
+    print("=" * 60)
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Build activities.db from extraction JSON.")
+    parser.add_argument("--rebuild", action="store_true",
+                        help="rebuild the database from scratch (only mode supported)")
+    parser.add_argument("--extracted", default="data/extracted")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--sources", default="data/sources")
+    parser.add_argument("--db", default="data/activities.db")
+    parser.add_argument("--decisions", default="data/review_decisions.json")
+    parser.add_argument("--enrichment", default="data/enrichment.json")
+    parser.add_argument("--golden", default="data/golden")
+    parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
+    args = parser.parse_args(argv)
+
+    if not args.rebuild:
+        parser.error("only --rebuild is supported (full rebuild, no incremental merge)")
+
+    report = rebuild(
+        extracted_dir=Path(args.extracted),
+        chunks_dir=Path(args.chunks),
+        sources_dir=Path(args.sources),
+        db_path=Path(args.db),
+        decisions_path=Path(args.decisions),
+        enrichment_path=Path(args.enrichment),
+        schema_path=Path(args.schema),
+        golden_dir=Path(args.golden),
+    )
+    print_report(report)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/chunk_sources.py
+++ b/scripts/chunk_sources.py
@@ -0,0 +1,251 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+chunk_sources.py — split normalized data/sources/*.txt into ~20-page chunks
+for subagent extraction, and maintain data/chunks/manifest.json.
+
+Paginated text  → ~20-page chunks, ~4-page overlap (plan D8).
+Unpaginated text → ~10000-word windows, ~2000-word overlap.
+
+The manifest is a cache derived from the filesystem + per-chunk state. Re-running
+this script is idempotent: existing chunk states (pending/assigned/done/rejected)
+survive as long as the source content hash is unchanged.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+if str(SCRIPT_DIR) not in sys.path:
+    sys.path.insert(0, str(SCRIPT_DIR))
+
+from extract_common import content_hash, split_pages  # noqa: E402
+
+SCHEMA_VERSION = "1.0"
+PAGES_PER_CHUNK = 20
+PAGE_OVERLAP = 4
+WORD_WINDOW = 10_000
+WORD_OVERLAP = 2_000
+
+VALID_STATES = {"pending", "assigned", "done", "rejected"}
+
+
+# --------------------------------------------------------------------------
+# header parsing
+# --------------------------------------------------------------------------
+def parse_source(text: str) -> tuple[dict, str]:
+    """Split a normalized source file into (header_dict, body)."""
+    lines = text.splitlines()
+    header: dict = {}
+    body_start = 0
+    in_header = True
+    for i, line in enumerate(lines):
+        if line.startswith("--- PAGE "):
+            body_start = i
+            break
+        if not in_header:
+            continue
+        if set(line.strip()) == {"="} and line.strip():
+            body_start = i + 1
+            in_header = False  # header ends at the rule line
+            continue
+        if ":" in line:
+            key, _, val = line.partition(":")
+            header[key.strip()] = val.strip()
+    body = "\n".join(lines[body_start:])
+    return header, body
+
+
+# --------------------------------------------------------------------------
+# chunking — pure functions
+# --------------------------------------------------------------------------
+def chunk_pages(
+    pages: list[tuple[int, str]],
+    pages_per_chunk: int = PAGES_PER_CHUNK,
+    overlap: int = PAGE_OVERLAP,
+) -> list[dict]:
+    """
+    Split an ordered list of (page_no, text) into overlapping chunks.
+
+    stride = pages_per_chunk - overlap. Because stride < pages_per_chunk - 1, any
+    activity straddling a page boundary appears whole in at least one chunk.
+    """
+    if not pages:
+        return []
+    stride = max(1, pages_per_chunk - overlap)
+    chunks: list[dict] = []
+    i = 0
+    n = len(pages)
+    while i < n:
+        window = pages[i : i + pages_per_chunk]
+        first, last = window[0][0], window[-1][0]
+        text = "".join(
+            f"\n--- PAGE {num} ---\n{txt}\n" for num, txt in window
+        )
+        chunks.append(
+            {"page_start": first, "page_end": last,
+             "chunk_range": f"pages {first}-{last}", "text": text}
+        )
+        if i + pages_per_chunk >= n:
+            break
+        i += stride
+    return chunks
+
+
+def chunk_words(
+    text: str, window: int = WORD_WINDOW, overlap: int = WORD_OVERLAP
+) -> list[dict]:
+    """Split unpaginated text into overlapping word windows."""
+    words = text.split()
+    if not words:
+        return []
+    stride = max(1, window - overlap)
+    chunks: list[dict] = []
+    i = 0
+    n = len(words)
+    while i < n:
+        seg = words[i : i + window]
+        chunks.append(
+            {"word_start": i, "word_end": i + len(seg),
+             "chunk_range": f"words {i}-{i + len(seg)}", "text": " ".join(seg)}
+        )
+        if i + window >= n:
+            break
+        i += stride
+    return chunks
+
+
+def make_chunks(source_text: str) -> list[dict]:
+    """Chunk one normalized source file. Picks page- or word-windowing."""
+    _, body = parse_source(source_text)
+    pages = split_pages(body)
+    if pages:
+        return chunk_pages(pages)
+    return chunk_words(body)
+
+
+# --------------------------------------------------------------------------
+# manifest
+# --------------------------------------------------------------------------
+def _empty_manifest() -> dict:
+    return {"schema_version": SCHEMA_VERSION, "chunks": {}}
+
+
+def load_manifest(manifest_path: Path) -> dict:
+    if manifest_path.exists():
+        try:
+            data = json.loads(manifest_path.read_text(encoding="utf-8"))
+            data.setdefault("schema_version", SCHEMA_VERSION)
+            data.setdefault("chunks", {})
+            return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return _empty_manifest()
+
+
+def save_manifest(manifest: dict, manifest_path: Path) -> None:
+    manifest_path.parent.mkdir(parents=True, exist_ok=True)
+    manifest_path.write_text(
+        json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
+    )
+
+
+def chunk_source_file(
+    source_path: Path, chunks_dir: Path, manifest: dict
+) -> list[str]:
+    """
+    Chunk one data/sources/<id>.txt → data/chunks/<id>/<id>.partNN.txt and
+    register every chunk in `manifest`. Preserves prior state when the source
+    content hash is unchanged. Returns the list of chunk keys written.
+    """
+    source_id = source_path.stem
+    text = source_path.read_text(encoding="utf-8", errors="replace")
+    src_hash = content_hash(text)
+    chunks = make_chunks(text)
+
+    out_dir = chunks_dir / source_id
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    written: list[str] = []
+    for idx, chunk in enumerate(chunks, 1):
+        key = f"{source_id}.part{idx:02d}"
+        chunk_file = out_dir / f"{key}.txt"
+        chunk_file.write_text(chunk["text"], encoding="utf-8")
+
+        prior = manifest["chunks"].get(key)
+        # preserve state only if the source content is unchanged
+        if prior and prior.get("source_hash") == src_hash and \
+                prior.get("state") in VALID_STATES:
+            state = prior["state"]
+        else:
+            state = "pending"
+
+        manifest["chunks"][key] = {
+            "source_id": source_id,
+            "source_hash": src_hash,
+            "part": idx,
+            "chunk_range": chunk["chunk_range"],
+            "chunk_file": str(chunk_file.relative_to(chunks_dir.parent)),
+            "expected_json": f"{key}.json",
+            "state": state,
+        }
+        written.append(key)
+    return written
+
+
+def prune_stale(manifest: dict, live_keys: set[str]) -> list[str]:
+    """Drop manifest entries whose chunk no longer exists on disk."""
+    stale = [k for k in manifest["chunks"] if k not in live_keys]
+    for k in stale:
+        del manifest["chunks"][k]
+    return stale
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def run(sources_dir: Path, chunks_dir: Path) -> dict:
+    """Chunk every *.txt in sources_dir. Returns a summary dict."""
+    manifest_path = chunks_dir / "manifest.json"
+    manifest = load_manifest(manifest_path)
+
+    live_keys: set[str] = set()
+    source_files = sorted(sources_dir.glob("*.txt"))
+    for src in source_files:
+        live_keys.update(chunk_source_file(src, chunks_dir, manifest))
+
+    stale = prune_stale(manifest, live_keys)
+    save_manifest(manifest, manifest_path)
+
+    states: dict[str, int] = {}
+    for meta in manifest["chunks"].values():
+        states[meta["state"]] = states.get(meta["state"], 0) + 1
+    return {
+        "sources": len(source_files),
+        "chunks": len(live_keys),
+        "pruned": len(stale),
+        "states": states,
+    }
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="Chunk normalized sources.")
+    parser.add_argument("--sources", default="data/sources", help="sources dir")
+    parser.add_argument("--chunks", default="data/chunks", help="chunks output dir")
+    args = parser.parse_args(argv)
+
+    summary = run(Path(args.sources), Path(args.chunks))
+    print(f"sources processed : {summary['sources']}")
+    print(f"chunks written    : {summary['chunks']}")
+    print(f"stale pruned      : {summary['pruned']}")
+    for state, count in sorted(summary["states"].items()):
+        print(f"  {state:<10}: {count}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/claude_extraction_template.md
+++ b/scripts/claude_extraction_template.md
@@ -1,54 +0,0 @@
-# TEMPLATE PENTRU EXTRACȚIE ACTIVITĂȚI CU CLAUDE
-
-## Instrucțiuni pentru Claude Code:
-
-Pentru fiecare PDF/DOC, folosește următorul format de extracție:
-
-### 1. Citește fișierul:
-```
-Claude, te rog citește fișierul: [CALE_FISIER]
-```
-
-### 2. Extrage activitățile folosind acest template JSON:
-```json
-{
-  "source_file": "[NUME_FISIER]",
-  "activities": [
-    {
-      "name": "Numele activității",
-      "description": "Descrierea completă a activității",
-      "rules": "Regulile jocului/activității",
-      "variations": "Variante sau adaptări",
-      "category": "[A-H] bazat pe tip",
-      "age_group_min": 6,
-      "age_group_max": 14,
-      "participants_min": 4,
-      "participants_max": 20,
-      "duration_min": 10,
-      "duration_max": 30,
-      "materials_list": "Lista materialelor necesare",
-      "skills_developed": "Competențe dezvoltate",
-      "difficulty_level": "Ușor/Mediu/Dificil",
-      "keywords": "cuvinte cheie separate prin virgulă",
-      "tags": "taguri relevante"
-    }
-  ]
-}
-```
-
-### 3. Salvează în fișier:
-După extracție, salvează JSON-ul în: `/scripts/extracted_activities/[NUME_FISIER].json`
-
-### 4. Priorități de procesare:
-
-**TOP PRIORITY (procesează primele):**
-1. 1000 Fantastic Scout Games.pdf
-2. Cartea Mare a jocurilor.pdf
-3. 160-de-activitati-dinamice-jocuri-pentru-team-building.pdf
-4. 101 Ways to Create an Unforgettable Camp Experience.pdf
-5. 151 Awesome Summer Camp Nature Activities.pdf
-
-**Categorii de focus:**
- [A] Jocuri Cercetășești
- [C] Camping & Activități Exterior
- [G] Activități Educaționale
--- a/scripts/create_databases.py
+++ b/scripts/create_databases.py
@@ -1,164 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-"""
-DATABASE SETUP SCRIPT - INDEX-SISTEM-JOCURI
-
-Script pentru recrearea bazelor de date din .gitignore
-Folosește clasele DatabaseManager pentru consistență
-
-Usage:
-    python scripts/create_databases.py
-    python scripts/create_databases.py --clear-existing
-"""
-
-import sys
-import argparse
-from pathlib import Path
-
-# Add src to path so we can import our modules
-sys.path.append(str(Path(__file__).parent.parent / 'src'))
-
-from database import DatabaseManager
-from game_library_manager import GameLibraryManager
-
-def create_main_database(db_path: str = "data/activities.db", clear: bool = False):
-    """Create the main activities database"""
-    db_file = Path(db_path)
-    
-    if clear and db_file.exists():
-        print(f"🗑️  Removing existing database: {db_path}")
-        db_file.unlink()
-    
-    print(f"📊 Creating main database: {db_path}")
-    db = DatabaseManager(db_path)
-    
-    # Test the database
-    try:
-        stats = db.get_statistics()
-        print(f"✅ Database created successfully: {stats['total_activities']} activities")
-        return True
-    except Exception as e:
-        print(f"❌ Error creating database: {e}")
-        return False
-
-def create_game_library_database(db_path: str = "data/game_library.db", clear: bool = False):
-    """Create the legacy game library database"""
-    db_file = Path(db_path)
-    
-    if clear and db_file.exists():
-        print(f"🗑️  Removing existing database: {db_path}")
-        db_file.unlink()
-    
-    print(f"📊 Creating game library database: {db_path}")
-    manager = GameLibraryManager(db_path)
-    
-    print(f"✅ Game library database created successfully")
-    return True
-
-def create_test_database(db_path: str = "data/test_activities.db", clear: bool = False):
-    """Create the test database"""
-    db_file = Path(db_path)
-    
-    if clear and db_file.exists():
-        print(f"🗑️  Removing existing database: {db_path}")
-        db_file.unlink()
-    
-    print(f"📊 Creating test database: {db_path}")
-    db = DatabaseManager(db_path)
-    
-    # Add some test data
-    test_activity = {
-        'title': 'Test Activity - Setup Script',
-        'description': 'This is a test activity created by the setup script',
-        'file_path': 'test/sample.txt',
-        'file_type': 'TXT',
-        'category': 'test',
-        'age_group': '8-12 ani',
-        'participants': '5-10 persoane',
-        'duration': '15-30min',
-        'materials': 'Fără materiale',
-        'tags': '["test", "setup"]',
-        'source_text': 'Sample test content for verification'
-    }
-    
-    try:
-        db.insert_activity(test_activity)
-        stats = db.get_statistics()
-        print(f"✅ Test database created with sample data: {stats['total_activities']} activities")
-        return True
-    except Exception as e:
-        print(f"❌ Error creating test database: {e}")
-        return False
-
-def ensure_data_directory():
-    """Ensure the data directory exists"""
-    data_dir = Path("data")
-    if not data_dir.exists():
-        print(f"📁 Creating data directory: {data_dir}")
-        data_dir.mkdir(parents=True)
-    else:
-        print(f"📁 Data directory exists: {data_dir}")
-
-def main():
-    """Main setup function"""
-    parser = argparse.ArgumentParser(description='Create databases for INDEX-SISTEM-JOCURI')
-    parser.add_argument('--clear-existing', '-c', action='store_true',
-                       help='Remove existing databases before creating new ones')
-    parser.add_argument('--main-only', action='store_true',
-                       help='Create only the main activities database')
-    parser.add_argument('--test-only', action='store_true',
-                       help='Create only the test database')
-    
-    args = parser.parse_args()
-    
-    print("🚀 DATABASE SETUP - INDEX-SISTEM-JOCURI")
-    print("=" * 50)
-    
-    # Ensure data directory exists
-    ensure_data_directory()
-    
-    success_count = 0
-    total_count = 0
-    
-    if args.test_only:
-        total_count = 1
-        if create_test_database(clear=args.clear_existing):
-            success_count += 1
-    elif args.main_only:
-        total_count = 1
-        if create_main_database(clear=args.clear_existing):
-            success_count += 1
-    else:
-        # Create all databases
-        databases = [
-            ("Main activities", lambda: create_main_database(clear=args.clear_existing)),
-            ("Game library", lambda: create_game_library_database(clear=args.clear_existing)),
-            ("Test activities", lambda: create_test_database(clear=args.clear_existing))
-        ]
-        
-        total_count = len(databases)
-        
-        for name, create_func in databases:
-            print(f"\n📂 Creating {name} database...")
-            try:
-                if create_func():
-                    success_count += 1
-            except Exception as e:
-                print(f"❌ Failed to create {name} database: {e}")
-    
-    print("\n" + "=" * 50)
-    print(f"🎯 SUMMARY: {success_count}/{total_count} databases created successfully")
-    
-    if success_count == total_count:
-        print("✅ All databases ready!")
-        print("\nNext steps:")
-        print("1. Run indexer: cd src && python indexer.py --clear-db")
-        print("2. Start web app: cd src && python app.py")
-    else:
-        print("⚠️  Some databases failed to create. Check errors above.")
-        return 1
-    
-    return 0
-
-if __name__ == '__main__':
-    sys.exit(main())
--- a/scripts/enrich_wave.sh
+++ b/scripts/enrich_wave.sh
@@ -0,0 +1,102 @@
+#!/bin/bash
+# ============================================================================
+# enrich_wave.sh — ONE throttled enrichment wave, fully headless (no Claude
+# session). Designed to be run by the LXC's OS cron at night.
+#
+#   - Prepares a bounded wave (first N missing keys) via enrichment_wave.py.
+#   - Runs ONE `claude -p` per batch file, PAR batches concurrently (OS-level
+#     parallelism — no Workflow tool, no 2-per-workflow cap, no session needed).
+#   - When the backlog is empty, runs --collect + --rebuild and stops.
+#
+# Throttle = --keys (default 700 ≈ 75% of a 5h usage window ≈ 950 keys).
+# A single flock guarantees waves never overlap.
+#
+# Usage:  scripts/enrich_wave.sh [KEYS] [PAR]
+#         KEYS = max keys this wave   (default 700)
+#         PAR  = concurrent claude -p (default 6)
+# ============================================================================
+set -uo pipefail
+
+REPO="/workspace/game-library"
+LOG_DIR="/workspace/.claude-logs"
+LOCK="/tmp/enrich_wave.lock"
+KEYS="${1:-700}"
+PAR="${2:-6}"
+MAX_TURNS=100
+
+# --- environment (cron has a minimal env) ---------------------------------- #
+export HOME="${HOME:-/home/claude}"
+[ -f "$HOME/.nvm/nvm.sh" ] && . "$HOME/.nvm/nvm.sh" >/dev/null 2>&1
+export PATH="$HOME/.nvm/versions/node/v20.19.6/bin:/usr/local/bin:/usr/bin:/bin:$PATH"
+
+mkdir -p "$LOG_DIR"
+TS="$(date +%Y%m%d_%H%M%S)"
+LOG="$LOG_DIR/enrich_${TS}.log"
+
+log() { echo "[$(date '+%H:%M:%S')] $*" | tee -a "$LOG"; }
+
+# --- single-instance lock: skip if a wave is still running ----------------- #
+exec 9>"$LOCK"
+if ! flock -n 9; then
+  log "another wave holds the lock; exiting."
+  exit 0
+fi
+
+cd "$REPO" || { log "cannot cd $REPO"; exit 1; }
+command -v claude >/dev/null 2>&1 || { log "claude CLI not on PATH"; exit 1; }
+
+log "=== enrichment wave start (keys=$KEYS par=$PAR) ==="
+
+# --- 1) prepare bounded wave (batch files only) ---------------------------- #
+PREP="$(python3 scripts/enrichment_wave.py --prepare --keys "$KEYS" --no-shards 2>&1)"
+echo "$PREP" | tee -a "$LOG"
+
+if echo "$PREP" | grep -q "WAVE: COMPLETE"; then
+  log "backlog empty -> collect + rebuild"
+  python3 scripts/run_enrichment.py --collect   >>"$LOG" 2>&1
+  python3 scripts/build_database.py --rebuild   >>"$LOG" 2>&1
+  grep -E "enrichment .*matched" "$LOG" | tail -1 | tee -a "$LOG"
+  log "=== ENRICHMENT COMPLETE ==="
+  exit 0
+fi
+
+# --- 2) per-batch headless enrichment, PAR-way parallel -------------------- #
+read -r -d '' BATCH_PROMPT <<'EOP'
+You are an enrichment subagent in the game-library pipeline. Working dir: /workspace/game-library.
+
+Read `scripts/ENRICHMENT_PROMPT.md` FIRST — it defines the rules and output format EXACTLY (translate faithfully to Romanian; expand description_ro ONLY from the source chunk text; mark inferred filter fields in estimated_fields; fixed enum vocab).
+
+Your batch file is __BATCHFILE__ — it lists content_keys, one per line. For EACH key:
+1. IDEMPOTENT SKIP: if `data/enrichment_parts/<key>.json` already exists AND parses as valid JSON, SKIP it (do not rewrite).
+2. Otherwise read its prompt `data/enrichment_prompts/<key>.prompt.md`, produce the enrichment JSON per ENRICHMENT_PROMPT.md, and write it to `data/enrichment_parts/<key>.json` (MUST include the exact "content_key": "<key>").
+3. Validate it parses: python3 -c "import json;json.load(open('data/enrichment_parts/<key>.json'))".
+
+CRITICAL — JSON quote escaping: any literal ASCII double-quote inside a string value MUST be escaped as \". Romanian text uses „cuvant" where the closing mark is a plain ASCII " — written raw it breaks the JSON. Either keep the typographic „ " marks or escape every ASCII ". Re-read and re-validate each file; fix any that fail.
+
+Work through EVERY key in the batch file. If a key's prompt is missing, skip it and continue. When done, reply with one line: the count written and skipped.
+EOP
+
+export REPO LOG MAX_TURNS BATCH_PROMPT
+run_one() {
+  local bf="$1"
+  local name; name="$(basename "$bf")"
+  local prompt="${BATCH_PROMPT/__BATCHFILE__/$bf}"
+  cd "$REPO" || return 1
+  timeout 1200 claude -p "$prompt" \
+    --allowedTools "Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)" \
+    --max-turns "$MAX_TURNS" </dev/null >>"$LOG.$name.out" 2>&1
+  echo "[$(date '+%H:%M:%S')] done $name (exit $?)" >>"$LOG"
+}
+export -f run_one
+
+BATCHES=(data/enrichment_batches/batch_*.txt)
+log "launching ${#BATCHES[@]} batches, $PAR concurrent..."
+printf '%s\n' "${BATCHES[@]}" | xargs -P "$PAR" -I{} bash -c 'run_one "$@"' _ {}
+
+# --- 3) summary ------------------------------------------------------------ #
+if grep -qi "session limit\|usage limit" "$LOG".batch_*.out 2>/dev/null; then
+  log "WINDOW EXHAUSTED (usage limit hit mid-wave) — unfinished keys retry next fire."
+fi
+STATUS="$(python3 scripts/enrichment_wave.py --status 2>&1 | grep -E 'good|missing|done')"
+echo "$STATUS" | tee -a "$LOG"
+log "=== wave done ==="
--- a/scripts/enrichment_wave.py
+++ b/scripts/enrichment_wave.py
@@ -0,0 +1,294 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+enrichment_wave.py — throttled, window-paced wave preparation for the corpus
+enrichment pipeline.
+
+The enrichment backlog (~9541 keys) does NOT fit in one 5-hour Anthropic usage
+window. Launching all remaining batches at once always runs the window to
+EXHAUSTION (the "subagent completed without calling StructuredOutput" signature),
+consuming 100% and blocking other work. There is no readable real-time window
+meter, so pacing must be BLIND: cap each wave to a fixed KEY COUNT (sized to
+~75% of empirical window capacity, ~950 keys), and let an external scheduler
+(cron, every 6h) space waves across windows.
+
+This script encapsulates the reconcile + bounded-wave preparation that used to
+live as ad-hoc inline Python. It does NOT call the LLM and does NOT launch
+workflows — it only prepares files on disk and prints what to launch.
+
+Modes:
+  --status                          read-only: print done / missing / pct
+  --prepare --keys N --shards K     drop corrupt parts; take the FIRST N missing
+                                    keys (sorted, deterministic); write batch
+                                    files for ONLY those; regenerate K shard JS
+                                    files covering exactly those batches; print
+                                    machine-greppable WAVE:/SHARD: lines.
+
+Idempotency: a key is "done" iff data/enrichment_parts/<key>.json exists AND
+parses. Re-running --prepare with the same args is deterministic (same sorted
+first-N keys), so a re-fire never reshuffles work. Parts on disk are the durable
+checkpoint.
+
+Output contract (parsed by the cron wave-runner):
+  WAVE: COMPLETE                                  -> backlog empty; run collect+rebuild
+  WAVE: PREPARED keys=.. batches=.. shards=.. remaining_after=..
+  SHARD: data/enrichment_wf/shard_0.js            -> one line per workflow to launch
+  ...
+
+Usage:
+    python3 scripts/enrichment_wave.py --status
+    python3 scripts/enrichment_wave.py --prepare --keys 700 --shards 8
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+
+PROMPT_SUFFIX = ".prompt.md"
+PART_SUFFIX = ".json"
+BATCH_SIZE_DEFAULT = 12
+KEYS_DEFAULT = 700
+SHARDS_DEFAULT = 8
+
+# Resolved relative to REPO_ROOT so the script works from any cwd.
+DEF_PROMPTS = "data/enrichment_prompts"
+DEF_PARTS = "data/enrichment_parts"
+DEF_BATCHES = "data/enrichment_batches"
+DEF_WF = "data/enrichment_wf"
+TEMPLATE_NAME = "shard.js.tmpl"
+
+
+# --------------------------------------------------------------------------- #
+# Helpers
+# --------------------------------------------------------------------------- #
+def _abs(p: str) -> Path:
+    q = Path(p)
+    return q if q.is_absolute() else (REPO_ROOT / q)
+
+
+def part_ok(path: Path) -> bool:
+    """A part counts as done iff it parses as a JSON object."""
+    try:
+        return isinstance(json.load(open(path, encoding="utf-8")), dict)
+    except Exception:
+        return False
+
+
+def corrupt_parts(parts_dir: Path) -> list[Path]:
+    return [p for p in parts_dir.glob("*" + PART_SUFFIX) if not part_ok(p)]
+
+
+def compute_missing(prompts_dir: Path, parts_dir: Path) -> list[str]:
+    """Keys whose prompt exists but whose part is absent. Sorted = deterministic."""
+    missing = []
+    for pr in prompts_dir.glob("*" + PROMPT_SUFFIX):
+        key = pr.name[: -len(PROMPT_SUFFIX)]
+        if not (parts_dir / (key + PART_SUFFIX)).exists():
+            missing.append(key)
+    return sorted(missing)
+
+
+def count_done(prompts_dir: Path, parts_dir: Path) -> tuple[int, int]:
+    """(good_parts_with_prompt, total_prompts)."""
+    total = 0
+    good = 0
+    for pr in prompts_dir.glob("*" + PROMPT_SUFFIX):
+        total += 1
+        key = pr.name[: -len(PROMPT_SUFFIX)]
+        part = parts_dir / (key + PART_SUFFIX)
+        if part.exists() and part_ok(part):
+            good += 1
+    return good, total
+
+
+def write_batches(keys: list[str], batches_dir: Path, size: int) -> int:
+    """Replace all batch_*.txt with fresh files of <= size keys. Returns NB."""
+    batches_dir.mkdir(parents=True, exist_ok=True)
+    for old in batches_dir.glob("batch_*.txt"):
+        old.unlink()
+    nb = 0
+    for i in range(0, len(keys), size):
+        chunk = keys[i : i + size]
+        (batches_dir / f"batch_{nb:04d}.txt").write_text(
+            "\n".join(chunk) + "\n", encoding="utf-8"
+        )
+        nb += 1
+    return nb
+
+
+def shard_ranges(nb: int, k: int) -> list[tuple[int, int]]:
+    """Split [0,nb) into k contiguous, disjoint, total-covering ranges.
+
+    Even distribution: the first (nb % k) shards carry one extra batch. When
+    nb < k the trailing ranges are empty [x,x) and are dropped by the caller.
+    """
+    if nb <= 0 or k <= 0:
+        return []
+    base, extra = divmod(nb, k)
+    ranges = []
+    start = 0
+    for i in range(k):
+        length = base + (1 if i < extra else 0)
+        ranges.append((start, start + length))
+        start += length
+    return ranges
+
+
+def render_shard(template: str, shard: int, start: int, end: int, nshards: int) -> str:
+    return (
+        template.replace("__SHARD__", str(shard))
+        .replace("__START__", str(start))
+        .replace("__END__", str(end))
+        .replace("__NSHARDS__", str(nshards))
+    )
+
+
+def write_shards(ranges: list[tuple[int, int]], template: str, wf_dir: Path) -> list[Path]:
+    """Delete stale shard_*.js, then write one per NON-EMPTY range. Returns paths."""
+    wf_dir.mkdir(parents=True, exist_ok=True)
+    for old in wf_dir.glob("shard_*.js"):
+        old.unlink()
+    non_empty = [(i, s, e) for i, (s, e) in enumerate(ranges) if e > s]
+    nshards = len(non_empty)
+    paths = []
+    # Re-index shards 0..nshards-1 so labels/meta stay contiguous even if some
+    # trailing ranges were empty (tiny final wave with fewer batches than K).
+    for new_idx, (_, s, e) in enumerate(non_empty):
+        path = wf_dir / f"shard_{new_idx}.js"
+        path.write_text(
+            render_shard(template, new_idx, s, e, nshards), encoding="utf-8"
+        )
+        paths.append(path)
+    return paths
+
+
+def rel(path: Path) -> str:
+    try:
+        return str(path.relative_to(REPO_ROOT))
+    except ValueError:
+        return str(path)
+
+
+# --------------------------------------------------------------------------- #
+# Modes
+# --------------------------------------------------------------------------- #
+def cmd_status(prompts_dir: Path, parts_dir: Path) -> int:
+    good, total = count_done(prompts_dir, parts_dir)
+    parts_on_disk = len(list(parts_dir.glob("*" + PART_SUFFIX)))
+    bad = len(corrupt_parts(parts_dir))
+    missing = total - good
+    pct = (100.0 * good / total) if total else 0.0
+    print("=== enrichment status ===")
+    print(f"prompts (universe) : {total}")
+    print(f"parts on disk      : {parts_on_disk}")
+    print(f"good (done)        : {good}")
+    print(f"corrupt parts      : {bad}  (reported only; --prepare drops them)")
+    print(f"missing            : {missing}")
+    print(f"done               : {pct:.1f}%")
+    if total:
+        print(f"WAVE: {'COMPLETE' if missing == 0 else 'PENDING'} missing={missing}")
+    return 0
+
+
+def cmd_prepare(
+    prompts_dir: Path,
+    parts_dir: Path,
+    batches_dir: Path,
+    wf_dir: Path,
+    keys: int,
+    shards: int,
+    batch_size: int,
+    make_shards: bool = True,
+) -> int:
+    template = ""
+    if make_shards:
+        template_path = wf_dir / TEMPLATE_NAME
+        if not template_path.is_file():
+            print(f"ERROR: missing shard template {rel(template_path)}", file=sys.stderr)
+            return 2
+        template = template_path.read_text(encoding="utf-8")
+
+    # 1) drop corrupt parts (only mutation to parts/)
+    dropped = 0
+    for p in corrupt_parts(parts_dir):
+        p.unlink()
+        dropped += 1
+
+    # 2) compute missing (deterministic)
+    missing = compute_missing(prompts_dir, parts_dir)
+
+    # 3) empty -> COMPLETE sentinel, no files written
+    if not missing:
+        print(f"dropped_corrupt={dropped}")
+        print("WAVE: COMPLETE")
+        return 0
+
+    # 4) clamp to first N
+    take = missing[:keys]
+
+    # 5) batches for ONLY those keys
+    nb = write_batches(take, batches_dir, batch_size)
+
+    # 6) shard scripts covering exactly those batches (skipped on the bash path)
+    paths = []
+    if make_shards:
+        ranges = shard_ranges(nb, shards)
+        paths = write_shards(ranges, template, wf_dir)
+
+    remaining_after = len(missing) - len(take)
+    print(f"dropped_corrupt={dropped}")
+    print(
+        f"WAVE: PREPARED keys={len(take)} batches={nb} "
+        f"shards={len(paths)} remaining_after={remaining_after}"
+    )
+    for p in paths:
+        print(f"SHARD: {rel(p)}")
+    return 0
+
+
+def main(argv=None) -> int:
+    ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--status", action="store_true", help="read-only progress report")
+    ap.add_argument("--prepare", action="store_true", help="prepare one bounded wave")
+    ap.add_argument("--keys", type=int, default=KEYS_DEFAULT, help=f"max keys this wave (default {KEYS_DEFAULT})")
+    ap.add_argument("--shards", type=int, default=SHARDS_DEFAULT, help=f"workflow shards (default {SHARDS_DEFAULT})")
+    ap.add_argument("--batch-size", type=int, default=BATCH_SIZE_DEFAULT, help=f"keys per batch (default {BATCH_SIZE_DEFAULT})")
+    ap.add_argument("--no-shards", action="store_true", help="prepare batch files only; skip shard JS generation (bash/headless path)")
+    ap.add_argument("--prompts", default=DEF_PROMPTS)
+    ap.add_argument("--parts", default=DEF_PARTS)
+    ap.add_argument("--batches", default=DEF_BATCHES)
+    ap.add_argument("--wf-dir", default=DEF_WF)
+    args = ap.parse_args(argv)
+
+    prompts_dir = _abs(args.prompts)
+    parts_dir = _abs(args.parts)
+    batches_dir = _abs(args.batches)
+    wf_dir = _abs(args.wf_dir)
+
+    if not prompts_dir.is_dir():
+        print(f"ERROR: prompts dir not found: {rel(prompts_dir)}", file=sys.stderr)
+        return 2
+    parts_dir.mkdir(parents=True, exist_ok=True)
+
+    if args.keys < 1 or args.shards < 1 or args.batch_size < 1:
+        print("ERROR: --keys/--shards/--batch-size must be >= 1", file=sys.stderr)
+        return 2
+
+    if args.prepare:
+        return cmd_prepare(
+            prompts_dir, parts_dir, batches_dir, wf_dir,
+            args.keys, args.shards, args.batch_size,
+            make_shards=not args.no_shards,
+        )
+    # default to status
+    return cmd_status(prompts_dir, parts_dir)
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/extract_common.py
+++ b/scripts/extract_common.py
@@ -0,0 +1,361 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+extract_common.py — single home for per-format text extraction.
+
+Every extractor returns a plain text *body* with synthetic page markers
+(`--- PAGE N ---`). The file-level header (`SOURCE:` / `CONVERTED:`) is added
+by normalize_sources.py, not here.
+
+Critical fix vs. the old pdf_to_text_converter.py: there is NO `max_pages` cap.
+Large books are extracted in full.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import importlib
+import os
+import re
+import shutil
+import subprocess
+import tempfile
+import zipfile
+from pathlib import Path
+from typing import Callable
+
+PAGE_MARKER_RE = re.compile(r"^--- PAGE (\d+) ---\s*$", re.MULTILINE)
+
+# paragraphs per synthetic page for paginated-by-flow formats (docx)
+DOCX_PARAS_PER_PAGE = 40
+
+# formats we deliberately ignore (epub duplicates existing PDFs — plan §1)
+IGNORED_EXTENSIONS = {".epub"}
+
+# obvious junk filenames skipped during a walk
+JUNK_NAMES = {"desktop.ini", "linkuri-jocuri.txt"}
+JUNK_SUFFIXES = {".bak", ".tmp", ".ini"}
+
+
+# --------------------------------------------------------------------------
+# page assembly helpers
+# --------------------------------------------------------------------------
+def join_pages(pages: list[str], start: int = 1) -> str:
+    """Join a list of page texts into a body string with `--- PAGE N ---`."""
+    out: list[str] = []
+    for i, text in enumerate(pages, start):
+        out.append(f"\n--- PAGE {i} ---\n{(text or '').strip()}\n")
+    return "".join(out)
+
+
+def split_pages(body: str) -> list[tuple[int, str]]:
+    """Inverse of join_pages: parse a body into [(page_number, text), ...]."""
+    matches = list(PAGE_MARKER_RE.finditer(body))
+    if not matches:
+        return []
+    pages: list[tuple[int, str]] = []
+    for idx, m in enumerate(matches):
+        num = int(m.group(1))
+        seg_start = m.end()
+        seg_end = matches[idx + 1].start() if idx + 1 < len(matches) else len(body)
+        pages.append((num, body[seg_start:seg_end].strip()))
+    return pages
+
+
+def count_page_markers(body: str) -> int:
+    return len(PAGE_MARKER_RE.findall(body))
+
+
+# --------------------------------------------------------------------------
+# format detection
+# --------------------------------------------------------------------------
+FORMAT_BY_EXT = {
+    ".pdf": "pdf",
+    ".docx": "docx",
+    ".doc": "doc",
+    ".pptx": "pptx",
+    ".ppt": "pptx",
+    ".htm": "html",
+    ".html": "html",
+    ".zip": "zip",
+    ".epub": "epub",
+    ".txt": "txt",
+}
+
+
+def detect_format(path: str | os.PathLike) -> str:
+    """Return a format key for a path based on its extension."""
+    ext = Path(path).suffix.lower()
+    return FORMAT_BY_EXT.get(ext, "unknown")
+
+
+def is_junk(path: str | os.PathLike) -> bool:
+    p = Path(path)
+    name = p.name.lower()
+    if name in JUNK_NAMES:
+        return True
+    if name.startswith("readme") and p.suffix.lower() == ".md":
+        return True
+    if p.suffix.lower() in JUNK_SUFFIXES:
+        return True
+    return False
+
+
+# --------------------------------------------------------------------------
+# content hashing + near-duplicate elimination
+# --------------------------------------------------------------------------
+def _normalize_for_hash(text: str) -> str:
+    return re.sub(r"\s+", " ", (text or "")).strip().lower()
+
+
+def content_hash(text: str) -> str:
+    """Stable SHA1 of whitespace-normalized text — used for exact-dup detection."""
+    return hashlib.sha1(_normalize_for_hash(text).encode("utf-8")).hexdigest()
+
+
+def near_duplicate_ratio(a: str, b: str) -> float:
+    """Similarity score in [0, 100] between two texts (rapidfuzz token ratio)."""
+    from rapidfuzz import fuzz
+
+    return fuzz.token_sort_ratio(_normalize_for_hash(a), _normalize_for_hash(b))
+
+
+def dedupe_texts(
+    items: list[tuple[str, str]], threshold: float = 95.0
+) -> list[tuple[str, str]]:
+    """
+    Drop exact and near-duplicate texts from a list of (key, text) pairs.
+
+    Used for HTML mirror pages (print copies, repeated index/footer pages).
+    Keeps the first occurrence; O(n) on exact hash, O(n*k) fuzzy only against
+    already-kept items.
+    """
+    kept: list[tuple[str, str]] = []
+    seen_hashes: set[str] = set()
+    for key, text in items:
+        h = content_hash(text)
+        if h in seen_hashes:
+            continue
+        if any(near_duplicate_ratio(text, kt) >= threshold for _, kt in kept):
+            continue
+        seen_hashes.add(h)
+        kept.append((key, text))
+    return kept
+
+
+# --------------------------------------------------------------------------
+# preflight dependency check
+# --------------------------------------------------------------------------
+REQUIRED_PYTHON_MODULES = {
+    "pdfplumber": "pdfplumber",
+    "PyPDF2": "pypdf2",
+    "docx": "python-docx",
+    "pptx": "python-pptx",
+    "bs4": "beautifulsoup4",
+    "lxml": "lxml",
+    "jsonschema": "jsonschema",
+    "rapidfuzz": "rapidfuzz",
+    "chardet": "chardet",
+}
+
+
+def preflight(check_ocr: bool = False) -> dict:
+    """
+    Check system + Python dependencies before a long normalization run.
+
+    Returns {'ok': bool, 'missing_python': [...], 'missing_system': [...],
+             'warnings': [...]}.  libreoffice is a *warning* (only .doc needs it),
+             tesseract only checked when check_ocr=True.
+    """
+    missing_python: list[str] = []
+    for module, pip_name in REQUIRED_PYTHON_MODULES.items():
+        try:
+            importlib.import_module(module)
+        except ImportError:
+            missing_python.append(pip_name)
+
+    warnings: list[str] = []
+    missing_system: list[str] = []
+
+    if not (shutil.which("libreoffice") or shutil.which("soffice")):
+        warnings.append("libreoffice not found — legacy .doc files cannot be converted")
+
+    if check_ocr and not shutil.which("tesseract"):
+        missing_system.append("tesseract (OCR requested but not installed)")
+
+    return {
+        "ok": not missing_python and not missing_system,
+        "missing_python": missing_python,
+        "missing_system": missing_system,
+        "warnings": warnings,
+    }
+
+
+# --------------------------------------------------------------------------
+# per-format extractors
+# --------------------------------------------------------------------------
+def extract_pdf(path: str | os.PathLike) -> str:
+    """PDF → body. pdfplumber primary, PyPDF2 fallback. No page cap."""
+    path = str(path)
+    try:
+        return _extract_pdf_pdfplumber(path)
+    except Exception:
+        return _extract_pdf_pypdf2(path)
+
+
+def _extract_pdf_pdfplumber(path: str) -> str:
+    import pdfplumber
+
+    pages: list[str] = []
+    with pdfplumber.open(path) as pdf:
+        for page in pdf.pages:  # ALL pages — no max_pages
+            try:
+                pages.append(page.extract_text() or "")
+            except Exception:
+                pages.append("")
+    return join_pages(pages)
+
+
+def _extract_pdf_pypdf2(path: str) -> str:
+    import PyPDF2
+
+    pages: list[str] = []
+    with open(path, "rb") as fh:
+        reader = PyPDF2.PdfReader(fh)
+        for page in reader.pages:  # ALL pages — no max_pages
+            try:
+                pages.append(page.extract_text() or "")
+            except Exception:
+                pages.append("")
+    return join_pages(pages)
+
+
+def extract_docx(path: str | os.PathLike) -> str:
+    """docx → body. Synthetic page marker every DOCX_PARAS_PER_PAGE paragraphs."""
+    import docx
+
+    document = docx.Document(str(path))
+    paragraphs = [p.text for p in document.paragraphs]
+    pages: list[str] = []
+    for i in range(0, max(len(paragraphs), 1), DOCX_PARAS_PER_PAGE):
+        chunk = paragraphs[i : i + DOCX_PARAS_PER_PAGE]
+        pages.append("\n".join(chunk))
+    return join_pages(pages)
+
+
+def extract_doc(path: str | os.PathLike) -> str:
+    """
+    Legacy .doc → body via `libreoffice --headless --convert-to docx`.
+
+    Raises RuntimeError if libreoffice is unavailable — the caller marks the
+    resulting source `needs_review` regardless (conversion is imperfect).
+    """
+    soffice = shutil.which("libreoffice") or shutil.which("soffice")
+    if not soffice:
+        raise RuntimeError("libreoffice/soffice not available — cannot convert .doc")
+
+    src = Path(path).resolve()
+    with tempfile.TemporaryDirectory() as tmp:
+        subprocess.run(
+            [soffice, "--headless", "--convert-to", "docx", "--outdir", tmp, str(src)],
+            check=True,
+            capture_output=True,
+            timeout=300,
+        )
+        converted = Path(tmp) / (src.stem + ".docx")
+        if not converted.exists():
+            raise RuntimeError(f"libreoffice produced no output for {src.name}")
+        return extract_docx(converted)
+
+
+def extract_pptx(path: str | os.PathLike) -> str:
+    """pptx → body. One page per slide: title + body text + speaker notes."""
+    from pptx import Presentation
+
+    presentation = Presentation(str(path))
+    pages: list[str] = []
+    for slide in presentation.slides:
+        parts: list[str] = []
+        for shape in slide.shapes:
+            if shape.has_text_frame and shape.text_frame.text.strip():
+                parts.append(shape.text_frame.text.strip())
+        if slide.has_notes_slide:
+            notes = slide.notes_slide.notes_text_frame.text.strip()
+            if notes:
+                parts.append(f"[NOTES] {notes}")
+        pages.append("\n".join(parts))
+    return join_pages(pages)
+
+
+def extract_html(path: str | os.PathLike) -> str:
+    """HTML mirror page → body. Strips nav/script/style/footer/header/aside."""
+    import chardet
+    from bs4 import BeautifulSoup
+
+    raw = Path(path).read_bytes()
+    enc = chardet.detect(raw).get("encoding") or "utf-8"
+    soup = BeautifulSoup(raw.decode(enc, errors="replace"), "lxml")
+
+    for tag in soup(["script", "style", "nav", "footer", "header", "aside", "noscript"]):
+        tag.decompose()
+    # also drop common chrome by role/class
+    for tag in soup.find_all(attrs={"role": ["navigation", "banner", "contentinfo"]}):
+        tag.decompose()
+
+    text = soup.get_text(separator="\n")
+    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
+    return join_pages(["\n".join(lines)])
+
+
+def extract_zip(path: str | os.PathLike) -> str:
+    """
+    zip → body. Unzips into a temp dir and recurses on every extractable inner
+    file. Inner files are page-renumbered into one continuous body.
+    """
+    path = str(path)
+    pages: list[str] = []
+    with tempfile.TemporaryDirectory() as tmp:
+        try:
+            with zipfile.ZipFile(path) as zf:
+                zf.extractall(tmp)
+        except zipfile.BadZipFile:
+            return ""
+        for inner in sorted(Path(tmp).rglob("*")):
+            if not inner.is_file() or is_junk(inner):
+                continue
+            fmt = detect_format(inner)
+            if fmt in ("unknown", "epub", "zip"):
+                # nested zips handled by recursion below
+                if fmt == "zip":
+                    body = extract_zip(inner)
+                    pages.extend(t for _, t in split_pages(body))
+                continue
+            try:
+                body = extract_file(inner)
+            except Exception:
+                continue
+            pages.extend(t for _, t in split_pages(body))
+    return join_pages(pages)
+
+
+EXTRACTORS: dict[str, Callable[[str | os.PathLike], str]] = {
+    "pdf": extract_pdf,
+    "docx": extract_docx,
+    "doc": extract_doc,
+    "pptx": extract_pptx,
+    "html": extract_html,
+    "zip": extract_zip,
+}
+
+
+def extract_file(path: str | os.PathLike) -> str:
+    """Dispatch a single file to the right extractor. Returns a page-marked body."""
+    fmt = detect_format(path)
+    if fmt == "txt":
+        body = Path(path).read_text(encoding="utf-8", errors="replace")
+        # already paginated? pass through; else wrap as one page
+        return body if count_page_markers(body) else join_pages([body])
+    extractor = EXTRACTORS.get(fmt)
+    if extractor is None:
+        raise ValueError(f"No extractor for format '{fmt}': {path}")
+    return extractor(path)
--- a/scripts/html_extractor.py
+++ b/scripts/html_extractor.py
@@ -1,424 +0,0 @@
-#!/usr/bin/env python3
-"""
-HTML Activity Extractor - Proceseaz 1876 fiiere HTML
-Extrage automat activiti folosind pattern recognition
-"""
-
-import os
-import re
-import json
-from pathlib import Path
-from bs4 import BeautifulSoup
-import chardet
-from typing import List, Dict, Optional
-import sqlite3
-from datetime import datetime
-
-class HTMLActivityExtractor:
-    def __init__(self, db_path='data/activities.db'):
-        self.db_path = db_path
-        # Pattern-uri pentru detectare activiti <20>n rom<6F>n
-        self.activity_patterns = {
-            'title_patterns': [
-                r'(?i)(joc|activitate|exerci[t]iu|team[\s-]?building|energizer|ice[\s-]?breaker)[\s:]+([^\.]{5,100})',
-                r'(?i)<h[1-6][^>]*>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</h[1-6]>',
-                r'(?i)<strong>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</strong>',
-                r'(?i)^[\d]+\.?\s*([A-Z][^\.]{10,100}(?:joc|activitate|exerci[t]iu)[^\.]{0,50})$',
-            ],
-            'description_markers': [
-                'descriere', 'reguli', 'cum se joac[a]', 'instructiuni', 
-                'obiectiv', 'desfasurare', 'explicatie', 'mod de joc'
-            ],
-            'materials_markers': [
-                'materiale', 'necesare', 'echipament', 'ce avem nevoie',
-                'se folosesc', 'trebuie sa avem', 'dotari'
-            ],
-            'age_patterns': [
-                r'(?i)v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
-                r'(?i)(\d+)[\s-]+(\d+)\s*ani',
-                r'(?i)pentru\s+(\d+)[\s-]+(\d+)\s*ani',
-                r'(?i)categoria?\s*(?:de\s*)?v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
-            ],
-            'participants_patterns': [
-                r'(?i)(\d+)[\s-]+(\d+)\s*(?:participan[t]i|juc[a]tori|persoane|copii)',
-                r'(?i)num[a]r\s*(?:de\s*)?(?:participan[t]i|juc[a]tori)[\s:]+(\d+)[\s-]+(\d+)',
-                r'(?i)grup\s*de\s*(\d+)[\s-]+(\d+)',
-            ],
-            'duration_patterns': [
-                r'(?i)durat[a][\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
-                r'(?i)timp[\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
-                r'(?i)(\d+)[\s-]+(\d+)\s*minute',
-            ]
-        }
-        
-        # Categorii predefinite bazate pe sistemul existent
-        self.categories = {
-            '[A]': ['joc', 'joaca', 'distractie', 'amuzament'],
-            '[B]': ['aventura', 'explorare', 'descoperire'],
-            '[C]': ['camping', 'tabara', 'excursie', 'drumetie'],
-            '[D]': ['foc', 'flacara', 'lumina'],
-            '[E]': ['noduri', 'fr<EFBFBD>nghii', 'sfori', 'legare'],
-            '[F]': ['bushcraft', 'supravietuire', 'survival'],
-            '[G]': ['educatie', 'educativ', 'invatare', 'scoala'],
-            '[H]': ['orientare', 'busola', 'harta', 'navigare']
-        }
-    
-    def detect_encoding(self, file_path):
-        """Detecteaz encoding-ul fiierului"""
-        with open(file_path, 'rb') as f:
-            result = chardet.detect(f.read())
-        return result['encoding'] or 'utf-8'
-    
-    def extract_from_html(self, html_path: str) -> List[Dict]:
-        """Extrage activiti dintr-un singur fiier HTML"""
-        activities = []
-        
-        try:
-            # Detectare encoding i citire
-            encoding = self.detect_encoding(html_path)
-            with open(html_path, 'r', encoding=encoding, errors='ignore') as f:
-                content = f.read()
-            
-            soup = BeautifulSoup(content, 'lxml')
-            
-            # Metod 1: Caut liste de activiti
-            activities.extend(self._extract_from_lists(soup, html_path))
-            
-            # Metod 2: Caut activiti <20>n headings
-            activities.extend(self._extract_from_headings(soup, html_path))
-            
-            # Metod 3: Caut pattern-uri <20>n text
-            activities.extend(self._extract_from_patterns(soup, html_path))
-            
-            # Metod 4: Caut <20>n tabele
-            activities.extend(self._extract_from_tables(soup, html_path))
-            
-        except Exception as e:
-            print(f"Error processing {html_path}: {e}")
-        
-        return activities
-    
-    def _extract_from_lists(self, soup, source_file):
-        """Extrage activiti din liste HTML (ul, ol)"""
-        activities = []
-        
-        for list_elem in soup.find_all(['ul', 'ol']):
-            # Verific dac lista pare s conin activiti
-            list_text = list_elem.get_text().lower()
-            if any(marker in list_text for marker in ['joc', 'activitate', 'exercitiu']):
-                for li in list_elem.find_all('li'):
-                    text = li.get_text(strip=True)
-                    if len(text) > 20:  # Minim 20 caractere pentru o activitate valid
-                        activity = self._create_activity_from_text(text, source_file)
-                        if activity:
-                            activities.append(activity)
-        
-        return activities
-    
-    def _extract_from_headings(self, soup, source_file):
-        """Extrage activiti bazate pe headings"""
-        activities = []
-        
-        for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
-            heading_text = heading.get_text(strip=True)
-            
-            # Verific dac heading-ul conine cuvinte cheie
-            if any(keyword in heading_text.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
-                # Caut descrierea <20>n elementele urmtoare
-                description = ""
-                next_elem = heading.find_next_sibling()
-                
-                while next_elem and next_elem.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
-                    if next_elem.name in ['p', 'div', 'ul']:
-                        description += next_elem.get_text(strip=True) + " "
-                        if len(description) > 500:  # Limit descriere
-                            break
-                    next_elem = next_elem.find_next_sibling()
-                
-                if description:
-                    activity = {
-                        'name': heading_text[:200],
-                        'description': description[:1000],
-                        'source_file': str(source_file),
-                        'category': self._detect_category(heading_text + " " + description)
-                    }
-                    activities.append(activity)
-        
-        return activities
-    
-    def _extract_from_patterns(self, soup, source_file):
-        """Extrage activiti folosind pattern matching"""
-        activities = []
-        text = soup.get_text()
-        
-        # Caut pattern-uri de activiti
-        for pattern in self.activity_patterns['title_patterns']:
-            matches = re.finditer(pattern, text, re.MULTILINE)
-            for match in matches:
-                title = match.group(0) if match.lastindex == 0 else match.group(match.lastindex)
-                if len(title) > 10:
-                    # Extrage context <20>n jurul match-ului
-                    start = max(0, match.start() - 200)
-                    end = min(len(text), match.end() + 500)
-                    context = text[start:end]
-                    
-                    activity = self._create_activity_from_text(context, source_file, title)
-                    if activity:
-                        activities.append(activity)
-        
-        return activities
-    
-    def _extract_from_tables(self, soup, source_file):
-        """Extrage activiti din tabele"""
-        activities = []
-        
-        for table in soup.find_all('table'):
-            rows = table.find_all('tr')
-            if len(rows) > 1:  # Cel puin header i o linie de date
-                # Detecteaz coloanele relevante
-                headers = [th.get_text(strip=True).lower() for th in rows[0].find_all(['th', 'td'])]
-                
-                for row in rows[1:]:
-                    cells = row.find_all(['td'])
-                    if cells:
-                        activity_data = {}
-                        for i, cell in enumerate(cells):
-                            if i < len(headers):
-                                activity_data[headers[i]] = cell.get_text(strip=True)
-                        
-                        # Creeaz activitate din date tabel
-                        if any(key in activity_data for key in ['joc', 'activitate', 'nume', 'titlu']):
-                            activity = self._create_activity_from_table_data(activity_data, source_file)
-                            if activity:
-                                activities.append(activity)
-        
-        return activities
-    
-    def _create_activity_from_text(self, text, source_file, title=None):
-        """Creeaz un dicionar de activitate din text"""
-        if not text or len(text) < 30:
-            return None
-        
-        activity = {
-            'name': title or text[:100].split('.')[0].strip(),
-            'description': text[:1000],
-            'source_file': str(source_file),
-            'category': self._detect_category(text),
-            'keywords': self._extract_keywords(text),
-            'created_at': datetime.now().isoformat()
-        }
-        
-        # Extrage metadata suplimentar
-        activity.update(self._extract_metadata(text))
-        
-        return activity
-    
-    def _create_activity_from_table_data(self, data, source_file):
-        """Creeaz activitate din date de tabel"""
-        activity = {
-            'source_file': str(source_file),
-            'created_at': datetime.now().isoformat()
-        }
-        
-        # Mapare c<>mpuri tabel la c<>mpuri DB
-        field_mapping = {
-            'nume': 'name', 'titlu': 'name', 'joc': 'name', 'activitate': 'name',
-            'descriere': 'description', 'detalii': 'description', 'explicatie': 'description',
-            'materiale': 'materials_list', 'echipament': 'materials_list',
-            'varsta': 'age_group_min', 'categoria': 'category',
-            'participanti': 'participants_min', 'numar': 'participants_min',
-            'durata': 'duration_min', 'timp': 'duration_min'
-        }
-        
-        for table_field, db_field in field_mapping.items():
-            if table_field in data:
-                activity[db_field] = data[table_field]
-        
-        # Validare minim
-        if 'name' in activity and len(activity.get('name', '')) > 5:
-            return activity
-        
-        return None
-    
-    def _extract_metadata(self, text):
-        """Extrage metadata din text folosind pattern-uri"""
-        metadata = {}
-        
-        # Extrage v<>rsta
-        for pattern in self.activity_patterns['age_patterns']:
-            match = re.search(pattern, text)
-            if match:
-                metadata['age_group_min'] = int(match.group(1))
-                metadata['age_group_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
-                break
-        
-        # Extrage numr participani
-        for pattern in self.activity_patterns['participants_patterns']:
-            match = re.search(pattern, text)
-            if match:
-                metadata['participants_min'] = int(match.group(1))
-                metadata['participants_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
-                break
-        
-        # Extrage durata
-        for pattern in self.activity_patterns['duration_patterns']:
-            match = re.search(pattern, text)
-            if match:
-                metadata['duration_min'] = int(match.group(1))
-                metadata['duration_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
-                break
-        
-        # Extrage materiale
-        materials = []
-        text_lower = text.lower()
-        for marker in self.activity_patterns['materials_markers']:
-            idx = text_lower.find(marker)
-            if idx != -1:
-                # Extrage urmtoarele 200 caractere dup marker
-                materials_text = text[idx:idx+200]
-                # Extrage items din list
-                items = re.findall(r'[-"]\s*([^\n-"]+)', materials_text)
-                if items:
-                    materials.extend(items)
-        
-        if materials:
-            metadata['materials_list'] = ', '.join(materials[:10])  # Maxim 10 materiale
-        
-        return metadata
-    
-    def _detect_category(self, text):
-        """Detecteaz categoria activitii bazat pe cuvinte cheie"""
-        text_lower = text.lower()
-        
-        for category, keywords in self.categories.items():
-            if any(keyword in text_lower for keyword in keywords):
-                return category
-        
-        return '[A]'  # Default categoria jocuri
-    
-    def _extract_keywords(self, text):
-        """Extrage cuvinte cheie din text"""
-        keywords = []
-        text_lower = text.lower()
-        
-        # Lista de cuvinte cheie relevante
-        keyword_list = [
-            'cooperare', 'competitie', 'echipa', 'creativitate', 'miscare',
-            'strategie', 'comunicare', 'incredere', 'coordonare', 'atentie',
-            'reflexe', 'logica', 'imaginatie', 'muzica', 'dans', 'sport',
-            'natura', 'mediu', 'stiinta', 'matematica', 'limba', 'cultura'
-        ]
-        
-        for keyword in keyword_list:
-            if keyword in text_lower:
-                keywords.append(keyword)
-        
-        return ', '.join(keywords[:5])  # Maxim 5 keywords
-    
-    def save_to_database(self, activities):
-        """Salveaz activitile <20>n baza de date"""
-        conn = sqlite3.connect(self.db_path)
-        cursor = conn.cursor()
-        
-        saved_count = 0
-        duplicate_count = 0
-        
-        for activity in activities:
-            try:
-                # Verific duplicate
-                cursor.execute(
-                    "SELECT id FROM activities WHERE name = ? AND source_file = ?",
-                    (activity.get('name'), activity.get('source_file'))
-                )
-                
-                if cursor.fetchone():
-                    duplicate_count += 1
-                    continue
-                
-                # Pregtete valorile pentru insert
-                columns = []
-                values = []
-                placeholders = []
-                
-                for key, value in activity.items():
-                    if key != 'created_at':  # Skip created_at, it has default
-                        columns.append(key)
-                        values.append(value)
-                        placeholders.append('?')
-                
-                # Insert <20>n DB
-                query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
-                cursor.execute(query, values)
-                saved_count += 1
-                
-            except Exception as e:
-                print(f"Error saving activity: {e}")
-                continue
-        
-        conn.commit()
-        conn.close()
-        
-        return saved_count, duplicate_count
-    
-    def process_all_html_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
-        """Proceseaz toate fiierele HTML din directorul specificat"""
-        base_path = Path(base_path)
-        html_files = list(base_path.rglob("*.html"))
-        html_files.extend(list(base_path.rglob("*.htm")))
-        
-        print(f"Found {len(html_files)} HTML files to process")
-        
-        all_activities = []
-        processed = 0
-        errors = 0
-        
-        for i, html_file in enumerate(html_files):
-            try:
-                activities = self.extract_from_html(str(html_file))
-                all_activities.extend(activities)
-                processed += 1
-                
-                # Progress update
-                if (i + 1) % 100 == 0:
-                    print(f"Progress: {i+1}/{len(html_files)} files processed, {len(all_activities)} activities found")
-                    # Save batch to DB
-                    if all_activities:
-                        saved, dupes = self.save_to_database(all_activities)
-                        print(f"Batch saved: {saved} new activities, {dupes} duplicates skipped")
-                        all_activities = []  # Clear buffer
-                
-            except Exception as e:
-                print(f"Error processing {html_file}: {e}")
-                errors += 1
-        
-        # Save remaining activities
-        if all_activities:
-            saved, dupes = self.save_to_database(all_activities)
-            print(f"Final batch saved: {saved} new activities, {dupes} duplicates skipped")
-        
-        print(f"\nProcessing complete!")
-        print(f"Files processed: {processed}")
-        print(f"Errors: {errors}")
-        
-        return processed, errors
-
-# Funcie main pentru test
-if __name__ == "__main__":
-    extractor = HTMLActivityExtractor()
-    
-    # Test pe un fiier sample mai <20>nt<6E>i
-    print("Testing on sample file first...")
-    # Gsete un fiier HTML pentru test
-    test_files = list(Path('/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri').rglob("*.html"))[:3]
-    
-    for test_file in test_files:
-        print(f"\nTesting: {test_file}")
-        activities = extractor.extract_from_html(str(test_file))
-        print(f"Found {len(activities)} activities")
-        if activities:
-            print(f"Sample activity: {activities[0]['name'][:50]}...")
-    
-    # <20>ntreab dac s continue cu procesarea complet
-    response = input("\nContinue with full processing? (y/n): ")
-    if response.lower() == 'y':
-        extractor.process_all_html_files()
--- a/scripts/import_claude_activities.py
+++ b/scripts/import_claude_activities.py
@@ -1,78 +0,0 @@
-#!/usr/bin/env python3
-"""
-Import activities extracted by Claude from JSON files
-"""
-
-import json
-import sqlite3
-from pathlib import Path
-from datetime import datetime
-
-class ClaudeActivityImporter:
-    def __init__(self, db_path='data/activities.db'):
-        self.db_path = db_path
-        self.json_dir = Path('scripts/extracted_activities')
-        self.json_dir.mkdir(exist_ok=True)
-    
-    def import_json_file(self, json_path):
-        """Import activities from a single JSON file"""
-        with open(json_path, 'r', encoding='utf-8') as f:
-            data = json.load(f)
-        
-        source_file = data.get('source_file', str(json_path))
-        activities = data.get('activities', [])
-        
-        conn = sqlite3.connect(self.db_path)
-        cursor = conn.cursor()
-        
-        imported = 0
-        for activity in activities:
-            try:
-                # Add source file and timestamp
-                activity['source_file'] = source_file
-                activity['created_at'] = datetime.now().isoformat()
-                
-                # Prepare insert
-                columns = list(activity.keys())
-                values = list(activity.values())
-                placeholders = ['?' for _ in values]
-                
-                # Check for duplicate
-                cursor.execute(
-                    "SELECT id FROM activities WHERE name = ? AND source_file = ?",
-                    (activity.get('name'), source_file)
-                )
-                
-                if not cursor.fetchone():
-                    query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
-                    cursor.execute(query, values)
-                    imported += 1
-                    
-            except Exception as e:
-                print(f"Error importing activity: {e}")
-        
-        conn.commit()
-        conn.close()
-        
-        print(f"Imported {imported} activities from {json_path.name}")
-        return imported
-    
-    def import_all_json_files(self):
-        """Import all JSON files from the extracted_activities directory"""
-        json_files = list(self.json_dir.glob("*.json"))
-        
-        if not json_files:
-            print("No JSON files found in extracted_activities directory")
-            return 0
-        
-        total_imported = 0
-        for json_file in json_files:
-            imported = self.import_json_file(json_file)
-            total_imported += imported
-        
-        print(f"\nTotal imported: {total_imported} activities from {len(json_files)} files")
-        return total_imported
-
-if __name__ == "__main__":
-    importer = ClaudeActivityImporter()
-    importer.import_all_json_files()
--- a/scripts/import_common.py
+++ b/scripts/import_common.py
@@ -0,0 +1,179 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+import_common.py — shared helpers for the import / validation side of the
+extraction pipeline (Lane C).
+
+Used by build_database.py and validate_extractions.py:
+  * JSON-schema validation of subagent extraction files,
+  * the anti-hallucination source_excerpt substring check (E5),
+  * locating the source chunk that an extraction file came from,
+  * the stable content key used by the needs_review queue.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import json
+import re
+import unicodedata
+from pathlib import Path
+from typing import Any, Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+
+DEFAULT_SCHEMA_PATH = SCRIPT_DIR / "activity_schema.json"
+
+# rapidfuzz.partial_ratio is on a 0..100 scale; an excerpt counts as a real
+# quote from the source when it scores at least this against the chunk text.
+EXCERPT_MATCH_THRESHOLD = 90.0
+
+
+# --------------------------------------------------------------------------
+# schema validation
+# --------------------------------------------------------------------------
+def load_schema(schema_path: str | Path = DEFAULT_SCHEMA_PATH) -> dict:
+    """Load the activity JSON schema produced by Lane A."""
+    return json.loads(Path(schema_path).read_text(encoding="utf-8"))
+
+
+def validate_extraction(data: Any, schema: dict) -> list[str]:
+    """
+    Validate one parsed extraction file against `schema`.
+
+    Returns a list of human-readable error strings; empty list == valid.
+    """
+    import jsonschema
+
+    validator = jsonschema.Draft7Validator(schema)
+    errors: list[str] = []
+    for err in sorted(validator.iter_errors(data), key=lambda e: list(e.path)):
+        location = "/".join(str(p) for p in err.path) or "<root>"
+        errors.append(f"{location}: {err.message}")
+    return errors
+
+
+# --------------------------------------------------------------------------
+# excerpt verification (E5 — anti-hallucination)
+# --------------------------------------------------------------------------
+def _normalize_text(text: str) -> str:
+    return re.sub(r"\s+", " ", (text or "")).strip().lower()
+
+
+def excerpt_score(excerpt: str, chunk_text: str) -> float:
+    """Best fuzzy-substring score (0..100) of `excerpt` inside `chunk_text`."""
+    from rapidfuzz import fuzz
+
+    if not excerpt or not chunk_text:
+        return 0.0
+    return float(fuzz.partial_ratio(_normalize_text(excerpt), _normalize_text(chunk_text)))
+
+
+def excerpt_matches(
+    excerpt: str, chunk_text: str, threshold: float = EXCERPT_MATCH_THRESHOLD
+) -> bool:
+    """True when `excerpt` appears (fuzzily) as a substring of `chunk_text`."""
+    return excerpt_score(excerpt, chunk_text) >= threshold
+
+
+# --------------------------------------------------------------------------
+# locating the source chunk an extraction file came from
+# --------------------------------------------------------------------------
+def chunk_key_for(json_path: Path, header: Optional[dict]) -> str:
+    """
+    Resolve the chunk key for an extraction file.
+
+    Prefers the explicit `chunk_key` in the header, otherwise falls back to the
+    JSON file stem (extraction files are named `<chunk_key>.json`).
+    """
+    if header and header.get("chunk_key"):
+        return str(header["chunk_key"])
+    return json_path.stem
+
+
+def source_id_for(chunk_key: str, header: Optional[dict]) -> str:
+    """Resolve the source id; `<source_id>.partNN` → `<source_id>`."""
+    if header and header.get("source_id"):
+        return str(header["source_id"])
+    # chunk keys look like "<source_id>.partNN"
+    return chunk_key.rsplit(".part", 1)[0]
+
+
+def find_chunk_text(
+    json_path: Path, header: Optional[dict], chunks_dir: Path
+) -> Optional[str]:
+    """
+    Return the text of the source chunk for an extraction file, or None.
+
+    Looks for data/chunks/<source_id>/<chunk_key>.txt, then falls back to a
+    recursive glob on the chunk key.
+    """
+    chunk_key = chunk_key_for(json_path, header)
+    source_id = source_id_for(chunk_key, header)
+
+    candidate = chunks_dir / source_id / f"{chunk_key}.txt"
+    if candidate.is_file():
+        return candidate.read_text(encoding="utf-8", errors="replace")
+
+    matches = list(chunks_dir.rglob(f"{chunk_key}.txt"))
+    if matches:
+        return matches[0].read_text(encoding="utf-8", errors="replace")
+    return None
+
+
+def source_path_for(source_id: str, sources_dir: Path) -> Optional[str]:
+    """
+    Read the original `SOURCE:` path from a normalized source header.
+
+    data/sources/<source_id>.txt starts with a `SOURCE: <relative path>` line.
+    """
+    src_file = sources_dir / f"{source_id}.txt"
+    if not src_file.is_file():
+        return None
+    try:
+        with src_file.open(encoding="utf-8", errors="replace") as fh:
+            for line in fh:
+                if line.startswith("SOURCE:"):
+                    return line.split(":", 1)[1].strip()
+                if line.startswith("=") or line.startswith("--- PAGE "):
+                    break
+    except OSError:
+        return None
+    return None
+
+
+# --------------------------------------------------------------------------
+# stable content key for the needs_review queue (plan §5c)
+# --------------------------------------------------------------------------
+def normalize_name(name: str) -> str:
+    """Diacritic-free, lowercased, whitespace-collapsed name (dedup key)."""
+    if not name:
+        return ""
+    decomposed = unicodedata.normalize("NFKD", name)
+    ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
+    return re.sub(r"\s+", " ", ascii_str.lower().strip())
+
+
+def content_key(normalized_name: str, language: Optional[str], description: str) -> str:
+    """
+    Stable hash identifying a row for the review queue.
+
+    Only borderline-kept-separate rows and legacy `.doc` rows ever carry
+    needs_review, and neither is auto-merged — so their (normalized_name,
+    language, description) triple is stable across rebuilds.
+    """
+    payload = f"{normalized_name}\x1f{language or ''}\x1f{_normalize_text(description)}"
+    return hashlib.sha1(payload.encode("utf-8")).hexdigest()
+
+
+# --------------------------------------------------------------------------
+# iteration
+# --------------------------------------------------------------------------
+def iter_extraction_files(extracted_dir: Path):
+    """Yield every *.json directly under `extracted_dir` (skips _rejected/)."""
+    if not extracted_dir.is_dir():
+        return
+    for path in sorted(extracted_dir.glob("*.json")):
+        if path.is_file():
+            yield path
--- a/scripts/normalize_sources.py
+++ b/scripts/normalize_sources.py
@@ -0,0 +1,255 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+normalize_sources.py — walk data/carti-camp-jocuri/ and write data/sources/<id>.txt.
+
+Output files keep the existing header format:
+
+    SOURCE: <original relative path>
+    CONVERTED: <iso date>
+    FORMAT: <pdf|docx|doc|pptx|html-mirror|zip>
+    NEEDS_REVIEW: <reason>            (optional — legacy .doc conversions)
+    ==================================================
+
+    --- PAGE 1 ---
+    ...
+
+Each source gets a stable id = <8-hex hash of relative path>_<sanitized stem>,
+so two files with the same name in different folders never collide.
+
+The pipeline is script-only: this normalizes formats, it does NOT run extraction.
+Run `--check-deps` before a long job.
+"""
+
+from __future__ import annotations
+
+import argparse
+import datetime as _dt
+import hashlib
+import re
+import sys
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+if str(SCRIPT_DIR) not in sys.path:
+    sys.path.insert(0, str(SCRIPT_DIR))
+
+from extract_common import (  # noqa: E402
+    count_page_markers,
+    dedupe_texts,
+    detect_format,
+    extract_file,
+    extract_html,
+    is_junk,
+    join_pages,
+    preflight,
+    split_pages,
+)
+
+HEADER_RULE = "=" * 50
+
+
+# --------------------------------------------------------------------------
+# stable source id
+# --------------------------------------------------------------------------
+def sanitize_stem(stem: str) -> str:
+    s = re.sub(r"[^\w]+", "_", stem, flags=re.UNICODE).strip("_").lower()
+    return s[:60] or "source"
+
+
+def stable_id(relative_path: str | Path) -> str:
+    """Collision-proof id derived from the path relative to the corpus root."""
+    rel = str(relative_path).replace("\\", "/")
+    digest = hashlib.sha1(rel.encode("utf-8")).hexdigest()[:8]
+    stem = sanitize_stem(Path(rel).stem)
+    return f"{digest}_{stem}"
+
+
+# --------------------------------------------------------------------------
+# header
+# --------------------------------------------------------------------------
+def build_header(
+    source_rel: str, fmt: str, needs_review: str | None = None
+) -> str:
+    today = _dt.date.today().isoformat()
+    lines = [
+        f"SOURCE: {source_rel}",
+        f"CONVERTED: {today}",
+        f"FORMAT: {fmt}",
+    ]
+    if needs_review:
+        lines.append(f"NEEDS_REVIEW: {needs_review}")
+    lines.append(HEADER_RULE)
+    return "\n".join(lines) + "\n\n"
+
+
+# --------------------------------------------------------------------------
+# mirror-site directories
+# --------------------------------------------------------------------------
+MIRROR_PAGE_EXTS = {".html", ".htm"}
+
+
+def is_mirror_dir(path: Path) -> bool:
+    """A directory counts as a site mirror if it contains HTML pages."""
+    if not path.is_dir():
+        return False
+    if path.name.endswith("_files"):
+        return False
+    return any(
+        p.suffix.lower() in MIRROR_PAGE_EXTS
+        for p in path.rglob("*")
+        if p.is_file()
+    )
+
+
+def normalize_mirror(mirror_dir: Path) -> str:
+    """Extract every HTML page in a mirror dir, dedupe near-duplicates, join."""
+    pages: list[tuple[str, str]] = []
+    for html in sorted(mirror_dir.rglob("*")):
+        if not html.is_file() or html.suffix.lower() not in MIRROR_PAGE_EXTS:
+            continue
+        if "_files" in html.parts:
+            continue
+        try:
+            body = extract_html(html)
+        except Exception:
+            continue
+        text = "\n".join(t for _, t in split_pages(body))
+        if text.strip():
+            pages.append((str(html.relative_to(mirror_dir)), text))
+    pages = dedupe_texts(pages)
+    return join_pages([t for _, t in pages])
+
+
+# --------------------------------------------------------------------------
+# one source
+# --------------------------------------------------------------------------
+def normalize_one(
+    path: Path, corpus_root: Path, out_dir: Path
+) -> dict | None:
+    """
+    Normalize a single file or mirror directory → data/sources/<id>.txt.
+
+    Returns a result dict, or None if the entry was skipped (junk / ignored).
+    """
+    rel = path.relative_to(corpus_root)
+    sid = stable_id(rel)
+
+    if path.is_dir():
+        if not is_mirror_dir(path):
+            return None
+        fmt, needs_review = "html-mirror", None
+        body = normalize_mirror(path)
+    else:
+        if is_junk(path):
+            return None
+        fmt = detect_format(path)
+        if fmt in ("unknown", "epub", "txt"):
+            return None  # epub duplicates PDFs; txt is not a source format here
+        needs_review = "legacy .doc conversion is imperfect" if fmt == "doc" else None
+        try:
+            body = extract_file(path)
+        except Exception as exc:  # noqa: BLE001
+            return {"id": sid, "source": str(rel), "status": "error", "error": str(exc)}
+
+    if not body.strip():
+        return {"id": sid, "source": str(rel), "status": "empty"}
+
+    out_path = out_dir / f"{sid}.txt"
+    out_path.write_text(build_header(str(rel), fmt, needs_review) + body,
+                        encoding="utf-8")
+    return {
+        "id": sid,
+        "source": str(rel),
+        "status": "ok",
+        "format": fmt,
+        "pages": count_page_markers(body),
+        "needs_review": bool(needs_review),
+    }
+
+
+# --------------------------------------------------------------------------
+# walk
+# --------------------------------------------------------------------------
+def iter_corpus_entries(corpus_root: Path):
+    """Yield top-level files and mirror directories under the corpus root."""
+    for entry in sorted(corpus_root.iterdir()):
+        if entry.name.startswith("."):
+            continue
+        if entry.is_dir():
+            if is_mirror_dir(entry):
+                yield entry
+        else:
+            yield entry
+
+
+def run(corpus_root: Path, out_dir: Path) -> dict:
+    out_dir.mkdir(parents=True, exist_ok=True)
+    results: list[dict] = []
+    for entry in iter_corpus_entries(corpus_root):
+        res = normalize_one(entry, corpus_root, out_dir)
+        if res is not None:
+            results.append(res)
+    summary = {
+        "total": len(results),
+        "ok": sum(1 for r in results if r["status"] == "ok"),
+        "errors": sum(1 for r in results if r["status"] == "error"),
+        "empty": sum(1 for r in results if r["status"] == "empty"),
+        "needs_review": sum(1 for r in results if r.get("needs_review")),
+        "results": results,
+    }
+    return summary
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def print_preflight(report: dict) -> int:
+    print("Dependency preflight")
+    print("--------------------")
+    if report["missing_python"]:
+        print("  MISSING Python packages: " + ", ".join(report["missing_python"]))
+    else:
+        print("  Python packages: OK")
+    if report["missing_system"]:
+        print("  MISSING system tools  : " + ", ".join(report["missing_system"]))
+    for w in report["warnings"]:
+        print(f"  WARNING: {w}")
+    print("  => " + ("READY" if report["ok"] else "NOT READY — install the above"))
+    return 0 if report["ok"] else 1
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="Normalize mixed sources to .txt")
+    parser.add_argument("--corpus", default="data/carti-camp-jocuri",
+                        help="corpus root to walk")
+    parser.add_argument("--out", default="data/sources", help="output directory")
+    parser.add_argument("--check-deps", action="store_true",
+                        help="run dependency preflight and exit")
+    parser.add_argument("--ocr", action="store_true",
+                        help="include OCR (tesseract) in the preflight check")
+    args = parser.parse_args(argv)
+
+    if args.check_deps:
+        return print_preflight(preflight(check_ocr=args.ocr))
+
+    report = preflight(check_ocr=args.ocr)
+    if report["missing_python"]:
+        print_preflight(report)
+        return 1
+    for w in report["warnings"]:
+        print(f"WARNING: {w}")
+
+    summary = run(Path(args.corpus), Path(args.out))
+    print(f"normalized : {summary['ok']}/{summary['total']}")
+    print(f"errors     : {summary['errors']}")
+    print(f"empty      : {summary['empty']}")
+    print(f"needs_review: {summary['needs_review']}")
+    for r in summary["results"]:
+        if r["status"] != "ok":
+            print(f"  [{r['status']}] {r['source']}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/pdf_extractor.py
+++ b/scripts/pdf_extractor.py
--- a/scripts/pdf_to_text_converter.py
+++ b/scripts/pdf_to_text_converter.py
@@ -1,143 +0,0 @@
-#!/usr/bin/env python3
-"""
-PDF Mass Conversion to Text for Activity Extraction
-Handles all PDF sizes efficiently with multiple fallback methods
-"""
-
-import os
-import json
-from pathlib import Path
-import PyPDF2
-import pdfplumber
-from typing import List, Dict
-import logging
-
-class PDFConverter:
-    def __init__(self, max_pages=50):
-        self.max_pages = max_pages
-        self.conversion_stats = {}
-    
-    def convert_pdf_to_text(self, pdf_path: str) -> str:
-        """Convert PDF to text using multiple methods with fallbacks"""
-        try:
-            # Method 1: pdfplumber (best for tables and layout)
-            return self._convert_with_pdfplumber(pdf_path)
-        except Exception as e:
-            print(f"pdfplumber failed for {pdf_path}: {e}")
-            
-            try:
-                # Method 2: PyPDF2 (fallback)
-                return self._convert_with_pypdf2(pdf_path)
-            except Exception as e2:
-                print(f"PyPDF2 also failed for {pdf_path}: {e2}")
-                return ""
-    
-    def _convert_with_pdfplumber(self, pdf_path: str) -> str:
-        """Primary conversion method using pdfplumber"""
-        text_content = ""
-        
-        with pdfplumber.open(pdf_path) as pdf:
-            total_pages = len(pdf.pages)
-            pages_to_process = min(total_pages, self.max_pages)
-            
-            print(f"  Converting {pdf_path}: {pages_to_process}/{total_pages} pages")
-            
-            for i, page in enumerate(pdf.pages[:pages_to_process]):
-                try:
-                    page_text = page.extract_text()
-                    if page_text:
-                        text_content += f"\n--- PAGE {i+1} ---\n"
-                        text_content += page_text
-                        text_content += "\n"
-                except Exception as e:
-                    print(f"    Error on page {i+1}: {e}")
-                    continue
-        
-        self.conversion_stats[pdf_path] = {
-            'method': 'pdfplumber',
-            'pages_processed': pages_to_process,
-            'total_pages': total_pages,
-            'success': True,
-            'text_length': len(text_content)
-        }
-        
-        return text_content
-    
-    def _convert_with_pypdf2(self, pdf_path: str) -> str:
-        """Fallback conversion method using PyPDF2"""
-        text_content = ""
-        
-        with open(pdf_path, 'rb') as file:
-            reader = PyPDF2.PdfReader(file)
-            total_pages = len(reader.pages)
-            pages_to_process = min(total_pages, self.max_pages)
-            
-            print(f"  Converting {pdf_path} (fallback): {pages_to_process}/{total_pages} pages")
-            
-            for i in range(pages_to_process):
-                try:
-                    page = reader.pages[i]
-                    page_text = page.extract_text()
-                    if page_text:
-                        text_content += f"\n--- PAGE {i+1} ---\n"
-                        text_content += page_text
-                        text_content += "\n"
-                except Exception as e:
-                    print(f"    Error on page {i+1}: {e}")
-                    continue
-        
-        self.conversion_stats[pdf_path] = {
-            'method': 'PyPDF2',
-            'pages_processed': pages_to_process,
-            'total_pages': total_pages,
-            'success': True,
-            'text_length': len(text_content)
-        }
-        
-        return text_content
-    
-    def convert_all_pdfs(self, pdf_directory: str, output_directory: str):
-        """Convert all PDFs in directory to text files"""
-        pdf_files = list(Path(pdf_directory).glob("**/*.pdf"))
-        
-        print(f"🔄 Converting {len(pdf_files)} PDF files to text...")
-        
-        os.makedirs(output_directory, exist_ok=True)
-        
-        for i, pdf_path in enumerate(pdf_files):
-            print(f"\n[{i+1}/{len(pdf_files)}] Processing {pdf_path.name}...")
-            
-            # Convert to text
-            text_content = self.convert_pdf_to_text(str(pdf_path))
-            
-            if text_content.strip():
-                # Save as text file
-                output_file = Path(output_directory) / f"{pdf_path.stem}.txt"
-                with open(output_file, 'w', encoding='utf-8') as f:
-                    f.write(f"SOURCE: {pdf_path}\n")
-                    f.write(f"CONVERTED: 2025-01-11\n")
-                    f.write("="*50 + "\n\n")
-                    f.write(text_content)
-                
-                print(f"  ✅ Saved: {output_file}")
-            else:
-                print(f"  ❌ No text extracted from {pdf_path.name}")
-        
-        # Save conversion statistics
-        stats_file = Path(output_directory) / "conversion_stats.json"
-        with open(stats_file, 'w', encoding='utf-8') as f:
-            json.dump(self.conversion_stats, f, indent=2, ensure_ascii=False)
-        
-        print(f"\n🎉 PDF conversion complete! Check {output_directory}")
-        return len([f for f in self.conversion_stats.values() if f['success']])
-
-# Usage
-if __name__ == "__main__":
-    converter = PDFConverter(max_pages=50)
-    
-    # Convert all PDFs
-    pdf_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri"
-    output_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri/INDEX-SISTEM-JOCURI/converted_pdfs"
-    
-    converted_count = converter.convert_all_pdfs(pdf_dir, output_dir)
-    print(f"Final result: {converted_count} PDFs successfully converted")
--- a/scripts/repair_extractions.py
+++ b/scripts/repair_extractions.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+repair_extractions.py — one-shot repair of malformed extraction JSON.
+
+Subagents systematically emit unescaped ASCII double-quotes inside string
+values (Romanian text like  „Unu"  uses a closing " that terminates the JSON
+string early). Re-extraction reproduces the bug, so we repair instead.
+
+IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
+ending the string at the stray quote and reinterpreting the trailing text as a
+new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
+truncation is silent (the field is still non-empty) and slips past a naive
+presence check. So we use a faithful char-scanner that ESCAPES stray quotes
+(\\") instead of splitting on them, then validate the result against the real
+activity schema (additionalProperties:false also catches any residual split).
+
+This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
+the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
+disk. We write clean JSON back to data/extracted/ and the build reads vanilla
+json.
+
+Source selection (faithful recovery needs the ORIGINAL malformed text):
+  * a chunk is a candidate when a MALFORMED original exists — either the
+    top-level data/extracted/<key>.json is itself invalid, or a malformed
+    original sits in data/extracted/_rejected/<key>.json.
+  * the malformed original is preferred as the repair source.
+  * chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
+    output that lost the original) are NOT silently "repaired" — if such a chunk
+    has no valid top-level file it is reported as needing RE-EXTRACTION.
+
+Usage:
+    python scripts/repair_extractions.py            # report only (dry run)
+    python scripts/repair_extractions.py --apply     # write repaired JSON
+"""
+
+from __future__ import annotations
+
+import argparse
+import glob
+import json
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+EXTRACTED = REPO_ROOT / "data" / "extracted"
+REJECTED = EXTRACTED / "_rejected"
+
+if str(SCRIPT_DIR) not in __import__("sys").path:
+    __import__("sys").path.insert(0, str(SCRIPT_DIR))
+from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction  # noqa: E402
+
+
+def escape_stray_quotes(s: str) -> str:
+    """Escape ASCII double-quotes that occur INSIDE a JSON string value.
+
+    A `"` inside a string is treated as a real string-close only when the next
+    non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
+    content and is escaped to `\\"`. This preserves the full value instead of
+    truncating it (the json_repair failure mode).
+    """
+    out: list[str] = []
+    in_str = False
+    esc = False
+    n = len(s)
+    i = 0
+    while i < n:
+        c = s[i]
+        if esc:
+            out.append(c)
+            esc = False
+            i += 1
+            continue
+        if c == "\\":
+            out.append(c)
+            esc = True
+            i += 1
+            continue
+        if c == '"':
+            if not in_str:
+                in_str = True
+                out.append(c)
+            else:
+                j = i + 1
+                while j < n and s[j] in " \t\r\n":
+                    j += 1
+                nxt = s[j] if j < n else ""
+                if nxt in ",}]:" or nxt == "":
+                    in_str = False
+                    out.append(c)
+                else:
+                    out.append('\\"')  # content quote → escape, keep value whole
+            i += 1
+            continue
+        out.append(c)
+        i += 1
+    return "".join(out)
+
+
+def _is_valid_json(path: Path) -> bool:
+    try:
+        json.loads(path.read_text(encoding="utf-8"))
+        return True
+    except (json.JSONDecodeError, OSError):
+        return False
+
+
+def _malformed_source(key: str) -> Optional[Path]:
+    """Return the malformed-original file for a chunk, preferring top-level."""
+    live = EXTRACTED / f"{key}.json"
+    if live.exists() and not _is_valid_json(live):
+        return live
+    rej = REJECTED / f"{key}.json"
+    if rej.exists() and not _is_valid_json(rej):
+        return rej
+    return None
+
+
+def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
+    """
+    (repair_candidates, needs_reextraction).
+
+    repair_candidates: key -> malformed source file (faithfully repairable).
+    needs_reextraction: chunks with no malformed original AND no valid
+    top-level file (their original was lost) — must be re-extracted.
+    """
+    keys = set()
+    for fn in glob.glob(str(EXTRACTED / "*.json")):
+        keys.add(Path(fn).stem)
+    for fn in glob.glob(str(REJECTED / "*.json")):
+        keys.add(Path(fn).stem)
+
+    candidates: dict[str, Path] = {}
+    needs_reextraction: list[str] = []
+    for key in sorted(keys):
+        # A malformed original anywhere is faithfully repairable, and is the
+        # source of truth even if a (json_repair-produced, possibly truncated)
+        # valid top-level file exists — escaping the original never truncates,
+        # so re-repairing from it is always >= the json_repair output.
+        src = _malformed_source(key)
+        if src is not None:
+            candidates[key] = src
+            continue
+        live = EXTRACTED / f"{key}.json"
+        if live.exists() and _is_valid_json(live):
+            continue  # genuinely-valid extraction, nothing to do
+        # no valid top-level and no malformed original to repair from
+        needs_reextraction.append(key)
+    return candidates, needs_reextraction
+
+
+def repair(apply: bool) -> int:
+    schema = load_schema(DEFAULT_SCHEMA_PATH)
+    candidates, needs_reextraction = _candidate_keys()
+
+    print("=" * 64)
+    print(f"REPAIR EXTRACTIONS  ({'APPLY' if apply else 'dry run'})")
+    print("=" * 64)
+    print(f"repair candidates: {len(candidates)}")
+
+    def _textlen(data: dict) -> int:
+        total = 0
+        for a in data.get("activities", []):
+            if isinstance(a, dict):
+                for v in a.values():
+                    if isinstance(v, str):
+                        total += len(v)
+        return total
+
+    ok = 0
+    kept_toplevel = 0
+    still_bad: list[str] = []
+    schema_fail: list[tuple[str, str]] = []
+
+    for key, src in candidates.items():
+        live = EXTRACTED / f"{key}.json"
+        live_valid = live.exists() and _is_valid_json(live)
+
+        raw = src.read_text(encoding="utf-8")
+        fixed = escape_stray_quotes(raw)
+        try:
+            data = json.loads(fixed)
+        except json.JSONDecodeError as exc:
+            if live_valid:
+                kept_toplevel += 1  # genuine top-level is fine; stale _rejected
+            else:
+                still_bad.append(f"{key}: still invalid after escape ({exc})")
+            continue
+        errors = validate_extraction(data, schema)
+        if errors:
+            if live_valid:
+                kept_toplevel += 1
+            else:
+                schema_fail.append((key, errors[0]))
+                print(f"  {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
+            continue
+
+        # Faithfulness guard: only replace a valid top-level when the escaped
+        # repair carries STRICTLY more text (i.e. the top-level was a truncated
+        # json_repair output). Genuine extractions are kept untouched.
+        if live_valid:
+            try:
+                live_data = json.loads(live.read_text(encoding="utf-8"))
+            except json.JSONDecodeError:
+                live_data = {}
+            if _textlen(data) <= _textlen(live_data):
+                kept_toplevel += 1
+                continue
+
+        n = len(data.get("activities", []))
+        print(f"  {key[:50]:<50} {n:>3} acts  REPAIR")
+        if apply:
+            live.write_text(
+                json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
+            )
+        ok += 1
+
+    print("-" * 64)
+    print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
+          f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
+          f"needs re-extraction: {len(needs_reextraction)}")
+    for key, err in schema_fail:
+        print(f"  ⚠ schema {key}: {err[:60]}")
+    for msg in still_bad:
+        print(f"  ✘ {msg}")
+    for key in needs_reextraction:
+        print(f"  ↻ re-extract: {key}")
+    if not apply:
+        print("\nDry run — re-run with --apply to write repaired JSON.")
+    print("=" * 64)
+    return 0
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
+    parser.add_argument("--apply", action="store_true",
+                        help="write repaired JSON (default: dry run)")
+    args = parser.parse_args(argv)
+    return repair(args.apply)
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/review_queue.py
+++ b/scripts/review_queue.py
@@ -0,0 +1,145 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+review_queue.py — CLI for the needs_review lifecycle (plan §5c).
+
+Rows land in the queue when dedup leaves a borderline pair separate, or when a
+legacy `.doc` source was converted imperfectly. Each row has a stable content
+key; a decision written here is stored in data/review_decisions.json (git
+tracked) and re-applied by build_database.py on every rebuild, so the queue
+never resurfaces a resolved row.
+
+Commands:
+    python scripts/review_queue.py list
+    python scripts/review_queue.py resolve <id> <merge|keep-separate|drop>
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sqlite3
+import sys
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from import_common import content_key, normalize_name  # noqa: E402
+
+VALID_DECISIONS = ("merge", "keep-separate", "drop")
+
+
+# --------------------------------------------------------------------------
+# review_decisions.json
+# --------------------------------------------------------------------------
+def load_decisions(path: Path) -> dict:
+    if path.is_file():
+        try:
+            data = json.loads(path.read_text(encoding="utf-8"))
+            if isinstance(data, dict):
+                return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {}
+
+
+def save_decisions(decisions: dict, path: Path) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(
+        json.dumps(decisions, indent=2, ensure_ascii=False, sort_keys=True),
+        encoding="utf-8",
+    )
+
+
+# --------------------------------------------------------------------------
+# queue
+# --------------------------------------------------------------------------
+def list_queue(db_path: Path) -> list[dict]:
+    """Return every needs_review row in the current DB, with its content key."""
+    if not db_path.is_file():
+        return []
+    conn = sqlite3.connect(db_path)
+    conn.row_factory = sqlite3.Row
+    try:
+        rows = conn.execute(
+            "SELECT name, normalized_name, language, description "
+            "FROM activities WHERE needs_review = 1 ORDER BY normalized_name"
+        ).fetchall()
+    except sqlite3.OperationalError:
+        return []
+    finally:
+        conn.close()
+
+    out = []
+    for row in rows:
+        norm = row["normalized_name"] or normalize_name(row["name"])
+        key = content_key(norm, row["language"], row["description"] or "")
+        out.append({
+            "id": key,
+            "name": row["name"],
+            "language": row["language"],
+            "description": row["description"] or "",
+        })
+    return out
+
+
+def resolve(decisions_path: Path, content_id: str, decision: str) -> dict:
+    """Record a decision for a content key in review_decisions.json."""
+    if decision not in VALID_DECISIONS:
+        raise ValueError(
+            f"invalid decision {decision!r}; expected one of {VALID_DECISIONS}"
+        )
+    decisions = load_decisions(decisions_path)
+    decisions[content_id] = {"decision": decision}
+    save_decisions(decisions, decisions_path)
+    return decisions
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="needs_review queue CLI")
+    parser.add_argument("--db", default="data/activities.db")
+    parser.add_argument("--decisions", default="data/review_decisions.json")
+    sub = parser.add_subparsers(dest="command", required=True)
+
+    sub.add_parser("list", help="list rows currently flagged needs_review")
+
+    p_resolve = sub.add_parser("resolve", help="record a decision for a row")
+    p_resolve.add_argument("id", help="content id from `list`")
+    p_resolve.add_argument("decision", choices=VALID_DECISIONS)
+
+    args = parser.parse_args(argv)
+
+    if args.command == "list":
+        rows = list_queue(Path(args.db))
+        if not rows:
+            print("review queue is empty.")
+            return 0
+        print(f"{len(rows)} row(s) need review:\n")
+        for r in rows:
+            desc = r["description"][:80].replace("\n", " ")
+            print(f"  id   : {r['id']}")
+            print(f"  name : {r['name']}  [{r['language']}]")
+            print(f"  desc : {desc}")
+            print(f"  -> review_queue.py resolve {r['id']} <merge|keep-separate|drop>")
+            print()
+        return 0
+
+    if args.command == "resolve":
+        resolve(Path(args.decisions), args.id, args.decision)
+        print(f"recorded: {args.id} -> {args.decision}")
+        print(f"written to {args.decisions} (applied on next build_database --rebuild)")
+        return 0
+
+    return 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/run_enrichment.py
+++ b/scripts/run_enrichment.py
@@ -0,0 +1,289 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+run_enrichment.py — enrichment orchestrator (plan Part B3).
+
+Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
+already-rebuilt data/activities.db, and for every activity emits one subagent
+prompt asking for a single bilingual + inferred-filter enrichment pass. Like
+extraction, this script does NOT call the LLM — the interactive Claude Code
+orchestrator launches waves of subagents on the emitted prompts.
+
+Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
+import_common.content_key(normalized_name, language, _normalize_text(description))
+— the SAME function build_database uses to apply the overlay. The key is stable
+only while the extraction text is frozen, so enrichment runs AFTER the freezing
+rebuild.
+
+Modes:
+  (default)    emit one prompt per activity that has no enrichment part yet
+               (resumable: data/enrichment_parts/<key>.json present => skip)
+  --collect    merge data/enrichment_parts/*.json -> data/enrichment.json
+
+Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
+the emitted prompts to a single source / category for the sign-off pilot.
+
+Usage:
+    python scripts/run_enrichment.py --source teambuilding_corbu        # pilot
+    python scripts/run_enrichment.py                                    # all rows
+    python scripts/run_enrichment.py --collect                          # merge parts
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sqlite3
+import sys
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from import_common import (  # noqa: E402
+    content_key,
+    find_chunk_text,
+    normalize_name,
+)
+from repair_extractions import escape_stray_quotes  # noqa: E402
+
+ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
+
+# Columns pulled from the DB into the prompt as the "current value" context.
+_DB_COLUMNS = (
+    "id", "name", "description", "rules", "variations",
+    "category", "content_type", "language", "normalized_name",
+    "page_reference", "source_id", "chunk_key",
+    "participants_min", "participants_max",
+    "duration_min", "duration_max",
+    "age_group_min", "age_group_max",
+)
+
+# How much source-chunk text to inline. Chunks are page-sized; cap so a dense
+# chunk does not blow the prompt up, but keep enough to ground the expansion.
+_CHUNK_TEXT_CAP = 12000
+
+
+def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
+    conn = sqlite3.connect(db_path)
+    conn.row_factory = sqlite3.Row
+    try:
+        cols = ", ".join(_DB_COLUMNS)
+        sql = f"SELECT {cols} FROM activities"
+        params: list = []
+        if source_substr:
+            sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
+            params = [f"%{source_substr}%", f"%{source_substr}%"]
+        sql += " ORDER BY source_id, id"
+        return [dict(r) for r in conn.execute(sql, params).fetchall()]
+    finally:
+        conn.close()
+
+
+def _row_content_key(row: dict) -> str:
+    return content_key(
+        row.get("normalized_name") or normalize_name(row.get("name") or ""),
+        row.get("language"),
+        row.get("description") or "",
+    )
+
+
+def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
+    """Locate the source-chunk text via the row's chunk_key / source_id."""
+    header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
+    if not header["chunk_key"]:
+        return None
+    # find_chunk_text resolves from the header when chunk_key is present;
+    # the json_path arg is only a fallback, so a synthetic path is fine.
+    text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
+    if text and len(text) > _CHUNK_TEXT_CAP:
+        text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
+    return text
+
+
+def _current_fields_block(row: dict) -> str:
+    """The activity's current DB values, as a compact JSON block for context."""
+    fields = {
+        "name": row.get("name"),
+        "description": row.get("description"),
+        "rules": row.get("rules"),
+        "variations": row.get("variations"),
+        "category": row.get("category"),
+        "content_type": row.get("content_type"),
+        "language": row.get("language"),
+        "participants_min": row.get("participants_min"),
+        "participants_max": row.get("participants_max"),
+        "duration_min": row.get("duration_min"),
+        "duration_max": row.get("duration_max"),
+        "age_group_min": row.get("age_group_min"),
+        "age_group_max": row.get("age_group_max"),
+    }
+    return json.dumps(fields, ensure_ascii=False, indent=2)
+
+
+def emit_enrichment_prompt(
+    row: dict, key: str, chunks_dir: Path, prompts_dir: Path
+) -> Path:
+    """Write the subagent enrichment prompt for one activity."""
+    chunk_text = _chunk_text_for_row(row, chunks_dir)
+    source_block = (
+        chunk_text if chunk_text is not None
+        else "[source chunk text unavailable — translate only what is given "
+             "above; do NOT invent steps, and mark any inferred filter field "
+             "as estimated]"
+    )
+    part_path = f"data/enrichment_parts/{key}.json"
+    text = "\n".join([
+        f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
+        "",
+        f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
+        "Single pass. Translate faithfully to Romanian; expand description_ro "
+        "ONLY from the source chunk text below; mark inferred filter fields in "
+        "`estimated_fields`.",
+        "",
+        f"Write the result JSON to: `{part_path}`",
+        f'It MUST include `"content_key": "{key}"`.',
+        f'Page reference: {row.get("page_reference") or "?"}',
+        "",
+        "## Current activity values (the text to translate / enrich)",
+        "```json",
+        _current_fields_block(row),
+        "```",
+        "",
+        "## Source chunk text (ground description_ro expansion in THIS only)",
+        "```",
+        source_block,
+        "```",
+        "",
+    ])
+    prompts_dir.mkdir(parents=True, exist_ok=True)
+    out = prompts_dir / f"{key}.prompt.md"
+    out.write_text(text, encoding="utf-8")
+    return out
+
+
+def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
+    """Merge data/enrichment_parts/*.json into one flat content_key map."""
+    merged: dict = {}
+    bad: list[str] = []
+    repaired: list[str] = []
+    if parts_dir.is_dir():
+        for part in sorted(parts_dir.glob("*.json")):
+            raw = part.read_text(encoding="utf-8")
+            try:
+                data = json.loads(raw)
+            except json.JSONDecodeError:
+                # Enrichment subagents hit the same unescaped-ASCII-quote bug as
+                # extraction (description_ro is full of Romanian „…"). Repair by
+                # escaping rather than dropping the activity's enrichment.
+                try:
+                    data = json.loads(escape_stray_quotes(raw))
+                    repaired.append(part.name)
+                except json.JSONDecodeError:
+                    bad.append(part.name)
+                    continue
+            except OSError:
+                bad.append(part.name)
+                continue
+            if not isinstance(data, dict):
+                bad.append(part.name)
+                continue
+            key = data.get("content_key") or part.stem
+            entry = {k: v for k, v in data.items() if k != "content_key"}
+            merged[key] = entry
+    out_path.write_text(
+        json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
+    )
+    return {"entries": len(merged), "repaired": repaired,
+            "bad_parts": bad, "out": str(out_path)}
+
+
+def run_emit(
+    *,
+    db_path: Path,
+    chunks_dir: Path,
+    parts_dir: Path,
+    prompts_dir: Path,
+    source_substr: Optional[str],
+    limit: Optional[int],
+) -> dict:
+    rows = _fetch_rows(db_path, source_substr)
+    emitted, skipped = 0, 0
+    for row in rows:
+        key = _row_content_key(row)
+        if (parts_dir / f"{key}.json").is_file():
+            skipped += 1
+            continue
+        emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
+        emitted += 1
+        if limit and emitted >= limit:
+            break
+    return {
+        "rows": len(rows),
+        "emitted": emitted,
+        "skipped_done": skipped,
+        "prompts_dir": str(prompts_dir),
+    }
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
+    parser.add_argument("--db", default="data/activities.db")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--parts", default="data/enrichment_parts")
+    parser.add_argument("--prompts", default="data/enrichment_prompts")
+    parser.add_argument("--out", default="data/enrichment.json")
+    parser.add_argument("--source", default=None,
+                        help="only rows whose source_id/chunk_key contains this (pilot)")
+    parser.add_argument("--limit", type=int, default=None,
+                        help="cap emitted prompts (pilot)")
+    parser.add_argument("--collect", action="store_true",
+                        help="merge enrichment parts into the overlay JSON")
+    args = parser.parse_args(argv)
+
+    print("=" * 60)
+    print("ENRICHMENT ORCHESTRATOR")
+    print("=" * 60)
+
+    if args.collect:
+        result = collect_enrichment(Path(args.parts), Path(args.out))
+        print(f"collected  : {result['entries']} entries -> {result['out']}")
+        if result["repaired"]:
+            print(f"repaired   : {len(result['repaired'])} parts (unescaped-quote fix)")
+        if result["bad_parts"]:
+            print(f"bad parts  : {len(result['bad_parts'])} (skipped)")
+            for name in result["bad_parts"]:
+                print(f"  - {name}")
+        print("Run build_database.py --rebuild to apply the overlay.")
+        print("=" * 60)
+        return 0
+
+    summary = run_emit(
+        db_path=Path(args.db),
+        chunks_dir=Path(args.chunks),
+        parts_dir=Path(args.parts),
+        prompts_dir=Path(args.prompts),
+        source_substr=args.source,
+        limit=args.limit,
+    )
+    print(f"rows in DB        : {summary['rows']}"
+          + (f"  (filtered by '{args.source}')" if args.source else ""))
+    print(f"already enriched  : {summary['skipped_done']}")
+    print(f"prompts emitted   : {summary['emitted']}")
+    if summary["emitted"]:
+        print(f"prompts dir       : {summary['prompts_dir']}/")
+        print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
+              "writing data/enrichment_parts/<key>.json, then run "
+              "run_enrichment.py --collect and build_database.py --rebuild.")
+    else:
+        print("Nothing to emit — run --collect then build_database.py --rebuild.")
+    print("=" * 60)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/run_extraction.py
+++ b/scripts/run_extraction.py
@@ -1,50 +1,140 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
-Main extraction orchestrator
-Ruleaza intregul proces de extractie
+run_extraction.py — extraction orchestrator (plan §3).
+
+The pipeline is script-only up to the LLM step: this script normalizes the
+corpus, chunks the normalized sources, and emits one subagent prompt per
+`pending` chunk. It does NOT run the extraction itself — that step is the
+interactive Claude Code orchestrator launching waves of subagents.
+
+Steps:
+  1. normalize  data/carti-camp-jocuri/ -> data/sources/*.txt
+  2. chunk      data/sources/*.txt      -> data/chunks/<id>/*.txt + manifest.json
+  3. emit       one prompt per `pending` chunk -> data/chunks/_prompts/*.md
+  4. report     how many chunks remain `pending`
+
+Usage:
+    python scripts/run_extraction.py
+    python scripts/run_extraction.py --skip-normalize   # re-chunk only
 """

+from __future__ import annotations
+
+import argparse
 import sys
-import time
 from pathlib import Path
+from typing import Optional

-from unified_processor import UnifiedProcessor
-from import_claude_activities import ClaudeActivityImporter
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+import chunk_sources  # noqa: E402
+import normalize_sources  # noqa: E402
+
+SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
+
+
+def emit_chunk_prompt(chunk_key: str, meta: dict, prompts_dir: Path) -> Path:
+    """Write the subagent prompt for one pending chunk."""
+    chunk_file = meta.get("chunk_file", f"data/chunks/<id>/{chunk_key}.txt")
+    expected_json = meta.get("expected_json", f"{chunk_key}.json")
+    text = "\n".join([
+        f"# EXTRACTION — chunk `{chunk_key}`",
+        "",
+        f"Read ONLY this chunk: `{chunk_file}`",
+        f"Chunk range: {meta.get('chunk_range', '?')}",
+        "",
+        f"Follow the rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
+        "Identify every distinct activity, fill the schema "
+        "(`scripts/activity_schema.json`), and write the result to:",
+        "",
+        f"    data/extracted/{expected_json}",
+        "",
+        "Header fields to set: "
+        f'source_id="{meta.get("source_id", "")}", chunk_key="{chunk_key}", '
+        f'source_hash="{meta.get("source_hash", "")}".',
+        "",
+    ])
+    prompts_dir.mkdir(parents=True, exist_ok=True)
+    out = prompts_dir / f"{chunk_key}.prompt.md"
+    out.write_text(text, encoding="utf-8")
+    return out
+
+
+def run(
+    *,
+    corpus_root: Path,
+    sources_dir: Path,
+    chunks_dir: Path,
+    skip_normalize: bool = False,
+) -> dict:
+    summary: dict = {}
+
+    if not skip_normalize:
+        norm = normalize_sources.run(corpus_root, sources_dir)
+        summary["normalized"] = {"ok": norm["ok"], "total": norm["total"],
+                                 "errors": norm["errors"]}
+
+    chunk_summary = chunk_sources.run(sources_dir, chunks_dir)
+    summary["chunks"] = chunk_summary
+
+    manifest_path = chunks_dir / "manifest.json"
+    manifest = chunk_sources.load_manifest(manifest_path)
+    prompts_dir = chunks_dir / "_prompts"
+
+    pending = {k: m for k, m in manifest["chunks"].items()
+               if m.get("state") == "pending"}
+    for key, meta in sorted(pending.items()):
+        emit_chunk_prompt(key, meta, prompts_dir)
+
+    states: dict[str, int] = {}
+    for m in manifest["chunks"].values():
+        states[m.get("state", "?")] = states.get(m.get("state", "?"), 0) + 1
+    summary["states"] = states
+    summary["pending"] = len(pending)
+    summary["prompts_dir"] = str(prompts_dir)
+    return summary
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Extraction orchestrator.")
+    parser.add_argument("--corpus", default="data/carti-camp-jocuri")
+    parser.add_argument("--sources", default="data/sources")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--skip-normalize", action="store_true",
+                        help="skip normalization, re-chunk existing sources only")
+    args = parser.parse_args(argv)
+
+    summary = run(
+        corpus_root=Path(args.corpus),
+        sources_dir=Path(args.sources),
+        chunks_dir=Path(args.chunks),
+        skip_normalize=args.skip_normalize,
+    )
+
+    print("=" * 60)
+    print("EXTRACTION ORCHESTRATOR")
+    print("=" * 60)
+    if "normalized" in summary:
+        n = summary["normalized"]
+        print(f"normalized : {n['ok']}/{n['total']} (errors {n['errors']})")
+    print(f"chunks     : {summary['chunks']['chunks']}")
+    for state, count in sorted(summary["states"].items()):
+        print(f"  {state:<10}: {count}")
+    print(f"\npending chunks remaining : {summary['pending']}")
+    if summary["pending"]:
+        print(f"subagent prompts written : {summary['prompts_dir']}/")
+        print("Launch waves of ~5-10 subagents on those prompts, then run "
+              "validate_extractions.py and build_database.py --rebuild.")
+    else:
+        print("All chunks extracted — run build_database.py --rebuild.")
+    print("=" * 60)
+    return 0

-def main():
-    print("="*60)
-    print("ACTIVITY EXTRACTION SYSTEM")
-    print("Strategy S8: Hybrid Claude + Scripts")
-    print("="*60)
-    
-    # Step 1: Run automated extraction
-    print("\nSTEP 1: Automated Extraction")
-    print("-"*40)
-    processor = UnifiedProcessor()
-    processor.process_automated_formats()
-    
-    # Step 2: Wait for Claude processing
-    print("\n" + "="*60)
-    print("STEP 2: Manual Claude Processing Required")
-    print("-"*40)
-    print("Please process PDF/DOC files with Claude using the template.")
-    print("Files are listed in: pdf_doc_for_claude.txt")
-    print("Save extracted activities as JSON in: scripts/extracted_activities/")
-    print("="*60)
-    
-    response = input("\nHave you completed Claude processing? (y/n): ")
-    
-    if response.lower() == 'y':
-        # Step 3: Import Claude-extracted activities
-        print("\nSTEP 3: Importing Claude-extracted activities")
-        print("-"*40)
-        importer = ClaudeActivityImporter()
-        importer.import_all_json_files()
-    
-    print("\n" + "="*60)
-    print("EXTRACTION COMPLETE!")
-    print("="*60)

 if __name__ == "__main__":
-    main()
+    raise SystemExit(main())
--- a/scripts/text_extractor.py
+++ b/scripts/text_extractor.py
@@ -1,197 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-"""
-Text/Markdown Activity Extractor
-Proceseaza fisiere TXT si MD pentru extractie activitati
-"""
-
-import re
-from pathlib import Path
-from typing import List, Dict
-import sqlite3
-from datetime import datetime
-
-class TextActivityExtractor:
-    def __init__(self, db_path='data/activities.db'):
-        self.db_path = db_path
-        self.activity_patterns = {
-            'section_headers': [
-                r'^#{1,6}\s*(.+)$',  # Markdown headers
-                r'^([A-Z][^\.]{10,100})$',  # Titluri simple
-                r'^\d+\.\s*(.+)$',  # Numbered lists
-                r'^[•\-\*]\s*(.+)$',  # Bullet points
-            ],
-            'activity_markers': [
-                'joc:', 'activitate:', 'exercitiu:', 'team building:',
-                'nume:', 'titlu:', 'denumire:'
-            ]
-        }
-    
-    def extract_from_text(self, file_path: str) -> List[Dict]:
-        """Extrage activitati din fisier text/markdown"""
-        activities = []
-        
-        try:
-            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
-                content = f.read()
-            
-            # Metoda 1: Cauta sectiuni markdown
-            if file_path.endswith('.md'):
-                activities.extend(self._extract_from_markdown(content, file_path))
-            
-            # Metoda 2: Cauta pattern-uri generale
-            activities.extend(self._extract_from_patterns(content, file_path))
-            
-            # Metoda 3: Cauta blocuri de text structurate
-            activities.extend(self._extract_from_blocks(content, file_path))
-            
-        except Exception as e:
-            print(f"Error processing {file_path}: {e}")
-        
-        return activities
-    
-    def _extract_from_markdown(self, content, source_file):
-        """Extrage activitati din format markdown"""
-        activities = []
-        lines = content.split('\n')
-        
-        current_activity = None
-        current_content = []
-        
-        for line in lines:
-            # Verifica daca e header de activitate
-            if re.match(r'^#{1,3}\s*(.+)', line):
-                # Salveaza activitatea anterioara daca exista
-                if current_activity and current_content:
-                    current_activity['description'] = '\n'.join(current_content[:20])  # Max 20 linii
-                    activities.append(current_activity)
-                
-                # Verifica daca noul header e o activitate
-                header_text = re.sub(r'^#{1,3}\s*', '', line)
-                if any(marker in header_text.lower() for marker in ['joc', 'activitate', 'exercitiu']):
-                    current_activity = {
-                        'name': header_text[:200],
-                        'source_file': str(source_file),
-                        'category': '[A]'
-                    }
-                    current_content = []
-                else:
-                    current_activity = None
-            
-            elif current_activity:
-                # Adauga continut la activitatea curenta
-                if line.strip():
-                    current_content.append(line)
-        
-        # Salveaza ultima activitate
-        if current_activity and current_content:
-            current_activity['description'] = '\n'.join(current_content[:20])
-            activities.append(current_activity)
-        
-        return activities
-    
-    def _extract_from_patterns(self, content, source_file):
-        """Extrage folosind pattern matching"""
-        activities = []
-        
-        # Cauta markeri specifici de activitati
-        for marker in self.activity_patterns['activity_markers']:
-            pattern = re.compile(f'{re.escape(marker)}\\s*(.+?)(?=\\n\\n|{re.escape(marker)}|$)', 
-                               re.IGNORECASE | re.DOTALL)
-            matches = pattern.finditer(content)
-            
-            for match in matches:
-                activity_text = match.group(1)
-                if len(activity_text) > 20:
-                    activity = {
-                        'name': activity_text.split('\n')[0][:200],
-                        'description': activity_text[:1000],
-                        'source_file': str(source_file),
-                        'category': '[A]'
-                    }
-                    activities.append(activity)
-        
-        return activities
-    
-    def _extract_from_blocks(self, content, source_file):
-        """Extrage din blocuri de text separate"""
-        activities = []
-        
-        # Imparte in blocuri separate de linii goale
-        blocks = re.split(r'\n\s*\n', content)
-        
-        for block in blocks:
-            if len(block) > 50:  # Minim 50 caractere
-                lines = block.strip().split('\n')
-                first_line = lines[0].strip()
-                
-                # Verifica daca blocul pare o activitate
-                if any(keyword in first_line.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
-                    activity = {
-                        'name': first_line[:200],
-                        'description': block[:1000],
-                        'source_file': str(source_file),
-                        'category': '[A]'
-                    }
-                    activities.append(activity)
-        
-        return activities
-    
-    def save_to_database(self, activities):
-        """Salveaza in baza de date"""
-        conn = sqlite3.connect(self.db_path)
-        cursor = conn.cursor()
-        
-        saved_count = 0
-        
-        for activity in activities:
-            try:
-                # Check for duplicates
-                cursor.execute(
-                    "SELECT id FROM activities WHERE name = ? AND source_file = ?",
-                    (activity.get('name'), activity.get('source_file'))
-                )
-                
-                if not cursor.fetchone():
-                    columns = list(activity.keys())
-                    values = list(activity.values())
-                    placeholders = ['?' for _ in values]
-                    
-                    query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
-                    cursor.execute(query, values)
-                    saved_count += 1
-                    
-            except Exception as e:
-                print(f"Error saving: {e}")
-        
-        conn.commit()
-        conn.close()
-        
-        return saved_count
-    
-    def process_all_text_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
-        """Proceseaza toate fisierele text si markdown"""
-        base_path = Path(base_path)
-        
-        text_files = list(base_path.rglob("*.txt"))
-        md_files = list(base_path.rglob("*.md"))
-        all_files = text_files + md_files
-        
-        print(f"Found {len(all_files)} text/markdown files")
-        
-        all_activities = []
-        
-        for file_path in all_files:
-            activities = self.extract_from_text(str(file_path))
-            all_activities.extend(activities)
-            print(f"Processed {file_path.name}: {len(activities)} activities")
-        
-        # Save to database
-        saved = self.save_to_database(all_activities)
-        print(f"\nTotal saved: {saved} activities from {len(all_files)} files")
-        
-        return len(all_files), saved
-
-if __name__ == "__main__":
-    extractor = TextActivityExtractor()
-    extractor.process_all_text_files()
--- a/scripts/unified_processor.py
+++ b/scripts/unified_processor.py
@@ -1,151 +0,0 @@
-#!/usr/bin/env python3
-"""
-Unified Activity Processor
-Orchestreaz toate extractoarele pentru procesare complet
-"""
-
-import time
-from pathlib import Path
-from html_extractor import HTMLActivityExtractor
-from text_extractor import TextActivityExtractor
-import sqlite3
-
-class UnifiedProcessor:
-    def __init__(self, db_path='data/activities.db'):
-        self.db_path = db_path
-        self.html_extractor = HTMLActivityExtractor(db_path)
-        self.text_extractor = TextActivityExtractor(db_path)
-        self.stats = {
-            'html_processed': 0,
-            'text_processed': 0,
-            'pdf_to_process': 0,
-            'doc_to_process': 0,
-            'total_activities': 0,
-            'start_time': None,
-            'end_time': None
-        }
-    
-    def get_current_activity_count(self):
-        """Obine numrul curent de activiti din DB"""
-        conn = sqlite3.connect(self.db_path)
-        cursor = conn.cursor()
-        cursor.execute("SELECT COUNT(*) FROM activities")
-        count = cursor.fetchone()[0]
-        conn.close()
-        return count
-    
-    def count_files_to_process(self, base_path):
-        """Numr fiierele care trebuie procesate"""
-        base_path = Path(base_path)
-        
-        counts = {
-            'html': len(list(base_path.rglob("*.html"))) + len(list(base_path.rglob("*.htm"))),
-            'txt': len(list(base_path.rglob("*.txt"))),
-            'md': len(list(base_path.rglob("*.md"))),
-            'pdf': len(list(base_path.rglob("*.pdf"))),
-            'doc': len(list(base_path.rglob("*.doc"))),
-            'docx': len(list(base_path.rglob("*.docx")))
-        }
-        
-        return counts
-    
-    def process_automated_formats(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
-        """Proceseaz toate formatele care pot fi automatizate"""
-        print("="*60)
-        print("UNIFIED ACTIVITY PROCESSOR - AUTOMATED PHASE")
-        print("="*60)
-        
-        self.stats['start_time'] = time.time()
-        initial_count = self.get_current_activity_count()
-        
-        # Afieaz statistici iniiale
-        file_counts = self.count_files_to_process(base_path)
-        print(f"\nFiles to process:")
-        for format, count in file_counts.items():
-            print(f"  {format.upper()}: {count} files")
-        print(f"\nCurrent activities in database: {initial_count}")
-        print("-"*60)
-        
-        # FAZA 1: Procesare HTML (prioritate maxim - volum mare)
-        print("\n[1/2] Processing HTML files...")
-        print("-"*40)
-        html_processed, html_errors = self.html_extractor.process_all_html_files(base_path)
-        self.stats['html_processed'] = html_processed
-        
-        # FAZA 2: Procesare Text/MD
-        print("\n[2/2] Processing Text/Markdown files...")
-        print("-"*40)
-        text_processed, text_saved = self.text_extractor.process_all_text_files(base_path)
-        self.stats['text_processed'] = text_processed
-        
-        # Statistici finale
-        self.stats['end_time'] = time.time()
-        final_count = self.get_current_activity_count()
-        self.stats['total_activities'] = final_count - initial_count
-        
-        # Identific fiierele care necesit procesare manual
-        self.stats['pdf_to_process'] = file_counts['pdf']
-        self.stats['doc_to_process'] = file_counts['doc'] + file_counts['docx']
-        
-        self.print_summary()
-        self.save_pdf_doc_list(base_path)
-    
-    def print_summary(self):
-        """Afieaz rezumatul procesrii"""
-        print("\n" + "="*60)
-        print("PROCESSING SUMMARY")
-        print("="*60)
-        
-        duration = self.stats['end_time'] - self.stats['start_time']
-        
-        print(f"\nAutomated Processing Results:")
-        print(f"  HTML files processed: {self.stats['html_processed']}")
-        print(f"  Text/MD files processed: {self.stats['text_processed']}")
-        print(f"  New activities added: {self.stats['total_activities']}")
-        print(f"  Processing time: {duration:.1f} seconds")
-        
-        print(f"\nFiles requiring Claude processing:")
-        print(f"  PDF files: {self.stats['pdf_to_process']}")
-        print(f"  DOC/DOCX files: {self.stats['doc_to_process']}")
-        
-        print("\n" + "="*60)
-        print("NEXT STEPS:")
-        print("1. Review the file 'pdf_doc_for_claude.txt' for manual processing")
-        print("2. Use Claude to extract activities from PDF/DOC files")
-        print("3. Focus on largest PDF files first (highest activity density)")
-        print("="*60)
-    
-    def save_pdf_doc_list(self, base_path):
-        """Salveaz lista de PDF/DOC pentru procesare cu Claude"""
-        base_path = Path(base_path)
-        
-        pdf_files = sorted(base_path.rglob("*.pdf"), key=lambda p: p.stat().st_size, reverse=True)
-        doc_files = list(base_path.rglob("*.doc"))
-        docx_files = list(base_path.rglob("*.docx"))
-        
-        with open('pdf_doc_for_claude.txt', 'w', encoding='utf-8') as f:
-            f.write("PDF/DOC FILES FOR CLAUDE PROCESSING\n")
-            f.write("="*60 + "\n")
-            f.write("Files sorted by size (largest first = likely more activities)\n\n")
-            
-            f.write("TOP PRIORITY PDF FILES (process these first):\n")
-            f.write("-"*40 + "\n")
-            for i, pdf in enumerate(pdf_files[:20], 1):
-                size_mb = pdf.stat().st_size / (1024*1024)
-                f.write(f"{i}. {pdf.name} ({size_mb:.1f} MB)\n")
-                f.write(f"   Path: {pdf}\n\n")
-            
-            if len(pdf_files) > 20:
-                f.write(f"\n... and {len(pdf_files)-20} more PDF files\n\n")
-            
-            f.write("\nDOC/DOCX FILES:\n")
-            f.write("-"*40 + "\n")
-            for doc in doc_files + docx_files:
-                size_kb = doc.stat().st_size / 1024
-                f.write(f"- {doc.name} ({size_kb:.1f} KB)\n")
-        
-        print(f"\nPDF/DOC list saved to: pdf_doc_for_claude.txt")
-
-if __name__ == "__main__":
-    processor = UnifiedProcessor()
-    processor.process_automated_formats()
--- a/scripts/validate_extractions.py
+++ b/scripts/validate_extractions.py
@@ -0,0 +1,208 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+validate_extractions.py — validate every data/extracted/*.json (plan §5b).
+
+For each extraction file it runs two checks:
+  1. JSON-schema validation against scripts/activity_schema.json,
+  2. the source_excerpt anti-hallucination check (each excerpt must be a fuzzy
+     substring of the chunk it came from).
+
+For every failing chunk it:
+  * writes the exact re-extraction prompt to data/extracted/_reextract/<chunk>.prompt.md,
+  * marks the chunk `rejected` in data/chunks/manifest.json.
+
+The orchestrator then re-launches subagents only on the `rejected` chunks; the
+loop repeats until nothing is rejected.
+
+Usage:
+    python scripts/validate_extractions.py
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from import_common import (  # noqa: E402
+    DEFAULT_SCHEMA_PATH,
+    chunk_key_for,
+    excerpt_matches,
+    excerpt_score,
+    find_chunk_text,
+    iter_extraction_files,
+    load_schema,
+    validate_extraction,
+)
+
+SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
+
+
+# --------------------------------------------------------------------------
+# re-extraction prompt
+# --------------------------------------------------------------------------
+def build_reextraction_prompt(
+    chunk_key: str, chunk_file: Optional[str], errors: list[str]
+) -> str:
+    """The exact prompt to hand a subagent to re-extract a rejected chunk."""
+    chunk_ref = chunk_file or f"data/chunks/<source_id>/{chunk_key}.txt"
+    lines = [
+        f"# RE-EXTRACTION — chunk `{chunk_key}`",
+        "",
+        "The previous extraction for this chunk was **REJECTED**. Reasons:",
+        "",
+    ]
+    lines += [f"- {e}" for e in errors]
+    lines += [
+        "",
+        "## What to do",
+        "",
+        f"1. Read ONLY this chunk: `{chunk_ref}`",
+        f"2. Follow the extraction rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
+        "3. Fix every problem listed above. In particular:",
+        "   - every `source_excerpt` must be copied **verbatim** from the chunk",
+        "     (it is checked as a fuzzy substring — invented quotes are rejected);",
+        "   - `source_excerpt` and `page_reference` are mandatory on every activity;",
+        "   - the output must validate against `scripts/activity_schema.json`.",
+        f"4. Overwrite the extraction file `data/extracted/{chunk_key}.json`.",
+        "",
+    ]
+    return "\n".join(lines)
+
+
+# --------------------------------------------------------------------------
+# manifest
+# --------------------------------------------------------------------------
+def load_manifest(manifest_path: Path) -> dict:
+    if manifest_path.is_file():
+        try:
+            data = json.loads(manifest_path.read_text(encoding="utf-8"))
+            data.setdefault("chunks", {})
+            return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {"chunks": {}}
+
+
+def save_manifest(manifest: dict, manifest_path: Path) -> None:
+    manifest_path.parent.mkdir(parents=True, exist_ok=True)
+    manifest_path.write_text(
+        json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
+    )
+
+
+def mark_rejected(manifest: dict, chunk_key: str) -> None:
+    """Flip a chunk to `rejected` in the manifest (creating the entry if new)."""
+    entry = manifest["chunks"].get(chunk_key, {})
+    entry["state"] = "rejected"
+    manifest["chunks"][chunk_key] = entry
+
+
+# --------------------------------------------------------------------------
+# validation
+# --------------------------------------------------------------------------
+def validate_file(json_path: Path, schema: dict, chunks_dir: Path) -> list[str]:
+    """Return the list of errors for one extraction file (empty == valid)."""
+    try:
+        data = json.loads(json_path.read_text(encoding="utf-8"))
+    except json.JSONDecodeError as exc:
+        return [f"invalid JSON: {exc}"]
+
+    errors = validate_extraction(data, schema)
+    if errors:
+        return errors
+
+    header = data.get("header", {})
+    chunk_text = find_chunk_text(json_path, header, chunks_dir)
+    if chunk_text is None:
+        return [f"source chunk not found for {chunk_key_for(json_path, header)}"]
+
+    for adict in data.get("activities", []):
+        excerpt = adict.get("source_excerpt") or ""
+        if not excerpt_matches(excerpt, chunk_text):
+            score = excerpt_score(excerpt, chunk_text)
+            errors.append(
+                f"activity {adict.get('name')!r}: source_excerpt not found in "
+                f"chunk (best match {score:.0f}/100) — possible hallucination"
+            )
+    return errors
+
+
+def run(
+    extracted_dir: Path,
+    chunks_dir: Path,
+    manifest_path: Path,
+    schema_path: Path = DEFAULT_SCHEMA_PATH,
+) -> dict:
+    schema = load_schema(schema_path)
+    manifest = load_manifest(manifest_path)
+    reextract_dir = extracted_dir / "_reextract"
+
+    report = {"total": 0, "valid": 0, "rejected": 0, "rejected_chunks": []}
+    for json_path in iter_extraction_files(extracted_dir):
+        report["total"] += 1
+        errors = validate_file(json_path, schema, chunks_dir)
+        if not errors:
+            report["valid"] += 1
+            continue
+
+        report["rejected"] += 1
+        try:
+            data = json.loads(json_path.read_text(encoding="utf-8"))
+            header = data.get("header", {})
+        except json.JSONDecodeError:
+            header = {}
+        chunk_key = chunk_key_for(json_path, header)
+        chunk_file = None
+        meta = manifest["chunks"].get(chunk_key)
+        if meta:
+            chunk_file = meta.get("chunk_file")
+
+        reextract_dir.mkdir(parents=True, exist_ok=True)
+        prompt = build_reextraction_prompt(chunk_key, chunk_file, errors)
+        (reextract_dir / f"{chunk_key}.prompt.md").write_text(prompt, encoding="utf-8")
+
+        mark_rejected(manifest, chunk_key)
+        report["rejected_chunks"].append({"chunk": chunk_key, "errors": errors})
+
+    save_manifest(manifest, manifest_path)
+    return report
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Validate extraction JSON files.")
+    parser.add_argument("--extracted", default="data/extracted")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--manifest", default="data/chunks/manifest.json")
+    parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
+    args = parser.parse_args(argv)
+
+    report = run(
+        Path(args.extracted), Path(args.chunks), Path(args.manifest), Path(args.schema)
+    )
+    print(f"extraction files : {report['total']}")
+    print(f"  valid          : {report['valid']}")
+    print(f"  rejected       : {report['rejected']}")
+    for item in report["rejected_chunks"]:
+        print(f"  [rejected] {item['chunk']}")
+        for err in item["errors"]:
+            print(f"      - {err}")
+    if report["rejected"]:
+        print(f"\nRe-extraction prompts written to {args.extracted}/_reextract/")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -0,0 +1,114 @@
+# -*- coding: utf-8 -*-
+"""
+Shared pytest fixtures for the extraction-pipeline tests.
+
+scripts/ is not a package, so it is added to sys.path here. Synthetic fixtures
+(PDF, docx, zip, HTML) are generated at runtime — no binary blobs in the repo.
+"""
+
+import sys
+import zipfile
+from pathlib import Path
+
+import pytest
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+SCRIPTS_DIR = REPO_ROOT / "scripts"
+if str(SCRIPTS_DIR) not in sys.path:
+    sys.path.insert(0, str(SCRIPTS_DIR))
+
+
+# --------------------------------------------------------------------------
+# synthetic PDF — deliberately large to pin the "no max_pages" regression
+# --------------------------------------------------------------------------
+@pytest.fixture
+def big_pdf(tmp_path):
+    """A 60-page PDF; each page carries a unique 'PDFMARK-<n>' token."""
+    from reportlab.pdfgen import canvas
+    from reportlab.lib.pagesizes import letter
+
+    path = tmp_path / "big.pdf"
+    c = canvas.Canvas(str(path), pagesize=letter)
+    for n in range(1, 61):
+        c.drawString(72, 720, f"PDFMARK-{n} synthetic activity page number {n}")
+        c.drawString(72, 700, "Acest joc educativ se joaca in echipa.")
+        c.showPage()
+    c.save()
+    return path
+
+
+# --------------------------------------------------------------------------
+# synthetic docx — 100 paragraphs => 3 synthetic pages at 40 paras/page
+# --------------------------------------------------------------------------
+@pytest.fixture
+def sample_docx(tmp_path):
+    import docx
+
+    path = tmp_path / "sample.docx"
+    document = docx.Document()
+    for i in range(100):
+        document.add_paragraph(f"Paragraf {i}: continut joc team-building.")
+    document.save(str(path))
+    return path
+
+
+# --------------------------------------------------------------------------
+# synthetic HTML mirror page — with nav/script/footer chrome to strip
+# --------------------------------------------------------------------------
+HTML_WITH_NAV = """<!doctype html>
+<html><head><title>Joc</title>
+<style>.x{color:red}</style>
+<script>var tracking = 1;</script>
+</head><body>
+<nav><a href="/">Home</a><a href="/games">Games</a></nav>
+<header>Site Banner Junk</header>
+<main>
+<h1>Vanatoarea de comori</h1>
+<p>Acesta este un joc real de orientare pentru cercetasi.</p>
+<p>Jucatorii cauta indicii ascunse in tabara.</p>
+</main>
+<footer>Copyright 2024 - toate drepturile rezervate</footer>
+</body></html>
+"""
+
+
+@pytest.fixture
+def html_with_nav(tmp_path):
+    path = tmp_path / "page.html"
+    path.write_text(HTML_WITH_NAV, encoding="utf-8")
+    return path
+
+
+# --------------------------------------------------------------------------
+# synthetic zip — contains a docx and a stray junk file
+# --------------------------------------------------------------------------
+@pytest.fixture
+def sample_zip(tmp_path, sample_docx):
+    path = tmp_path / "archive.zip"
+    with zipfile.ZipFile(path, "w") as zf:
+        zf.write(sample_docx, arcname="inner/sample.docx")
+        zf.writestr("desktop.ini", "junk")
+    return path
+
+
+# --------------------------------------------------------------------------
+# synthetic normalized source — paginated, with an activity straddling a
+# page boundary so the chunker overlap can be verified.
+# --------------------------------------------------------------------------
+@pytest.fixture
+def paginated_source(tmp_path):
+    """A 50-page normalized source. An activity spans the page 20/21 boundary."""
+    lines = ["SOURCE: synthetic/test.pdf", "CONVERTED: 2026-05-19",
+             "FORMAT: pdf", "=" * 50, ""]
+    for n in range(1, 51):
+        lines.append(f"--- PAGE {n} ---")
+        if n == 20:
+            lines.append("ACTIVITY-START jocul podului care traverseaza pagina")
+        elif n == 21:
+            lines.append("continuare a jocului podului ACTIVITY-END")
+        else:
+            lines.append(f"continut obisnuit pe pagina {n}")
+        lines.append("")
+    path = tmp_path / "src_paginated.txt"
+    path.write_text("\n".join(lines), encoding="utf-8")
+    return path
--- a/tests/fixtures/.gitkeep
+++ b/tests/fixtures/.gitkeep
@@ -0,0 +1,3 @@
+# Test fixtures (synthetic PDF/docx/zip/HTML) are generated at runtime by
+# tests/conftest.py — no binary blobs are committed. This file only preserves
+# the directory in git.
--- a/tests/test_build_database.py
+++ b/tests/test_build_database.py
@@ -0,0 +1,334 @@
+# -*- coding: utf-8 -*-
+"""
+Tests for scripts/build_database.py — the import / dedup / swap side.
+
+Covers: category -> slug + `altele` fallback; dedup across all three threshold
+bands; EN != RO never merged; field combination on merge; atomic swap with a
+simulated mid-build crash; the source_excerpt substring check.
+"""
+
+import json
+import os
+import sys
+from pathlib import Path
+
+import pytest
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+SCRIPTS_DIR = REPO_ROOT / "scripts"
+for _p in (str(REPO_ROOT), str(SCRIPTS_DIR)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+import build_database as bd  # noqa: E402
+from app.models.activity import Activity  # noqa: E402
+from app.models.database import DatabaseManager  # noqa: E402
+
+
+# --------------------------------------------------------------------------
+# helpers
+# --------------------------------------------------------------------------
+def _activity(**over):
+    base = dict(
+        name="Jocul testului",
+        description="O activitate de echipa in aer liber.",
+        category="team-building",
+        content_type="joc",
+        language="ro",
+        extraction_confidence="high",
+    )
+    base.update(over)
+    return Activity(**base)
+
+
+def _ext_activity(**over):
+    """A schema-valid extraction-JSON activity object."""
+    base = dict(
+        name="Jocul testului",
+        description="O activitate de echipa in aer liber.",
+        category="team-building",
+        content_type="joc",
+        language="ro",
+        extraction_confidence="high",
+        source_excerpt="ANCHOR-EXCERPT despre jocul testului",
+        page_reference="page 1",
+    )
+    base.update(over)
+    return base
+
+
+def _write_extraction(extracted_dir, chunk_key, activities, source_id="src01"):
+    extracted_dir.mkdir(parents=True, exist_ok=True)
+    payload = {
+        "header": {
+            "source_hash": "hash1234deadbeef",
+            "schema_version": "1.0",
+            "prompt_version": "1.0",
+            "chunk_range": "pages 1-20",
+            "source_id": source_id,
+            "chunk_key": chunk_key,
+        },
+        "activities": activities,
+    }
+    (extracted_dir / f"{chunk_key}.json").write_text(
+        json.dumps(payload, ensure_ascii=False), encoding="utf-8"
+    )
+
+
+def _write_chunk(chunks_dir, source_id, chunk_key, text):
+    d = chunks_dir / source_id
+    d.mkdir(parents=True, exist_ok=True)
+    (d / f"{chunk_key}.txt").write_text(text, encoding="utf-8")
+
+
+# --------------------------------------------------------------------------
+# step 3 — category normalization
+# --------------------------------------------------------------------------
+def test_category_alias_mapped_to_slug():
+    act = bd.dict_to_activity(_ext_activity(category="teambuilding"), "s.txt")
+    assert act.category == "team-building"
+
+
+def test_unknown_category_falls_back_to_altele():
+    act = bd.dict_to_activity(_ext_activity(category="zzz-not-a-category"), "s.txt")
+    assert act.category == "altele"
+
+
+def test_content_type_normalized():
+    act = bd.dict_to_activity(_ext_activity(content_type="games"), "s.txt")
+    assert act.content_type == "joc"
+
+
+# --------------------------------------------------------------------------
+# step 4 — dedup, three bands
+# --------------------------------------------------------------------------
+def test_dedup_auto_merge_identical_descriptions():
+    """>= 85 similar -> a single merged row."""
+    a = _activity(description="copiii formeaza echipe si traverseaza terenul")
+    b = _activity(description="copiii formeaza echipe si traverseaza terenul")
+    out, stats = bd.dedup_activities([a, b])
+    assert len(out) == 1
+    assert stats["auto_merged"] == 1
+    assert out[0].needs_review == 0
+
+
+def test_dedup_borderline_keeps_both_and_flags_needs_review():
+    """60-85 similar -> both kept, both flagged needs_review."""
+    from rapidfuzz import fuzz
+
+    d1 = "alpha beta gamma delta epsilon"
+    d2 = "alpha beta gamma delta epsilon zeta eta theta iota"
+    score = fuzz.token_sort_ratio(d1, d2)
+    assert 60.0 <= score < 85.0, f"precondition: score={score} not borderline"
+
+    a = _activity(description=d1)
+    b = _activity(description=d2)
+    out, stats = bd.dedup_activities([a, b])
+    assert len(out) == 2
+    assert stats["borderline"] == 2
+    assert all(act.needs_review == 1 for act in out)
+
+
+def test_dedup_low_similarity_kept_as_separate_variants():
+    """< 60 similar -> separate variants, no needs_review."""
+    from rapidfuzz import fuzz
+
+    d1 = "alpha beta gamma delta epsilon"
+    d2 = "quebec romeo sierra tango uniform victor whiskey"
+    assert fuzz.token_sort_ratio(d1, d2) < 60.0
+
+    a = _activity(description=d1)
+    b = _activity(description=d2)
+    out, stats = bd.dedup_activities([a, b])
+    assert len(out) == 2
+    assert stats["auto_merged"] == 0
+    assert all(act.needs_review == 0 for act in out)
+
+
+def test_dedup_never_merges_across_languages():
+    """Same name + same description but EN vs RO -> two distinct rows."""
+    desc = "children form teams and cross the field"
+    ro = _activity(name="Cursa", description=desc, language="ro")
+    en = _activity(name="Cursa", description=desc, language="en")
+    out, stats = bd.dedup_activities([ro, en])
+    assert len(out) == 2
+    assert stats["auto_merged"] == 0
+    langs = {a.language for a in out}
+    assert langs == {"ro", "en"}
+
+
+def test_merge_combines_fields():
+    """On merge: longest description/rules, union materials, accumulated sources."""
+    desc = "copiii formeaza echipe si traverseaza terenul cu obstacole"
+    a = _activity(
+        description=desc,
+        rules="regula scurta",
+        materials_list="franghie, esarfa",
+        source_file="a.txt",
+        keywords="echipa",
+    )
+    b = _activity(
+        description=desc,
+        rules="o regula mult mai lunga si mai detaliata pentru joc",
+        materials_list="busola, esarfa",
+        source_file="b.txt",
+        keywords="cooperare",
+    )
+    out, _ = bd.dedup_activities([a, b])
+    assert len(out) == 1
+    merged = out[0]
+    assert merged.rules == "o regula mult mai lunga si mai detaliata pentru joc"
+    mats = set(m.strip() for m in merged.materials_list.split(","))
+    assert mats == {"franghie", "esarfa", "busola"}
+    assert set(merged.source_files) == {"a.txt", "b.txt"}
+    assert merged.popularity_score == 1
+    assert set(k.strip() for k in merged.keywords.split(",")) == {"echipa", "cooperare"}
+
+
+# --------------------------------------------------------------------------
+# step 5 — review decisions
+# --------------------------------------------------------------------------
+def test_review_decision_drop_removes_row():
+    from import_common import content_key, normalize_name
+
+    a = _activity(description="o descriere de test")
+    key = content_key(normalize_name(a.name), a.language, a.description)
+    kept, stats = bd.apply_review_decisions([a], {key: {"decision": "drop"}})
+    assert kept == []
+    assert stats["dropped"] == 1
+
+
+def test_review_decision_keep_separate_clears_needs_review():
+    from import_common import content_key, normalize_name
+
+    a = _activity(description="o descriere de test")
+    a.needs_review = 1
+    key = content_key(normalize_name(a.name), a.language, a.description)
+    kept, stats = bd.apply_review_decisions([a], {key: {"decision": "keep-separate"}})
+    assert len(kept) == 1 and kept[0].needs_review == 0
+    assert stats["resolved"] == 1
+
+
+# --------------------------------------------------------------------------
+# step 2b — source_excerpt hallucination check
+# --------------------------------------------------------------------------
+def test_hallucinated_excerpt_activity_dropped(tmp_path):
+    extracted = tmp_path / "extracted"
+    chunks = tmp_path / "chunks"
+    sources = tmp_path / "sources"
+
+    good = _ext_activity(
+        name="Joc real", source_excerpt="textul real apare in bucata sursa"
+    )
+    bad = _ext_activity(
+        name="Joc inventat",
+        source_excerpt="acest citat nu exista nicaieri in sursa originala xyzzy",
+    )
+    _write_extraction(extracted, "src01.part01", [good, bad])
+    _write_chunk(
+        chunks, "src01", "src01.part01",
+        "--- PAGE 1 ---\ntextul real apare in bucata sursa pentru jocul real.\n",
+    )
+
+    from import_common import load_schema
+
+    schema = load_schema()
+    res = bd.collect_activities(extracted, chunks, sources, schema)
+    names = {a.name for a in res["activities"]}
+    assert names == {"Joc real"}
+    assert res["activities_hallucinated"] == 1
+    assert (extracted / "_rejected").exists()
+
+
+def test_schema_invalid_file_moved_to_rejected(tmp_path):
+    extracted = tmp_path / "extracted"
+    chunks = tmp_path / "chunks"
+    sources = tmp_path / "sources"
+    extracted.mkdir(parents=True)
+
+    # missing required header keys + bad activity
+    (extracted / "bad.json").write_text(
+        json.dumps({"header": {}, "activities": [{"name": "x"}]}),
+        encoding="utf-8",
+    )
+    from import_common import load_schema
+
+    res = bd.collect_activities(extracted, chunks, sources, load_schema())
+    assert res["files_rejected_schema"] == 1
+    assert not (extracted / "bad.json").exists()
+    assert (extracted / "_rejected" / "bad.json").exists()
+    assert (extracted / "_rejected" / "bad.errors.txt").exists()
+
+
+# --------------------------------------------------------------------------
+# end-to-end rebuild + atomic swap
+# --------------------------------------------------------------------------
+def _setup_corpus(tmp_path):
+    extracted = tmp_path / "extracted"
+    chunks = tmp_path / "chunks"
+    sources = tmp_path / "sources"
+    excerpt = "jocul testului este o activitate de echipa"
+    _write_extraction(
+        extracted, "src01.part01",
+        [_ext_activity(source_excerpt=excerpt)],
+    )
+    _write_chunk(chunks, "src01", "src01.part01",
+                 f"--- PAGE 1 ---\n{excerpt} in aer liber.\n")
+    return extracted, chunks, sources
+
+
+def test_rebuild_creates_database(tmp_path):
+    extracted, chunks, sources = _setup_corpus(tmp_path)
+    db_path = tmp_path / "activities.db"
+
+    report = bd.rebuild(
+        extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
+        db_path=db_path,
+    )
+    assert db_path.exists()
+    assert report["final_count"] == 1
+
+    db = DatabaseManager(str(db_path))
+    rows = db.search_activities()
+    assert len(rows) == 1
+    assert rows[0]["category"] == "team-building"
+
+
+def test_atomic_swap_keeps_live_db_intact_on_crash(tmp_path, monkeypatch):
+    """A mid-build crash must leave the live DB byte-identical."""
+    extracted, chunks, sources = _setup_corpus(tmp_path)
+    db_path = tmp_path / "activities.db"
+
+    # a pre-existing live DB with sentinel content
+    live = DatabaseManager(str(db_path))
+    live.insert_activity(_activity(name="Sentinel viu"))
+    before = db_path.read_bytes()
+
+    def boom(self, *a, **k):
+        raise RuntimeError("simulated mid-build crash")
+
+    monkeypatch.setattr(DatabaseManager, "bulk_insert_activities", boom)
+
+    with pytest.raises(RuntimeError, match="simulated mid-build crash"):
+        bd.rebuild(
+            extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
+            db_path=db_path,
+        )
+
+    # live DB untouched, tmp cleaned up
+    assert db_path.read_bytes() == before
+    assert not (tmp_path / "activities.db.tmp").exists()
+
+
+def test_rebuild_backs_up_live_db(tmp_path):
+    extracted, chunks, sources = _setup_corpus(tmp_path)
+    db_path = tmp_path / "activities.db"
+    DatabaseManager(str(db_path)).insert_activity(_activity(name="Vechi"))
+
+    report = bd.rebuild(
+        extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
+        db_path=db_path,
+    )
+    assert report["backup"] is not None
+    assert Path(report["backup"]).exists()
+    assert os.path.basename(report["backup"]) == "activities.db.bak"
--- a/tests/test_chunk_sources.py
+++ b/tests/test_chunk_sources.py
@@ -0,0 +1,183 @@
+# -*- coding: utf-8 -*-
+"""Tests for scripts/chunk_sources.py."""
+
+import json
+
+import chunk_sources as cs
+import normalize_sources as ns
+
+
+def _pages(n):
+    return [(i, f"text-{i}") for i in range(1, n + 1)]
+
+
+# --------------------------------------------------------------------------
+# header parsing
+# --------------------------------------------------------------------------
+def test_parse_source_splits_header_and_body(paginated_source):
+    text = paginated_source.read_text(encoding="utf-8")
+    header, body = cs.parse_source(text)
+    assert header["FORMAT"] == "pdf"
+    assert body.lstrip().startswith("--- PAGE 1 ---")
+
+
+# --------------------------------------------------------------------------
+# page chunking
+# --------------------------------------------------------------------------
+def test_chunk_pages_basic_split():
+    chunks = cs.chunk_pages(_pages(50), pages_per_chunk=20, overlap=4)
+    # stride 16: starts at pages 1, 17, 33, ...
+    assert chunks[0]["page_start"] == 1 and chunks[0]["page_end"] == 20
+    assert chunks[1]["page_start"] == 17
+    assert chunks[-1]["page_end"] == 50
+
+
+def test_chunk_pages_have_overlap():
+    chunks = cs.chunk_pages(_pages(50), pages_per_chunk=20, overlap=4)
+    overlap = chunks[0]["page_end"] - chunks[1]["page_start"] + 1
+    assert overlap == 4
+
+
+def test_chunk_pages_short_document_single_chunk():
+    chunks = cs.chunk_pages(_pages(8), pages_per_chunk=20, overlap=4)
+    assert len(chunks) == 1
+    assert chunks[0]["page_start"] == 1 and chunks[0]["page_end"] == 8
+
+
+def test_chunk_pages_empty():
+    assert cs.chunk_pages([]) == []
+
+
+def test_activity_at_page_boundary_intact_in_one_chunk(paginated_source):
+    """An activity straddling the page 20/21 boundary must appear whole in >=1 chunk."""
+    text = paginated_source.read_text(encoding="utf-8")
+    chunks = cs.make_chunks(text)
+    full = [
+        c for c in chunks
+        if "ACTIVITY-START" in c["text"] and "ACTIVITY-END" in c["text"]
+    ]
+    assert full, "activity spanning a page boundary was split across all chunks"
+
+
+# --------------------------------------------------------------------------
+# word-window chunking for unpaginated text
+# --------------------------------------------------------------------------
+def test_chunk_words_window_and_overlap():
+    text = " ".join(f"w{i}" for i in range(25_000))
+    chunks = cs.chunk_words(text, window=10_000, overlap=2_000)
+    assert len(chunks) == 3  # stride 8000 over 25000 words
+    first = chunks[0]["text"].split()
+    second = chunks[1]["text"].split()
+    assert first[8_000:10_000] == second[0:2_000]  # 2000-word overlap
+
+
+def test_make_chunks_unpaginated_uses_word_windows():
+    body = "cuvant " * 15_000
+    text = "SOURCE: x\nFORMAT: txt\n" + "=" * 50 + "\n\n" + body
+    chunks = cs.make_chunks(text)
+    assert len(chunks) >= 2
+    assert chunks[0]["chunk_range"].startswith("words")
+
+
+# --------------------------------------------------------------------------
+# stable source ids — anti-collision
+# --------------------------------------------------------------------------
+def test_stable_id_same_stem_different_path_no_collision():
+    a = ns.stable_id("camp/games/scout.pdf")
+    b = ns.stable_id("school/lessons/scout.pdf")
+    assert a != b
+    assert a.endswith("_scout") and b.endswith("_scout")
+
+
+def test_stable_id_deterministic():
+    assert ns.stable_id("a/b/c.pdf") == ns.stable_id("a/b/c.pdf")
+
+
+# --------------------------------------------------------------------------
+# manifest registry + idempotency
+# --------------------------------------------------------------------------
+def test_run_writes_chunks_and_manifest(paginated_source, tmp_path):
+    sources_dir = tmp_path / "sources"
+    sources_dir.mkdir()
+    (sources_dir / paginated_source.name).write_text(
+        paginated_source.read_text(encoding="utf-8"), encoding="utf-8"
+    )
+    chunks_dir = tmp_path / "chunks"
+
+    summary = cs.run(sources_dir, chunks_dir)
+    assert summary["sources"] == 1
+    assert summary["chunks"] >= 2
+
+    manifest = json.loads((chunks_dir / "manifest.json").read_text())
+    assert manifest["chunks"]
+    for key, meta in manifest["chunks"].items():
+        assert meta["state"] == "pending"
+        assert meta["expected_json"] == f"{key}.json"
+        assert (chunks_dir.parent / meta["chunk_file"]).exists()
+
+
+def test_manifest_idempotent_preserves_state(paginated_source, tmp_path):
+    sources_dir = tmp_path / "sources"
+    sources_dir.mkdir()
+    (sources_dir / paginated_source.name).write_text(
+        paginated_source.read_text(encoding="utf-8"), encoding="utf-8"
+    )
+    chunks_dir = tmp_path / "chunks"
+    manifest_path = chunks_dir / "manifest.json"
+
+    cs.run(sources_dir, chunks_dir)
+
+    # orchestrator marks one chunk done
+    manifest = json.loads(manifest_path.read_text())
+    first_key = next(iter(manifest["chunks"]))
+    n_before = len(manifest["chunks"])
+    manifest["chunks"][first_key]["state"] = "done"
+    manifest_path.write_text(json.dumps(manifest), encoding="utf-8")
+
+    # re-run: 'done' must survive, no chunk added or lost
+    cs.run(sources_dir, chunks_dir)
+    manifest2 = json.loads(manifest_path.read_text())
+    assert len(manifest2["chunks"]) == n_before
+    assert manifest2["chunks"][first_key]["state"] == "done"
+    assert all(
+        m["state"] in ("pending", "done") for m in manifest2["chunks"].values()
+    )
+
+
+def test_manifest_resets_state_when_source_changes(paginated_source, tmp_path):
+    sources_dir = tmp_path / "sources"
+    sources_dir.mkdir()
+    src = sources_dir / paginated_source.name
+    src.write_text(paginated_source.read_text(encoding="utf-8"), encoding="utf-8")
+    chunks_dir = tmp_path / "chunks"
+    manifest_path = chunks_dir / "manifest.json"
+
+    cs.run(sources_dir, chunks_dir)
+    manifest = json.loads(manifest_path.read_text())
+    first_key = next(iter(manifest["chunks"]))
+    manifest["chunks"][first_key]["state"] = "done"
+    manifest_path.write_text(json.dumps(manifest), encoding="utf-8")
+
+    # mutate the source content -> hash changes -> state resets
+    src.write_text(src.read_text(encoding="utf-8") + "\n--- PAGE 51 ---\nextra\n",
+                   encoding="utf-8")
+    cs.run(sources_dir, chunks_dir)
+    manifest2 = json.loads(manifest_path.read_text())
+    assert manifest2["chunks"][first_key]["state"] == "pending"
+
+
+def test_prune_stale_removes_orphan_entries(paginated_source, tmp_path):
+    sources_dir = tmp_path / "sources"
+    sources_dir.mkdir()
+    src = sources_dir / paginated_source.name
+    src.write_text(paginated_source.read_text(encoding="utf-8"), encoding="utf-8")
+    chunks_dir = tmp_path / "chunks"
+
+    cs.run(sources_dir, chunks_dir)
+    # delete the source -> its chunks become stale
+    src.unlink()
+    summary = cs.run(sources_dir, chunks_dir)
+    assert summary["chunks"] == 0
+    assert summary["pruned"] >= 1
+    manifest = json.loads((chunks_dir / "manifest.json").read_text())
+    assert manifest["chunks"] == {}
--- a/tests/test_enrichment.py
+++ b/tests/test_enrichment.py
@@ -0,0 +1,231 @@
+"""
+Tests for the enrichment overlay (plan Part B) and the new filter axes /
+bilingual display helpers (plan Part A).
+
+Covers:
+  * config_taxonomy.normalize_indoor_outdoor / normalize_space_needed
+  * build_database.apply_enrichment keying, field application, estimated tally
+  * DatabaseManager indoor_outdoor / space_needed equality filters
+  * FTS5 indexing of the *_ro columns
+  * Activity bilingual display helpers
+"""
+
+import os
+import sys
+
+import pytest
+
+PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+if PROJECT_ROOT not in sys.path:
+    sys.path.insert(0, PROJECT_ROOT)
+SCRIPTS = os.path.join(PROJECT_ROOT, "scripts")
+if SCRIPTS not in sys.path:
+    sys.path.insert(0, SCRIPTS)
+
+from app.models.activity import Activity  # noqa: E402
+from app.models.database import DatabaseManager  # noqa: E402
+from app.config_taxonomy import (  # noqa: E402
+    normalize_indoor_outdoor,
+    normalize_space_needed,
+)
+from import_common import content_key, normalize_name  # noqa: E402
+from build_database import apply_enrichment  # noqa: E402
+
+
+# --------------------------------------------------------------------------
+# taxonomy normalizers
+# --------------------------------------------------------------------------
+@pytest.mark.parametrize("raw,expected", [
+    ("indoor", "indoor"),
+    ("Outdoor", "outdoor"),
+    ("either", "either"),
+    ("interior", "indoor"),
+    ("aer liber", "outdoor"),
+    ("both", "either"),
+    ("", None),
+    ("nonsense", None),
+    (None, None),
+])
+def test_normalize_indoor_outdoor(raw, expected):
+    assert normalize_indoor_outdoor(raw) == expected
+
+
+@pytest.mark.parametrize("raw,expected", [
+    ("mic", "mic"),
+    ("MEDIU", "mediu"),
+    ("mare", "mare"),
+    ("small", "mic"),
+    ("large", "mare"),
+    ("", None),
+    ("huge", None),
+    (None, None),
+])
+def test_normalize_space_needed(raw, expected):
+    assert normalize_space_needed(raw) == expected
+
+
+# --------------------------------------------------------------------------
+# apply_enrichment
+# --------------------------------------------------------------------------
+def _activity(name="Joc de test", description="O descriere de test.", language="ro"):
+    return Activity(
+        name=name, description=description, category="team-building",
+        content_type="joc", source_file="t.txt", language=language,
+    )
+
+
+def _key_for(act: Activity) -> str:
+    return content_key(
+        act.normalized_name or normalize_name(act.name),
+        act.language,
+        act.description or "",
+    )
+
+
+def test_apply_enrichment_matches_and_applies_fields():
+    act = _activity()
+    key = _key_for(act)
+    enrichment = {
+        key: {
+            "name_ro": "Joc de test (RO)",
+            "description_ro": "Descriere îmbogățită în română.",
+            "indoor_outdoor": "outdoor",
+            "space_needed": "mediu",
+            "participants_min": 4,
+            "participants_max": 12,
+            "estimated_fields": ["space_needed", "participants_min", "participants_max"],
+        }
+    }
+    stats = apply_enrichment([act], enrichment)
+
+    assert act.name_ro == "Joc de test (RO)"
+    assert act.description_ro == "Descriere îmbogățită în română."
+    assert act.indoor_outdoor == "outdoor"
+    assert act.space_needed == "mediu"
+    assert act.participants_min == 4 and act.participants_max == 12
+    assert set(act.estimated_fields) == {"space_needed", "participants_min", "participants_max"}
+
+    assert stats["entries"] == 1
+    assert stats["matched"] == 1
+    assert stats["orphaned"] == 0
+    # indoor_outdoor stated, space_needed estimated
+    assert stats["fields_stated"].get("indoor_outdoor") == 1
+    assert stats["fields_estimated"].get("space_needed") == 1
+
+
+def test_apply_enrichment_orphan_entry_counted():
+    act = _activity()
+    enrichment = {"deadbeef" * 5: {"name_ro": "nu se potrivește"}}
+    stats = apply_enrichment([act], enrichment)
+    assert stats["matched"] == 0
+    assert stats["orphaned"] == 1
+    assert act.name_ro is None  # untouched
+
+
+def test_apply_enrichment_absent_fields_leave_value_untouched():
+    act = _activity()
+    act.participants_min = 5
+    key = _key_for(act)
+    # entry only translates name; participants must be preserved
+    apply_enrichment([act], {key: {"name_ro": "Tradus"}})
+    assert act.participants_min == 5
+    assert act.name_ro == "Tradus"
+
+
+def test_apply_enrichment_drops_unrecognised_enum():
+    act = _activity()
+    key = _key_for(act)
+    apply_enrichment([act], {key: {"indoor_outdoor": "spaceship", "space_needed": "mic"}})
+    assert act.indoor_outdoor is None       # unrecognised → dropped
+    assert act.space_needed == "mic"
+
+
+# --------------------------------------------------------------------------
+# DB equality filters + FTS on *_ro
+# --------------------------------------------------------------------------
+@pytest.fixture
+def db(tmp_path):
+    return DatabaseManager(str(tmp_path / "enrich.db"))
+
+
+def _insert(db, **overrides):
+    base = dict(
+        name="Activitate", description="desc", category="camp-outdoor",
+        content_type="joc", source_file="t.txt", language="ro",
+    )
+    base.update(overrides)
+    return db.insert_activity(Activity(**base))
+
+
+def test_indoor_outdoor_equality_filter(db):
+    _insert(db, name="In casa", indoor_outdoor="indoor")
+    _insert(db, name="Afara", indoor_outdoor="outdoor")
+    res = db.search_activities(indoor_outdoor="outdoor")
+    assert len(res) == 1
+    assert res[0]["name"] == "Afara"
+
+
+def test_space_needed_equality_filter(db):
+    _insert(db, name="Mic", space_needed="mic")
+    _insert(db, name="Mare", space_needed="mare")
+    res = db.search_activities(space_needed="mare")
+    assert len(res) == 1
+    assert res[0]["name"] == "Mare"
+
+
+def test_fts_indexes_name_ro(db):
+    _insert(db, name="Treasure Hunt", name_ro="Vânătoarea de comori")
+    # term only present in the Romanian twin
+    res = db.search_activities(search_text="comori")
+    assert len(res) == 1
+    assert res[0]["name"] == "Treasure Hunt"
+
+
+def test_fts_indexes_description_ro(db):
+    _insert(db, name="Game", description="english desc",
+            description_ro="o activitate de cooperare")
+    res = db.search_activities(search_text="cooperare")
+    assert len(res) == 1
+
+
+def test_ro_columns_round_trip(db):
+    aid = _insert(
+        db, name="X", name_ro="X-ro", description_ro="d-ro",
+        rules_ro="r-ro", variations_ro="v-ro",
+        indoor_outdoor="either", space_needed="mediu",
+        estimated_fields=["duration_min"], source_id="src1",
+        source_ids=["src1", "src2"], chunk_key="src1.part01",
+    )
+    row = db.get_activity_by_id(aid)
+    loaded = Activity.from_dict(row)
+    assert loaded.name_ro == "X-ro"
+    assert loaded.indoor_outdoor == "either"
+    assert loaded.space_needed == "mediu"
+    assert loaded.estimated_fields == ["duration_min"]
+    assert loaded.source_ids == ["src1", "src2"]
+    assert loaded.chunk_key == "src1.part01"
+
+
+# --------------------------------------------------------------------------
+# display helpers
+# --------------------------------------------------------------------------
+def test_display_helpers_prefer_ro_with_fallback():
+    act = _activity(name="Original", description="Original desc")
+    assert act.get_display_name() == "Original"          # no translation yet
+    assert act.get_display_description() == "Original desc"
+    act.name_ro = "Tradus"
+    act.description_ro = "Descriere tradusă"
+    assert act.get_display_name() == "Tradus"
+    assert act.get_display_description() == "Descriere tradusă"
+    assert act.has_translation() is True
+
+
+def test_is_estimated_and_axis_displays():
+    act = _activity()
+    act.indoor_outdoor = "outdoor"
+    act.space_needed = "mare"
+    act.estimated_fields = ["space_needed"]
+    assert act.get_indoor_outdoor_display() == "Exterior"
+    assert act.get_space_needed_display() == "Spațiu mare"
+    assert act.is_estimated("space_needed") is True
+    assert act.is_estimated("indoor_outdoor") is False
--- a/tests/test_extract_common.py
+++ b/tests/test_extract_common.py
@@ -0,0 +1,177 @@
+# -*- coding: utf-8 -*-
+"""Tests for scripts/extract_common.py."""
+
+import shutil
+import zipfile
+
+import pytest
+
+import extract_common as ec
+
+
+# --------------------------------------------------------------------------
+# format detection
+# --------------------------------------------------------------------------
+def test_detect_format():
+    assert ec.detect_format("a/b/file.PDF") == "pdf"
+    assert ec.detect_format("x.docx") == "docx"
+    assert ec.detect_format("x.doc") == "doc"
+    assert ec.detect_format("x.pptx") == "pptx"
+    assert ec.detect_format("x.html") == "html"
+    assert ec.detect_format("x.zip") == "zip"
+    assert ec.detect_format("x.epub") == "epub"
+    assert ec.detect_format("x.xyz") == "unknown"
+
+
+def test_is_junk():
+    assert ec.is_junk("some/desktop.ini")
+    assert ec.is_junk("notes.bak")
+    assert ec.is_junk("README.md")
+    assert not ec.is_junk("1000 Scout Games.pdf")
+
+
+# --------------------------------------------------------------------------
+# PDF — the critical "no max_pages" regression
+# --------------------------------------------------------------------------
+def test_pdf_extracts_all_60_pages(big_pdf):
+    body = ec.extract_pdf(big_pdf)
+    # the old converter capped at 50 pages — page 60 must be present now
+    assert "--- PAGE 60 ---" in body
+    assert "PDFMARK-60" in body
+    assert ec.count_page_markers(body) == 60
+
+
+def test_pdf_does_not_truncate_mid_document(big_pdf):
+    body = ec.extract_pdf(big_pdf)
+    pages = ec.split_pages(body)
+    assert pages[-1][0] == 60  # last marker is the real last page
+
+
+# --------------------------------------------------------------------------
+# page join / split round-trip
+# --------------------------------------------------------------------------
+def test_join_split_round_trip():
+    body = ec.join_pages(["alpha", "beta", "gamma"])
+    pages = ec.split_pages(body)
+    assert [n for n, _ in pages] == [1, 2, 3]
+    assert [t for _, t in pages] == ["alpha", "beta", "gamma"]
+
+
+def test_split_pages_no_markers_returns_empty():
+    assert ec.split_pages("plain text with no markers") == []
+
+
+# --------------------------------------------------------------------------
+# docx — synthetic page markers
+# --------------------------------------------------------------------------
+def test_docx_synthetic_page_markers(sample_docx):
+    body = ec.extract_docx(sample_docx)
+    # 100 paragraphs / 40 per page => 3 pages
+    assert ec.count_page_markers(body) == 3
+    assert "Paragraf 99" in body
+
+
+# --------------------------------------------------------------------------
+# HTML mirror — nav/script/footer stripped
+# --------------------------------------------------------------------------
+def test_html_strips_chrome(html_with_nav):
+    body = ec.extract_html(html_with_nav)
+    assert "Vanatoarea de comori" in body
+    assert "joc real de orientare" in body
+    # chrome must be gone
+    assert "tracking" not in body
+    assert "Site Banner Junk" not in body
+    assert "toate drepturile rezervate" not in body
+    assert "Games" not in body
+
+
+# --------------------------------------------------------------------------
+# content hash + near-duplicate elimination
+# --------------------------------------------------------------------------
+def test_content_hash_ignores_whitespace():
+    assert ec.content_hash("hello  world") == ec.content_hash("hello world\n")
+    assert ec.content_hash("hello world") != ec.content_hash("goodbye world")
+
+
+def test_dedupe_exact_duplicates():
+    items = [("a", "joc identic"), ("b", "joc identic"), ("c", "alt joc")]
+    kept = ec.dedupe_texts(items)
+    assert [k for k, _ in kept] == ["a", "c"]
+
+
+def test_dedupe_near_duplicates():
+    base = "Vanatoarea de comori este un joc de orientare pentru cercetasi in tabara."
+    near = base + " Pagina printata."  # >95% similar
+    items = [("orig", base), ("print", near), ("other", "Cu totul alt continut diferit aici.")]
+    kept = ec.dedupe_texts(items, threshold=85.0)
+    keys = [k for k, _ in kept]
+    assert "orig" in keys
+    assert "print" not in keys
+    assert "other" in keys
+
+
+# --------------------------------------------------------------------------
+# zip recursion
+# --------------------------------------------------------------------------
+def test_zip_recurses_into_inner_files(sample_zip):
+    body = ec.extract_zip(sample_zip)
+    assert "Paragraf 0" in body
+    assert ec.count_page_markers(body) > 0
+
+
+def test_zip_bad_archive_returns_empty(tmp_path):
+    bad = tmp_path / "broken.zip"
+    bad.write_text("not a zip", encoding="utf-8")
+    assert ec.extract_zip(bad) == ""
+
+
+def test_nested_zip(tmp_path, sample_zip):
+    outer = tmp_path / "outer.zip"
+    with zipfile.ZipFile(outer, "w") as zf:
+        zf.write(sample_zip, arcname="nested/archive.zip")
+    body = ec.extract_zip(outer)
+    assert "Paragraf 0" in body
+
+
+# --------------------------------------------------------------------------
+# preflight
+# --------------------------------------------------------------------------
+def test_preflight_python_packages_present():
+    report = ec.preflight()
+    # all required packages are installed in the test environment
+    assert report["missing_python"] == []
+
+
+def test_preflight_reports_libreoffice_state():
+    report = ec.preflight()
+    has_lo = bool(shutil.which("libreoffice") or shutil.which("soffice"))
+    if has_lo:
+        assert all("libreoffice" not in w for w in report["warnings"])
+    else:
+        assert any("libreoffice" in w for w in report["warnings"])
+
+
+def test_preflight_ocr_flag():
+    report = ec.preflight(check_ocr=True)
+    if not shutil.which("tesseract"):
+        assert any("tesseract" in m for m in report["missing_system"])
+
+
+# --------------------------------------------------------------------------
+# legacy .doc — skipped unless libreoffice is installed
+# --------------------------------------------------------------------------
+@pytest.mark.skipif(
+    not (shutil.which("libreoffice") or shutil.which("soffice")),
+    reason="libreoffice not installed",
+)
+def test_doc_conversion(tmp_path, sample_docx):
+    doc_path = tmp_path / "legacy.doc"
+    shutil.copy(sample_docx, doc_path)  # smoke test of the docx path
+    body = ec.extract_doc(doc_path)
+    assert ec.count_page_markers(body) >= 1
+
+
+def test_doc_without_libreoffice_raises(tmp_path, monkeypatch):
+    monkeypatch.setattr(ec.shutil, "which", lambda _: None)
+    with pytest.raises(RuntimeError):
+        ec.extract_doc(tmp_path / "whatever.doc")
--- a/tests/test_fts.py
+++ b/tests/test_fts.py
@@ -0,0 +1,139 @@
+"""
+Integration tests for the FTS5 search index.
+
+Confirms that materials_list and skills_developed are indexed by FTS5 and kept
+in sync by the insert / update / delete triggers (plan §6, §7).
+"""
+
+import os
+import sys
+import json
+
+import pytest
+
+# Make the project root importable when pytest is run from anywhere.
+PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+if PROJECT_ROOT not in sys.path:
+    sys.path.insert(0, PROJECT_ROOT)
+
+from app.models.activity import Activity  # noqa: E402
+from app.models.database import DatabaseManager  # noqa: E402
+
+
+@pytest.fixture
+def db(tmp_path):
+    """A fresh DatabaseManager backed by a temporary SQLite file."""
+    return DatabaseManager(str(tmp_path / "test_activities.db"))
+
+
+def _make_activity(**overrides):
+    base = dict(
+        name="Vânătoarea de comori",
+        description="O activitate de echipă în aer liber.",
+        category="camp-outdoor",
+        content_type="joc",
+        source_file="test.txt",
+        language="ro",
+    )
+    base.update(overrides)
+    return Activity(**base)
+
+
+def test_search_by_materials_list(db):
+    """A term that only appears in materials_list returns the activity."""
+    activity = _make_activity(materials_list="frânghie, eșarfă, busolă")
+    db.insert_activity(activity)
+
+    results = db.search_activities(search_text="busolă")
+    assert len(results) == 1
+    assert results[0]["name"] == "Vânătoarea de comori"
+
+
+def test_search_by_skills_developed(db):
+    """A term that only appears in skills_developed returns the activity."""
+    activity = _make_activity(skills_developed="comunicare, leadership, rabdare")
+    db.insert_activity(activity)
+
+    results = db.search_activities(search_text="leadership")
+    assert len(results) == 1
+    assert results[0]["name"] == "Vânătoarea de comori"
+
+
+def test_term_absent_from_indexed_columns_no_hit(db):
+    """A term present in no indexed column yields no hit (control)."""
+    db.insert_activity(_make_activity(materials_list="frânghie"))
+    assert db.search_activities(search_text="zzzunlikelyterm") == []
+
+
+def test_delete_trigger_removes_from_fts(db):
+    """Deleting an activity removes it from the FTS index (delete trigger)."""
+    activity = _make_activity(materials_list="catalige")
+    activity_id = db.insert_activity(activity)
+    assert len(db.search_activities(search_text="catalige")) == 1
+
+    with db._get_connection() as conn:
+        conn.execute("DELETE FROM activities WHERE id = ?", (activity_id,))
+        conn.commit()
+
+    assert db.search_activities(search_text="catalige") == []
+
+
+def test_update_trigger_resyncs_fts(db):
+    """Updating materials_list re-syncs the FTS index (update trigger)."""
+    activity = _make_activity(materials_list="creioane")
+    activity_id = db.insert_activity(activity)
+    assert len(db.search_activities(search_text="creioane")) == 1
+
+    with db._get_connection() as conn:
+        conn.execute(
+            "UPDATE activities SET materials_list = ? WHERE id = ?",
+            ("acuarele", activity_id),
+        )
+        conn.commit()
+
+    # Old term gone, new term found.
+    assert db.search_activities(search_text="creioane") == []
+    assert len(db.search_activities(search_text="acuarele")) == 1
+
+
+def test_rebuild_fts_index(db):
+    """rebuild_fts_index keeps materials_list / skills_developed searchable."""
+    db.insert_activity(_make_activity(skills_developed="orientare"))
+    db.rebuild_fts_index()
+    assert len(db.search_activities(search_text="orientare")) == 1
+
+
+def test_new_schema_columns_round_trip(db):
+    """New activity columns persist and load back via from_dict."""
+    activity = _make_activity(
+        source_files=["a.txt", "b.txt"],
+        source_excerpt="Citat scurt din sursă.",
+        extraction_confidence="high",
+        needs_review=1,
+        normalized_name="vanatoarea de comori",
+    )
+    activity_id = db.insert_activity(activity)
+
+    row = db.get_activity_by_id(activity_id)
+    assert row["content_type"] == "joc"
+    assert row["language"] == "ro"
+    assert row["extraction_confidence"] == "high"
+    assert row["needs_review"] == 1
+    assert row["normalized_name"] == "vanatoarea de comori"
+    assert json.loads(row["source_files"]) == ["a.txt", "b.txt"]
+    assert row["source_excerpt"] == "Citat scurt din sursă."
+
+    loaded = Activity.from_dict(row)
+    assert loaded.source_files == ["a.txt", "b.txt"]
+    assert loaded.content_type == "joc"
+
+
+def test_normalized_name_auto_derived(db):
+    """normalized_name is auto-derived from name when not provided."""
+    activity = Activity(
+        name="Ștafetă cu  Obstacole",
+        description="desc",
+        category="sports-active",
+        source_file="t.txt",
+    )
+    assert activity.normalized_name == "stafeta cu obstacole"
--- a/tests/test_search.py
+++ b/tests/test_search.py
@@ -0,0 +1,140 @@
+# -*- coding: utf-8 -*-
+"""
+CRITICAL REGRESSION TEST (plan §6, §7).
+
+`search.py` changed the result sets of /search and /api/search: the default
+search now EXCLUDES the non-game content types (rețetă / cântec / ceremonie),
+which surface only when the user explicitly filters that content_type or picks
+a non-game category. This test guards that behaviour.
+"""
+
+import pytest
+
+from app.models.activity import Activity
+from app.models.database import DatabaseManager
+from app.services.search import SearchService
+from app.config_taxonomy import NON_GAME_CONTENT_TYPES
+
+
+# --------------------------------------------------------------------------
+# fixtures
+# --------------------------------------------------------------------------
+def _activity(name, content_type, category="altele", language="ro"):
+    return Activity(
+        name=name,
+        description=f"Descriere pentru {name}, un conținut de tip {content_type}.",
+        category=category,
+        content_type=content_type,
+        language=language,
+        source_file="test/fixture.txt",
+    )
+
+
+@pytest.fixture
+def search_service(tmp_path):
+    """A SearchService over a temp DB seeded with one row per content_type."""
+    db = DatabaseManager(str(tmp_path / "activities.db"))
+    db.clear_database()
+    db.bulk_insert_activities([
+        _activity("Vanatoarea de comori", "joc", category="wide-games"),
+        _activity("Cercul de cunoastere", "activitate", category="icebreakers"),
+        _activity("Reteta de paine la ceaun", "reteta", category="retete"),
+        _activity("Cantecul de tabara", "cantec", category="cantece-ceremonii"),
+        _activity("Ceremonia de inchidere", "ceremonie", category="cantece-ceremonii"),
+        _activity("Game in English", "joc", category="wide-games", language="en"),
+    ])
+    return SearchService(db)
+
+
+def _content_types(results):
+    return {r.get("content_type") for r in results}
+
+
+# --------------------------------------------------------------------------
+# the regression: default search excludes non-game content types
+# --------------------------------------------------------------------------
+def test_default_search_excludes_non_game_content(search_service):
+    """No filters → rețete / cântece / ceremonii must NOT appear."""
+    results = search_service.search_activities()
+    types = _content_types(results)
+
+    assert types, "default search returned nothing"
+    for non_game in NON_GAME_CONTENT_TYPES:
+        assert non_game not in types, (
+            f"default search leaked non-game content_type '{non_game}'"
+        )
+    # game content is still present
+    assert "joc" in types
+    assert "activitate" in types
+
+
+def test_default_search_with_text_excludes_non_game(search_service):
+    """A text query still excludes non-game content by default."""
+    results = search_service.search_activities(search_text="conținut")
+    assert NON_GAME_CONTENT_TYPES[0] not in _content_types(results)
+
+
+# --------------------------------------------------------------------------
+# explicit content_type filter INCLUDES the non-game rows
+# --------------------------------------------------------------------------
+def test_explicit_content_type_filter_includes_non_game(search_service):
+    """Filtering content_type=reteta returns exactly the rețete."""
+    results = search_service.search_activities(filters={"content_type": "reteta"})
+    types = _content_types(results)
+
+    assert types == {"reteta"}, f"expected only rețete, got {types}"
+    assert len(results) == 1
+
+
+def test_explicit_content_type_filter_for_cantec(search_service):
+    results = search_service.search_activities(filters={"content_type": "cantec"})
+    assert _content_types(results) == {"cantec"}
+
+
+# --------------------------------------------------------------------------
+# a non-game CATEGORY filter also lifts the exclusion
+# --------------------------------------------------------------------------
+def test_non_game_category_filter_includes_non_game(search_service):
+    """Picking category=cantece-ceremonii surfaces cântece + ceremonii."""
+    results = search_service.search_activities(
+        filters={"category": "cantece-ceremonii"})
+    types = _content_types(results)
+
+    assert "cantec" in types
+    assert "ceremonie" in types
+
+
+def test_game_category_filter_still_excludes_non_game(search_service):
+    """A normal (game) category filter keeps the non-game exclusion."""
+    results = search_service.search_activities(filters={"category": "wide-games"})
+    types = _content_types(results)
+    for non_game in NON_GAME_CONTENT_TYPES:
+        assert non_game not in types
+
+
+# --------------------------------------------------------------------------
+# language filter
+# --------------------------------------------------------------------------
+def test_language_filter_ro(search_service):
+    results = search_service.search_activities(filters={"language": "ro"})
+    assert results
+    assert all(r.get("language") == "ro" for r in results)
+
+
+def test_language_filter_en(search_service):
+    results = search_service.search_activities(filters={"language": "en"})
+    assert results
+    assert all(r.get("language") == "en" for r in results)
+    assert {r.get("name") for r in results} == {"Game in English"}
+
+
+# --------------------------------------------------------------------------
+# get_filter_options surfaces the new axes
+# --------------------------------------------------------------------------
+def test_filter_options_include_content_type_and_language(search_service):
+    """The dynamic-filter mechanism now exposes content_type + language."""
+    options = search_service.db.get_filter_options()
+    assert "content_type" in options
+    assert "language" in options
+    assert "joc" in options["content_type"]
+    assert set(options["language"]) == {"ro", "en"}
--- a/tests/test_validate_extractions.py
+++ b/tests/test_validate_extractions.py
@@ -0,0 +1,156 @@
+# -*- coding: utf-8 -*-
+"""
+Tests for scripts/validate_extractions.py.
+
+Covers: schema rejection, the source_excerpt hallucination check, the content
+of the generated re-extraction prompt, and the manifest `rejected` marking.
+"""
+
+import json
+import sys
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+SCRIPTS_DIR = REPO_ROOT / "scripts"
+for _p in (str(REPO_ROOT), str(SCRIPTS_DIR)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+import validate_extractions as ve  # noqa: E402
+
+
+# --------------------------------------------------------------------------
+# helpers
+# --------------------------------------------------------------------------
+def _ext_activity(**over):
+    base = dict(
+        name="Jocul testului",
+        description="O activitate de echipa in aer liber.",
+        category="team-building",
+        content_type="joc",
+        language="ro",
+        extraction_confidence="high",
+        source_excerpt="ancora din bucata sursa",
+        page_reference="page 1",
+    )
+    base.update(over)
+    return base
+
+
+def _write_extraction(extracted_dir, chunk_key, activities, header_extra=None):
+    extracted_dir.mkdir(parents=True, exist_ok=True)
+    header = {
+        "source_hash": "hash1234deadbeef",
+        "schema_version": "1.0",
+        "prompt_version": "1.0",
+        "chunk_range": "pages 1-20",
+        "source_id": "src01",
+        "chunk_key": chunk_key,
+    }
+    if header_extra:
+        header.update(header_extra)
+    payload = {"header": header, "activities": activities}
+    (extracted_dir / f"{chunk_key}.json").write_text(
+        json.dumps(payload, ensure_ascii=False), encoding="utf-8"
+    )
+
+
+def _write_chunk(chunks_dir, source_id, chunk_key, text):
+    d = chunks_dir / source_id
+    d.mkdir(parents=True, exist_ok=True)
+    (d / f"{chunk_key}.txt").write_text(text, encoding="utf-8")
+
+
+# --------------------------------------------------------------------------
+# tests
+# --------------------------------------------------------------------------
+def test_valid_file_passes(tmp_path):
+    extracted = tmp_path / "extracted"
+    chunks = tmp_path / "chunks"
+    excerpt = "ancora din bucata sursa apare aici"
+    _write_extraction(extracted, "src01.part01", [_ext_activity(source_excerpt=excerpt)])
+    _write_chunk(chunks, "src01", "src01.part01", f"--- PAGE 1 ---\n{excerpt}\n")
+
+    report = ve.run(extracted, chunks, tmp_path / "manifest.json")
+    assert report["valid"] == 1
+    assert report["rejected"] == 0
+
+
+def test_schema_invalid_file_rejected(tmp_path):
+    extracted = tmp_path / "extracted"
+    chunks = tmp_path / "chunks"
+    extracted.mkdir(parents=True)
+    (extracted / "src01.part01.json").write_text(
+        json.dumps({"header": {}, "activities": [{"name": "x"}]}), encoding="utf-8"
+    )
+
+    report = ve.run(extracted, chunks, tmp_path / "manifest.json")
+    assert report["rejected"] == 1
+    prompt = extracted / "_reextract" / "src01.part01.prompt.md"
+    assert prompt.exists()
+
+
+def test_hallucinated_excerpt_rejected(tmp_path):
+    extracted = tmp_path / "extracted"
+    chunks = tmp_path / "chunks"
+    _write_extraction(
+        extracted, "src01.part01",
+        [_ext_activity(source_excerpt="citat complet inventat care nu exista qqqq")],
+    )
+    _write_chunk(chunks, "src01", "src01.part01",
+                 "--- PAGE 1 ---\ntext complet diferit despre altceva.\n")
+
+    report = ve.run(extracted, chunks, tmp_path / "manifest.json")
+    assert report["rejected"] == 1
+    errors = report["rejected_chunks"][0]["errors"]
+    assert any("hallucination" in e for e in errors)
+
+
+def test_reextraction_prompt_content(tmp_path):
+    extracted = tmp_path / "extracted"
+    chunks = tmp_path / "chunks"
+    _write_extraction(
+        extracted, "src01.part01",
+        [_ext_activity(source_excerpt="citat inventat care nu exista zzzz")],
+    )
+    _write_chunk(chunks, "src01", "src01.part01",
+                 "--- PAGE 1 ---\ntext despre cu totul altceva aici.\n")
+
+    ve.run(extracted, chunks, tmp_path / "manifest.json")
+    prompt = (extracted / "_reextract" / "src01.part01.prompt.md").read_text(
+        encoding="utf-8"
+    )
+    assert "src01.part01" in prompt
+    assert "REJECTED" in prompt
+    assert "verbatim" in prompt
+    assert "data/extracted/src01.part01.json" in prompt
+
+
+def test_manifest_marks_chunk_rejected(tmp_path):
+    extracted = tmp_path / "extracted"
+    chunks = tmp_path / "chunks"
+    manifest_path = tmp_path / "manifest.json"
+    manifest_path.write_text(
+        json.dumps({"chunks": {"src01.part01": {"state": "done",
+                                                "chunk_file": "chunks/src01/src01.part01.txt"}}}),
+        encoding="utf-8",
+    )
+    _write_extraction(
+        extracted, "src01.part01",
+        [_ext_activity(source_excerpt="citat fabricat absent vvvv")],
+    )
+    _write_chunk(chunks, "src01", "src01.part01",
+                 "--- PAGE 1 ---\nun continut neinrudit.\n")
+
+    ve.run(extracted, chunks, manifest_path)
+    manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
+    assert manifest["chunks"]["src01.part01"]["state"] == "rejected"
+
+
+def test_build_reextraction_prompt_lists_errors():
+    prompt = ve.build_reextraction_prompt(
+        "abc.part03", "data/chunks/abc/abc.part03.txt",
+        ["header: 'source_hash' is a required property"],
+    )
+    assert "abc.part03" in prompt
+    assert "source_hash" in prompt
Author	SHA1	Message	Date
Claude Agent	f7a37f91ec	Headless cron enrichment system + progress checkpoint at 32% OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully headless: one claude -p per batch via xargs, flock-guarded, idempotent. DB updated to 9541 activities; .gitignore covers enrichment intermediates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 21:26:35 +00:00
Claude Agent	d6971e47f8	Prevent + net the unescaped-quote bug in the durable prompts/pipeline The escape-ASCII-quote rule previously lived only in ephemeral Agent-call strings. Bake it into the durable artifacts so the next session doesn't re-derive it: - SUBAGENT_PROMPT.md + ENRICHMENT_PROMPT.md: explicit rule to escape any ASCII " inside JSON string values (Romanian „cuvânt" is the trap). - run_enrichment.py collect_enrichment: repair malformed parts with escape_stray_quotes instead of dropping them — the enrichment path had no repair net (bad parts were silently dropped, losing that activity's enrichment). Extraction already had one; now both do. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 18:16:04 +00:00
Claude Agent	bcfb6841eb	Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 _ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-29 18:10:13 +00:00
Claude Agent	46d9592a55	HANDOFF for Faza 1 resumption (10.9% done, switch to Sonnet) 64/588 chunks extracted so far (~1949 activities) but in a fresh session we should switch the subagent model from Opus to Sonnet — the task is structured JSON extraction with a fixed schema, no complex reasoning needed, and Sonnet's 200K context easily fits the ~25k-token prompt and ~20k-token output per chunk. Document captures the exact resume procedure: pending-chunk discovery, the Agent call template with model:"sonnet", and the finalization steps (validate -> build_database -> needs_review bulk merge). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 19:32:44 +00:00
Claude Agent	09999ccd40	Faza 0 follow-ups: re-extract 13 chunks, resolve 377 needs_review - Re-extracted the 13 chunks with paraphrased source_excerpts (root cause: original excerpts straddled --- PAGE N --- markers which the rapidfuzz partial_ratio scored 75-90/100). Re-extraction used verbatim within-page quotes; all now score 100/100. - Hallucinated drops: 19 -> 0. - Bulk-resolved all 377 borderline-dedup needs_review pairs as merge (cleared the badge; both rows remain). They came from chunk overlap re-extracting the same activity with slightly different prose. - Final DB: 1751 activities (was 1732). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:59:36 +00:00
Claude Agent	3d9f266696	Faza 0 pilot: rebuild activities.db from 5-file extraction 61 chunks × LLM subagent extraction yielded 1780 raw activities; build_database dedup + hallucination check yielded 1732 in DB. Pilot metrics vs plan acceptance thresholds: - hallucinated drops : 19/1780 = 1.07% (threshold ≤ 2%) - schema-rejected files : 0/61 (threshold ≥ 0.9 valid) - chunks needing re-extract: 13/61 (paraphrased excerpts 75-90/100) - % with rules : 99.9% - extraction_confidence high: 1712/1732 = 98.8% OCR decision: NOT NEEDED. The Cartea_Mare scanned-PDF candidate extracted 151 pages / 38k words of real text via pdfplumber alone. Pilot files: - 1000 Fantastic Scout Games (EN, 278pg, 18 chunks → 946 activities) - dragon.sleepdeprived.ca/games mirror (EN, 498pg, 31 chunks → 531) - Cartea Mare a Jocurilor (RO, 151pg, 10 chunks → 284) - Activităţi şi jocuri ... .doc (RO, 7pg, 1 chunk → 19, needs_review) - Amazing Race templates zip (graphics only, 0 activities — expected) The old activities.db was backed up to .bak before atomic swap. tests/ still green (71 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 07:43:42 +00:00
Claude Agent	66ae831c36	Rebuild extraction pipeline infrastructure (Faza 0 prep) Implements the approved plan to replace the broken regex/index-master extraction with an LLM-subagent pipeline. Four parallel lanes: Lane A — scripts/extract_common.py (PDF/docx/doc/pptx/html/zip, no max_pages truncation), normalize_sources.py, chunk_sources.py (~20pg chunks + overlap, manifest registry), activity_schema.json. Lane B — app/config_taxonomy.py (16 fixed category slugs), schema rebuilt from scratch in app/models/ with content_type, language, source_files, source_excerpt, normalized_name, extraction_confidence, needs_review; FTS5 + 3 triggers extended with materials_list and skills_developed. Lane C — build_database.py (--rebuild, atomic swap, schema + fuzzy source_excerpt validation, dedup with needs_review band), validate_extractions.py, review_queue.py, new run_extraction.py orchestrator, SUBAGENT_PROMPT.md. Lane D — search.py content_type/language filters (default search excludes non-game content), E7 schema-compat audit; fixed a NULL keywords AttributeError in _boost_search_relevance. Removes 8 orphaned/dead scripts and app/services/parser.py + indexer.py. Adds tests/ (70 passing, 1 skipped — libreoffice absent). Note: Lane D made one additive edit to app/models/database.py (_update_category_counts) to surface content_type/language in get_filter_options, outside its nominal lane boundary but after Lane B completed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 17:43:38 +00:00