# HANDOFF — Enrichment PILOT done; STOP at user sign-off gate before full-corpus scaling **Snapshot:** 2026-05-29 (updated). Executing plan `enumerated-petting-badger.md` (bilingual index + enrichment + new filters + source download). **>>> CURRENT STATE (2026-05-29): user SIGNED OFF on the pilot → full-corpus enrichment SCALING IN PROGRESS via 8 PARALLEL background Workflows on Sonnet. Parts on disk (`data/enrichment_parts/.json`) = the durable checkpoint. <<<** Two earlier single-workflow runs were stopped: the first ran on Opus by mistake (workflow `agent()` inherits the main-loop model unless `model:'sonnet'` is passed — fixed). Measured rate: ~4.7 keys/min single-shard, ~17 keys/min at 3 shards (linear, no real rate-limit errors — the "429" hits in transcripts are line numbers in chunk text, not API errors). Concurrency is capped at 2 PER workflow (`nproc`=4 → `min(16,cores-2)`), so parallelism = run many workflows. **8 shard scripts: `data/enrichment_wf/shard_0.js` … `shard_7.js`**, each owns a disjoint batch range of `data/enrichment_batches/batch_NNNN.txt` (780 batches × ~12 keys = 9357 keys; ranges: s0 [0,98) s1 [98,196) s2 [196,294) s3 [294,392) s4 [392,490) s5 [490,588) s6 [588,686) s7 [686,780)). Each agent is IDEMPOTENT (skips keys whose part already exists + parses), so re-launching any shard is safe. Run IDs: s0 `wf_3c314d06-01c` · s1 `wf_ecc7d151-a11` · s2 `wf_4156be35-748` · s3 `wf_fa16abee-17a` · s4 `wf_a0f595b8-8fe` · s5 `wf_b3505593-09a` · s6 `wf_ad0d731e-12e` · s7 `wf_a919a99b-1d2`. ### ▶ RESUME HERE (2026-06-01 — THROTTLED CRON SYSTEM now drives enrichment) **Enrichment progress: 3074 / 9541 done (32.2%), ~6467 missing.** Nothing running. Parts on disk (`data/enrichment_parts/*.json`) are the durable, idempotent checkpoint. **A paced wave system now runs FULLY HEADLESS via the LXC's OS cron — NO Claude session required.** Fixes the "always runs to exhaustion" bug: each wave caps at ~75% of a 5h window and the next window is reached by time (cron). ARCHITECTURE: OS cron → `scripts/enrich_wave.sh` → one `claude -p` per batch, PAR-way parallel (OS-level — NOT the Workflow tool, which can't be used headless: `claude -p` is one-shot and would exit before background workflows finish). Each headless `claude -p` reads a batch file and writes `data/enrichment_parts/.json`. - **`scripts/enrichment_wave.py`** (prepares a bounded wave, no LLM): - `--status` — read-only progress (done / missing / pct / corrupt count). - `--prepare --keys 700 --no-shards` — drop corrupt parts; take FIRST 700 sorted-missing keys; write batch files for ONLY those; print `WAVE: PREPARED …` or `WAVE: COMPLETE`. `--no-shards` = batch files only (the headless path). (Without `--no-shards` it also regenerates Workflow shard JS from `data/enrichment_wf/shard.js.tmpl` — only needed for the old Workflow path.) - **`scripts/enrich_wave.sh [KEYS] [PAR]`** (the headless orchestrator, run by cron): flock-guarded (waves never overlap); `--prepare`; if `WAVE: COMPLETE` → `--collect` + `--rebuild` and stop; else `xargs -P PAR` one `claude -p` per batch (`--allowedTools Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)`, `.log`. Detects + logs "WINDOW EXHAUSTED". - **OS crontab (user `claude`, `crontab -l` to view):** two night fires `20 23 * * *` and `50 0 * * *` UTC (= 02:20 & 03:50 EEST). Timed AFTER the live- confirmed **23:00 UTC usage-window reset** so both land in the fresh post-reset window (user asleep → safe to use it fully; two 700-caps top out at the window's ~950 capacity). Self-healing: a fire into an exhausted window is a harmless no-op (`claude -p` prints "session limit", writes nothing) and those keys retry next fire. **Auth caveat:** headless `claude -p` uses the OAuth token in `~/.claude/.credentials.json` (verified working). If it ever expires and can't refresh non-interactively, cron fires fail with auth errors → user must `claude` login once. **Manual fallback (one wave, any time, no session needed):** ```bash /workspace/game-library/scripts/enrich_wave.sh 700 6 # runs a full wave now # or step-by-step: python3 scripts/enrichment_wave.py --status # progress python3 scripts/run_enrichment.py --collect && python3 scripts/build_database.py --rebuild # at WAVE: COMPLETE # gate: rebuild must print enrichment {N} (matched N, orphaned 0) ``` **Control:** `crontab -e` to retime/disable; `crontab -r` removes all. Tune `--keys` (KEYS arg) up to drain faster, down if logs show "WINDOW EXHAUSTED" early. One full window ≈ 950 keys ≈ 100%. **Hard facts learned:** - Workflow concurrency is capped at **2 per workflow** (`nproc`=4 → `min(16,cores-2)`); parallelism = run many workflow processes. 3 shards measured ~17 keys/min (linear, no real rate-limit). - Workflow `agent()` inherits the **main-loop model unless `model:'sonnet'` is passed** — the FIRST run silently used Opus; always pass model. - The full corpus does **NOT fit in one 5h usage window** — it needs SEVERAL windows. Parallelism only cuts wall-clock inside a window, not total token budget. - Main-session token drain was **polling** (sleep/grep loops), NOT launching workflows. Launch + wait-for-notification only. - StructuredOutput failures appear when a window exhausts mid-run — harmless; idempotent skip + the regenerate-from-missing reconcile recover every dropped key. (prev note) Earlier STOPPED at 593/9541 — hit 92% of the 5h Anthropic usage window (resets 23:00 UTC). KEY LESSON: the full corpus does NOT fit in one 5h window; 6.2% + the session's other work already used ~92%. Enrichment must be spread across MANY 5h windows (parallelism only cuts wall-clock inside a window, not total token budget). Resume model: after each window reset, regenerate batches from currently-missing, relaunch a bounded number of shards, stop before the window exhausts. Idempotent shards + parts-on-disk make this safe to repeat. Consider a cron/scheduled job that runs a bounded wave each reset. **To regenerate batches from currently-missing + relaunch a shard** (reconcile): ```bash python3 - <<'PY' import glob, os BATCH=12 missing=sorted(os.path.basename(p)[:-9-len('.md')] for p in glob.glob('data/enrichment_prompts/*.prompt.md') if not os.path.exists('data/enrichment_parts/'+os.path.basename(p)[:-len('.prompt.md')]+'.json')) for old in glob.glob('data/enrichment_batches/batch_*.txt'): os.remove(old) for n,i in enumerate(range(0,len(missing),BATCH)): open(f'data/enrichment_batches/batch_{n:04d}.txt','w').write('\n'.join(missing[i:i+BATCH])+'\n') print('missing',len(missing),'batches',n+1) PY # then edit START/END in the shard files to cover the new batch count and re-invoke Workflow({scriptPath: 'data/enrichment_wf/shard_K.js'}) ``` ### Resume / completion procedure (do this when the workflow finishes — or to continue a new session) The pipeline is RESUMABLE: parts on disk are truth; re-running regenerates work only for missing keys. 1. **Rebuild the batch list from what's still missing** (prompt exists, part absent), then re-run the workflow for the gap: ```bash # regenerate batch files for missing keys (script below already lives in shell history; logic: # for each data/enrichment_prompts/.prompt.md with no data/enrichment_parts/.json, # split into data/enrichment_batches/batch_NNNN.txt of 12) ``` The workflow script is at `.../workflows/scripts/enrich-corpus-wf_440c0a2f-17f.js` (nBatches hardcoded → update it to the new batch count, or it defaults to 793). Re-invoke with `{scriptPath: ...}`. 2. **Reconcile loop** (expect 2–3 passes — some parts WILL drop: flaky agents, a stray quote that slips re-validation): repeat step 1 until `missing == 0`. 3. **Collect + final rebuild ONCE at the end** (don't rebuild after every wave — 9541 rows is wasted work): ```bash python3 scripts/run_enrichment.py --collect # robust: repairs stray-quote parts, skips+reports truly-broken python3 scripts/build_database.py --rebuild # picks up --enrichment by default ``` **Gate:** rebuild must print `enrichment {entries} (matched {entries}, orphaned 0)`. Done-criterion is the reconcile counts converging: `emitted == parts-on-disk == entries == matched`. ### ⚠ FREEZE IS NOW LOCKED Enrichment content_keys depend on the current freeze. The earlier "re-freezing is safe" note is **INVERTED** now — do NOT re-extract or re-freeze `data/extracted/` until the final `--rebuild`, or content_keys drift and the overlay orphans. ## Where we are | Step (plan Part C) | Status | |--------------------|--------| | 1. Finish extraction | **DONE** — 582 chunks extracted & valid (7 re-extracted this session); 6 content-filter-blocked, accepted as missing | | 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** | | 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** | | 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** | | 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9541 activities** (re-frozen with the 7 chunks) | | Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total | | 4. Enrichment pilot → **STOP for user sign-off** | **DONE — 34 activities enriched (26 ro-polish + 8 en→ro), pipeline 34/34 matched, 0 orphaned. AWAITING SIGN-OFF.** | | 5. Final rebuild `--enrichment` | not started (post sign-off) | ## The 7 re-extracted chunks (this session) Re-extracted via Sonnet subagents, all valid JSON, re-frozen into the corpus. One (`d297a434…part01`) had an activity named "Eu" (<3 chars, schema-rejected); renamed faithfully to "Eu sunt..." (matches the source affirmation). The 6 content-filter-blocked chunks remain accepted as missing. Everything is committed except whatever this session leaves dirty. `data/extracted/*.json` is gitignored (575 files on disk, durable across /clear). ## The 13 missing chunks (out of 588) **6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss): - `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics) - `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96` **7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair incident" below; re-extract once the subagent session limit resets, ~5pm UTC): ``` 3f9c8232_teambuilding_corbu_29092023.part01 5f959f85_scoli_fara_bullying.part02 83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04 d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01 d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04 d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05 e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03 ``` Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at `data/chunks/_prompts/.prompt.md`), then **re-run the freeze rebuild** so they join the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no overlay keys depend on the current freeze yet. ## The json_repair incident (important — root cause + what was fixed) Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files were affected. First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it ends the string and reinterprets the trailing text as a new key, silently dropping the rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create an extra key slipped through. Applying json_repair output to disk also **overwrote the malformed originals** for those 8 → originals lost → those (now 7, one recovered) need re-extraction. **Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner (`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them, validates against the real schema, and only replaces a valid top-level file when the repaired version carries **strictly more text** (a length guard that catches truncated json_repair output while leaving genuine extractions untouched). Re-running it cleanly repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**. `json_repair` is no longer used anywhere. Do NOT reintroduce it. `build_database.py` does NOT depend on the repair script (the "DB regenerable from data/extracted/" invariant holds — plain `json.loads` only). ## What the code does now (all committed) **Part A — plumbing (corpus-independent):** - `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON), chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync); indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor` and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into the categories table so dropdowns populate. - `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name` / `get_display_description` / `get_display_rules` / `get_display_variations` (RO-primary, EN fallback), `has_translation`, `is_estimated(field)`, `get_indoor_outdoor_display`, `get_space_needed_display`. - `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels + `normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no fallback — never fabricate a value) + display-name helpers. - `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`; `merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but **never** touches enrichment fields (those are applied post-dedup). **Part A — UI/search:** - `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/ `space_needed` to DB equality filters. - `app/web/routes.py`: new `/source/` download route — **shipped DARK behind `SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves `source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for web-mirror sources). `DISPLAY_NAMES` extended with both new axes. - `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`. - Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible "Text original" section, indoor/space cards, estimat markers, and the download link (only when the flag is on). `main.css` has `.estimated` / `.original-text` styles. **Part B — enrichment pipeline (built, not yet run):** - `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)` applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on `import_common.content_key(normalized_name, language, _normalize_text(description))` (reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints `enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts. Translated/expanded text is NOT re-validated against source (by design). - `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key, skips rows already in `data/enrichment_parts/.json` (resumable), emits one prompt per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via `find_chunk_text`). Pilot scoping: `--source ` and/or `--limit N`. `--collect` merges parts → `data/enrichment.json`. - `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand `description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`, fixed enum vocab, output `data/enrichment_parts/.json` including `content_key`. ## Exact next steps 1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`). If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now). 2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected, note the new total (~9418 + the 7 chunks' activities). 3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 6–8k LLM calls): - Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu` (or `--limit 30`). This writes prompts to `data/enrichment_prompts/`. - Launch a small wave of Sonnet subagents on those prompts (each writes `data/enrichment_parts/.json`). - `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`. - `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default). - **STOP. Hand the user translation-quality + estimation-plausibility + description- fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not auto-proceed past this gate. 4. After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, `--collect`, final `--rebuild --enrichment`. ## Verify / run - Tests: `python3 -m pytest tests/ -q` → 99 pass. - App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true only if the user accepts the copyright exposure of serving original files. - `data/activities.db.bak` is the pre-this-freeze backup.