Headless cron enrichment system + progress checkpoint at 32%

OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully headless: one claude -p per batch via xargs, flock-guarded, idempotent. DB updated to 9541 activities; .gitignore covers enrichment intermediates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 21:26:35 +00:00
parent d6971e47f8
commit f7a37f91ec
6 changed files with 619 additions and 7 deletions
--- a/HANDOFF.md
+++ b/HANDOFF.md
@@ -1,20 +1,153 @@
-# HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot
+# HANDOFF — Enrichment PILOT done; STOP at user sign-off gate before full-corpus scaling

-**Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual
-index + enrichment + new filters + source download).
+**Snapshot:** 2026-05-29 (updated). Executing plan `enumerated-petting-badger.md`
+(bilingual index + enrichment + new filters + source download).
+
+**>>> CURRENT STATE (2026-05-29): user SIGNED OFF on the pilot → full-corpus
+enrichment SCALING IN PROGRESS via 8 PARALLEL background Workflows on Sonnet.
+Parts on disk (`data/enrichment_parts/<key>.json`) = the durable checkpoint. <<<**
+
+Two earlier single-workflow runs were stopped: the first ran on Opus by mistake
+(workflow `agent()` inherits the main-loop model unless `model:'sonnet'` is passed —
+fixed). Measured rate: ~4.7 keys/min single-shard, ~17 keys/min at 3 shards (linear,
+no real rate-limit errors — the "429" hits in transcripts are line numbers in chunk
+text, not API errors). Concurrency is capped at 2 PER workflow (`nproc`=4 →
+`min(16,cores-2)`), so parallelism = run many workflows.
+
+**8 shard scripts: `data/enrichment_wf/shard_0.js` … `shard_7.js`**, each owns a
+disjoint batch range of `data/enrichment_batches/batch_NNNN.txt` (780 batches × ~12
+keys = 9357 keys; ranges: s0 [0,98) s1 [98,196) s2 [196,294) s3 [294,392) s4 [392,490)
+s5 [490,588) s6 [588,686) s7 [686,780)). Each agent is IDEMPOTENT (skips keys whose
+part already exists + parses), so re-launching any shard is safe. Run IDs:
+s0 `wf_3c314d06-01c` · s1 `wf_ecc7d151-a11` · s2 `wf_4156be35-748` ·
+s3 `wf_fa16abee-17a` · s4 `wf_a0f595b8-8fe` · s5 `wf_b3505593-09a` ·
+s6 `wf_ad0d731e-12e` · s7 `wf_a919a99b-1d2`.
+
+### ▶ RESUME HERE (2026-06-01 — THROTTLED CRON SYSTEM now drives enrichment)
+
+**Enrichment progress: 3074 / 9541 done (32.2%), ~6467 missing.** Nothing running.
+Parts on disk (`data/enrichment_parts/*.json`) are the durable, idempotent checkpoint.
+
+**A paced wave system now runs FULLY HEADLESS via the LXC's OS cron — NO Claude
+session required.** Fixes the "always runs to exhaustion" bug: each wave caps at
+~75% of a 5h window and the next window is reached by time (cron).
+
+ARCHITECTURE: OS cron → `scripts/enrich_wave.sh` → one `claude -p` per batch,
+PAR-way parallel (OS-level — NOT the Workflow tool, which can't be used headless:
+`claude -p` is one-shot and would exit before background workflows finish). Each
+headless `claude -p` reads a batch file and writes `data/enrichment_parts/<key>.json`.
+
+- **`scripts/enrichment_wave.py`** (prepares a bounded wave, no LLM):
+  - `--status` — read-only progress (done / missing / pct / corrupt count).
+  - `--prepare --keys 700 --no-shards` — drop corrupt parts; take FIRST 700
+    sorted-missing keys; write batch files for ONLY those; print
+    `WAVE: PREPARED …` or `WAVE: COMPLETE`. `--no-shards` = batch files only
+    (the headless path). (Without `--no-shards` it also regenerates Workflow shard
+    JS from `data/enrichment_wf/shard.js.tmpl` — only needed for the old Workflow path.)
+- **`scripts/enrich_wave.sh [KEYS] [PAR]`** (the headless orchestrator, run by cron):
+  flock-guarded (waves never overlap); `--prepare`; if `WAVE: COMPLETE` → `--collect`
+  + `--rebuild` and stop; else `xargs -P PAR` one `claude -p` per batch
+  (`--allowedTools Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)`, `</dev/null`).
+  Logs to `/workspace/.claude-logs/enrich_<ts>.log`. Detects + logs "WINDOW EXHAUSTED".
+- **OS crontab (user `claude`, `crontab -l` to view):** two night fires
+  `20 23 * * *` and `50 0 * * *` UTC (= 02:20 & 03:50 EEST). Timed AFTER the live-
+  confirmed **23:00 UTC usage-window reset** so both land in the fresh post-reset
+  window (user asleep → safe to use it fully; two 700-caps top out at the window's
+  ~950 capacity). Self-healing: a fire into an exhausted window is a harmless no-op
+  (`claude -p` prints "session limit", writes nothing) and those keys retry next fire.
+
+**Auth caveat:** headless `claude -p` uses the OAuth token in
+`~/.claude/.credentials.json` (verified working). If it ever expires and can't refresh
+non-interactively, cron fires fail with auth errors → user must `claude` login once.
+
+**Manual fallback (one wave, any time, no session needed):**
+```bash
+/workspace/game-library/scripts/enrich_wave.sh 700 6      # runs a full wave now
+# or step-by-step:
+python3 scripts/enrichment_wave.py --status               # progress
+python3 scripts/run_enrichment.py --collect && python3 scripts/build_database.py --rebuild  # at WAVE: COMPLETE
+#   gate: rebuild must print enrichment {N} (matched N, orphaned 0)
+```
+
+**Control:** `crontab -e` to retime/disable; `crontab -r` removes all. Tune `--keys`
+(KEYS arg) up to drain faster, down if logs show "WINDOW EXHAUSTED" early. One full
+window ≈ 950 keys ≈ 100%.
+
+**Hard facts learned:**
+- Workflow concurrency is capped at **2 per workflow** (`nproc`=4 → `min(16,cores-2)`); parallelism = run many workflow processes. 3 shards measured ~17 keys/min (linear, no real rate-limit).
+- Workflow `agent()` inherits the **main-loop model unless `model:'sonnet'` is passed** — the FIRST run silently used Opus; always pass model.
+- The full corpus does **NOT fit in one 5h usage window** — it needs SEVERAL windows. Parallelism only cuts wall-clock inside a window, not total token budget.
+- Main-session token drain was **polling** (sleep/grep loops), NOT launching workflows. Launch + wait-for-notification only.
+- StructuredOutput failures appear when a window exhausts mid-run — harmless; idempotent skip + the regenerate-from-missing reconcile recover every dropped key.
+
+(prev note) Earlier STOPPED at 593/9541 — hit 92% of the 5h Anthropic
+usage window (resets 23:00 UTC). KEY LESSON: the full corpus does NOT fit in one 5h
+window; 6.2% + the session's other work already used ~92%. Enrichment must be spread
+across MANY 5h windows (parallelism only cuts wall-clock inside a window, not total
+token budget). Resume model: after each window reset, regenerate batches from
+currently-missing, relaunch a bounded number of shards, stop before the window
+exhausts. Idempotent shards + parts-on-disk make this safe to repeat. Consider a
+cron/scheduled job that runs a bounded wave each reset.
+
+**To regenerate batches from currently-missing + relaunch a shard** (reconcile):
+```bash
+python3 - <<'PY'
+import glob, os
+BATCH=12
+missing=sorted(os.path.basename(p)[:-9-len('.md')] for p in glob.glob('data/enrichment_prompts/*.prompt.md')
+               if not os.path.exists('data/enrichment_parts/'+os.path.basename(p)[:-len('.prompt.md')]+'.json'))
+for old in glob.glob('data/enrichment_batches/batch_*.txt'): os.remove(old)
+for n,i in enumerate(range(0,len(missing),BATCH)):
+    open(f'data/enrichment_batches/batch_{n:04d}.txt','w').write('\n'.join(missing[i:i+BATCH])+'\n')
+print('missing',len(missing),'batches',n+1)
+PY
+# then edit START/END in the shard files to cover the new batch count and re-invoke Workflow({scriptPath: 'data/enrichment_wf/shard_K.js'})
+```
+
+### Resume / completion procedure (do this when the workflow finishes — or to continue a new session)
+
+The pipeline is RESUMABLE: parts on disk are truth; re-running regenerates work only for missing keys.
+
+1. **Rebuild the batch list from what's still missing** (prompt exists, part absent), then re-run the workflow for the gap:
+   ```bash
+   # regenerate batch files for missing keys (script below already lives in shell history; logic:
+   #   for each data/enrichment_prompts/<key>.prompt.md with no data/enrichment_parts/<key>.json,
+   #   split into data/enrichment_batches/batch_NNNN.txt of 12)
+   ```
+   The workflow script is at
+   `.../workflows/scripts/enrich-corpus-wf_440c0a2f-17f.js` (nBatches hardcoded → update it to the new batch count, or it defaults to 793). Re-invoke with `{scriptPath: ...}`.
+2. **Reconcile loop** (expect 2–3 passes — some parts WILL drop: flaky agents, a stray quote that slips re-validation): repeat step 1 until `missing == 0`.
+3. **Collect + final rebuild ONCE at the end** (don't rebuild after every wave — 9541 rows is wasted work):
+   ```bash
+   python3 scripts/run_enrichment.py --collect      # robust: repairs stray-quote parts, skips+reports truly-broken
+   python3 scripts/build_database.py --rebuild       # picks up --enrichment by default
+   ```
+   **Gate:** rebuild must print `enrichment {entries} (matched {entries}, orphaned 0)`. Done-criterion is the reconcile counts converging: `emitted == parts-on-disk == entries == matched`.
+
+### ⚠ FREEZE IS NOW LOCKED
+Enrichment content_keys depend on the current freeze. The earlier "re-freezing is safe"
+note is **INVERTED** now — do NOT re-extract or re-freeze `data/extracted/` until the
+final `--rebuild`, or content_keys drift and the overlay orphans.

 ## Where we are

 | Step (plan Part C) | Status |
 |--------------------|--------|
-| 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) |
+| 1. Finish extraction | **DONE** — 582 chunks extracted & valid (7 re-extracted this session); 6 content-filter-blocked, accepted as missing |
 | 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
 | 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
 | 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
-| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9418 activities** |
+| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9541 activities** (re-frozen with the 7 chunks) |
 | Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
-| 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) |
-| 5. Final rebuild `--enrichment` | not started |
+| 4. Enrichment pilot → **STOP for user sign-off** | **DONE — 34 activities enriched (26 ro-polish + 8 en→ro), pipeline 34/34 matched, 0 orphaned. AWAITING SIGN-OFF.** |
+| 5. Final rebuild `--enrichment` | not started (post sign-off) |
+
+## The 7 re-extracted chunks (this session)
+
+Re-extracted via Sonnet subagents, all valid JSON, re-frozen into the corpus.
+One (`d297a434…part01`) had an activity named "Eu" (<3 chars, schema-rejected);
+renamed faithfully to "Eu sunt..." (matches the source affirmation). The 6
+content-filter-blocked chunks remain accepted as missing.

 Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
 is gitignored (575 files on disk, durable across /clear).