Files
game-library/HANDOFF.md
Claude Agent f7a37f91ec Headless cron enrichment system + progress checkpoint at 32%
OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave
caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully
headless: one claude -p per batch via xargs, flock-guarded, idempotent.
DB updated to 9541 activities; .gitignore covers enrichment intermediates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 21:26:35 +00:00

277 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HANDOFF — Enrichment PILOT done; STOP at user sign-off gate before full-corpus scaling
**Snapshot:** 2026-05-29 (updated). Executing plan `enumerated-petting-badger.md`
(bilingual index + enrichment + new filters + source download).
**>>> CURRENT STATE (2026-05-29): user SIGNED OFF on the pilot → full-corpus
enrichment SCALING IN PROGRESS via 8 PARALLEL background Workflows on Sonnet.
Parts on disk (`data/enrichment_parts/<key>.json`) = the durable checkpoint. <<<**
Two earlier single-workflow runs were stopped: the first ran on Opus by mistake
(workflow `agent()` inherits the main-loop model unless `model:'sonnet'` is passed —
fixed). Measured rate: ~4.7 keys/min single-shard, ~17 keys/min at 3 shards (linear,
no real rate-limit errors — the "429" hits in transcripts are line numbers in chunk
text, not API errors). Concurrency is capped at 2 PER workflow (`nproc`=4 →
`min(16,cores-2)`), so parallelism = run many workflows.
**8 shard scripts: `data/enrichment_wf/shard_0.js` … `shard_7.js`**, each owns a
disjoint batch range of `data/enrichment_batches/batch_NNNN.txt` (780 batches × ~12
keys = 9357 keys; ranges: s0 [0,98) s1 [98,196) s2 [196,294) s3 [294,392) s4 [392,490)
s5 [490,588) s6 [588,686) s7 [686,780)). Each agent is IDEMPOTENT (skips keys whose
part already exists + parses), so re-launching any shard is safe. Run IDs:
s0 `wf_3c314d06-01c` · s1 `wf_ecc7d151-a11` · s2 `wf_4156be35-748` ·
s3 `wf_fa16abee-17a` · s4 `wf_a0f595b8-8fe` · s5 `wf_b3505593-09a` ·
s6 `wf_ad0d731e-12e` · s7 `wf_a919a99b-1d2`.
### ▶ RESUME HERE (2026-06-01 — THROTTLED CRON SYSTEM now drives enrichment)
**Enrichment progress: 3074 / 9541 done (32.2%), ~6467 missing.** Nothing running.
Parts on disk (`data/enrichment_parts/*.json`) are the durable, idempotent checkpoint.
**A paced wave system now runs FULLY HEADLESS via the LXC's OS cron — NO Claude
session required.** Fixes the "always runs to exhaustion" bug: each wave caps at
~75% of a 5h window and the next window is reached by time (cron).
ARCHITECTURE: OS cron → `scripts/enrich_wave.sh` → one `claude -p` per batch,
PAR-way parallel (OS-level — NOT the Workflow tool, which can't be used headless:
`claude -p` is one-shot and would exit before background workflows finish). Each
headless `claude -p` reads a batch file and writes `data/enrichment_parts/<key>.json`.
- **`scripts/enrichment_wave.py`** (prepares a bounded wave, no LLM):
- `--status` — read-only progress (done / missing / pct / corrupt count).
- `--prepare --keys 700 --no-shards` — drop corrupt parts; take FIRST 700
sorted-missing keys; write batch files for ONLY those; print
`WAVE: PREPARED …` or `WAVE: COMPLETE`. `--no-shards` = batch files only
(the headless path). (Without `--no-shards` it also regenerates Workflow shard
JS from `data/enrichment_wf/shard.js.tmpl` — only needed for the old Workflow path.)
- **`scripts/enrich_wave.sh [KEYS] [PAR]`** (the headless orchestrator, run by cron):
flock-guarded (waves never overlap); `--prepare`; if `WAVE: COMPLETE``--collect`
+ `--rebuild` and stop; else `xargs -P PAR` one `claude -p` per batch
(`--allowedTools Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)`, `</dev/null`).
Logs to `/workspace/.claude-logs/enrich_<ts>.log`. Detects + logs "WINDOW EXHAUSTED".
- **OS crontab (user `claude`, `crontab -l` to view):** two night fires
`20 23 * * *` and `50 0 * * *` UTC (= 02:20 & 03:50 EEST). Timed AFTER the live-
confirmed **23:00 UTC usage-window reset** so both land in the fresh post-reset
window (user asleep → safe to use it fully; two 700-caps top out at the window's
~950 capacity). Self-healing: a fire into an exhausted window is a harmless no-op
(`claude -p` prints "session limit", writes nothing) and those keys retry next fire.
**Auth caveat:** headless `claude -p` uses the OAuth token in
`~/.claude/.credentials.json` (verified working). If it ever expires and can't refresh
non-interactively, cron fires fail with auth errors → user must `claude` login once.
**Manual fallback (one wave, any time, no session needed):**
```bash
/workspace/game-library/scripts/enrich_wave.sh 700 6 # runs a full wave now
# or step-by-step:
python3 scripts/enrichment_wave.py --status # progress
python3 scripts/run_enrichment.py --collect && python3 scripts/build_database.py --rebuild # at WAVE: COMPLETE
# gate: rebuild must print enrichment {N} (matched N, orphaned 0)
```
**Control:** `crontab -e` to retime/disable; `crontab -r` removes all. Tune `--keys`
(KEYS arg) up to drain faster, down if logs show "WINDOW EXHAUSTED" early. One full
window ≈ 950 keys ≈ 100%.
**Hard facts learned:**
- Workflow concurrency is capped at **2 per workflow** (`nproc`=4 → `min(16,cores-2)`); parallelism = run many workflow processes. 3 shards measured ~17 keys/min (linear, no real rate-limit).
- Workflow `agent()` inherits the **main-loop model unless `model:'sonnet'` is passed** — the FIRST run silently used Opus; always pass model.
- The full corpus does **NOT fit in one 5h usage window** — it needs SEVERAL windows. Parallelism only cuts wall-clock inside a window, not total token budget.
- Main-session token drain was **polling** (sleep/grep loops), NOT launching workflows. Launch + wait-for-notification only.
- StructuredOutput failures appear when a window exhausts mid-run — harmless; idempotent skip + the regenerate-from-missing reconcile recover every dropped key.
(prev note) Earlier STOPPED at 593/9541 — hit 92% of the 5h Anthropic
usage window (resets 23:00 UTC). KEY LESSON: the full corpus does NOT fit in one 5h
window; 6.2% + the session's other work already used ~92%. Enrichment must be spread
across MANY 5h windows (parallelism only cuts wall-clock inside a window, not total
token budget). Resume model: after each window reset, regenerate batches from
currently-missing, relaunch a bounded number of shards, stop before the window
exhausts. Idempotent shards + parts-on-disk make this safe to repeat. Consider a
cron/scheduled job that runs a bounded wave each reset.
**To regenerate batches from currently-missing + relaunch a shard** (reconcile):
```bash
python3 - <<'PY'
import glob, os
BATCH=12
missing=sorted(os.path.basename(p)[:-9-len('.md')] for p in glob.glob('data/enrichment_prompts/*.prompt.md')
if not os.path.exists('data/enrichment_parts/'+os.path.basename(p)[:-len('.prompt.md')]+'.json'))
for old in glob.glob('data/enrichment_batches/batch_*.txt'): os.remove(old)
for n,i in enumerate(range(0,len(missing),BATCH)):
open(f'data/enrichment_batches/batch_{n:04d}.txt','w').write('\n'.join(missing[i:i+BATCH])+'\n')
print('missing',len(missing),'batches',n+1)
PY
# then edit START/END in the shard files to cover the new batch count and re-invoke Workflow({scriptPath: 'data/enrichment_wf/shard_K.js'})
```
### Resume / completion procedure (do this when the workflow finishes — or to continue a new session)
The pipeline is RESUMABLE: parts on disk are truth; re-running regenerates work only for missing keys.
1. **Rebuild the batch list from what's still missing** (prompt exists, part absent), then re-run the workflow for the gap:
```bash
# regenerate batch files for missing keys (script below already lives in shell history; logic:
# for each data/enrichment_prompts/<key>.prompt.md with no data/enrichment_parts/<key>.json,
# split into data/enrichment_batches/batch_NNNN.txt of 12)
```
The workflow script is at
`.../workflows/scripts/enrich-corpus-wf_440c0a2f-17f.js` (nBatches hardcoded → update it to the new batch count, or it defaults to 793). Re-invoke with `{scriptPath: ...}`.
2. **Reconcile loop** (expect 23 passes — some parts WILL drop: flaky agents, a stray quote that slips re-validation): repeat step 1 until `missing == 0`.
3. **Collect + final rebuild ONCE at the end** (don't rebuild after every wave — 9541 rows is wasted work):
```bash
python3 scripts/run_enrichment.py --collect # robust: repairs stray-quote parts, skips+reports truly-broken
python3 scripts/build_database.py --rebuild # picks up --enrichment by default
```
**Gate:** rebuild must print `enrichment {entries} (matched {entries}, orphaned 0)`. Done-criterion is the reconcile counts converging: `emitted == parts-on-disk == entries == matched`.
### ⚠ FREEZE IS NOW LOCKED
Enrichment content_keys depend on the current freeze. The earlier "re-freezing is safe"
note is **INVERTED** now — do NOT re-extract or re-freeze `data/extracted/` until the
final `--rebuild`, or content_keys drift and the overlay orphans.
## Where we are
| Step (plan Part C) | Status |
|--------------------|--------|
| 1. Finish extraction | **DONE** — 582 chunks extracted & valid (7 re-extracted this session); 6 content-filter-blocked, accepted as missing |
| 2. Land code Part A1A4 (model/schema/merge) | **DONE & committed** |
| 2b. Code Part A5A8 (UI/search/download) | **DONE & committed** |
| 2c. Code Part B2B4 (enrichment pipeline) | **DONE & committed** |
| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9541 activities** (re-frozen with the 7 chunks) |
| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
| 4. Enrichment pilot → **STOP for user sign-off** | **DONE — 34 activities enriched (26 ro-polish + 8 en→ro), pipeline 34/34 matched, 0 orphaned. AWAITING SIGN-OFF.** |
| 5. Final rebuild `--enrichment` | not started (post sign-off) |
## The 7 re-extracted chunks (this session)
Re-extracted via Sonnet subagents, all valid JSON, re-frozen into the corpus.
One (`d297a434…part01`) had an activity named "Eu" (<3 chars, schema-rejected);
renamed faithfully to "Eu sunt..." (matches the source affirmation). The 6
content-filter-blocked chunks remain accepted as missing.
Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
is gitignored (575 files on disk, durable across /clear).
## The 13 missing chunks (out of 588)
**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`
**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
```
3f9c8232_teambuilding_corbu_29092023.part01
5f959f85_scoli_fara_bullying.part02
83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
```
Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
overlay keys depend on the current freeze yet.
## The json_repair incident (important — root cause + what was fixed)
Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
were affected.
First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
ends the string and reinterprets the trailing text as a new key, silently dropping the
rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
an extra key slipped through. Applying json_repair output to disk also **overwrote the
malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
re-extraction.
**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
validates against the real schema, and only replaces a valid top-level file when the
repaired version carries **strictly more text** (a length guard that catches truncated
json_repair output while leaving genuine extractions untouched). Re-running it cleanly
repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
`json_repair` is no longer used anywhere. Do NOT reintroduce it.
`build_database.py` does NOT depend on the repair script (the "DB regenerable from
data/extracted/" invariant holds — plain `json.loads` only).
## What the code does now (all committed)
**Part A — plumbing (corpus-independent):**
- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
the categories table so dropdowns populate.
- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
/ `get_display_description` / `get_display_rules` / `get_display_variations`
(RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
`get_indoor_outdoor_display`, `get_space_needed_display`.
- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
`normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
fallback — never fabricate a value) + display-name helpers.
- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
`merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
**never** touches enrichment fields (those are applied post-dedup).
**Part A — UI/search:**
- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
`space_needed` to DB equality filters.
- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
`SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
`source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
"Text original" section, indoor/space cards, estimat markers, and the download link
(only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
**Part B — enrichment pipeline (built, not yet run):**
- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
`import_common.content_key(normalized_name, language, _normalize_text(description))`
(reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
`enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
Translated/expanded text is NOT re-validated against source (by design).
- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
`find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
merges parts → `data/enrichment.json`.
- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
`description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
## Exact next steps
1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
note the new total (~9418 + the 7 chunks' activities).
3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 68k LLM calls):
- Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
(or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
- Launch a small wave of Sonnet subagents on those prompts (each writes
`data/enrichment_parts/<key>.json`).
- `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
- `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
- **STOP. Hand the user translation-quality + estimation-plausibility + description-
fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
auto-proceed past this gate.
4. After sign-off: scale enrichment in waves of ~816 Sonnet subagents, `--collect`,
final `--rebuild --enrichment`.
## Verify / run
- Tests: `python3 -m pytest tests/ -q` → 99 pass.
- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
only if the user accepts the copyright exposure of serving original files.
- `data/activities.db.bak` is the pre-this-freeze backup.