OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully headless: one claude -p per batch via xargs, flock-guarded, idempotent. DB updated to 9541 activities; .gitignore covers enrichment intermediates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
18 KiB
HANDOFF — Enrichment PILOT done; STOP at user sign-off gate before full-corpus scaling
Snapshot: 2026-05-29 (updated). Executing plan enumerated-petting-badger.md
(bilingual index + enrichment + new filters + source download).
>>> CURRENT STATE (2026-05-29): user SIGNED OFF on the pilot → full-corpus
enrichment SCALING IN PROGRESS via 8 PARALLEL background Workflows on Sonnet.
Parts on disk (data/enrichment_parts/<key>.json) = the durable checkpoint. <<<
Two earlier single-workflow runs were stopped: the first ran on Opus by mistake
(workflow agent() inherits the main-loop model unless model:'sonnet' is passed —
fixed). Measured rate: ~4.7 keys/min single-shard, ~17 keys/min at 3 shards (linear,
no real rate-limit errors — the "429" hits in transcripts are line numbers in chunk
text, not API errors). Concurrency is capped at 2 PER workflow (nproc=4 →
min(16,cores-2)), so parallelism = run many workflows.
8 shard scripts: data/enrichment_wf/shard_0.js … shard_7.js, each owns a
disjoint batch range of data/enrichment_batches/batch_NNNN.txt (780 batches × ~12
keys = 9357 keys; ranges: s0 [0,98) s1 [98,196) s2 [196,294) s3 [294,392) s4 [392,490)
s5 [490,588) s6 [588,686) s7 [686,780)). Each agent is IDEMPOTENT (skips keys whose
part already exists + parses), so re-launching any shard is safe. Run IDs:
s0 wf_3c314d06-01c · s1 wf_ecc7d151-a11 · s2 wf_4156be35-748 ·
s3 wf_fa16abee-17a · s4 wf_a0f595b8-8fe · s5 wf_b3505593-09a ·
s6 wf_ad0d731e-12e · s7 wf_a919a99b-1d2.
▶ RESUME HERE (2026-06-01 — THROTTLED CRON SYSTEM now drives enrichment)
Enrichment progress: 3074 / 9541 done (32.2%), ~6467 missing. Nothing running.
Parts on disk (data/enrichment_parts/*.json) are the durable, idempotent checkpoint.
A paced wave system now runs FULLY HEADLESS via the LXC's OS cron — NO Claude session required. Fixes the "always runs to exhaustion" bug: each wave caps at ~75% of a 5h window and the next window is reached by time (cron).
ARCHITECTURE: OS cron → scripts/enrich_wave.sh → one claude -p per batch,
PAR-way parallel (OS-level — NOT the Workflow tool, which can't be used headless:
claude -p is one-shot and would exit before background workflows finish). Each
headless claude -p reads a batch file and writes data/enrichment_parts/<key>.json.
scripts/enrichment_wave.py(prepares a bounded wave, no LLM):--status— read-only progress (done / missing / pct / corrupt count).--prepare --keys 700 --no-shards— drop corrupt parts; take FIRST 700 sorted-missing keys; write batch files for ONLY those; printWAVE: PREPARED …orWAVE: COMPLETE.--no-shards= batch files only (the headless path). (Without--no-shardsit also regenerates Workflow shard JS fromdata/enrichment_wf/shard.js.tmpl— only needed for the old Workflow path.)
scripts/enrich_wave.sh [KEYS] [PAR](the headless orchestrator, run by cron): flock-guarded (waves never overlap);--prepare; ifWAVE: COMPLETE→--collect--rebuildand stop; elsexargs -P PARoneclaude -pper batch (--allowedTools Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*),</dev/null). Logs to/workspace/.claude-logs/enrich_<ts>.log. Detects + logs "WINDOW EXHAUSTED".
- OS crontab (user
claude,crontab -lto view): two night fires20 23 * * *and50 0 * * *UTC (= 02:20 & 03:50 EEST). Timed AFTER the live- confirmed 23:00 UTC usage-window reset so both land in the fresh post-reset window (user asleep → safe to use it fully; two 700-caps top out at the window's ~950 capacity). Self-healing: a fire into an exhausted window is a harmless no-op (claude -pprints "session limit", writes nothing) and those keys retry next fire.
Auth caveat: headless claude -p uses the OAuth token in
~/.claude/.credentials.json (verified working). If it ever expires and can't refresh
non-interactively, cron fires fail with auth errors → user must claude login once.
Manual fallback (one wave, any time, no session needed):
/workspace/game-library/scripts/enrich_wave.sh 700 6 # runs a full wave now
# or step-by-step:
python3 scripts/enrichment_wave.py --status # progress
python3 scripts/run_enrichment.py --collect && python3 scripts/build_database.py --rebuild # at WAVE: COMPLETE
# gate: rebuild must print enrichment {N} (matched N, orphaned 0)
Control: crontab -e to retime/disable; crontab -r removes all. Tune --keys
(KEYS arg) up to drain faster, down if logs show "WINDOW EXHAUSTED" early. One full
window ≈ 950 keys ≈ 100%.
Hard facts learned:
- Workflow concurrency is capped at 2 per workflow (
nproc=4 →min(16,cores-2)); parallelism = run many workflow processes. 3 shards measured ~17 keys/min (linear, no real rate-limit). - Workflow
agent()inherits the main-loop model unlessmodel:'sonnet'is passed — the FIRST run silently used Opus; always pass model. - The full corpus does NOT fit in one 5h usage window — it needs SEVERAL windows. Parallelism only cuts wall-clock inside a window, not total token budget.
- Main-session token drain was polling (sleep/grep loops), NOT launching workflows. Launch + wait-for-notification only.
- StructuredOutput failures appear when a window exhausts mid-run — harmless; idempotent skip + the regenerate-from-missing reconcile recover every dropped key.
(prev note) Earlier STOPPED at 593/9541 — hit 92% of the 5h Anthropic usage window (resets 23:00 UTC). KEY LESSON: the full corpus does NOT fit in one 5h window; 6.2% + the session's other work already used ~92%. Enrichment must be spread across MANY 5h windows (parallelism only cuts wall-clock inside a window, not total token budget). Resume model: after each window reset, regenerate batches from currently-missing, relaunch a bounded number of shards, stop before the window exhausts. Idempotent shards + parts-on-disk make this safe to repeat. Consider a cron/scheduled job that runs a bounded wave each reset.
To regenerate batches from currently-missing + relaunch a shard (reconcile):
python3 - <<'PY'
import glob, os
BATCH=12
missing=sorted(os.path.basename(p)[:-9-len('.md')] for p in glob.glob('data/enrichment_prompts/*.prompt.md')
if not os.path.exists('data/enrichment_parts/'+os.path.basename(p)[:-len('.prompt.md')]+'.json'))
for old in glob.glob('data/enrichment_batches/batch_*.txt'): os.remove(old)
for n,i in enumerate(range(0,len(missing),BATCH)):
open(f'data/enrichment_batches/batch_{n:04d}.txt','w').write('\n'.join(missing[i:i+BATCH])+'\n')
print('missing',len(missing),'batches',n+1)
PY
# then edit START/END in the shard files to cover the new batch count and re-invoke Workflow({scriptPath: 'data/enrichment_wf/shard_K.js'})
Resume / completion procedure (do this when the workflow finishes — or to continue a new session)
The pipeline is RESUMABLE: parts on disk are truth; re-running regenerates work only for missing keys.
- Rebuild the batch list from what's still missing (prompt exists, part absent), then re-run the workflow for the gap:
The workflow script is at
# regenerate batch files for missing keys (script below already lives in shell history; logic: # for each data/enrichment_prompts/<key>.prompt.md with no data/enrichment_parts/<key>.json, # split into data/enrichment_batches/batch_NNNN.txt of 12).../workflows/scripts/enrich-corpus-wf_440c0a2f-17f.js(nBatches hardcoded → update it to the new batch count, or it defaults to 793). Re-invoke with{scriptPath: ...}. - Reconcile loop (expect 2–3 passes — some parts WILL drop: flaky agents, a stray quote that slips re-validation): repeat step 1 until
missing == 0. - Collect + final rebuild ONCE at the end (don't rebuild after every wave — 9541 rows is wasted work):
Gate: rebuild must print
python3 scripts/run_enrichment.py --collect # robust: repairs stray-quote parts, skips+reports truly-broken python3 scripts/build_database.py --rebuild # picks up --enrichment by defaultenrichment {entries} (matched {entries}, orphaned 0). Done-criterion is the reconcile counts converging:emitted == parts-on-disk == entries == matched.
⚠ FREEZE IS NOW LOCKED
Enrichment content_keys depend on the current freeze. The earlier "re-freezing is safe"
note is INVERTED now — do NOT re-extract or re-freeze data/extracted/ until the
final --rebuild, or content_keys drift and the overlay orphans.
Where we are
| Step (plan Part C) | Status |
|---|---|
| 1. Finish extraction | DONE — 582 chunks extracted & valid (7 re-extracted this session); 6 content-filter-blocked, accepted as missing |
| 2. Land code Part A1–A4 (model/schema/merge) | DONE & committed |
| 2b. Code Part A5–A8 (UI/search/download) | DONE & committed |
| 2c. Code Part B2–B4 (enrichment pipeline) | DONE & committed |
| 3. Freeze rebuild (freezes content_keys) | DONE — data/activities.db = 9541 activities (re-frozen with the 7 chunks) |
| Part D tests | DONE — tests/test_enrichment.py, 99 pass total |
| 4. Enrichment pilot → STOP for user sign-off | DONE — 34 activities enriched (26 ro-polish + 8 en→ro), pipeline 34/34 matched, 0 orphaned. AWAITING SIGN-OFF. |
5. Final rebuild --enrichment |
not started (post sign-off) |
The 7 re-extracted chunks (this session)
Re-extracted via Sonnet subagents, all valid JSON, re-frozen into the corpus.
One (d297a434…part01) had an activity named "Eu" (<3 chars, schema-rejected);
renamed faithfully to "Eu sunt..." (matches the source affirmation). The 6
content-filter-blocked chunks remain accepted as missing.
Everything is committed except whatever this session leaves dirty. data/extracted/*.json
is gitignored (575 files on disk, durable across /clear).
The 13 missing chunks (out of 588)
6 content-filter-blocked (Anthropic safety; accept as missing — marginal loss):
87850302_dragon_sleepdeprived.part73 / .part85 / .part94(camp song lyrics)c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96
7 need RE-EXTRACTION (their malformed-original JSON was destroyed — see "json_repair incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
3f9c8232_teambuilding_corbu_29092023.part01
5f959f85_scoli_fara_bullying.part02
83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
data/chunks/_prompts/<key>.prompt.md), then re-run the freeze rebuild so they join
the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
overlay keys depend on the current freeze yet.
The json_repair incident (important — root cause + what was fixed)
Subagents systematically emit unescaped ASCII " inside string values (Romanian
text like „Unu" uses a closing " that terminates the JSON string early). ~34 files
were affected.
First repair attempt used the json_repair lib. It truncates: on a stray quote it
ends the string and reinterprets the trailing text as a new key, silently dropping the
rest of the value and injecting garbage keys. Schema additionalProperties:false caught
the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
an extra key slipped through. Applying json_repair output to disk also overwrote the
malformed originals for those 8 → originals lost → those (now 7, one recovered) need
re-extraction.
Fix: scripts/repair_extractions.py was rewritten to use a faithful char-scanner
(escape_stray_quotes) that escapes stray quotes (\") instead of splitting on them,
validates against the real schema, and only replaces a valid top-level file when the
repaired version carries strictly more text (a length guard that catches truncated
json_repair output while leaving genuine extractions untouched). Re-running it cleanly
repaired the affected files; the final freeze had 0 schema-rejected, 0 invalid.
json_repair is no longer used anywhere. Do NOT reintroduce it.
build_database.py does NOT depend on the repair script (the "DB regenerable from
data/extracted/" invariant holds — plain json.loads only).
What the code does now (all committed)
Part A — plumbing (corpus-independent):
app/models/database.py: new columnsname_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON), chunk_key; FTS5 indexes the 4*_rocolumns (CREATE + all 3 triggers — kept in sync); indexes onindoor_outdoor/space_needed;search_activitiesgainedindoor_outdoorandspace_neededequality kwargs;_update_category_countsfeeds both new axes into the categories table so dropdowns populate.app/models/activity.py: new fields +to_dict/from_dict; helpersget_display_name/get_display_description/get_display_rules/get_display_variations(RO-primary, EN fallback),has_translation,is_estimated(field),get_indoor_outdoor_display,get_space_needed_display.app/config_taxonomy.py:INDOOR_OUTDOOR,SPACE_NEEDEDenums + RO labels +normalize_indoor_outdoor/normalize_space_needed(None on unrecognised, no fallback — never fabricate a value) + display-name helpers.scripts/build_database.py:dict_to_activitysetssource_id+chunk_key;merge_clusterunionssource_idsand carries rep'ssource_id/chunk_keybut never touches enrichment fields (those are applied post-dedup).
Part A — UI/search:
app/services/search.py:_map_filters_to_db_fieldsmapsindoor_outdoor/space_neededto DB equality filters.app/web/routes.py: new/source/<id>download route — shipped DARK behindSOURCE_DOWNLOAD_ENABLED(default false; copyright exposure, user opts in); resolvessource_fileunderCORPUS_DIRviasend_from_directory(traversal-safe, 404s for web-mirror sources).DISPLAY_NAMESextended with both new axes.app/config.py:SOURCE_DOWNLOAD_ENABLED,CORPUS_DIR.- Templates:
index.html/results.htmlhave the 2 new dropdowns; cards use display helpers +(estimat)markers;activity.htmlis RO-primary with a collapsible "Text original" section, indoor/space cards, estimat markers, and the download link (only when the flag is on).main.csshas.estimated/.original-textstyles.
Part B — enrichment pipeline (built, not yet run):
scripts/build_database.py:load_enrichment+apply_enrichment(activities, enrichment)applied right afterapply_review_decisions, on the post-dedup list, keyed onimport_common.content_key(normalized_name, language, _normalize_text(description))(reused verbatim). CLI--enrichment(defaultdata/enrichment.json). QA report printsenrichment {entries, matched, orphaned}+ per-field stated vs estimated counts. Translated/expanded text is NOT re-validated against source (by design).scripts/run_enrichment.py: reads the rebuilt DB, computes each row's content_key, skips rows already indata/enrichment_parts/<key>.json(resumable), emits one prompt per activity todata/enrichment_prompts/(current EN fields + source chunk text viafind_chunk_text). Pilot scoping:--source <substr>and/or--limit N.--collectmerges parts →data/enrichment.json.scripts/ENRICHMENT_PROMPT.md: single-pass rules — translate faithfully, expanddescription_roONLY from chunk text, mark inferred filter fields inestimated_fields, fixed enum vocab, outputdata/enrichment_parts/<content_key>.jsonincludingcontent_key.
Exact next steps
- Re-extract the 7 chunks above (after session-limit reset). Verify each writes valid
JSON (
python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"). If any come back malformed,python3 scripts/repair_extractions.py --apply(faithful now). - Re-freeze:
python3 scripts/build_database.py --rebuild— confirm 0 schema-rejected, note the new total (~9418 + the 7 chunks' activities). - Enrichment PILOT (plan B5 — the STOP gate guarding 6–8k LLM calls):
- Pick one source, e.g.
python3 scripts/run_enrichment.py --source teambuilding_corbu(or--limit 30). This writes prompts todata/enrichment_prompts/. - Launch a small wave of Sonnet subagents on those prompts (each writes
data/enrichment_parts/<key>.json). python3 scripts/run_enrichment.py --collect→data/enrichment.json.python3 scripts/build_database.py --rebuild(picks up--enrichmentby default).- STOP. Hand the user translation-quality + estimation-plausibility + description- fidelity samples and get sign-off BEFORE scaling to the full corpus. Do not auto-proceed past this gate.
- Pick one source, e.g.
- After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents,
--collect, final--rebuild --enrichment.
Verify / run
- Tests:
python3 -m pytest tests/ -q→ 99 pass. - App:
SOURCE_DOWNLOAD_ENABLEDis false by default (download link hidden). Set it true only if the user accepts the copyright exposure of serving original files. data/activities.db.bakis the pre-this-freeze backup.