Headless cron enrichment system + progress checkpoint at 32%
OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully headless: one claude -p per batch via xargs, flock-guarded, idempotent. DB updated to 9541 activities; .gitignore covers enrichment intermediates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
12
.gitignore
vendored
12
.gitignore
vendored
@@ -165,9 +165,14 @@ cython_debug/
|
|||||||
*.db.backup
|
*.db.backup
|
||||||
*.db.bak
|
*.db.bak
|
||||||
*.db.tmp
|
*.db.tmp
|
||||||
|
*.db.prefreeze*
|
||||||
*.sqlite.backup
|
*.sqlite.backup
|
||||||
*.sqlite3.backup
|
*.sqlite3.backup
|
||||||
|
|
||||||
|
# Agent runtime locks
|
||||||
|
.claude/scheduled_tasks.lock
|
||||||
|
.claude/*.lock
|
||||||
|
|
||||||
# Temporary files
|
# Temporary files
|
||||||
*.tmp
|
*.tmp
|
||||||
*.backup
|
*.backup
|
||||||
@@ -179,6 +184,13 @@ data/sources/
|
|||||||
data/chunks/
|
data/chunks/
|
||||||
data/extracted/
|
data/extracted/
|
||||||
|
|
||||||
|
# Enrichment pipeline intermediates (LLM output; final result lands in data/activities.db)
|
||||||
|
data/enrichment_prompts/
|
||||||
|
data/enrichment_parts/
|
||||||
|
data/enrichment_batches/
|
||||||
|
data/enrichment_wf/
|
||||||
|
data/enrichment.json
|
||||||
|
|
||||||
# Keep main production database, the hand-written index, and committed golden set
|
# Keep main production database, the hand-written index, and committed golden set
|
||||||
!data/activities.db
|
!data/activities.db
|
||||||
!data/INDEX_MASTER_JOCURI_ACTIVITATI.md
|
!data/INDEX_MASTER_JOCURI_ACTIVITATI.md
|
||||||
|
|||||||
71
ENRICHMENT_PILOT.md
Normal file
71
ENRICHMENT_PILOT.md
Normal file
@@ -0,0 +1,71 @@
|
|||||||
|
# Enrichment PILOT — sign-off required before full-corpus scaling
|
||||||
|
|
||||||
|
**Date:** 2026-05-29. Pilot covers **34 activities** (the STOP gate from `HANDOFF.md`
|
||||||
|
step 3, guarding ~6–8k LLM calls across the full corpus).
|
||||||
|
|
||||||
|
## Pipeline integrity (all green)
|
||||||
|
|
||||||
|
| Hop | Expected | Actual |
|
||||||
|
|-----|----------|--------|
|
||||||
|
| prompts emitted | 34 | 34 |
|
||||||
|
| part files on disk (valid JSON, key matches filename) | 34 | 34 |
|
||||||
|
| `enrichment.json` entries after `--collect` | 34 | 34 |
|
||||||
|
| rebuild overlay: `matched` / `orphaned` | 34 / 0 | **34 / 0** |
|
||||||
|
|
||||||
|
No leak at any hop. `orphaned 0` confirms the content_key the rebuild computes
|
||||||
|
matches what `run_enrichment` emitted (no dedup rep-selection drift).
|
||||||
|
|
||||||
|
## Pilot composition
|
||||||
|
|
||||||
|
Deliberately mixed to exercise BOTH operations (corpus is 7076 EN / 2465 RO, so
|
||||||
|
en→ro translation is the dominant + highest-risk path):
|
||||||
|
|
||||||
|
- **26** rows from `teambuilding_corbu` — all Romanian → **ro→ro polish**
|
||||||
|
- **8** rows from `d3959920_outdoor_games` — all English → **en→ro translation**
|
||||||
|
|
||||||
|
Result: ~7 genuine en→ro translations + ~27 ro→ro polish.
|
||||||
|
|
||||||
|
## Field population (stated vs estimated)
|
||||||
|
|
||||||
|
```
|
||||||
|
age_group_max : 0 stated / 30 estimated
|
||||||
|
age_group_min : 0 / 34
|
||||||
|
duration_max : 3 / 29
|
||||||
|
duration_min : 4 / 28
|
||||||
|
indoor_outdoor : 12 / 22
|
||||||
|
participants_max : 0 / 24
|
||||||
|
participants_min : 4 / 30
|
||||||
|
space_needed : 2 / 32
|
||||||
|
```
|
||||||
|
|
||||||
|
Almost everything is estimated — sources rarely state ages/durations explicitly.
|
||||||
|
The pipeline marks every inferred field in `estimated_fields`, and the UI shows an
|
||||||
|
`(estimat)` marker, so estimates are transparent to end users.
|
||||||
|
|
||||||
|
## What to evaluate (the three sign-off axes)
|
||||||
|
|
||||||
|
1. **Translation fidelity (en→ro)** — e.g. *Labels → Etichete*, *Ships in a Fog →
|
||||||
|
Nave în ceață*, *Spot the Colours → Găsește culorile*. Game rules preserved,
|
||||||
|
no moralizing added, proper terms kept.
|
||||||
|
2. **Description fidelity / expansion** — ro→ro rows fold in setup/material detail
|
||||||
|
that IS in the source chunk (e.g. *Găsește-ți fratele și sora* adds "carton A6"
|
||||||
|
+ "la semnal, toți încep simultan"; *Ce-mi place?* folds in the character-traits
|
||||||
|
discussion). No invented steps observed.
|
||||||
|
3. **Estimation plausibility** — mostly reasonable. **Weak spots to judge:** a few
|
||||||
|
age ranges are very wide/defaulted (e.g. *Găsește-ți fratele și sora* → age
|
||||||
|
10–99). If wide age defaults are unacceptable, tighten the ENRICHMENT_PROMPT
|
||||||
|
guidance before scaling.
|
||||||
|
|
||||||
|
## Inspect the data yourself
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sqlite3 data/activities.db "select name, name_ro, language, indoor_outdoor, space_needed, estimated_fields from activities where name_ro is not null;"
|
||||||
|
# raw overlay: data/enrichment.json (34 entries)
|
||||||
|
# per-activity parts: data/enrichment_parts/*.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## After sign-off (do NOT auto-proceed)
|
||||||
|
|
||||||
|
Scale in waves of ~8–16 Sonnet subagents over the rest of the corpus
|
||||||
|
(`run_enrichment.py` is additive + resumable — skips already-enriched keys),
|
||||||
|
`--collect`, then final `build_database.py --rebuild --enrichment`.
|
||||||
147
HANDOFF.md
147
HANDOFF.md
@@ -1,20 +1,153 @@
|
|||||||
# HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot
|
# HANDOFF — Enrichment PILOT done; STOP at user sign-off gate before full-corpus scaling
|
||||||
|
|
||||||
**Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual
|
**Snapshot:** 2026-05-29 (updated). Executing plan `enumerated-petting-badger.md`
|
||||||
index + enrichment + new filters + source download).
|
(bilingual index + enrichment + new filters + source download).
|
||||||
|
|
||||||
|
**>>> CURRENT STATE (2026-05-29): user SIGNED OFF on the pilot → full-corpus
|
||||||
|
enrichment SCALING IN PROGRESS via 8 PARALLEL background Workflows on Sonnet.
|
||||||
|
Parts on disk (`data/enrichment_parts/<key>.json`) = the durable checkpoint. <<<**
|
||||||
|
|
||||||
|
Two earlier single-workflow runs were stopped: the first ran on Opus by mistake
|
||||||
|
(workflow `agent()` inherits the main-loop model unless `model:'sonnet'` is passed —
|
||||||
|
fixed). Measured rate: ~4.7 keys/min single-shard, ~17 keys/min at 3 shards (linear,
|
||||||
|
no real rate-limit errors — the "429" hits in transcripts are line numbers in chunk
|
||||||
|
text, not API errors). Concurrency is capped at 2 PER workflow (`nproc`=4 →
|
||||||
|
`min(16,cores-2)`), so parallelism = run many workflows.
|
||||||
|
|
||||||
|
**8 shard scripts: `data/enrichment_wf/shard_0.js` … `shard_7.js`**, each owns a
|
||||||
|
disjoint batch range of `data/enrichment_batches/batch_NNNN.txt` (780 batches × ~12
|
||||||
|
keys = 9357 keys; ranges: s0 [0,98) s1 [98,196) s2 [196,294) s3 [294,392) s4 [392,490)
|
||||||
|
s5 [490,588) s6 [588,686) s7 [686,780)). Each agent is IDEMPOTENT (skips keys whose
|
||||||
|
part already exists + parses), so re-launching any shard is safe. Run IDs:
|
||||||
|
s0 `wf_3c314d06-01c` · s1 `wf_ecc7d151-a11` · s2 `wf_4156be35-748` ·
|
||||||
|
s3 `wf_fa16abee-17a` · s4 `wf_a0f595b8-8fe` · s5 `wf_b3505593-09a` ·
|
||||||
|
s6 `wf_ad0d731e-12e` · s7 `wf_a919a99b-1d2`.
|
||||||
|
|
||||||
|
### ▶ RESUME HERE (2026-06-01 — THROTTLED CRON SYSTEM now drives enrichment)
|
||||||
|
|
||||||
|
**Enrichment progress: 3074 / 9541 done (32.2%), ~6467 missing.** Nothing running.
|
||||||
|
Parts on disk (`data/enrichment_parts/*.json`) are the durable, idempotent checkpoint.
|
||||||
|
|
||||||
|
**A paced wave system now runs FULLY HEADLESS via the LXC's OS cron — NO Claude
|
||||||
|
session required.** Fixes the "always runs to exhaustion" bug: each wave caps at
|
||||||
|
~75% of a 5h window and the next window is reached by time (cron).
|
||||||
|
|
||||||
|
ARCHITECTURE: OS cron → `scripts/enrich_wave.sh` → one `claude -p` per batch,
|
||||||
|
PAR-way parallel (OS-level — NOT the Workflow tool, which can't be used headless:
|
||||||
|
`claude -p` is one-shot and would exit before background workflows finish). Each
|
||||||
|
headless `claude -p` reads a batch file and writes `data/enrichment_parts/<key>.json`.
|
||||||
|
|
||||||
|
- **`scripts/enrichment_wave.py`** (prepares a bounded wave, no LLM):
|
||||||
|
- `--status` — read-only progress (done / missing / pct / corrupt count).
|
||||||
|
- `--prepare --keys 700 --no-shards` — drop corrupt parts; take FIRST 700
|
||||||
|
sorted-missing keys; write batch files for ONLY those; print
|
||||||
|
`WAVE: PREPARED …` or `WAVE: COMPLETE`. `--no-shards` = batch files only
|
||||||
|
(the headless path). (Without `--no-shards` it also regenerates Workflow shard
|
||||||
|
JS from `data/enrichment_wf/shard.js.tmpl` — only needed for the old Workflow path.)
|
||||||
|
- **`scripts/enrich_wave.sh [KEYS] [PAR]`** (the headless orchestrator, run by cron):
|
||||||
|
flock-guarded (waves never overlap); `--prepare`; if `WAVE: COMPLETE` → `--collect`
|
||||||
|
+ `--rebuild` and stop; else `xargs -P PAR` one `claude -p` per batch
|
||||||
|
(`--allowedTools Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)`, `</dev/null`).
|
||||||
|
Logs to `/workspace/.claude-logs/enrich_<ts>.log`. Detects + logs "WINDOW EXHAUSTED".
|
||||||
|
- **OS crontab (user `claude`, `crontab -l` to view):** two night fires
|
||||||
|
`20 23 * * *` and `50 0 * * *` UTC (= 02:20 & 03:50 EEST). Timed AFTER the live-
|
||||||
|
confirmed **23:00 UTC usage-window reset** so both land in the fresh post-reset
|
||||||
|
window (user asleep → safe to use it fully; two 700-caps top out at the window's
|
||||||
|
~950 capacity). Self-healing: a fire into an exhausted window is a harmless no-op
|
||||||
|
(`claude -p` prints "session limit", writes nothing) and those keys retry next fire.
|
||||||
|
|
||||||
|
**Auth caveat:** headless `claude -p` uses the OAuth token in
|
||||||
|
`~/.claude/.credentials.json` (verified working). If it ever expires and can't refresh
|
||||||
|
non-interactively, cron fires fail with auth errors → user must `claude` login once.
|
||||||
|
|
||||||
|
**Manual fallback (one wave, any time, no session needed):**
|
||||||
|
```bash
|
||||||
|
/workspace/game-library/scripts/enrich_wave.sh 700 6 # runs a full wave now
|
||||||
|
# or step-by-step:
|
||||||
|
python3 scripts/enrichment_wave.py --status # progress
|
||||||
|
python3 scripts/run_enrichment.py --collect && python3 scripts/build_database.py --rebuild # at WAVE: COMPLETE
|
||||||
|
# gate: rebuild must print enrichment {N} (matched N, orphaned 0)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Control:** `crontab -e` to retime/disable; `crontab -r` removes all. Tune `--keys`
|
||||||
|
(KEYS arg) up to drain faster, down if logs show "WINDOW EXHAUSTED" early. One full
|
||||||
|
window ≈ 950 keys ≈ 100%.
|
||||||
|
|
||||||
|
**Hard facts learned:**
|
||||||
|
- Workflow concurrency is capped at **2 per workflow** (`nproc`=4 → `min(16,cores-2)`); parallelism = run many workflow processes. 3 shards measured ~17 keys/min (linear, no real rate-limit).
|
||||||
|
- Workflow `agent()` inherits the **main-loop model unless `model:'sonnet'` is passed** — the FIRST run silently used Opus; always pass model.
|
||||||
|
- The full corpus does **NOT fit in one 5h usage window** — it needs SEVERAL windows. Parallelism only cuts wall-clock inside a window, not total token budget.
|
||||||
|
- Main-session token drain was **polling** (sleep/grep loops), NOT launching workflows. Launch + wait-for-notification only.
|
||||||
|
- StructuredOutput failures appear when a window exhausts mid-run — harmless; idempotent skip + the regenerate-from-missing reconcile recover every dropped key.
|
||||||
|
|
||||||
|
(prev note) Earlier STOPPED at 593/9541 — hit 92% of the 5h Anthropic
|
||||||
|
usage window (resets 23:00 UTC). KEY LESSON: the full corpus does NOT fit in one 5h
|
||||||
|
window; 6.2% + the session's other work already used ~92%. Enrichment must be spread
|
||||||
|
across MANY 5h windows (parallelism only cuts wall-clock inside a window, not total
|
||||||
|
token budget). Resume model: after each window reset, regenerate batches from
|
||||||
|
currently-missing, relaunch a bounded number of shards, stop before the window
|
||||||
|
exhausts. Idempotent shards + parts-on-disk make this safe to repeat. Consider a
|
||||||
|
cron/scheduled job that runs a bounded wave each reset.
|
||||||
|
|
||||||
|
**To regenerate batches from currently-missing + relaunch a shard** (reconcile):
|
||||||
|
```bash
|
||||||
|
python3 - <<'PY'
|
||||||
|
import glob, os
|
||||||
|
BATCH=12
|
||||||
|
missing=sorted(os.path.basename(p)[:-9-len('.md')] for p in glob.glob('data/enrichment_prompts/*.prompt.md')
|
||||||
|
if not os.path.exists('data/enrichment_parts/'+os.path.basename(p)[:-len('.prompt.md')]+'.json'))
|
||||||
|
for old in glob.glob('data/enrichment_batches/batch_*.txt'): os.remove(old)
|
||||||
|
for n,i in enumerate(range(0,len(missing),BATCH)):
|
||||||
|
open(f'data/enrichment_batches/batch_{n:04d}.txt','w').write('\n'.join(missing[i:i+BATCH])+'\n')
|
||||||
|
print('missing',len(missing),'batches',n+1)
|
||||||
|
PY
|
||||||
|
# then edit START/END in the shard files to cover the new batch count and re-invoke Workflow({scriptPath: 'data/enrichment_wf/shard_K.js'})
|
||||||
|
```
|
||||||
|
|
||||||
|
### Resume / completion procedure (do this when the workflow finishes — or to continue a new session)
|
||||||
|
|
||||||
|
The pipeline is RESUMABLE: parts on disk are truth; re-running regenerates work only for missing keys.
|
||||||
|
|
||||||
|
1. **Rebuild the batch list from what's still missing** (prompt exists, part absent), then re-run the workflow for the gap:
|
||||||
|
```bash
|
||||||
|
# regenerate batch files for missing keys (script below already lives in shell history; logic:
|
||||||
|
# for each data/enrichment_prompts/<key>.prompt.md with no data/enrichment_parts/<key>.json,
|
||||||
|
# split into data/enrichment_batches/batch_NNNN.txt of 12)
|
||||||
|
```
|
||||||
|
The workflow script is at
|
||||||
|
`.../workflows/scripts/enrich-corpus-wf_440c0a2f-17f.js` (nBatches hardcoded → update it to the new batch count, or it defaults to 793). Re-invoke with `{scriptPath: ...}`.
|
||||||
|
2. **Reconcile loop** (expect 2–3 passes — some parts WILL drop: flaky agents, a stray quote that slips re-validation): repeat step 1 until `missing == 0`.
|
||||||
|
3. **Collect + final rebuild ONCE at the end** (don't rebuild after every wave — 9541 rows is wasted work):
|
||||||
|
```bash
|
||||||
|
python3 scripts/run_enrichment.py --collect # robust: repairs stray-quote parts, skips+reports truly-broken
|
||||||
|
python3 scripts/build_database.py --rebuild # picks up --enrichment by default
|
||||||
|
```
|
||||||
|
**Gate:** rebuild must print `enrichment {entries} (matched {entries}, orphaned 0)`. Done-criterion is the reconcile counts converging: `emitted == parts-on-disk == entries == matched`.
|
||||||
|
|
||||||
|
### ⚠ FREEZE IS NOW LOCKED
|
||||||
|
Enrichment content_keys depend on the current freeze. The earlier "re-freezing is safe"
|
||||||
|
note is **INVERTED** now — do NOT re-extract or re-freeze `data/extracted/` until the
|
||||||
|
final `--rebuild`, or content_keys drift and the overlay orphans.
|
||||||
|
|
||||||
## Where we are
|
## Where we are
|
||||||
|
|
||||||
| Step (plan Part C) | Status |
|
| Step (plan Part C) | Status |
|
||||||
|--------------------|--------|
|
|--------------------|--------|
|
||||||
| 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) |
|
| 1. Finish extraction | **DONE** — 582 chunks extracted & valid (7 re-extracted this session); 6 content-filter-blocked, accepted as missing |
|
||||||
| 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
|
| 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
|
||||||
| 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
|
| 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
|
||||||
| 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
|
| 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
|
||||||
| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9418 activities** |
|
| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9541 activities** (re-frozen with the 7 chunks) |
|
||||||
| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
|
| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
|
||||||
| 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) |
|
| 4. Enrichment pilot → **STOP for user sign-off** | **DONE — 34 activities enriched (26 ro-polish + 8 en→ro), pipeline 34/34 matched, 0 orphaned. AWAITING SIGN-OFF.** |
|
||||||
| 5. Final rebuild `--enrichment` | not started |
|
| 5. Final rebuild `--enrichment` | not started (post sign-off) |
|
||||||
|
|
||||||
|
## The 7 re-extracted chunks (this session)
|
||||||
|
|
||||||
|
Re-extracted via Sonnet subagents, all valid JSON, re-frozen into the corpus.
|
||||||
|
One (`d297a434…part01`) had an activity named "Eu" (<3 chars, schema-rejected);
|
||||||
|
renamed faithfully to "Eu sunt..." (matches the source affirmation). The 6
|
||||||
|
content-filter-blocked chunks remain accepted as missing.
|
||||||
|
|
||||||
Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
|
Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
|
||||||
is gitignored (575 files on disk, durable across /clear).
|
is gitignored (575 files on disk, durable across /clear).
|
||||||
|
|||||||
Binary file not shown.
102
scripts/enrich_wave.sh
Executable file
102
scripts/enrich_wave.sh
Executable file
@@ -0,0 +1,102 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# ============================================================================
|
||||||
|
# enrich_wave.sh — ONE throttled enrichment wave, fully headless (no Claude
|
||||||
|
# session). Designed to be run by the LXC's OS cron at night.
|
||||||
|
#
|
||||||
|
# - Prepares a bounded wave (first N missing keys) via enrichment_wave.py.
|
||||||
|
# - Runs ONE `claude -p` per batch file, PAR batches concurrently (OS-level
|
||||||
|
# parallelism — no Workflow tool, no 2-per-workflow cap, no session needed).
|
||||||
|
# - When the backlog is empty, runs --collect + --rebuild and stops.
|
||||||
|
#
|
||||||
|
# Throttle = --keys (default 700 ≈ 75% of a 5h usage window ≈ 950 keys).
|
||||||
|
# A single flock guarantees waves never overlap.
|
||||||
|
#
|
||||||
|
# Usage: scripts/enrich_wave.sh [KEYS] [PAR]
|
||||||
|
# KEYS = max keys this wave (default 700)
|
||||||
|
# PAR = concurrent claude -p (default 6)
|
||||||
|
# ============================================================================
|
||||||
|
set -uo pipefail
|
||||||
|
|
||||||
|
REPO="/workspace/game-library"
|
||||||
|
LOG_DIR="/workspace/.claude-logs"
|
||||||
|
LOCK="/tmp/enrich_wave.lock"
|
||||||
|
KEYS="${1:-700}"
|
||||||
|
PAR="${2:-6}"
|
||||||
|
MAX_TURNS=100
|
||||||
|
|
||||||
|
# --- environment (cron has a minimal env) ---------------------------------- #
|
||||||
|
export HOME="${HOME:-/home/claude}"
|
||||||
|
[ -f "$HOME/.nvm/nvm.sh" ] && . "$HOME/.nvm/nvm.sh" >/dev/null 2>&1
|
||||||
|
export PATH="$HOME/.nvm/versions/node/v20.19.6/bin:/usr/local/bin:/usr/bin:/bin:$PATH"
|
||||||
|
|
||||||
|
mkdir -p "$LOG_DIR"
|
||||||
|
TS="$(date +%Y%m%d_%H%M%S)"
|
||||||
|
LOG="$LOG_DIR/enrich_${TS}.log"
|
||||||
|
|
||||||
|
log() { echo "[$(date '+%H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||||
|
|
||||||
|
# --- single-instance lock: skip if a wave is still running ----------------- #
|
||||||
|
exec 9>"$LOCK"
|
||||||
|
if ! flock -n 9; then
|
||||||
|
log "another wave holds the lock; exiting."
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
cd "$REPO" || { log "cannot cd $REPO"; exit 1; }
|
||||||
|
command -v claude >/dev/null 2>&1 || { log "claude CLI not on PATH"; exit 1; }
|
||||||
|
|
||||||
|
log "=== enrichment wave start (keys=$KEYS par=$PAR) ==="
|
||||||
|
|
||||||
|
# --- 1) prepare bounded wave (batch files only) ---------------------------- #
|
||||||
|
PREP="$(python3 scripts/enrichment_wave.py --prepare --keys "$KEYS" --no-shards 2>&1)"
|
||||||
|
echo "$PREP" | tee -a "$LOG"
|
||||||
|
|
||||||
|
if echo "$PREP" | grep -q "WAVE: COMPLETE"; then
|
||||||
|
log "backlog empty -> collect + rebuild"
|
||||||
|
python3 scripts/run_enrichment.py --collect >>"$LOG" 2>&1
|
||||||
|
python3 scripts/build_database.py --rebuild >>"$LOG" 2>&1
|
||||||
|
grep -E "enrichment .*matched" "$LOG" | tail -1 | tee -a "$LOG"
|
||||||
|
log "=== ENRICHMENT COMPLETE ==="
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# --- 2) per-batch headless enrichment, PAR-way parallel -------------------- #
|
||||||
|
read -r -d '' BATCH_PROMPT <<'EOP'
|
||||||
|
You are an enrichment subagent in the game-library pipeline. Working dir: /workspace/game-library.
|
||||||
|
|
||||||
|
Read `scripts/ENRICHMENT_PROMPT.md` FIRST — it defines the rules and output format EXACTLY (translate faithfully to Romanian; expand description_ro ONLY from the source chunk text; mark inferred filter fields in estimated_fields; fixed enum vocab).
|
||||||
|
|
||||||
|
Your batch file is __BATCHFILE__ — it lists content_keys, one per line. For EACH key:
|
||||||
|
1. IDEMPOTENT SKIP: if `data/enrichment_parts/<key>.json` already exists AND parses as valid JSON, SKIP it (do not rewrite).
|
||||||
|
2. Otherwise read its prompt `data/enrichment_prompts/<key>.prompt.md`, produce the enrichment JSON per ENRICHMENT_PROMPT.md, and write it to `data/enrichment_parts/<key>.json` (MUST include the exact "content_key": "<key>").
|
||||||
|
3. Validate it parses: python3 -c "import json;json.load(open('data/enrichment_parts/<key>.json'))".
|
||||||
|
|
||||||
|
CRITICAL — JSON quote escaping: any literal ASCII double-quote inside a string value MUST be escaped as \". Romanian text uses „cuvant" where the closing mark is a plain ASCII " — written raw it breaks the JSON. Either keep the typographic „ " marks or escape every ASCII ". Re-read and re-validate each file; fix any that fail.
|
||||||
|
|
||||||
|
Work through EVERY key in the batch file. If a key's prompt is missing, skip it and continue. When done, reply with one line: the count written and skipped.
|
||||||
|
EOP
|
||||||
|
|
||||||
|
export REPO LOG MAX_TURNS BATCH_PROMPT
|
||||||
|
run_one() {
|
||||||
|
local bf="$1"
|
||||||
|
local name; name="$(basename "$bf")"
|
||||||
|
local prompt="${BATCH_PROMPT/__BATCHFILE__/$bf}"
|
||||||
|
cd "$REPO" || return 1
|
||||||
|
timeout 1200 claude -p "$prompt" \
|
||||||
|
--allowedTools "Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)" \
|
||||||
|
--max-turns "$MAX_TURNS" </dev/null >>"$LOG.$name.out" 2>&1
|
||||||
|
echo "[$(date '+%H:%M:%S')] done $name (exit $?)" >>"$LOG"
|
||||||
|
}
|
||||||
|
export -f run_one
|
||||||
|
|
||||||
|
BATCHES=(data/enrichment_batches/batch_*.txt)
|
||||||
|
log "launching ${#BATCHES[@]} batches, $PAR concurrent..."
|
||||||
|
printf '%s\n' "${BATCHES[@]}" | xargs -P "$PAR" -I{} bash -c 'run_one "$@"' _ {}
|
||||||
|
|
||||||
|
# --- 3) summary ------------------------------------------------------------ #
|
||||||
|
if grep -qi "session limit\|usage limit" "$LOG".batch_*.out 2>/dev/null; then
|
||||||
|
log "WINDOW EXHAUSTED (usage limit hit mid-wave) — unfinished keys retry next fire."
|
||||||
|
fi
|
||||||
|
STATUS="$(python3 scripts/enrichment_wave.py --status 2>&1 | grep -E 'good|missing|done')"
|
||||||
|
echo "$STATUS" | tee -a "$LOG"
|
||||||
|
log "=== wave done ==="
|
||||||
294
scripts/enrichment_wave.py
Normal file
294
scripts/enrichment_wave.py
Normal file
@@ -0,0 +1,294 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""
|
||||||
|
enrichment_wave.py — throttled, window-paced wave preparation for the corpus
|
||||||
|
enrichment pipeline.
|
||||||
|
|
||||||
|
The enrichment backlog (~9541 keys) does NOT fit in one 5-hour Anthropic usage
|
||||||
|
window. Launching all remaining batches at once always runs the window to
|
||||||
|
EXHAUSTION (the "subagent completed without calling StructuredOutput" signature),
|
||||||
|
consuming 100% and blocking other work. There is no readable real-time window
|
||||||
|
meter, so pacing must be BLIND: cap each wave to a fixed KEY COUNT (sized to
|
||||||
|
~75% of empirical window capacity, ~950 keys), and let an external scheduler
|
||||||
|
(cron, every 6h) space waves across windows.
|
||||||
|
|
||||||
|
This script encapsulates the reconcile + bounded-wave preparation that used to
|
||||||
|
live as ad-hoc inline Python. It does NOT call the LLM and does NOT launch
|
||||||
|
workflows — it only prepares files on disk and prints what to launch.
|
||||||
|
|
||||||
|
Modes:
|
||||||
|
--status read-only: print done / missing / pct
|
||||||
|
--prepare --keys N --shards K drop corrupt parts; take the FIRST N missing
|
||||||
|
keys (sorted, deterministic); write batch
|
||||||
|
files for ONLY those; regenerate K shard JS
|
||||||
|
files covering exactly those batches; print
|
||||||
|
machine-greppable WAVE:/SHARD: lines.
|
||||||
|
|
||||||
|
Idempotency: a key is "done" iff data/enrichment_parts/<key>.json exists AND
|
||||||
|
parses. Re-running --prepare with the same args is deterministic (same sorted
|
||||||
|
first-N keys), so a re-fire never reshuffles work. Parts on disk are the durable
|
||||||
|
checkpoint.
|
||||||
|
|
||||||
|
Output contract (parsed by the cron wave-runner):
|
||||||
|
WAVE: COMPLETE -> backlog empty; run collect+rebuild
|
||||||
|
WAVE: PREPARED keys=.. batches=.. shards=.. remaining_after=..
|
||||||
|
SHARD: data/enrichment_wf/shard_0.js -> one line per workflow to launch
|
||||||
|
...
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 scripts/enrichment_wave.py --status
|
||||||
|
python3 scripts/enrichment_wave.py --prepare --keys 700 --shards 8
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||||
|
REPO_ROOT = SCRIPT_DIR.parent
|
||||||
|
|
||||||
|
PROMPT_SUFFIX = ".prompt.md"
|
||||||
|
PART_SUFFIX = ".json"
|
||||||
|
BATCH_SIZE_DEFAULT = 12
|
||||||
|
KEYS_DEFAULT = 700
|
||||||
|
SHARDS_DEFAULT = 8
|
||||||
|
|
||||||
|
# Resolved relative to REPO_ROOT so the script works from any cwd.
|
||||||
|
DEF_PROMPTS = "data/enrichment_prompts"
|
||||||
|
DEF_PARTS = "data/enrichment_parts"
|
||||||
|
DEF_BATCHES = "data/enrichment_batches"
|
||||||
|
DEF_WF = "data/enrichment_wf"
|
||||||
|
TEMPLATE_NAME = "shard.js.tmpl"
|
||||||
|
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------- #
|
||||||
|
# Helpers
|
||||||
|
# --------------------------------------------------------------------------- #
|
||||||
|
def _abs(p: str) -> Path:
|
||||||
|
q = Path(p)
|
||||||
|
return q if q.is_absolute() else (REPO_ROOT / q)
|
||||||
|
|
||||||
|
|
||||||
|
def part_ok(path: Path) -> bool:
|
||||||
|
"""A part counts as done iff it parses as a JSON object."""
|
||||||
|
try:
|
||||||
|
return isinstance(json.load(open(path, encoding="utf-8")), dict)
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def corrupt_parts(parts_dir: Path) -> list[Path]:
|
||||||
|
return [p for p in parts_dir.glob("*" + PART_SUFFIX) if not part_ok(p)]
|
||||||
|
|
||||||
|
|
||||||
|
def compute_missing(prompts_dir: Path, parts_dir: Path) -> list[str]:
|
||||||
|
"""Keys whose prompt exists but whose part is absent. Sorted = deterministic."""
|
||||||
|
missing = []
|
||||||
|
for pr in prompts_dir.glob("*" + PROMPT_SUFFIX):
|
||||||
|
key = pr.name[: -len(PROMPT_SUFFIX)]
|
||||||
|
if not (parts_dir / (key + PART_SUFFIX)).exists():
|
||||||
|
missing.append(key)
|
||||||
|
return sorted(missing)
|
||||||
|
|
||||||
|
|
||||||
|
def count_done(prompts_dir: Path, parts_dir: Path) -> tuple[int, int]:
|
||||||
|
"""(good_parts_with_prompt, total_prompts)."""
|
||||||
|
total = 0
|
||||||
|
good = 0
|
||||||
|
for pr in prompts_dir.glob("*" + PROMPT_SUFFIX):
|
||||||
|
total += 1
|
||||||
|
key = pr.name[: -len(PROMPT_SUFFIX)]
|
||||||
|
part = parts_dir / (key + PART_SUFFIX)
|
||||||
|
if part.exists() and part_ok(part):
|
||||||
|
good += 1
|
||||||
|
return good, total
|
||||||
|
|
||||||
|
|
||||||
|
def write_batches(keys: list[str], batches_dir: Path, size: int) -> int:
|
||||||
|
"""Replace all batch_*.txt with fresh files of <= size keys. Returns NB."""
|
||||||
|
batches_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
for old in batches_dir.glob("batch_*.txt"):
|
||||||
|
old.unlink()
|
||||||
|
nb = 0
|
||||||
|
for i in range(0, len(keys), size):
|
||||||
|
chunk = keys[i : i + size]
|
||||||
|
(batches_dir / f"batch_{nb:04d}.txt").write_text(
|
||||||
|
"\n".join(chunk) + "\n", encoding="utf-8"
|
||||||
|
)
|
||||||
|
nb += 1
|
||||||
|
return nb
|
||||||
|
|
||||||
|
|
||||||
|
def shard_ranges(nb: int, k: int) -> list[tuple[int, int]]:
|
||||||
|
"""Split [0,nb) into k contiguous, disjoint, total-covering ranges.
|
||||||
|
|
||||||
|
Even distribution: the first (nb % k) shards carry one extra batch. When
|
||||||
|
nb < k the trailing ranges are empty [x,x) and are dropped by the caller.
|
||||||
|
"""
|
||||||
|
if nb <= 0 or k <= 0:
|
||||||
|
return []
|
||||||
|
base, extra = divmod(nb, k)
|
||||||
|
ranges = []
|
||||||
|
start = 0
|
||||||
|
for i in range(k):
|
||||||
|
length = base + (1 if i < extra else 0)
|
||||||
|
ranges.append((start, start + length))
|
||||||
|
start += length
|
||||||
|
return ranges
|
||||||
|
|
||||||
|
|
||||||
|
def render_shard(template: str, shard: int, start: int, end: int, nshards: int) -> str:
|
||||||
|
return (
|
||||||
|
template.replace("__SHARD__", str(shard))
|
||||||
|
.replace("__START__", str(start))
|
||||||
|
.replace("__END__", str(end))
|
||||||
|
.replace("__NSHARDS__", str(nshards))
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def write_shards(ranges: list[tuple[int, int]], template: str, wf_dir: Path) -> list[Path]:
|
||||||
|
"""Delete stale shard_*.js, then write one per NON-EMPTY range. Returns paths."""
|
||||||
|
wf_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
for old in wf_dir.glob("shard_*.js"):
|
||||||
|
old.unlink()
|
||||||
|
non_empty = [(i, s, e) for i, (s, e) in enumerate(ranges) if e > s]
|
||||||
|
nshards = len(non_empty)
|
||||||
|
paths = []
|
||||||
|
# Re-index shards 0..nshards-1 so labels/meta stay contiguous even if some
|
||||||
|
# trailing ranges were empty (tiny final wave with fewer batches than K).
|
||||||
|
for new_idx, (_, s, e) in enumerate(non_empty):
|
||||||
|
path = wf_dir / f"shard_{new_idx}.js"
|
||||||
|
path.write_text(
|
||||||
|
render_shard(template, new_idx, s, e, nshards), encoding="utf-8"
|
||||||
|
)
|
||||||
|
paths.append(path)
|
||||||
|
return paths
|
||||||
|
|
||||||
|
|
||||||
|
def rel(path: Path) -> str:
|
||||||
|
try:
|
||||||
|
return str(path.relative_to(REPO_ROOT))
|
||||||
|
except ValueError:
|
||||||
|
return str(path)
|
||||||
|
|
||||||
|
|
||||||
|
# --------------------------------------------------------------------------- #
|
||||||
|
# Modes
|
||||||
|
# --------------------------------------------------------------------------- #
|
||||||
|
def cmd_status(prompts_dir: Path, parts_dir: Path) -> int:
|
||||||
|
good, total = count_done(prompts_dir, parts_dir)
|
||||||
|
parts_on_disk = len(list(parts_dir.glob("*" + PART_SUFFIX)))
|
||||||
|
bad = len(corrupt_parts(parts_dir))
|
||||||
|
missing = total - good
|
||||||
|
pct = (100.0 * good / total) if total else 0.0
|
||||||
|
print("=== enrichment status ===")
|
||||||
|
print(f"prompts (universe) : {total}")
|
||||||
|
print(f"parts on disk : {parts_on_disk}")
|
||||||
|
print(f"good (done) : {good}")
|
||||||
|
print(f"corrupt parts : {bad} (reported only; --prepare drops them)")
|
||||||
|
print(f"missing : {missing}")
|
||||||
|
print(f"done : {pct:.1f}%")
|
||||||
|
if total:
|
||||||
|
print(f"WAVE: {'COMPLETE' if missing == 0 else 'PENDING'} missing={missing}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_prepare(
|
||||||
|
prompts_dir: Path,
|
||||||
|
parts_dir: Path,
|
||||||
|
batches_dir: Path,
|
||||||
|
wf_dir: Path,
|
||||||
|
keys: int,
|
||||||
|
shards: int,
|
||||||
|
batch_size: int,
|
||||||
|
make_shards: bool = True,
|
||||||
|
) -> int:
|
||||||
|
template = ""
|
||||||
|
if make_shards:
|
||||||
|
template_path = wf_dir / TEMPLATE_NAME
|
||||||
|
if not template_path.is_file():
|
||||||
|
print(f"ERROR: missing shard template {rel(template_path)}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
template = template_path.read_text(encoding="utf-8")
|
||||||
|
|
||||||
|
# 1) drop corrupt parts (only mutation to parts/)
|
||||||
|
dropped = 0
|
||||||
|
for p in corrupt_parts(parts_dir):
|
||||||
|
p.unlink()
|
||||||
|
dropped += 1
|
||||||
|
|
||||||
|
# 2) compute missing (deterministic)
|
||||||
|
missing = compute_missing(prompts_dir, parts_dir)
|
||||||
|
|
||||||
|
# 3) empty -> COMPLETE sentinel, no files written
|
||||||
|
if not missing:
|
||||||
|
print(f"dropped_corrupt={dropped}")
|
||||||
|
print("WAVE: COMPLETE")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
# 4) clamp to first N
|
||||||
|
take = missing[:keys]
|
||||||
|
|
||||||
|
# 5) batches for ONLY those keys
|
||||||
|
nb = write_batches(take, batches_dir, batch_size)
|
||||||
|
|
||||||
|
# 6) shard scripts covering exactly those batches (skipped on the bash path)
|
||||||
|
paths = []
|
||||||
|
if make_shards:
|
||||||
|
ranges = shard_ranges(nb, shards)
|
||||||
|
paths = write_shards(ranges, template, wf_dir)
|
||||||
|
|
||||||
|
remaining_after = len(missing) - len(take)
|
||||||
|
print(f"dropped_corrupt={dropped}")
|
||||||
|
print(
|
||||||
|
f"WAVE: PREPARED keys={len(take)} batches={nb} "
|
||||||
|
f"shards={len(paths)} remaining_after={remaining_after}"
|
||||||
|
)
|
||||||
|
for p in paths:
|
||||||
|
print(f"SHARD: {rel(p)}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv=None) -> int:
|
||||||
|
ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||||
|
ap.add_argument("--status", action="store_true", help="read-only progress report")
|
||||||
|
ap.add_argument("--prepare", action="store_true", help="prepare one bounded wave")
|
||||||
|
ap.add_argument("--keys", type=int, default=KEYS_DEFAULT, help=f"max keys this wave (default {KEYS_DEFAULT})")
|
||||||
|
ap.add_argument("--shards", type=int, default=SHARDS_DEFAULT, help=f"workflow shards (default {SHARDS_DEFAULT})")
|
||||||
|
ap.add_argument("--batch-size", type=int, default=BATCH_SIZE_DEFAULT, help=f"keys per batch (default {BATCH_SIZE_DEFAULT})")
|
||||||
|
ap.add_argument("--no-shards", action="store_true", help="prepare batch files only; skip shard JS generation (bash/headless path)")
|
||||||
|
ap.add_argument("--prompts", default=DEF_PROMPTS)
|
||||||
|
ap.add_argument("--parts", default=DEF_PARTS)
|
||||||
|
ap.add_argument("--batches", default=DEF_BATCHES)
|
||||||
|
ap.add_argument("--wf-dir", default=DEF_WF)
|
||||||
|
args = ap.parse_args(argv)
|
||||||
|
|
||||||
|
prompts_dir = _abs(args.prompts)
|
||||||
|
parts_dir = _abs(args.parts)
|
||||||
|
batches_dir = _abs(args.batches)
|
||||||
|
wf_dir = _abs(args.wf_dir)
|
||||||
|
|
||||||
|
if not prompts_dir.is_dir():
|
||||||
|
print(f"ERROR: prompts dir not found: {rel(prompts_dir)}", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
parts_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
if args.keys < 1 or args.shards < 1 or args.batch_size < 1:
|
||||||
|
print("ERROR: --keys/--shards/--batch-size must be >= 1", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
|
||||||
|
if args.prepare:
|
||||||
|
return cmd_prepare(
|
||||||
|
prompts_dir, parts_dir, batches_dir, wf_dir,
|
||||||
|
args.keys, args.shards, args.batch_size,
|
||||||
|
make_shards=not args.no_shards,
|
||||||
|
)
|
||||||
|
# default to status
|
||||||
|
return cmd_status(prompts_dir, parts_dir)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
Reference in New Issue
Block a user