Headless cron enrichment system + progress checkpoint at 32%
OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully headless: one claude -p per batch via xargs, flock-guarded, idempotent. DB updated to 9541 activities; .gitignore covers enrichment intermediates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
71
ENRICHMENT_PILOT.md
Normal file
71
ENRICHMENT_PILOT.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# Enrichment PILOT — sign-off required before full-corpus scaling
|
||||
|
||||
**Date:** 2026-05-29. Pilot covers **34 activities** (the STOP gate from `HANDOFF.md`
|
||||
step 3, guarding ~6–8k LLM calls across the full corpus).
|
||||
|
||||
## Pipeline integrity (all green)
|
||||
|
||||
| Hop | Expected | Actual |
|
||||
|-----|----------|--------|
|
||||
| prompts emitted | 34 | 34 |
|
||||
| part files on disk (valid JSON, key matches filename) | 34 | 34 |
|
||||
| `enrichment.json` entries after `--collect` | 34 | 34 |
|
||||
| rebuild overlay: `matched` / `orphaned` | 34 / 0 | **34 / 0** |
|
||||
|
||||
No leak at any hop. `orphaned 0` confirms the content_key the rebuild computes
|
||||
matches what `run_enrichment` emitted (no dedup rep-selection drift).
|
||||
|
||||
## Pilot composition
|
||||
|
||||
Deliberately mixed to exercise BOTH operations (corpus is 7076 EN / 2465 RO, so
|
||||
en→ro translation is the dominant + highest-risk path):
|
||||
|
||||
- **26** rows from `teambuilding_corbu` — all Romanian → **ro→ro polish**
|
||||
- **8** rows from `d3959920_outdoor_games` — all English → **en→ro translation**
|
||||
|
||||
Result: ~7 genuine en→ro translations + ~27 ro→ro polish.
|
||||
|
||||
## Field population (stated vs estimated)
|
||||
|
||||
```
|
||||
age_group_max : 0 stated / 30 estimated
|
||||
age_group_min : 0 / 34
|
||||
duration_max : 3 / 29
|
||||
duration_min : 4 / 28
|
||||
indoor_outdoor : 12 / 22
|
||||
participants_max : 0 / 24
|
||||
participants_min : 4 / 30
|
||||
space_needed : 2 / 32
|
||||
```
|
||||
|
||||
Almost everything is estimated — sources rarely state ages/durations explicitly.
|
||||
The pipeline marks every inferred field in `estimated_fields`, and the UI shows an
|
||||
`(estimat)` marker, so estimates are transparent to end users.
|
||||
|
||||
## What to evaluate (the three sign-off axes)
|
||||
|
||||
1. **Translation fidelity (en→ro)** — e.g. *Labels → Etichete*, *Ships in a Fog →
|
||||
Nave în ceață*, *Spot the Colours → Găsește culorile*. Game rules preserved,
|
||||
no moralizing added, proper terms kept.
|
||||
2. **Description fidelity / expansion** — ro→ro rows fold in setup/material detail
|
||||
that IS in the source chunk (e.g. *Găsește-ți fratele și sora* adds "carton A6"
|
||||
+ "la semnal, toți încep simultan"; *Ce-mi place?* folds in the character-traits
|
||||
discussion). No invented steps observed.
|
||||
3. **Estimation plausibility** — mostly reasonable. **Weak spots to judge:** a few
|
||||
age ranges are very wide/defaulted (e.g. *Găsește-ți fratele și sora* → age
|
||||
10–99). If wide age defaults are unacceptable, tighten the ENRICHMENT_PROMPT
|
||||
guidance before scaling.
|
||||
|
||||
## Inspect the data yourself
|
||||
|
||||
```bash
|
||||
sqlite3 data/activities.db "select name, name_ro, language, indoor_outdoor, space_needed, estimated_fields from activities where name_ro is not null;"
|
||||
# raw overlay: data/enrichment.json (34 entries)
|
||||
# per-activity parts: data/enrichment_parts/*.json
|
||||
```
|
||||
|
||||
## After sign-off (do NOT auto-proceed)
|
||||
|
||||
Scale in waves of ~8–16 Sonnet subagents over the rest of the corpus
|
||||
(`run_enrichment.py` is additive + resumable — skips already-enriched keys),
|
||||
`--collect`, then final `build_database.py --rebuild --enrichment`.
|
||||
Reference in New Issue
Block a user