Files
game-library/HANDOFF.md
Claude Agent bcfb6841eb Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB
Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await
re-extraction). DB rebuilt and frozen at 9418 activities — content_keys
are now stable for the enrichment overlay.

Part A (plumbing + UI):
- database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor,
  space_needed, estimated_fields, source_id/source_ids/chunk_key columns;
  FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality
  filters + category counts for both axes.
- activity.py: new fields + bilingual display helpers (get_display_*,
  is_estimated, axis displays).
- config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers
  (None on unrecognised, no fabrication).
- search.py / routes.py / config.py / templates / css: new dropdowns,
  RO-primary rendering with "(estimat)" markers and collapsible original
  text, and a /source/<id> download route shipped DARK behind
  SOURCE_DOWNLOAD_ENABLED (copyright opt-in).
- build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster
  unions source_ids without touching enrichment fields.

Part B (enrichment pipeline, built not yet run):
- build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on
  content_key) + --enrichment CLI + stated-vs-estimated QA.
- run_enrichment.py (resumable, --source/--limit pilot scoping, --collect),
  ENRICHMENT_PROMPT.md.

Repair: scripts/repair_extractions.py fixes the subagents' systematic
unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never
truncates) + schema validation + a strictly-more-text guard. json_repair was
tried first, truncated silently, and is NOT used. build_database has no repair
dependency.

Tests: tests/test_enrichment.py added; 99 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:10:13 +00:00

144 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot
**Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual
index + enrichment + new filters + source download).
## Where we are
| Step (plan Part C) | Status |
|--------------------|--------|
| 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) |
| 2. Land code Part A1A4 (model/schema/merge) | **DONE & committed** |
| 2b. Code Part A5A8 (UI/search/download) | **DONE & committed** |
| 2c. Code Part B2B4 (enrichment pipeline) | **DONE & committed** |
| 3. Freeze rebuild (freezes content_keys) | **DONE**`data/activities.db` = **9418 activities** |
| Part D tests | **DONE**`tests/test_enrichment.py`, 99 pass total |
| 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) |
| 5. Final rebuild `--enrichment` | not started |
Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
is gitignored (575 files on disk, durable across /clear).
## The 13 missing chunks (out of 588)
**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`
**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
```
3f9c8232_teambuilding_corbu_29092023.part01
5f959f85_scoli_fara_bullying.part02
83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
```
Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
overlay keys depend on the current freeze yet.
## The json_repair incident (important — root cause + what was fixed)
Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
were affected.
First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
ends the string and reinterprets the trailing text as a new key, silently dropping the
rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
an extra key slipped through. Applying json_repair output to disk also **overwrote the
malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
re-extraction.
**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
validates against the real schema, and only replaces a valid top-level file when the
repaired version carries **strictly more text** (a length guard that catches truncated
json_repair output while leaving genuine extractions untouched). Re-running it cleanly
repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
`json_repair` is no longer used anywhere. Do NOT reintroduce it.
`build_database.py` does NOT depend on the repair script (the "DB regenerable from
data/extracted/" invariant holds — plain `json.loads` only).
## What the code does now (all committed)
**Part A — plumbing (corpus-independent):**
- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
the categories table so dropdowns populate.
- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
/ `get_display_description` / `get_display_rules` / `get_display_variations`
(RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
`get_indoor_outdoor_display`, `get_space_needed_display`.
- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
`normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
fallback — never fabricate a value) + display-name helpers.
- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
`merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
**never** touches enrichment fields (those are applied post-dedup).
**Part A — UI/search:**
- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
`space_needed` to DB equality filters.
- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
`SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
`source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
"Text original" section, indoor/space cards, estimat markers, and the download link
(only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
**Part B — enrichment pipeline (built, not yet run):**
- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
`import_common.content_key(normalized_name, language, _normalize_text(description))`
(reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
`enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
Translated/expanded text is NOT re-validated against source (by design).
- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
`find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
merges parts → `data/enrichment.json`.
- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
`description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
## Exact next steps
1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
note the new total (~9418 + the 7 chunks' activities).
3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 68k LLM calls):
- Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
(or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
- Launch a small wave of Sonnet subagents on those prompts (each writes
`data/enrichment_parts/<key>.json`).
- `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
- `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
- **STOP. Hand the user translation-quality + estimation-plausibility + description-
fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
auto-proceed past this gate.
4. After sign-off: scale enrichment in waves of ~816 Sonnet subagents, `--collect`,
final `--rebuild --enrichment`.
## Verify / run
- Tests: `python3 -m pytest tests/ -q` → 99 pass.
- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
only if the user accepts the copyright exposure of serving original files.
- `data/activities.db.bak` is the pre-this-freeze backup.