Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_*, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
144 lines
8.8 KiB
Markdown
144 lines
8.8 KiB
Markdown
# HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot
|
||
|
||
**Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual
|
||
index + enrichment + new filters + source download).
|
||
|
||
## Where we are
|
||
|
||
| Step (plan Part C) | Status |
|
||
|--------------------|--------|
|
||
| 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) |
|
||
| 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
|
||
| 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
|
||
| 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
|
||
| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9418 activities** |
|
||
| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
|
||
| 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) |
|
||
| 5. Final rebuild `--enrichment` | not started |
|
||
|
||
Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
|
||
is gitignored (575 files on disk, durable across /clear).
|
||
|
||
## The 13 missing chunks (out of 588)
|
||
|
||
**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
|
||
- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
|
||
- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`
|
||
|
||
**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
|
||
incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
|
||
```
|
||
3f9c8232_teambuilding_corbu_29092023.part01
|
||
5f959f85_scoli_fara_bullying.part02
|
||
83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
|
||
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
|
||
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
|
||
d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
|
||
e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
|
||
```
|
||
Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
|
||
`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
|
||
the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
|
||
overlay keys depend on the current freeze yet.
|
||
|
||
## The json_repair incident (important — root cause + what was fixed)
|
||
|
||
Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
|
||
text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
|
||
were affected.
|
||
|
||
First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
|
||
ends the string and reinterprets the trailing text as a new key, silently dropping the
|
||
rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
|
||
the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
|
||
an extra key slipped through. Applying json_repair output to disk also **overwrote the
|
||
malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
|
||
re-extraction.
|
||
|
||
**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
|
||
(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
|
||
validates against the real schema, and only replaces a valid top-level file when the
|
||
repaired version carries **strictly more text** (a length guard that catches truncated
|
||
json_repair output while leaving genuine extractions untouched). Re-running it cleanly
|
||
repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
|
||
`json_repair` is no longer used anywhere. Do NOT reintroduce it.
|
||
|
||
`build_database.py` does NOT depend on the repair script (the "DB regenerable from
|
||
data/extracted/" invariant holds — plain `json.loads` only).
|
||
|
||
## What the code does now (all committed)
|
||
|
||
**Part A — plumbing (corpus-independent):**
|
||
- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
|
||
indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
|
||
chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
|
||
indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
|
||
and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
|
||
the categories table so dropdowns populate.
|
||
- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
|
||
/ `get_display_description` / `get_display_rules` / `get_display_variations`
|
||
(RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
|
||
`get_indoor_outdoor_display`, `get_space_needed_display`.
|
||
- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
|
||
`normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
|
||
fallback — never fabricate a value) + display-name helpers.
|
||
- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
|
||
`merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
|
||
**never** touches enrichment fields (those are applied post-dedup).
|
||
|
||
**Part A — UI/search:**
|
||
- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
|
||
`space_needed` to DB equality filters.
|
||
- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
|
||
`SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
|
||
`source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
|
||
web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
|
||
- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
|
||
- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
|
||
helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
|
||
"Text original" section, indoor/space cards, estimat markers, and the download link
|
||
(only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
|
||
|
||
**Part B — enrichment pipeline (built, not yet run):**
|
||
- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
|
||
applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
|
||
`import_common.content_key(normalized_name, language, _normalize_text(description))`
|
||
(reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
|
||
`enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
|
||
Translated/expanded text is NOT re-validated against source (by design).
|
||
- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
|
||
skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
|
||
per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
|
||
`find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
|
||
merges parts → `data/enrichment.json`.
|
||
- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
|
||
`description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
|
||
fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
|
||
|
||
## Exact next steps
|
||
|
||
1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
|
||
JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
|
||
If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
|
||
2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
|
||
note the new total (~9418 + the 7 chunks' activities).
|
||
3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 6–8k LLM calls):
|
||
- Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
|
||
(or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
|
||
- Launch a small wave of Sonnet subagents on those prompts (each writes
|
||
`data/enrichment_parts/<key>.json`).
|
||
- `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
|
||
- `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
|
||
- **STOP. Hand the user translation-quality + estimation-plausibility + description-
|
||
fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
|
||
auto-proceed past this gate.
|
||
4. After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, `--collect`,
|
||
final `--rebuild --enrichment`.
|
||
|
||
## Verify / run
|
||
|
||
- Tests: `python3 -m pytest tests/ -q` → 99 pass.
|
||
- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
|
||
only if the user accepts the copyright exposure of serving original files.
|
||
- `data/activities.db.bak` is the pre-this-freeze backup.
|