Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_*, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.8 KiB
HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot
Snapshot: 2026-05-29. Executing plan enumerated-petting-badger.md (bilingual
index + enrichment + new filters + source download).
Where we are
| Step (plan Part C) | Status |
|---|---|
| 1. Finish extraction | DONE — 575/588 chunks extracted & valid; 13 missing (see below) |
| 2. Land code Part A1–A4 (model/schema/merge) | DONE & committed |
| 2b. Code Part A5–A8 (UI/search/download) | DONE & committed |
| 2c. Code Part B2–B4 (enrichment pipeline) | DONE & committed |
| 3. Freeze rebuild (freezes content_keys) | DONE — data/activities.db = 9418 activities |
| Part D tests | DONE — tests/test_enrichment.py, 99 pass total |
| 4. Enrichment pilot → STOP for user sign-off | NOT STARTED (this is the gate) |
5. Final rebuild --enrichment |
not started |
Everything is committed except whatever this session leaves dirty. data/extracted/*.json
is gitignored (575 files on disk, durable across /clear).
The 13 missing chunks (out of 588)
6 content-filter-blocked (Anthropic safety; accept as missing — marginal loss):
87850302_dragon_sleepdeprived.part73 / .part85 / .part94(camp song lyrics)c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96
7 need RE-EXTRACTION (their malformed-original JSON was destroyed — see "json_repair incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
3f9c8232_teambuilding_corbu_29092023.part01
5f959f85_scoli_fara_bullying.part02
83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
data/chunks/_prompts/<key>.prompt.md), then re-run the freeze rebuild so they join
the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
overlay keys depend on the current freeze yet.
The json_repair incident (important — root cause + what was fixed)
Subagents systematically emit unescaped ASCII " inside string values (Romanian
text like „Unu" uses a closing " that terminates the JSON string early). ~34 files
were affected.
First repair attempt used the json_repair lib. It truncates: on a stray quote it
ends the string and reinterprets the trailing text as a new key, silently dropping the
rest of the value and injecting garbage keys. Schema additionalProperties:false caught
the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
an extra key slipped through. Applying json_repair output to disk also overwrote the
malformed originals for those 8 → originals lost → those (now 7, one recovered) need
re-extraction.
Fix: scripts/repair_extractions.py was rewritten to use a faithful char-scanner
(escape_stray_quotes) that escapes stray quotes (\") instead of splitting on them,
validates against the real schema, and only replaces a valid top-level file when the
repaired version carries strictly more text (a length guard that catches truncated
json_repair output while leaving genuine extractions untouched). Re-running it cleanly
repaired the affected files; the final freeze had 0 schema-rejected, 0 invalid.
json_repair is no longer used anywhere. Do NOT reintroduce it.
build_database.py does NOT depend on the repair script (the "DB regenerable from
data/extracted/" invariant holds — plain json.loads only).
What the code does now (all committed)
Part A — plumbing (corpus-independent):
app/models/database.py: new columnsname_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON), chunk_key; FTS5 indexes the 4*_rocolumns (CREATE + all 3 triggers — kept in sync); indexes onindoor_outdoor/space_needed;search_activitiesgainedindoor_outdoorandspace_neededequality kwargs;_update_category_countsfeeds both new axes into the categories table so dropdowns populate.app/models/activity.py: new fields +to_dict/from_dict; helpersget_display_name/get_display_description/get_display_rules/get_display_variations(RO-primary, EN fallback),has_translation,is_estimated(field),get_indoor_outdoor_display,get_space_needed_display.app/config_taxonomy.py:INDOOR_OUTDOOR,SPACE_NEEDEDenums + RO labels +normalize_indoor_outdoor/normalize_space_needed(None on unrecognised, no fallback — never fabricate a value) + display-name helpers.scripts/build_database.py:dict_to_activitysetssource_id+chunk_key;merge_clusterunionssource_idsand carries rep'ssource_id/chunk_keybut never touches enrichment fields (those are applied post-dedup).
Part A — UI/search:
app/services/search.py:_map_filters_to_db_fieldsmapsindoor_outdoor/space_neededto DB equality filters.app/web/routes.py: new/source/<id>download route — shipped DARK behindSOURCE_DOWNLOAD_ENABLED(default false; copyright exposure, user opts in); resolvessource_fileunderCORPUS_DIRviasend_from_directory(traversal-safe, 404s for web-mirror sources).DISPLAY_NAMESextended with both new axes.app/config.py:SOURCE_DOWNLOAD_ENABLED,CORPUS_DIR.- Templates:
index.html/results.htmlhave the 2 new dropdowns; cards use display helpers +(estimat)markers;activity.htmlis RO-primary with a collapsible "Text original" section, indoor/space cards, estimat markers, and the download link (only when the flag is on).main.csshas.estimated/.original-textstyles.
Part B — enrichment pipeline (built, not yet run):
scripts/build_database.py:load_enrichment+apply_enrichment(activities, enrichment)applied right afterapply_review_decisions, on the post-dedup list, keyed onimport_common.content_key(normalized_name, language, _normalize_text(description))(reused verbatim). CLI--enrichment(defaultdata/enrichment.json). QA report printsenrichment {entries, matched, orphaned}+ per-field stated vs estimated counts. Translated/expanded text is NOT re-validated against source (by design).scripts/run_enrichment.py: reads the rebuilt DB, computes each row's content_key, skips rows already indata/enrichment_parts/<key>.json(resumable), emits one prompt per activity todata/enrichment_prompts/(current EN fields + source chunk text viafind_chunk_text). Pilot scoping:--source <substr>and/or--limit N.--collectmerges parts →data/enrichment.json.scripts/ENRICHMENT_PROMPT.md: single-pass rules — translate faithfully, expanddescription_roONLY from chunk text, mark inferred filter fields inestimated_fields, fixed enum vocab, outputdata/enrichment_parts/<content_key>.jsonincludingcontent_key.
Exact next steps
- Re-extract the 7 chunks above (after session-limit reset). Verify each writes valid
JSON (
python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"). If any come back malformed,python3 scripts/repair_extractions.py --apply(faithful now). - Re-freeze:
python3 scripts/build_database.py --rebuild— confirm 0 schema-rejected, note the new total (~9418 + the 7 chunks' activities). - Enrichment PILOT (plan B5 — the STOP gate guarding 6–8k LLM calls):
- Pick one source, e.g.
python3 scripts/run_enrichment.py --source teambuilding_corbu(or--limit 30). This writes prompts todata/enrichment_prompts/. - Launch a small wave of Sonnet subagents on those prompts (each writes
data/enrichment_parts/<key>.json). python3 scripts/run_enrichment.py --collect→data/enrichment.json.python3 scripts/build_database.py --rebuild(picks up--enrichmentby default).- STOP. Hand the user translation-quality + estimation-plausibility + description- fidelity samples and get sign-off BEFORE scaling to the full corpus. Do not auto-proceed past this gate.
- Pick one source, e.g.
- After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents,
--collect, final--rebuild --enrichment.
Verify / run
- Tests:
python3 -m pytest tests/ -q→ 99 pass. - App:
SOURCE_DOWNLOAD_ENABLEDis false by default (download link hidden). Set it true only if the user accepts the copyright exposure of serving original files. data/activities.db.bakis the pre-this-freeze backup.