Files

Claude Agent bcfb6841eb Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await
re-extraction). DB rebuilt and frozen at 9418 activities — content_keys
are now stable for the enrichment overlay.

Part A (plumbing + UI):
- database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor,
  space_needed, estimated_fields, source_id/source_ids/chunk_key columns;
  FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality
  filters + category counts for both axes.
- activity.py: new fields + bilingual display helpers (get_display_*,
  is_estimated, axis displays).
- config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers
  (None on unrecognised, no fabrication).
- search.py / routes.py / config.py / templates / css: new dropdowns,
  RO-primary rendering with "(estimat)" markers and collapsible original
  text, and a /source/<id> download route shipped DARK behind
  SOURCE_DOWNLOAD_ENABLED (copyright opt-in).
- build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster
  unions source_ids without touching enrichment fields.

Part B (enrichment pipeline, built not yet run):
- build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on
  content_key) + --enrichment CLI + stated-vs-estimated QA.
- run_enrichment.py (resumable, --source/--limit pilot scoping, --collect),
  ENRICHMENT_PROMPT.md.

Repair: scripts/repair_extractions.py fixes the subagents' systematic
unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never
truncates) + schema validation + a strictly-more-text guard. json_repair was
tried first, truncated silently, and is NOT used. build_database has no repair
dependency.

Tests: tests/test_enrichment.py added; 99 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-29 18:10:13 +00:00

8.8 KiB

Raw Blame History

HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot

Snapshot: 2026-05-29. Executing plan enumerated-petting-badger.md (bilingual index + enrichment + new filters + source download).

Where we are

Step (plan Part C)	Status
1. Finish extraction	DONE — 575/588 chunks extracted & valid; 13 missing (see below)
2. Land code Part A1–A4 (model/schema/merge)	DONE & committed
2b. Code Part A5–A8 (UI/search/download)	DONE & committed
2c. Code Part B2–B4 (enrichment pipeline)	DONE & committed
3. Freeze rebuild (freezes content_keys)	DONE — `data/activities.db` = 9418 activities
Part D tests	DONE — `tests/test_enrichment.py`, 99 pass total
4. Enrichment pilot → STOP for user sign-off	NOT STARTED (this is the gate)
5. Final rebuild `--enrichment`	not started

Everything is committed except whatever this session leaves dirty. data/extracted/*.json is gitignored (575 files on disk, durable across /clear).

The 13 missing chunks (out of 588)

6 content-filter-blocked (Anthropic safety; accept as missing — marginal loss):

87850302_dragon_sleepdeprived.part73 / .part85 / .part94 (camp song lyrics)
c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96

7 need RE-EXTRACTION (their malformed-original JSON was destroyed — see "json_repair incident" below; re-extract once the subagent session limit resets, ~5pm UTC):

3f9c8232_teambuilding_corbu_29092023.part01
5f959f85_scoli_fara_bullying.part02
83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03

Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at data/chunks/_prompts/<key>.prompt.md), then re-run the freeze rebuild so they join the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no overlay keys depend on the current freeze yet.

The json_repair incident (important — root cause + what was fixed)

Subagents systematically emit unescaped ASCII " inside string values (Romanian text like „Unu" uses a closing " that terminates the JSON string early). ~34 files were affected.

First repair attempt used the json_repair lib. It truncates: on a stray quote it ends the string and reinterprets the trailing text as a new key, silently dropping the rest of the value and injecting garbage keys. Schema additionalProperties:false caught the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create an extra key slipped through. Applying json_repair output to disk also overwrote the malformed originals for those 8 → originals lost → those (now 7, one recovered) need re-extraction.

Fix: scripts/repair_extractions.py was rewritten to use a faithful char-scanner (escape_stray_quotes) that escapes stray quotes (\") instead of splitting on them, validates against the real schema, and only replaces a valid top-level file when the repaired version carries strictly more text (a length guard that catches truncated json_repair output while leaving genuine extractions untouched). Re-running it cleanly repaired the affected files; the final freeze had 0 schema-rejected, 0 invalid. json_repair is no longer used anywhere. Do NOT reintroduce it.

build_database.py does NOT depend on the repair script (the "DB regenerable from data/extracted/" invariant holds — plain json.loads only).

What the code does now (all committed)

Part A — plumbing (corpus-independent):

app/models/database.py: new columns name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON), chunk_key; FTS5 indexes the 4 *_ro columns (CREATE + all 3 triggers — kept in sync); indexes on indoor_outdoor/space_needed; search_activities gained indoor_outdoor and space_needed equality kwargs; _update_category_counts feeds both new axes into the categories table so dropdowns populate.
app/models/activity.py: new fields + to_dict/from_dict; helpers get_display_name / get_display_description / get_display_rules / get_display_variations (RO-primary, EN fallback), has_translation, is_estimated(field), get_indoor_outdoor_display, get_space_needed_display.
app/config_taxonomy.py: INDOOR_OUTDOOR, SPACE_NEEDED enums + RO labels + normalize_indoor_outdoor / normalize_space_needed (None on unrecognised, no fallback — never fabricate a value) + display-name helpers.
scripts/build_database.py: dict_to_activity sets source_id+chunk_key; merge_cluster unions source_ids and carries rep's source_id/chunk_key but never touches enrichment fields (those are applied post-dedup).

Part A — UI/search:

app/services/search.py: _map_filters_to_db_fields maps indoor_outdoor/ space_needed to DB equality filters.
app/web/routes.py: new /source/<id> download route — shipped DARK behind SOURCE_DOWNLOAD_ENABLED (default false; copyright exposure, user opts in); resolves source_file under CORPUS_DIR via send_from_directory (traversal-safe, 404s for web-mirror sources). DISPLAY_NAMES extended with both new axes.
app/config.py: SOURCE_DOWNLOAD_ENABLED, CORPUS_DIR.
Templates: index.html/results.html have the 2 new dropdowns; cards use display helpers + (estimat) markers; activity.html is RO-primary with a collapsible "Text original" section, indoor/space cards, estimat markers, and the download link (only when the flag is on). main.css has .estimated / .original-text styles.

Part B — enrichment pipeline (built, not yet run):

scripts/build_database.py: load_enrichment + apply_enrichment(activities, enrichment) applied right after apply_review_decisions, on the post-dedup list, keyed on import_common.content_key(normalized_name, language, _normalize_text(description)) (reused verbatim). CLI --enrichment (default data/enrichment.json). QA report prints enrichment {entries, matched, orphaned} + per-field stated vs estimated counts. Translated/expanded text is NOT re-validated against source (by design).
scripts/run_enrichment.py: reads the rebuilt DB, computes each row's content_key, skips rows already in data/enrichment_parts/<key>.json (resumable), emits one prompt per activity to data/enrichment_prompts/ (current EN fields + source chunk text via find_chunk_text). Pilot scoping: --source <substr> and/or --limit N. --collect merges parts → data/enrichment.json.
scripts/ENRICHMENT_PROMPT.md: single-pass rules — translate faithfully, expand description_ro ONLY from chunk text, mark inferred filter fields in estimated_fields, fixed enum vocab, output data/enrichment_parts/<content_key>.json including content_key.

Exact next steps

Re-extract the 7 chunks above (after session-limit reset). Verify each writes valid JSON (python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"). If any come back malformed, python3 scripts/repair_extractions.py --apply (faithful now).
Re-freeze: python3 scripts/build_database.py --rebuild — confirm 0 schema-rejected, note the new total (~9418 + the 7 chunks' activities).
Enrichment PILOT (plan B5 — the STOP gate guarding 6–8k LLM calls):
- Pick one source, e.g. python3 scripts/run_enrichment.py --source teambuilding_corbu (or --limit 30). This writes prompts to data/enrichment_prompts/.
- Launch a small wave of Sonnet subagents on those prompts (each writes data/enrichment_parts/<key>.json).
- python3 scripts/run_enrichment.py --collect → data/enrichment.json.
- python3 scripts/build_database.py --rebuild (picks up --enrichment by default).
- STOP. Hand the user translation-quality + estimation-plausibility + description- fidelity samples and get sign-off BEFORE scaling to the full corpus. Do not auto-proceed past this gate.
After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, --collect, final --rebuild --enrichment.

Verify / run

Tests: python3 -m pytest tests/ -q → 99 pass.
App: SOURCE_DOWNLOAD_ENABLED is false by default (download link hidden). Set it true only if the user accepts the copyright exposure of serving original files.
data/activities.db.bak is the pre-this-freeze backup.

8.8 KiB Raw Blame History Unescape Escape