Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_*, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:10:13 +00:00
parent 46d9592a55
commit bcfb6841eb
18 changed files with 1579 additions and 167 deletions
--- a/HANDOFF.md
+++ b/HANDOFF.md
@@ -1,135 +1,143 @@
-# HANDOFF — Faza 1 extraction in progress
+# HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot

-**Last updated:** 2026-05-20, commit `3d9f266` (pilot complete) + uncommitted Faza 1 work.
+**Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual
+index + enrichment + new filters + source download).

-## State of play
+## Where we are

-Faza 0 (pilot) is **complete and committed**. Faza 1 (full corpus) is **in progress at 10.9%**.
+| Step (plan Part C) | Status |
+|--------------------|--------|
+| 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) |
+| 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
+| 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
+| 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
+| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9418 activities** |
+| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
+| 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) |
+| 5. Final rebuild `--enrichment` | not started |

-| Phase | Status | DB rows | Tests |
-|-------|--------|---------|-------|
-| Faza 0 pilot (5 files) | committed (`3d9f266`) | 1751 in `data/activities.db` | 71 passed |
-| Faza 1 extraction | **64/588 chunks done**, 1949 activities in `data/extracted/*.json` (not yet imported to DB) | — | — |
+Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
+is gitignored (575 files on disk, durable across /clear).

-## What "Faza 1" is
+## The 13 missing chunks (out of 588)

-Process the full 96-source corpus (was 116 raw files; some are duplicates/empty zips/skipped junk) through the LLM-subagent extraction pipeline. Same code path as the pilot. Two large mirror dirs dominate the chunk count:
-
- `87850302_dragon_sleepdeprived` — 116 chunks (full dragon.sleepdeprived.ca mirror)
- `c3162825_resource_pack__learning_by_playing_catalunya_...` — 97 chunks (the catalunya mirror)
-
-Combined they are 213/588 = 36% of all remaining chunks.
-
-## Critical recommendation for the next session
-
-**Use Sonnet 4.6 for subagent extractions, not Opus.** Opus was used through chunks 1-64 and burned through a 5-hour rate-limit window faster than this scale needs. Sonnet has 200K context which is plenty for the ~25k-token prompt + ~20k-token output of a single chunk extraction. The task is structured JSON extraction with a fixed schema — no complex reasoning needed.
-
-The Agent tool's `model` parameter takes `"sonnet"`. Pass it on every Agent call below.
-
-## Resuming — exact mechanical steps
-
-1. **Verify state.**
-   ```bash
-   cd /workspace/game-library
-   ls data/extracted/*.json | wc -l       # should be 64 (or higher if more done)
-   ls data/chunks/_prompts/ | wc -l       # 588 — the full prompt set
-   git log --oneline -3                   # 3d9f266 must be HEAD or earlier
-   ```
-
-2. **Find what's still pending.** Compare prompts to extracted files:
-   ```bash
-   ls data/chunks/_prompts/ | sed 's/\.prompt\.md$//' | sort > /tmp/all.txt
-   ls data/extracted/*.json 2>/dev/null | sed 's|.*/||;s|\.json$||' | sort > /tmp/done.txt
-   comm -23 /tmp/all.txt /tmp/done.txt > /tmp/pending.txt
-   wc -l /tmp/pending.txt                 # how many remain
-   head /tmp/pending.txt                  # what's next
-   ```
-
-3. **Launch waves of 16 subagents in parallel.** One Agent call per chunk. Single message with 16 Agent tool calls. Use this exact template (substitute `<CHUNK_KEY>`):
-
-   ```
-   Agent(
-     description: "Extract <CHUNK_KEY>",
-     subagent_type: "general-purpose",
-     model: "sonnet",                  ← critical
-     prompt: "Working directory: /workspace/game-library. Extraction subagent — read data/chunks/_prompts/<CHUNK_KEY>.prompt.md and follow it EXACTLY. Apply rules from scripts/SUBAGENT_PROMPT.md and schema from scripts/activity_schema.json. Write the JSON. Set language per chunk content ('ro' or 'en'). Report under 40 words."
-   )
-   ```
-
-   The per-chunk prompt file is fully self-contained — it points to the right chunk, sets source_id/source_hash/chunk_key, and references the rules + schema. The subagent just follows it.
-
-4. **After every wave**, briefly check progress and continue:
-   ```bash
-   ls data/extracted/*.json | wc -l
-   ```
-   Repeat step 3 with the next 16 pending chunks. If an agent reports `"You've hit your limit · resets ..."` AND `tool_uses: 5` with `total_tokens: 0`, check whether the JSON was written anyway — agents often persist the file before the limit hit. Only re-launch if the JSON is missing.
-
-5. **When all 588 chunks are done**, finalize:
-   ```bash
-   python3 scripts/validate_extractions.py    # any chunks marked rejected go to data/extracted/_reextract/
-   # re-extract any rejected chunks (same template, prompt from _reextract/)
-   python3 scripts/build_database.py --rebuild
-   # if many borderline needs_review rows:
-   python3 -c "
-   import sys; sys.path.insert(0,'scripts')
-   from import_common import content_key, normalize_name
-   import sqlite3, json
-   conn = sqlite3.connect('data/activities.db')
-   conn.row_factory = sqlite3.Row
-   rows = list(conn.execute('SELECT name, normalized_name, language, description FROM activities WHERE needs_review=1'))
-   d = {content_key(r['normalized_name'] or normalize_name(r['name']), r['language'], r['description'] or ''): 'merge' for r in rows}
-   json.dump(d, open('data/review_decisions.json','w'), indent=2)
-   print(f'{len(d)} merge decisions')
-   "
-   python3 scripts/build_database.py --rebuild   # apply decisions
-   python3 -m pytest tests/ -q                   # 71 should pass
-   git add data/activities.db data/review_decisions.json
-   git commit -m "Faza 1: full corpus extraction"
-   ```
-
-## Code reference — what each script does
-
- `scripts/normalize_sources.py --corpus data/carti-camp-jocuri --out data/sources` → produces 96 `data/sources/<id>.txt` files with `--- PAGE N ---` markers. **Done. Don't re-run.**
- `scripts/chunk_sources.py --sources data/sources --chunks data/chunks` → splits each into ~20pg chunks with 4pg overlap, writes `data/chunks/<id>/<id>.partNN.txt` and updates `data/chunks/manifest.json`. **Done. Don't re-run unless sources change.**
- `scripts/run_extraction.py` → regenerates the per-chunk prompts in `data/chunks/_prompts/`. **Done. Don't re-run unless schema/prompt changes.**
- `scripts/SUBAGENT_PROMPT.md` — extraction rules (what subagents follow).
- `scripts/activity_schema.json` — JSON schema each extraction must validate against.
- `scripts/validate_extractions.py` — per-file schema check + fuzzy `source_excerpt` substring check; writes re-extraction prompts to `data/extracted/_reextract/` for rejected chunks; marks chunks `rejected` in manifest.
- `scripts/build_database.py --rebuild` — validates every `data/extracted/*.json` against schema, drops per-activity hallucinations, dedup, applies `data/review_decisions.json`, atomic swap into `data/activities.db`.
- `scripts/review_queue.py list|resolve <id> <merge|keep-separate|drop>` — CLI for borderline-dedup decisions; persisted in `data/review_decisions.json`.
-
-## Pilot lessons that apply
-
- **~1.07% hallucinated drops** at pilot scale (well below the 2% threshold). Caused by source_excerpts straddling `--- PAGE N ---` markers. Re-extraction with verbatim within-page quotes fixed all 13 affected chunks. Expect similar rate at Faza 1 scale (~10-30 chunks may need re-extraction).
- **Borderline dedup queue** (369 rows in pilot) — same-name activities re-extracted from chunk overlap with slightly-different prose. Bulk-merge is the right call: same normalized_name + same language + 60-85% desc similarity → merge takes the longest fields. Use the snippet in step 5 above.
- **OCR not needed.** The candidate scanned PDF (`07.Cartea_Mare`) extracted 151 pages of real text via pdfplumber alone. Skip OCR for v1.
-
-## Files not yet committed (uncommitted in this session)
-
- `data/sources/` — all 96 normalized `.txt` files (in `.gitignore`, don't try to commit them)
- `data/chunks/` — all 588 chunks + manifest (in `.gitignore`)
- `data/extracted/` — 64 JSON files so far (in `.gitignore`)
- `data/activities.db` — **still the pilot's 1751-row DB.** Will be rebuilt after Faza 1 finishes.
-
-The schema, all scripts, all tests, and the pilot DB are already committed at `3d9f266`. No code changes are needed for Faza 1 — just data.
-
-## Status snapshot (as of handoff)
+**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
+- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
+- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`

+**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
+incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
 ```
-chunks done       : 64 / 588   (10.9%)
-activities so far : 1949
-remaining chunks  : 524
-largest pending sources:
-  87850302_dragon_sleepdeprived               116 chunks (full dragon mirror)
-  c3162825_resource_pack__learning_by_playing  97 chunks (catalunya mirror)
-  4da6431e_cub_scout_leader_how_to_book         18
-  4a765782_1000_fantastic_scout_games           18 (re-extract; was in pilot)
-  bee67427_the_big_book_of_conflict_resolution  15
-  e3bd0953_1001_idei_pentru_o_educatie_timp     14
-  d5e51389_09_culegere_de_jocuri_si_povestiri   13
-  ce4b48f1_impact_culegere_de_jocuri_si_povest  13
-  193fdd94_ghid_de_integrare_a_persoanelor_vul  12 (in progress)
-  779f4fa0_ghidul_animatorului_855_de_jocuri    11
+3f9c8232_teambuilding_corbu_29092023.part01
+5f959f85_scoli_fara_bullying.part02
+83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
+d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
+d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
+d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
+e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
 ```
+Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
+`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
+the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
+overlay keys depend on the current freeze yet.

-In a fresh session: `cat HANDOFF.md`, then go straight to step 3 above.
+## The json_repair incident (important — root cause + what was fixed)
+
+Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
+text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
+were affected.
+
+First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
+ends the string and reinterprets the trailing text as a new key, silently dropping the
+rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
+the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
+an extra key slipped through. Applying json_repair output to disk also **overwrote the
+malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
+re-extraction.
+
+**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
+(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
+validates against the real schema, and only replaces a valid top-level file when the
+repaired version carries **strictly more text** (a length guard that catches truncated
+json_repair output while leaving genuine extractions untouched). Re-running it cleanly
+repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
+`json_repair` is no longer used anywhere. Do NOT reintroduce it.
+
+`build_database.py` does NOT depend on the repair script (the "DB regenerable from
+data/extracted/" invariant holds — plain `json.loads` only).
+
+## What the code does now (all committed)
+
+**Part A — plumbing (corpus-independent):**
+- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
+  indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
+  chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
+  indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
+  and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
+  the categories table so dropdowns populate.
+- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
+  / `get_display_description` / `get_display_rules` / `get_display_variations`
+  (RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
+  `get_indoor_outdoor_display`, `get_space_needed_display`.
+- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
+  `normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
+  fallback — never fabricate a value) + display-name helpers.
+- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
+  `merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
+  **never** touches enrichment fields (those are applied post-dedup).
+
+**Part A — UI/search:**
+- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
+  `space_needed` to DB equality filters.
+- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
+  `SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
+  `source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
+  web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
+- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
+- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
+  helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
+  "Text original" section, indoor/space cards, estimat markers, and the download link
+  (only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
+
+**Part B — enrichment pipeline (built, not yet run):**
+- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
+  applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
+  `import_common.content_key(normalized_name, language, _normalize_text(description))`
+  (reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
+  `enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
+  Translated/expanded text is NOT re-validated against source (by design).
+- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
+  skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
+  per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
+  `find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
+  merges parts → `data/enrichment.json`.
+- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
+  `description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
+  fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
+
+## Exact next steps
+
+1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
+   JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
+   If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
+2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
+   note the new total (~9418 + the 7 chunks' activities).
+3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 6–8k LLM calls):
+   - Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
+     (or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
+   - Launch a small wave of Sonnet subagents on those prompts (each writes
+     `data/enrichment_parts/<key>.json`).
+   - `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
+   - `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
+   - **STOP. Hand the user translation-quality + estimation-plausibility + description-
+     fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
+     auto-proceed past this gate.
+4. After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, `--collect`,
+   final `--rebuild --enrichment`.
+
+## Verify / run
+
+- Tests: `python3 -m pytest tests/ -q` → 99 pass.
+- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
+  only if the user accepts the copyright exposure of serving original files.
+- `data/activities.db.bak` is the pre-this-freeze backup.