diff --git a/HANDOFF.md b/HANDOFF.md new file mode 100644 index 0000000..61e0cda --- /dev/null +++ b/HANDOFF.md @@ -0,0 +1,135 @@ +# HANDOFF — Faza 1 extraction in progress + +**Last updated:** 2026-05-20, commit `3d9f266` (pilot complete) + uncommitted Faza 1 work. + +## State of play + +Faza 0 (pilot) is **complete and committed**. Faza 1 (full corpus) is **in progress at 10.9%**. + +| Phase | Status | DB rows | Tests | +|-------|--------|---------|-------| +| Faza 0 pilot (5 files) | committed (`3d9f266`) | 1751 in `data/activities.db` | 71 passed | +| Faza 1 extraction | **64/588 chunks done**, 1949 activities in `data/extracted/*.json` (not yet imported to DB) | — | — | + +## What "Faza 1" is + +Process the full 96-source corpus (was 116 raw files; some are duplicates/empty zips/skipped junk) through the LLM-subagent extraction pipeline. Same code path as the pilot. Two large mirror dirs dominate the chunk count: + +- `87850302_dragon_sleepdeprived` — 116 chunks (full dragon.sleepdeprived.ca mirror) +- `c3162825_resource_pack__learning_by_playing_catalunya_...` — 97 chunks (the catalunya mirror) + +Combined they are 213/588 = 36% of all remaining chunks. + +## Critical recommendation for the next session + +**Use Sonnet 4.6 for subagent extractions, not Opus.** Opus was used through chunks 1-64 and burned through a 5-hour rate-limit window faster than this scale needs. Sonnet has 200K context which is plenty for the ~25k-token prompt + ~20k-token output of a single chunk extraction. The task is structured JSON extraction with a fixed schema — no complex reasoning needed. + +The Agent tool's `model` parameter takes `"sonnet"`. Pass it on every Agent call below. + +## Resuming — exact mechanical steps + +1. **Verify state.** + ```bash + cd /workspace/game-library + ls data/extracted/*.json | wc -l # should be 64 (or higher if more done) + ls data/chunks/_prompts/ | wc -l # 588 — the full prompt set + git log --oneline -3 # 3d9f266 must be HEAD or earlier + ``` + +2. **Find what's still pending.** Compare prompts to extracted files: + ```bash + ls data/chunks/_prompts/ | sed 's/\.prompt\.md$//' | sort > /tmp/all.txt + ls data/extracted/*.json 2>/dev/null | sed 's|.*/||;s|\.json$||' | sort > /tmp/done.txt + comm -23 /tmp/all.txt /tmp/done.txt > /tmp/pending.txt + wc -l /tmp/pending.txt # how many remain + head /tmp/pending.txt # what's next + ``` + +3. **Launch waves of 16 subagents in parallel.** One Agent call per chunk. Single message with 16 Agent tool calls. Use this exact template (substitute ``): + + ``` + Agent( + description: "Extract ", + subagent_type: "general-purpose", + model: "sonnet", ← critical + prompt: "Working directory: /workspace/game-library. Extraction subagent — read data/chunks/_prompts/.prompt.md and follow it EXACTLY. Apply rules from scripts/SUBAGENT_PROMPT.md and schema from scripts/activity_schema.json. Write the JSON. Set language per chunk content ('ro' or 'en'). Report under 40 words." + ) + ``` + + The per-chunk prompt file is fully self-contained — it points to the right chunk, sets source_id/source_hash/chunk_key, and references the rules + schema. The subagent just follows it. + +4. **After every wave**, briefly check progress and continue: + ```bash + ls data/extracted/*.json | wc -l + ``` + Repeat step 3 with the next 16 pending chunks. If an agent reports `"You've hit your limit · resets ..."` AND `tool_uses: 5` with `total_tokens: 0`, check whether the JSON was written anyway — agents often persist the file before the limit hit. Only re-launch if the JSON is missing. + +5. **When all 588 chunks are done**, finalize: + ```bash + python3 scripts/validate_extractions.py # any chunks marked rejected go to data/extracted/_reextract/ + # re-extract any rejected chunks (same template, prompt from _reextract/) + python3 scripts/build_database.py --rebuild + # if many borderline needs_review rows: + python3 -c " + import sys; sys.path.insert(0,'scripts') + from import_common import content_key, normalize_name + import sqlite3, json + conn = sqlite3.connect('data/activities.db') + conn.row_factory = sqlite3.Row + rows = list(conn.execute('SELECT name, normalized_name, language, description FROM activities WHERE needs_review=1')) + d = {content_key(r['normalized_name'] or normalize_name(r['name']), r['language'], r['description'] or ''): 'merge' for r in rows} + json.dump(d, open('data/review_decisions.json','w'), indent=2) + print(f'{len(d)} merge decisions') + " + python3 scripts/build_database.py --rebuild # apply decisions + python3 -m pytest tests/ -q # 71 should pass + git add data/activities.db data/review_decisions.json + git commit -m "Faza 1: full corpus extraction" + ``` + +## Code reference — what each script does + +- `scripts/normalize_sources.py --corpus data/carti-camp-jocuri --out data/sources` → produces 96 `data/sources/.txt` files with `--- PAGE N ---` markers. **Done. Don't re-run.** +- `scripts/chunk_sources.py --sources data/sources --chunks data/chunks` → splits each into ~20pg chunks with 4pg overlap, writes `data/chunks//.partNN.txt` and updates `data/chunks/manifest.json`. **Done. Don't re-run unless sources change.** +- `scripts/run_extraction.py` → regenerates the per-chunk prompts in `data/chunks/_prompts/`. **Done. Don't re-run unless schema/prompt changes.** +- `scripts/SUBAGENT_PROMPT.md` — extraction rules (what subagents follow). +- `scripts/activity_schema.json` — JSON schema each extraction must validate against. +- `scripts/validate_extractions.py` — per-file schema check + fuzzy `source_excerpt` substring check; writes re-extraction prompts to `data/extracted/_reextract/` for rejected chunks; marks chunks `rejected` in manifest. +- `scripts/build_database.py --rebuild` — validates every `data/extracted/*.json` against schema, drops per-activity hallucinations, dedup, applies `data/review_decisions.json`, atomic swap into `data/activities.db`. +- `scripts/review_queue.py list|resolve ` — CLI for borderline-dedup decisions; persisted in `data/review_decisions.json`. + +## Pilot lessons that apply + +- **~1.07% hallucinated drops** at pilot scale (well below the 2% threshold). Caused by source_excerpts straddling `--- PAGE N ---` markers. Re-extraction with verbatim within-page quotes fixed all 13 affected chunks. Expect similar rate at Faza 1 scale (~10-30 chunks may need re-extraction). +- **Borderline dedup queue** (369 rows in pilot) — same-name activities re-extracted from chunk overlap with slightly-different prose. Bulk-merge is the right call: same normalized_name + same language + 60-85% desc similarity → merge takes the longest fields. Use the snippet in step 5 above. +- **OCR not needed.** The candidate scanned PDF (`07.Cartea_Mare`) extracted 151 pages of real text via pdfplumber alone. Skip OCR for v1. + +## Files not yet committed (uncommitted in this session) + +- `data/sources/` — all 96 normalized `.txt` files (in `.gitignore`, don't try to commit them) +- `data/chunks/` — all 588 chunks + manifest (in `.gitignore`) +- `data/extracted/` — 64 JSON files so far (in `.gitignore`) +- `data/activities.db` — **still the pilot's 1751-row DB.** Will be rebuilt after Faza 1 finishes. + +The schema, all scripts, all tests, and the pilot DB are already committed at `3d9f266`. No code changes are needed for Faza 1 — just data. + +## Status snapshot (as of handoff) + +``` +chunks done : 64 / 588 (10.9%) +activities so far : 1949 +remaining chunks : 524 +largest pending sources: + 87850302_dragon_sleepdeprived 116 chunks (full dragon mirror) + c3162825_resource_pack__learning_by_playing 97 chunks (catalunya mirror) + 4da6431e_cub_scout_leader_how_to_book 18 + 4a765782_1000_fantastic_scout_games 18 (re-extract; was in pilot) + bee67427_the_big_book_of_conflict_resolution 15 + e3bd0953_1001_idei_pentru_o_educatie_timp 14 + d5e51389_09_culegere_de_jocuri_si_povestiri 13 + ce4b48f1_impact_culegere_de_jocuri_si_povest 13 + 193fdd94_ghid_de_integrare_a_persoanelor_vul 12 (in progress) + 779f4fa0_ghidul_animatorului_855_de_jocuri 11 +``` + +In a fresh session: `cat HANDOFF.md`, then go straight to step 3 above.