Files

Claude Agent 46d9592a55 HANDOFF for Faza 1 resumption (10.9% done, switch to Sonnet)

64/588 chunks extracted so far (~1949 activities) but in a fresh
session we should switch the subagent model from Opus to Sonnet —
the task is structured JSON extraction with a fixed schema, no
complex reasoning needed, and Sonnet's 200K context easily fits the
~25k-token prompt and ~20k-token output per chunk. Document captures
the exact resume procedure: pending-chunk discovery, the Agent call
template with model:"sonnet", and the finalization steps
(validate -> build_database -> needs_review bulk merge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 19:32:44 +00:00

8.1 KiB

Raw Blame History

HANDOFF — Faza 1 extraction in progress

Last updated: 2026-05-20, commit 3d9f266 (pilot complete) + uncommitted Faza 1 work.

State of play

Faza 0 (pilot) is complete and committed. Faza 1 (full corpus) is in progress at 10.9%.

Phase	Status	DB rows	Tests
Faza 0 pilot (5 files)	committed (`3d9f266`)	1751 in `data/activities.db`	71 passed
Faza 1 extraction	64/588 chunks done, 1949 activities in `data/extracted/*.json` (not yet imported to DB)	—	—

What "Faza 1" is

Process the full 96-source corpus (was 116 raw files; some are duplicates/empty zips/skipped junk) through the LLM-subagent extraction pipeline. Same code path as the pilot. Two large mirror dirs dominate the chunk count:

87850302_dragon_sleepdeprived — 116 chunks (full dragon.sleepdeprived.ca mirror)
c3162825_resource_pack__learning_by_playing_catalunya_... — 97 chunks (the catalunya mirror)

Combined they are 213/588 = 36% of all remaining chunks.

Critical recommendation for the next session

Use Sonnet 4.6 for subagent extractions, not Opus. Opus was used through chunks 1-64 and burned through a 5-hour rate-limit window faster than this scale needs. Sonnet has 200K context which is plenty for the ~25k-token prompt + ~20k-token output of a single chunk extraction. The task is structured JSON extraction with a fixed schema — no complex reasoning needed.

The Agent tool's model parameter takes "sonnet". Pass it on every Agent call below.

Resuming — exact mechanical steps

Verify state.

cd /workspace/game-library
ls data/extracted/*.json | wc -l       # should be 64 (or higher if more done)
ls data/chunks/_prompts/ | wc -l       # 588 — the full prompt set
git log --oneline -3                   # 3d9f266 must be HEAD or earlier

Find what's still pending. Compare prompts to extracted files:

ls data/chunks/_prompts/ | sed 's/\.prompt\.md$//' | sort > /tmp/all.txt
ls data/extracted/*.json 2>/dev/null | sed 's|.*/||;s|\.json$||' | sort > /tmp/done.txt
comm -23 /tmp/all.txt /tmp/done.txt > /tmp/pending.txt
wc -l /tmp/pending.txt                 # how many remain
head /tmp/pending.txt                  # what's next

Launch waves of 16 subagents in parallel. One Agent call per chunk. Single message with 16 Agent tool calls. Use this exact template (substitute <CHUNK_KEY>):

Agent(
  description: "Extract <CHUNK_KEY>",
  subagent_type: "general-purpose",
  model: "sonnet",                  ← critical
  prompt: "Working directory: /workspace/game-library. Extraction subagent — read data/chunks/_prompts/<CHUNK_KEY>.prompt.md and follow it EXACTLY. Apply rules from scripts/SUBAGENT_PROMPT.md and schema from scripts/activity_schema.json. Write the JSON. Set language per chunk content ('ro' or 'en'). Report under 40 words."
)

The per-chunk prompt file is fully self-contained — it points to the right chunk, sets source_id/source_hash/chunk_key, and references the rules + schema. The subagent just follows it.

After every wave, briefly check progress and continue:
```
ls data/extracted/*.json | wc -l
```
Repeat step 3 with the next 16 pending chunks. If an agent reports "You've hit your limit · resets ..." AND tool_uses: 5 with total_tokens: 0, check whether the JSON was written anyway — agents often persist the file before the limit hit. Only re-launch if the JSON is missing.

When all 588 chunks are done, finalize:

python3 scripts/validate_extractions.py    # any chunks marked rejected go to data/extracted/_reextract/
# re-extract any rejected chunks (same template, prompt from _reextract/)
python3 scripts/build_database.py --rebuild
# if many borderline needs_review rows:
python3 -c "
import sys; sys.path.insert(0,'scripts')
from import_common import content_key, normalize_name
import sqlite3, json
conn = sqlite3.connect('data/activities.db')
conn.row_factory = sqlite3.Row
rows = list(conn.execute('SELECT name, normalized_name, language, description FROM activities WHERE needs_review=1'))
d = {content_key(r['normalized_name'] or normalize_name(r['name']), r['language'], r['description'] or ''): 'merge' for r in rows}
json.dump(d, open('data/review_decisions.json','w'), indent=2)
print(f'{len(d)} merge decisions')
"
python3 scripts/build_database.py --rebuild   # apply decisions
python3 -m pytest tests/ -q                   # 71 should pass
git add data/activities.db data/review_decisions.json
git commit -m "Faza 1: full corpus extraction"

Code reference — what each script does

scripts/normalize_sources.py --corpus data/carti-camp-jocuri --out data/sources → produces 96 data/sources/<id>.txt files with --- PAGE N --- markers. Done. Don't re-run.
scripts/chunk_sources.py --sources data/sources --chunks data/chunks → splits each into ~20pg chunks with 4pg overlap, writes data/chunks/<id>/<id>.partNN.txt and updates data/chunks/manifest.json. Done. Don't re-run unless sources change.
scripts/run_extraction.py → regenerates the per-chunk prompts in data/chunks/_prompts/. Done. Don't re-run unless schema/prompt changes.
scripts/SUBAGENT_PROMPT.md — extraction rules (what subagents follow).
scripts/activity_schema.json — JSON schema each extraction must validate against.
scripts/validate_extractions.py — per-file schema check + fuzzy source_excerpt substring check; writes re-extraction prompts to data/extracted/_reextract/ for rejected chunks; marks chunks rejected in manifest.
scripts/build_database.py --rebuild — validates every data/extracted/*.json against schema, drops per-activity hallucinations, dedup, applies data/review_decisions.json, atomic swap into data/activities.db.
scripts/review_queue.py list|resolve <id> <merge|keep-separate|drop> — CLI for borderline-dedup decisions; persisted in data/review_decisions.json.

Pilot lessons that apply

~1.07% hallucinated drops at pilot scale (well below the 2% threshold). Caused by source_excerpts straddling --- PAGE N --- markers. Re-extraction with verbatim within-page quotes fixed all 13 affected chunks. Expect similar rate at Faza 1 scale (~10-30 chunks may need re-extraction).
Borderline dedup queue (369 rows in pilot) — same-name activities re-extracted from chunk overlap with slightly-different prose. Bulk-merge is the right call: same normalized_name + same language + 60-85% desc similarity → merge takes the longest fields. Use the snippet in step 5 above.
OCR not needed. The candidate scanned PDF (07.Cartea_Mare) extracted 151 pages of real text via pdfplumber alone. Skip OCR for v1.

Files not yet committed (uncommitted in this session)

data/sources/ — all 96 normalized .txt files (in .gitignore, don't try to commit them)
data/chunks/ — all 588 chunks + manifest (in .gitignore)
data/extracted/ — 64 JSON files so far (in .gitignore)
data/activities.db — still the pilot's 1751-row DB. Will be rebuilt after Faza 1 finishes.

The schema, all scripts, all tests, and the pilot DB are already committed at 3d9f266. No code changes are needed for Faza 1 — just data.

Status snapshot (as of handoff)

chunks done       : 64 / 588   (10.9%)
activities so far : 1949
remaining chunks  : 524
largest pending sources:
  87850302_dragon_sleepdeprived               116 chunks (full dragon mirror)
  c3162825_resource_pack__learning_by_playing  97 chunks (catalunya mirror)
  4da6431e_cub_scout_leader_how_to_book         18
  4a765782_1000_fantastic_scout_games           18 (re-extract; was in pilot)
  bee67427_the_big_book_of_conflict_resolution  15
  e3bd0953_1001_idei_pentru_o_educatie_timp     14
  d5e51389_09_culegere_de_jocuri_si_povestiri   13
  ce4b48f1_impact_culegere_de_jocuri_si_povest  13
  193fdd94_ghid_de_integrare_a_persoanelor_vul  12 (in progress)
  779f4fa0_ghidul_animatorului_855_de_jocuri    11

In a fresh session: cat HANDOFF.md, then go straight to step 3 above.

8.1 KiB Raw Blame History