Faza 0 pilot: rebuild activities.db from 5-file extraction

61 chunks × LLM subagent extraction yielded 1780 raw activities;
build_database dedup + hallucination check yielded 1732 in DB.

Pilot metrics vs plan acceptance thresholds:
- hallucinated drops      : 19/1780 = 1.07%  (threshold ≤ 2%)
- schema-rejected files   : 0/61              (threshold ≥ 0.9 valid)
- chunks needing re-extract: 13/61 (paraphrased excerpts 75-90/100)
- % with rules            : 99.9%
- extraction_confidence high: 1712/1732 = 98.8%

OCR decision: NOT NEEDED. The Cartea_Mare scanned-PDF candidate
extracted 151 pages / 38k words of real text via pdfplumber alone.

Pilot files:
- 1000 Fantastic Scout Games (EN, 278pg, 18 chunks → 946 activities)
- dragon.sleepdeprived.ca/games mirror (EN, 498pg, 31 chunks → 531)
- Cartea Mare a Jocurilor (RO, 151pg, 10 chunks → 284)
- Activităţi şi jocuri ... .doc (RO, 7pg, 1 chunk → 19, needs_review)
- Amazing Race templates zip (graphics only, 0 activities — expected)

The old activities.db was backed up to .bak before atomic swap.
tests/ still green (71 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-05-20 07:43:42 +00:00
parent 66ae831c36
commit 3d9f266696

Binary file not shown.