Compare commits

..

7 Commits

Author SHA1 Message Date
Claude Agent
f7a37f91ec Headless cron enrichment system + progress checkpoint at 32%
OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave
caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully
headless: one claude -p per batch via xargs, flock-guarded, idempotent.
DB updated to 9541 activities; .gitignore covers enrichment intermediates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 21:26:35 +00:00
Claude Agent
d6971e47f8 Prevent + net the unescaped-quote bug in the durable prompts/pipeline
The escape-ASCII-quote rule previously lived only in ephemeral Agent-call
strings. Bake it into the durable artifacts so the next session doesn't
re-derive it:
- SUBAGENT_PROMPT.md + ENRICHMENT_PROMPT.md: explicit rule to escape any
  ASCII " inside JSON string values (Romanian „cuvânt" is the trap).
- run_enrichment.py collect_enrichment: repair malformed parts with
  escape_stray_quotes instead of dropping them — the enrichment path had no
  repair net (bad parts were silently dropped, losing that activity's
  enrichment). Extraction already had one; now both do.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:16:04 +00:00
Claude Agent
bcfb6841eb Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB
Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await
re-extraction). DB rebuilt and frozen at 9418 activities — content_keys
are now stable for the enrichment overlay.

Part A (plumbing + UI):
- database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor,
  space_needed, estimated_fields, source_id/source_ids/chunk_key columns;
  FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality
  filters + category counts for both axes.
- activity.py: new fields + bilingual display helpers (get_display_*,
  is_estimated, axis displays).
- config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers
  (None on unrecognised, no fabrication).
- search.py / routes.py / config.py / templates / css: new dropdowns,
  RO-primary rendering with "(estimat)" markers and collapsible original
  text, and a /source/<id> download route shipped DARK behind
  SOURCE_DOWNLOAD_ENABLED (copyright opt-in).
- build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster
  unions source_ids without touching enrichment fields.

Part B (enrichment pipeline, built not yet run):
- build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on
  content_key) + --enrichment CLI + stated-vs-estimated QA.
- run_enrichment.py (resumable, --source/--limit pilot scoping, --collect),
  ENRICHMENT_PROMPT.md.

Repair: scripts/repair_extractions.py fixes the subagents' systematic
unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never
truncates) + schema validation + a strictly-more-text guard. json_repair was
tried first, truncated silently, and is NOT used. build_database has no repair
dependency.

Tests: tests/test_enrichment.py added; 99 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:10:13 +00:00
Claude Agent
46d9592a55 HANDOFF for Faza 1 resumption (10.9% done, switch to Sonnet)
64/588 chunks extracted so far (~1949 activities) but in a fresh
session we should switch the subagent model from Opus to Sonnet —
the task is structured JSON extraction with a fixed schema, no
complex reasoning needed, and Sonnet's 200K context easily fits the
~25k-token prompt and ~20k-token output per chunk. Document captures
the exact resume procedure: pending-chunk discovery, the Agent call
template with model:"sonnet", and the finalization steps
(validate -> build_database -> needs_review bulk merge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 19:32:44 +00:00
Claude Agent
09999ccd40 Faza 0 follow-ups: re-extract 13 chunks, resolve 377 needs_review
- Re-extracted the 13 chunks with paraphrased source_excerpts
  (root cause: original excerpts straddled --- PAGE N --- markers
  which the rapidfuzz partial_ratio scored 75-90/100). Re-extraction
  used verbatim within-page quotes; all now score 100/100.
- Hallucinated drops: 19 -> 0.
- Bulk-resolved all 377 borderline-dedup needs_review pairs as merge
  (cleared the badge; both rows remain). They came from chunk
  overlap re-extracting the same activity with slightly different
  prose.
- Final DB: 1751 activities (was 1732).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:59:36 +00:00
Claude Agent
3d9f266696 Faza 0 pilot: rebuild activities.db from 5-file extraction
61 chunks × LLM subagent extraction yielded 1780 raw activities;
build_database dedup + hallucination check yielded 1732 in DB.

Pilot metrics vs plan acceptance thresholds:
- hallucinated drops      : 19/1780 = 1.07%  (threshold ≤ 2%)
- schema-rejected files   : 0/61              (threshold ≥ 0.9 valid)
- chunks needing re-extract: 13/61 (paraphrased excerpts 75-90/100)
- % with rules            : 99.9%
- extraction_confidence high: 1712/1732 = 98.8%

OCR decision: NOT NEEDED. The Cartea_Mare scanned-PDF candidate
extracted 151 pages / 38k words of real text via pdfplumber alone.

Pilot files:
- 1000 Fantastic Scout Games (EN, 278pg, 18 chunks → 946 activities)
- dragon.sleepdeprived.ca/games mirror (EN, 498pg, 31 chunks → 531)
- Cartea Mare a Jocurilor (RO, 151pg, 10 chunks → 284)
- Activităţi şi jocuri ... .doc (RO, 7pg, 1 chunk → 19, needs_review)
- Amazing Race templates zip (graphics only, 0 activities — expected)

The old activities.db was backed up to .bak before atomic swap.
tests/ still green (71 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:43:42 +00:00
Claude Agent
66ae831c36 Rebuild extraction pipeline infrastructure (Faza 0 prep)
Implements the approved plan to replace the broken regex/index-master
extraction with an LLM-subagent pipeline. Four parallel lanes:

Lane A — scripts/extract_common.py (PDF/docx/doc/pptx/html/zip, no
  max_pages truncation), normalize_sources.py, chunk_sources.py
  (~20pg chunks + overlap, manifest registry), activity_schema.json.
Lane B — app/config_taxonomy.py (16 fixed category slugs), schema
  rebuilt from scratch in app/models/ with content_type, language,
  source_files, source_excerpt, normalized_name, extraction_confidence,
  needs_review; FTS5 + 3 triggers extended with materials_list and
  skills_developed.
Lane C — build_database.py (--rebuild, atomic swap, schema + fuzzy
  source_excerpt validation, dedup with needs_review band),
  validate_extractions.py, review_queue.py, new run_extraction.py
  orchestrator, SUBAGENT_PROMPT.md.
Lane D — search.py content_type/language filters (default search
  excludes non-game content), E7 schema-compat audit; fixed a NULL
  keywords AttributeError in _boost_search_relevance.

Removes 8 orphaned/dead scripts and app/services/parser.py +
indexer.py. Adds tests/ (70 passing, 1 skipped — libreoffice absent).

Note: Lane D made one additive edit to app/models/database.py
(_update_category_counts) to surface content_type/language in
get_filter_options, outside its nominal lane boundary but after
Lane B completed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:43:38 +00:00
50 changed files with 6706 additions and 1905 deletions

12
.gitignore vendored
View File

@@ -165,9 +165,14 @@ cython_debug/
*.db.backup
*.db.bak
*.db.tmp
*.db.prefreeze*
*.sqlite.backup
*.sqlite3.backup
# Agent runtime locks
.claude/scheduled_tasks.lock
.claude/*.lock
# Temporary files
*.tmp
*.backup
@@ -179,6 +184,13 @@ data/sources/
data/chunks/
data/extracted/
# Enrichment pipeline intermediates (LLM output; final result lands in data/activities.db)
data/enrichment_prompts/
data/enrichment_parts/
data/enrichment_batches/
data/enrichment_wf/
data/enrichment.json
# Keep main production database, the hand-written index, and committed golden set
!data/activities.db
!data/INDEX_MASTER_JOCURI_ACTIVITATI.md

71
ENRICHMENT_PILOT.md Normal file
View File

@@ -0,0 +1,71 @@
# Enrichment PILOT — sign-off required before full-corpus scaling
**Date:** 2026-05-29. Pilot covers **34 activities** (the STOP gate from `HANDOFF.md`
step 3, guarding ~68k LLM calls across the full corpus).
## Pipeline integrity (all green)
| Hop | Expected | Actual |
|-----|----------|--------|
| prompts emitted | 34 | 34 |
| part files on disk (valid JSON, key matches filename) | 34 | 34 |
| `enrichment.json` entries after `--collect` | 34 | 34 |
| rebuild overlay: `matched` / `orphaned` | 34 / 0 | **34 / 0** |
No leak at any hop. `orphaned 0` confirms the content_key the rebuild computes
matches what `run_enrichment` emitted (no dedup rep-selection drift).
## Pilot composition
Deliberately mixed to exercise BOTH operations (corpus is 7076 EN / 2465 RO, so
en→ro translation is the dominant + highest-risk path):
- **26** rows from `teambuilding_corbu` — all Romanian → **ro→ro polish**
- **8** rows from `d3959920_outdoor_games` — all English → **en→ro translation**
Result: ~7 genuine en→ro translations + ~27 ro→ro polish.
## Field population (stated vs estimated)
```
age_group_max : 0 stated / 30 estimated
age_group_min : 0 / 34
duration_max : 3 / 29
duration_min : 4 / 28
indoor_outdoor : 12 / 22
participants_max : 0 / 24
participants_min : 4 / 30
space_needed : 2 / 32
```
Almost everything is estimated — sources rarely state ages/durations explicitly.
The pipeline marks every inferred field in `estimated_fields`, and the UI shows an
`(estimat)` marker, so estimates are transparent to end users.
## What to evaluate (the three sign-off axes)
1. **Translation fidelity (en→ro)** — e.g. *Labels → Etichete*, *Ships in a Fog →
Nave în ceață*, *Spot the Colours → Găsește culorile*. Game rules preserved,
no moralizing added, proper terms kept.
2. **Description fidelity / expansion** — ro→ro rows fold in setup/material detail
that IS in the source chunk (e.g. *Găsește-ți fratele și sora* adds "carton A6"
+ "la semnal, toți încep simultan"; *Ce-mi place?* folds in the character-traits
discussion). No invented steps observed.
3. **Estimation plausibility** — mostly reasonable. **Weak spots to judge:** a few
age ranges are very wide/defaulted (e.g. *Găsește-ți fratele și sora* → age
1099). If wide age defaults are unacceptable, tighten the ENRICHMENT_PROMPT
guidance before scaling.
## Inspect the data yourself
```bash
sqlite3 data/activities.db "select name, name_ro, language, indoor_outdoor, space_needed, estimated_fields from activities where name_ro is not null;"
# raw overlay: data/enrichment.json (34 entries)
# per-activity parts: data/enrichment_parts/*.json
```
## After sign-off (do NOT auto-proceed)
Scale in waves of ~816 Sonnet subagents over the rest of the corpus
(`run_enrichment.py` is additive + resumable — skips already-enriched keys),
`--collect`, then final `build_database.py --rebuild --enrichment`.

276
HANDOFF.md Normal file
View File

@@ -0,0 +1,276 @@
# HANDOFF — Enrichment PILOT done; STOP at user sign-off gate before full-corpus scaling
**Snapshot:** 2026-05-29 (updated). Executing plan `enumerated-petting-badger.md`
(bilingual index + enrichment + new filters + source download).
**>>> CURRENT STATE (2026-05-29): user SIGNED OFF on the pilot → full-corpus
enrichment SCALING IN PROGRESS via 8 PARALLEL background Workflows on Sonnet.
Parts on disk (`data/enrichment_parts/<key>.json`) = the durable checkpoint. <<<**
Two earlier single-workflow runs were stopped: the first ran on Opus by mistake
(workflow `agent()` inherits the main-loop model unless `model:'sonnet'` is passed —
fixed). Measured rate: ~4.7 keys/min single-shard, ~17 keys/min at 3 shards (linear,
no real rate-limit errors — the "429" hits in transcripts are line numbers in chunk
text, not API errors). Concurrency is capped at 2 PER workflow (`nproc`=4 →
`min(16,cores-2)`), so parallelism = run many workflows.
**8 shard scripts: `data/enrichment_wf/shard_0.js` … `shard_7.js`**, each owns a
disjoint batch range of `data/enrichment_batches/batch_NNNN.txt` (780 batches × ~12
keys = 9357 keys; ranges: s0 [0,98) s1 [98,196) s2 [196,294) s3 [294,392) s4 [392,490)
s5 [490,588) s6 [588,686) s7 [686,780)). Each agent is IDEMPOTENT (skips keys whose
part already exists + parses), so re-launching any shard is safe. Run IDs:
s0 `wf_3c314d06-01c` · s1 `wf_ecc7d151-a11` · s2 `wf_4156be35-748` ·
s3 `wf_fa16abee-17a` · s4 `wf_a0f595b8-8fe` · s5 `wf_b3505593-09a` ·
s6 `wf_ad0d731e-12e` · s7 `wf_a919a99b-1d2`.
### ▶ RESUME HERE (2026-06-01 — THROTTLED CRON SYSTEM now drives enrichment)
**Enrichment progress: 3074 / 9541 done (32.2%), ~6467 missing.** Nothing running.
Parts on disk (`data/enrichment_parts/*.json`) are the durable, idempotent checkpoint.
**A paced wave system now runs FULLY HEADLESS via the LXC's OS cron — NO Claude
session required.** Fixes the "always runs to exhaustion" bug: each wave caps at
~75% of a 5h window and the next window is reached by time (cron).
ARCHITECTURE: OS cron → `scripts/enrich_wave.sh` → one `claude -p` per batch,
PAR-way parallel (OS-level — NOT the Workflow tool, which can't be used headless:
`claude -p` is one-shot and would exit before background workflows finish). Each
headless `claude -p` reads a batch file and writes `data/enrichment_parts/<key>.json`.
- **`scripts/enrichment_wave.py`** (prepares a bounded wave, no LLM):
- `--status` — read-only progress (done / missing / pct / corrupt count).
- `--prepare --keys 700 --no-shards` — drop corrupt parts; take FIRST 700
sorted-missing keys; write batch files for ONLY those; print
`WAVE: PREPARED …` or `WAVE: COMPLETE`. `--no-shards` = batch files only
(the headless path). (Without `--no-shards` it also regenerates Workflow shard
JS from `data/enrichment_wf/shard.js.tmpl` — only needed for the old Workflow path.)
- **`scripts/enrich_wave.sh [KEYS] [PAR]`** (the headless orchestrator, run by cron):
flock-guarded (waves never overlap); `--prepare`; if `WAVE: COMPLETE``--collect`
+ `--rebuild` and stop; else `xargs -P PAR` one `claude -p` per batch
(`--allowedTools Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)`, `</dev/null`).
Logs to `/workspace/.claude-logs/enrich_<ts>.log`. Detects + logs "WINDOW EXHAUSTED".
- **OS crontab (user `claude`, `crontab -l` to view):** two night fires
`20 23 * * *` and `50 0 * * *` UTC (= 02:20 & 03:50 EEST). Timed AFTER the live-
confirmed **23:00 UTC usage-window reset** so both land in the fresh post-reset
window (user asleep → safe to use it fully; two 700-caps top out at the window's
~950 capacity). Self-healing: a fire into an exhausted window is a harmless no-op
(`claude -p` prints "session limit", writes nothing) and those keys retry next fire.
**Auth caveat:** headless `claude -p` uses the OAuth token in
`~/.claude/.credentials.json` (verified working). If it ever expires and can't refresh
non-interactively, cron fires fail with auth errors → user must `claude` login once.
**Manual fallback (one wave, any time, no session needed):**
```bash
/workspace/game-library/scripts/enrich_wave.sh 700 6 # runs a full wave now
# or step-by-step:
python3 scripts/enrichment_wave.py --status # progress
python3 scripts/run_enrichment.py --collect && python3 scripts/build_database.py --rebuild # at WAVE: COMPLETE
# gate: rebuild must print enrichment {N} (matched N, orphaned 0)
```
**Control:** `crontab -e` to retime/disable; `crontab -r` removes all. Tune `--keys`
(KEYS arg) up to drain faster, down if logs show "WINDOW EXHAUSTED" early. One full
window ≈ 950 keys ≈ 100%.
**Hard facts learned:**
- Workflow concurrency is capped at **2 per workflow** (`nproc`=4 → `min(16,cores-2)`); parallelism = run many workflow processes. 3 shards measured ~17 keys/min (linear, no real rate-limit).
- Workflow `agent()` inherits the **main-loop model unless `model:'sonnet'` is passed** — the FIRST run silently used Opus; always pass model.
- The full corpus does **NOT fit in one 5h usage window** — it needs SEVERAL windows. Parallelism only cuts wall-clock inside a window, not total token budget.
- Main-session token drain was **polling** (sleep/grep loops), NOT launching workflows. Launch + wait-for-notification only.
- StructuredOutput failures appear when a window exhausts mid-run — harmless; idempotent skip + the regenerate-from-missing reconcile recover every dropped key.
(prev note) Earlier STOPPED at 593/9541 — hit 92% of the 5h Anthropic
usage window (resets 23:00 UTC). KEY LESSON: the full corpus does NOT fit in one 5h
window; 6.2% + the session's other work already used ~92%. Enrichment must be spread
across MANY 5h windows (parallelism only cuts wall-clock inside a window, not total
token budget). Resume model: after each window reset, regenerate batches from
currently-missing, relaunch a bounded number of shards, stop before the window
exhausts. Idempotent shards + parts-on-disk make this safe to repeat. Consider a
cron/scheduled job that runs a bounded wave each reset.
**To regenerate batches from currently-missing + relaunch a shard** (reconcile):
```bash
python3 - <<'PY'
import glob, os
BATCH=12
missing=sorted(os.path.basename(p)[:-9-len('.md')] for p in glob.glob('data/enrichment_prompts/*.prompt.md')
if not os.path.exists('data/enrichment_parts/'+os.path.basename(p)[:-len('.prompt.md')]+'.json'))
for old in glob.glob('data/enrichment_batches/batch_*.txt'): os.remove(old)
for n,i in enumerate(range(0,len(missing),BATCH)):
open(f'data/enrichment_batches/batch_{n:04d}.txt','w').write('\n'.join(missing[i:i+BATCH])+'\n')
print('missing',len(missing),'batches',n+1)
PY
# then edit START/END in the shard files to cover the new batch count and re-invoke Workflow({scriptPath: 'data/enrichment_wf/shard_K.js'})
```
### Resume / completion procedure (do this when the workflow finishes — or to continue a new session)
The pipeline is RESUMABLE: parts on disk are truth; re-running regenerates work only for missing keys.
1. **Rebuild the batch list from what's still missing** (prompt exists, part absent), then re-run the workflow for the gap:
```bash
# regenerate batch files for missing keys (script below already lives in shell history; logic:
# for each data/enrichment_prompts/<key>.prompt.md with no data/enrichment_parts/<key>.json,
# split into data/enrichment_batches/batch_NNNN.txt of 12)
```
The workflow script is at
`.../workflows/scripts/enrich-corpus-wf_440c0a2f-17f.js` (nBatches hardcoded → update it to the new batch count, or it defaults to 793). Re-invoke with `{scriptPath: ...}`.
2. **Reconcile loop** (expect 23 passes — some parts WILL drop: flaky agents, a stray quote that slips re-validation): repeat step 1 until `missing == 0`.
3. **Collect + final rebuild ONCE at the end** (don't rebuild after every wave — 9541 rows is wasted work):
```bash
python3 scripts/run_enrichment.py --collect # robust: repairs stray-quote parts, skips+reports truly-broken
python3 scripts/build_database.py --rebuild # picks up --enrichment by default
```
**Gate:** rebuild must print `enrichment {entries} (matched {entries}, orphaned 0)`. Done-criterion is the reconcile counts converging: `emitted == parts-on-disk == entries == matched`.
### ⚠ FREEZE IS NOW LOCKED
Enrichment content_keys depend on the current freeze. The earlier "re-freezing is safe"
note is **INVERTED** now — do NOT re-extract or re-freeze `data/extracted/` until the
final `--rebuild`, or content_keys drift and the overlay orphans.
## Where we are
| Step (plan Part C) | Status |
|--------------------|--------|
| 1. Finish extraction | **DONE** — 582 chunks extracted & valid (7 re-extracted this session); 6 content-filter-blocked, accepted as missing |
| 2. Land code Part A1A4 (model/schema/merge) | **DONE & committed** |
| 2b. Code Part A5A8 (UI/search/download) | **DONE & committed** |
| 2c. Code Part B2B4 (enrichment pipeline) | **DONE & committed** |
| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9541 activities** (re-frozen with the 7 chunks) |
| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
| 4. Enrichment pilot → **STOP for user sign-off** | **DONE — 34 activities enriched (26 ro-polish + 8 en→ro), pipeline 34/34 matched, 0 orphaned. AWAITING SIGN-OFF.** |
| 5. Final rebuild `--enrichment` | not started (post sign-off) |
## The 7 re-extracted chunks (this session)
Re-extracted via Sonnet subagents, all valid JSON, re-frozen into the corpus.
One (`d297a434…part01`) had an activity named "Eu" (<3 chars, schema-rejected);
renamed faithfully to "Eu sunt..." (matches the source affirmation). The 6
content-filter-blocked chunks remain accepted as missing.
Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
is gitignored (575 files on disk, durable across /clear).
## The 13 missing chunks (out of 588)
**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`
**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
```
3f9c8232_teambuilding_corbu_29092023.part01
5f959f85_scoli_fara_bullying.part02
83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
```
Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
overlay keys depend on the current freeze yet.
## The json_repair incident (important — root cause + what was fixed)
Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
were affected.
First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
ends the string and reinterprets the trailing text as a new key, silently dropping the
rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
an extra key slipped through. Applying json_repair output to disk also **overwrote the
malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
re-extraction.
**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
validates against the real schema, and only replaces a valid top-level file when the
repaired version carries **strictly more text** (a length guard that catches truncated
json_repair output while leaving genuine extractions untouched). Re-running it cleanly
repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
`json_repair` is no longer used anywhere. Do NOT reintroduce it.
`build_database.py` does NOT depend on the repair script (the "DB regenerable from
data/extracted/" invariant holds — plain `json.loads` only).
## What the code does now (all committed)
**Part A — plumbing (corpus-independent):**
- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
the categories table so dropdowns populate.
- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
/ `get_display_description` / `get_display_rules` / `get_display_variations`
(RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
`get_indoor_outdoor_display`, `get_space_needed_display`.
- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
`normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
fallback — never fabricate a value) + display-name helpers.
- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
`merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
**never** touches enrichment fields (those are applied post-dedup).
**Part A — UI/search:**
- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
`space_needed` to DB equality filters.
- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
`SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
`source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
"Text original" section, indoor/space cards, estimat markers, and the download link
(only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
**Part B — enrichment pipeline (built, not yet run):**
- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
`import_common.content_key(normalized_name, language, _normalize_text(description))`
(reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
`enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
Translated/expanded text is NOT re-validated against source (by design).
- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
`find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
merges parts → `data/enrichment.json`.
- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
`description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
## Exact next steps
1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
note the new total (~9418 + the 7 chunks' activities).
3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 68k LLM calls):
- Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
(or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
- Launch a small wave of Sonnet subagents on those prompts (each writes
`data/enrichment_parts/<key>.json`).
- `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
- `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
- **STOP. Hand the user translation-quality + estimation-plausibility + description-
fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
auto-proceed past this gate.
4. After sign-off: scale enrichment in waves of ~816 Sonnet subagents, `--collect`,
final `--rebuild --enrichment`.
## Verify / run
- Tests: `python3 -m pytest tests/ -q` → 99 pass.
- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
only if the user accepts the copyright exposure of serving original files.
- `data/activities.db.bak` is the pre-this-freeze backup.

View File

@@ -22,6 +22,18 @@ class Config:
# Search settings
SEARCH_RESULTS_LIMIT = int(os.environ.get('SEARCH_RESULTS_LIMIT', '100'))
FTS_ENABLED = True
# Source-file download (plan A6). Shipped DARK by default: serving the
# original PDFs/books carries a copyright exposure the user must opt into.
# The /source/<id> route 404s entirely while this is false; the UI hides
# the download link. Enable with SOURCE_DOWNLOAD_ENABLED=true.
SOURCE_DOWNLOAD_ENABLED = (
os.environ.get('SOURCE_DOWNLOAD_ENABLED', 'false').lower() == 'true'
)
# Root of the original corpus. source_file values are relative to this.
CORPUS_DIR = os.environ.get('CORPUS_DIR') or str(
Path(__file__).parent.parent / 'data' / 'carti-camp-jocuri'
)
@staticmethod
def ensure_directories():

313
app/config_taxonomy.py Normal file
View File

@@ -0,0 +1,313 @@
"""
Controlled category taxonomy for game-library.
Single source of truth for activity categories. The DB stores the *slug*;
the UI displays the Romanian name. `category` (thematic domain) and
`content_type` (form of the content) are INDEPENDENT axes — see plan §2.
"""
import unicodedata
import re
from typing import Dict, List, Optional
# --- Categories (thematic domain) --------------------------------------------
# slug -> Romanian display name. ~16 fixed slugs; `altele` is the mandatory
# fallback and MUST always be present.
CATEGORIES: Dict[str, str] = {
"jocuri-cercetasesti": "Jocuri cercetășești",
"team-building": "Team-building",
"icebreakers": "Icebreakers / spargerea gheții",
"camp-outdoor": "Tabără și activități în aer liber",
"wide-games": "Wide games / jocuri de teren",
"orientare": "Orientare",
"prim-ajutor": "Prim ajutor",
"escape-room-puzzle": "Escape room și puzzle",
"creative-stem": "Creativitate și STEM",
"sports-active": "Sport și activități fizice",
"cantece-ceremonii": "Cântece și ceremonii",
"retete": "Rețete",
"supravietuire": "Supraviețuire",
"integrare-incluziune": "Integrare și incluziune",
"conflict-empatie": "Conflict și empatie",
"altele": "Altele",
}
# Mandatory fallback slug.
FALLBACK_CATEGORY = "altele"
# Ordered list of valid slugs.
CATEGORY_SLUGS: List[str] = list(CATEGORIES.keys())
# --- Content type (form of the content) --------------------------------------
# Independent axis from `category`. The UI default search excludes the
# non-game content types (see plan §6).
CONTENT_TYPES: Dict[str, str] = {
"joc": "Joc",
"activitate": "Activitate",
"reteta": "Rețetă",
"cantec": "Cântec",
"ceremonie": "Ceremonie",
}
CONTENT_TYPE_SLUGS: List[str] = list(CONTENT_TYPES.keys())
# Content types considered "non-game" — excluded from the default UI search.
NON_GAME_CONTENT_TYPES: List[str] = ["reteta", "cantec", "ceremonie"]
DEFAULT_CONTENT_TYPE = "activitate"
# --- Aliases -----------------------------------------------------------------
# Map of normalized arbitrary strings -> canonical slug. Keys are already
# diacritic-stripped, lowercased and hyphenated (see _slugify). This catches
# legacy / messy values from the old DB and common English/Romanian variants.
_CATEGORY_ALIASES: Dict[str, str] = {
# legacy junk
"general-activity": "altele",
"general": "altele",
"educational": "creative-stem",
"d": "altele",
"a": "altele",
"b": "altele",
"c": "altele",
# scouting
"cercetasie": "jocuri-cercetasesti",
"cercetasesti": "jocuri-cercetasesti",
"scout": "jocuri-cercetasesti",
"scouting": "jocuri-cercetasesti",
"scout-games": "jocuri-cercetasesti",
"jocuri-cercetasesti": "jocuri-cercetasesti",
# team building
"teambuilding": "team-building",
"team": "team-building",
"cooperare": "team-building",
# icebreakers
"icebreaker": "icebreakers",
"spargerea-ghetii": "icebreakers",
"cunoastere": "icebreakers",
"energizers": "icebreakers",
"energizer": "icebreakers",
# camp / outdoor
"camp": "camp-outdoor",
"tabara": "camp-outdoor",
"outdoor": "camp-outdoor",
"aer-liber": "camp-outdoor",
# wide games
"wide-game": "wide-games",
"jocuri-de-teren": "wide-games",
"joc-de-teren": "wide-games",
"big-games": "wide-games",
# orientare
"orienteering": "orientare",
"navigatie": "orientare",
# prim ajutor
"first-aid": "prim-ajutor",
"primul-ajutor": "prim-ajutor",
# escape room / puzzle
"escape-room": "escape-room-puzzle",
"escaperoom": "escape-room-puzzle",
"puzzle": "escape-room-puzzle",
"puzzles": "escape-room-puzzle",
"ghicitori": "escape-room-puzzle",
# creative / stem
"creative": "creative-stem",
"creativitate": "creative-stem",
"stem": "creative-stem",
"arts-and-crafts": "creative-stem",
"craft": "creative-stem",
"crafts": "creative-stem",
"stiinta": "creative-stem",
# sports
"sport": "sports-active",
"sports": "sports-active",
"sportive": "sports-active",
"active": "sports-active",
"miscare": "sports-active",
"physical": "sports-active",
# songs / ceremonies
"cantece": "cantece-ceremonii",
"cantec": "cantece-ceremonii",
"songs": "cantece-ceremonii",
"ceremonii": "cantece-ceremonii",
"ceremonie": "cantece-ceremonii",
"ceremony": "cantece-ceremonii",
# recipes
"reteta": "retete",
"recipe": "retete",
"recipes": "retete",
"cooking": "retete",
"gatit": "retete",
# survival
"survival": "supravietuire",
"supravietuire": "supravietuire",
# inclusion
"integrare": "integrare-incluziune",
"incluziune": "integrare-incluziune",
"inclusion": "integrare-incluziune",
# conflict / empathy
"conflict": "conflict-empatie",
"empatie": "conflict-empatie",
"empathy": "conflict-empatie",
"rezolvarea-conflictelor": "conflict-empatie",
# fallback
"altele": "altele",
"other": "altele",
"others": "altele",
"misc": "altele",
}
def _slugify(value: str) -> str:
"""Lowercase, strip diacritics, collapse non-alphanumerics to hyphens."""
if not value:
return ""
# Decompose accents (ă -> a, ș -> s, ț -> t, etc.)
decomposed = unicodedata.normalize("NFKD", value)
ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
ascii_str = ascii_str.lower().strip()
ascii_str = re.sub(r"[^a-z0-9]+", "-", ascii_str)
return ascii_str.strip("-")
def normalize_category(value: str) -> str:
"""Map an arbitrary string to a valid category slug.
Returns one of CATEGORY_SLUGS, falling back to `altele` for anything
unrecognised or empty.
"""
if not value:
return FALLBACK_CATEGORY
slug = _slugify(str(value))
if not slug:
return FALLBACK_CATEGORY
# Exact slug match.
if slug in CATEGORIES:
return slug
# Alias match.
if slug in _CATEGORY_ALIASES:
return _CATEGORY_ALIASES[slug]
return FALLBACK_CATEGORY
def normalize_content_type(value: str) -> str:
"""Map an arbitrary string to a valid content_type slug.
Returns one of CONTENT_TYPE_SLUGS, falling back to `activitate`.
"""
if not value:
return DEFAULT_CONTENT_TYPE
slug = _slugify(str(value))
if slug in CONTENT_TYPES:
return slug
# Light alias handling for plural / English forms.
aliases = {
"jocuri": "joc",
"game": "joc",
"games": "joc",
"activitati": "activitate",
"activity": "activitate",
"retete": "reteta",
"recipe": "reteta",
"cantece": "cantec",
"song": "cantec",
"ceremonii": "ceremonie",
"ceremony": "ceremonie",
}
return aliases.get(slug, DEFAULT_CONTENT_TYPE)
# --- Indoor / outdoor (enrichment axis) --------------------------------------
# Where the activity is run. Inferred during enrichment when the source is
# silent — such inferences are flagged in `estimated_fields`. slug -> RO label.
INDOOR_OUTDOOR: Dict[str, str] = {
"indoor": "Interior",
"outdoor": "Exterior",
"either": "Interior sau exterior",
}
# --- Space needed (enrichment axis) ------------------------------------------
# Rough footprint the activity requires. slug -> RO label.
SPACE_NEEDED: Dict[str, str] = {
"mic": "Spațiu mic",
"mediu": "Spațiu mediu",
"mare": "Spațiu mare",
}
# Aliases for robustness against LLM output variation. Keys are _slugify'd.
_INDOOR_OUTDOOR_ALIASES: Dict[str, str] = {
"interior": "indoor",
"inside": "indoor",
"in": "indoor",
"exterior": "outdoor",
"outside": "outdoor",
"out": "outdoor",
"aer-liber": "outdoor",
"both": "either",
"any": "either",
"ambele": "either",
"interior-exterior": "either",
"indoor-outdoor": "either",
}
_SPACE_NEEDED_ALIASES: Dict[str, str] = {
"small": "mic",
"redus": "mic",
"putin": "mic",
"medium": "mediu",
"moderat": "mediu",
"large": "mare",
"big": "mare",
"mult": "mare",
"spatiu-mic": "mic",
"spatiu-mediu": "mediu",
"spatiu-mare": "mare",
}
def normalize_indoor_outdoor(value: str) -> Optional[str]:
"""Map an arbitrary string to an indoor_outdoor slug, or None.
Unlike categories, this has NO mandatory fallback: an unrecognised or
empty value yields None (field simply absent), so we never fabricate a
location the enrichment did not assert.
"""
if not value:
return None
slug = _slugify(str(value))
if slug in INDOOR_OUTDOOR:
return slug
return _INDOOR_OUTDOOR_ALIASES.get(slug)
def normalize_space_needed(value: str) -> Optional[str]:
"""Map an arbitrary string to a space_needed slug, or None (no fallback)."""
if not value:
return None
slug = _slugify(str(value))
if slug in SPACE_NEEDED:
return slug
return _SPACE_NEEDED_ALIASES.get(slug)
def indoor_outdoor_display_name(slug: str) -> str:
"""RO display name for an indoor_outdoor slug."""
return INDOOR_OUTDOOR.get(slug, slug)
def space_needed_display_name(slug: str) -> str:
"""RO display name for a space_needed slug."""
return SPACE_NEEDED.get(slug, slug)
def is_valid_category(slug: str) -> bool:
"""True if `slug` is a valid category slug."""
return slug in CATEGORIES
def category_display_name(slug: str) -> str:
"""Romanian display name for a slug (fallback to the slug itself)."""
return CATEGORIES.get(slug, slug)
def content_type_display_name(slug: str) -> str:
"""Romanian display name for a content_type slug."""
return CONTENT_TYPES.get(slug, slug)

View File

@@ -5,6 +5,22 @@ Activity data model for INDEX-SISTEM-JOCURI v2.0
from dataclasses import dataclass, field
from typing import List, Optional, Dict, Any
import json
import re
import unicodedata
def normalize_name(name: str) -> str:
"""Diacritic-free, lowercased, whitespace-collapsed form of a name.
Used as the exact-match key for dedup grouping (see plan §4).
"""
if not name:
return ""
decomposed = unicodedata.normalize("NFKD", name)
ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
ascii_str = ascii_str.lower().strip()
ascii_str = re.sub(r"\s+", " ", ascii_str)
return ascii_str
@dataclass
class Activity:
@@ -19,10 +35,19 @@ class Activity:
# Categories
category: str = ""
subcategory: Optional[str] = None
# content_type is an axis INDEPENDENT of category:
# one of joc/activitate/reteta/cantec/ceremonie (see config_taxonomy).
content_type: Optional[str] = None
# Source information
source_file: str = ""
page_reference: Optional[str] = None
# source_files: JSON-encoded list of every source the activity was seen in.
# `source_file` (singular) stays as the primary/original source; build_database
# (Lane C) accumulates the full list here on dedup-merge.
source_files: List[str] = field(default_factory=list)
# Short verbatim quote from the source — anti-hallucination anchor.
source_excerpt: Optional[str] = None
# Age and participants
age_group_min: Optional[int] = None
@@ -44,11 +69,41 @@ class Activity:
keywords: Optional[str] = None
tags: List[str] = field(default_factory=list)
popularity_score: int = 0
# Extraction / language metadata
language: Optional[str] = None # 'ro' / 'en'
normalized_name: Optional[str] = None # dedup key; auto-derived from name
extraction_confidence: Optional[str] = None # 'high' / 'med' / 'low'
needs_review: int = 0
# Enrichment overlay (applied at build time from data/enrichment.json; see
# plan Part B). Bilingual: the EN/source text stays in name/description/...
# and the Romanian rendering lands in the *_ro twins. Absent fields leave
# the underlying DB value untouched.
name_ro: Optional[str] = None
description_ro: Optional[str] = None
rules_ro: Optional[str] = None
variations_ro: Optional[str] = None
indoor_outdoor: Optional[str] = None # slug: indoor / outdoor / either
space_needed: Optional[str] = None # slug: mic / mediu / mare
# Names of fields whose value was INFERRED by enrichment (source was
# silent) rather than stated in the source — surfaced as "(estimat)" in UI.
estimated_fields: List[str] = field(default_factory=list)
# Source provenance for the download route + enrichment keying.
source_id: Optional[str] = None # e.g. "876d1a2d_marcaje_turistice"
source_ids: List[str] = field(default_factory=list) # all source_ids merged
chunk_key: Optional[str] = None # e.g. "<source_id>.part01"
# Database fields
id: Optional[int] = None
created_at: Optional[str] = None
updated_at: Optional[str] = None
def __post_init__(self):
"""Derive normalized_name from name when not explicitly provided."""
if not self.normalized_name:
self.normalized_name = normalize_name(self.name)
def to_dict(self) -> Dict[str, Any]:
"""Convert activity to dictionary for database storage"""
@@ -59,8 +114,11 @@ class Activity:
'variations': self.variations,
'category': self.category,
'subcategory': self.subcategory,
'content_type': self.content_type,
'source_file': self.source_file,
'source_files': json.dumps(self.source_files) if self.source_files else None,
'page_reference': self.page_reference,
'source_excerpt': self.source_excerpt,
'age_group_min': self.age_group_min,
'age_group_max': self.age_group_max,
'participants_min': self.participants_min,
@@ -73,7 +131,21 @@ class Activity:
'difficulty_level': self.difficulty_level,
'keywords': self.keywords,
'tags': json.dumps(self.tags) if self.tags else None,
'popularity_score': self.popularity_score
'popularity_score': self.popularity_score,
'language': self.language,
'normalized_name': self.normalized_name or normalize_name(self.name),
'extraction_confidence': self.extraction_confidence,
'needs_review': self.needs_review,
'name_ro': self.name_ro,
'description_ro': self.description_ro,
'rules_ro': self.rules_ro,
'variations_ro': self.variations_ro,
'indoor_outdoor': self.indoor_outdoor,
'space_needed': self.space_needed,
'estimated_fields': json.dumps(self.estimated_fields) if self.estimated_fields else None,
'source_id': self.source_id,
'source_ids': json.dumps(self.source_ids) if self.source_ids else None,
'chunk_key': self.chunk_key,
}
@classmethod
@@ -86,7 +158,30 @@ class Activity:
tags = json.loads(data['tags'])
except (json.JSONDecodeError, TypeError):
tags = []
# source_files may arrive as a JSON string (DB) or a list (extraction)
source_files = data.get('source_files')
if isinstance(source_files, str):
try:
source_files = json.loads(source_files)
except (json.JSONDecodeError, TypeError):
source_files = []
elif source_files is None:
source_files = []
# estimated_fields / source_ids: JSON string (DB) or list (in-memory)
def _json_list(value):
if isinstance(value, str):
try:
parsed = json.loads(value)
return parsed if isinstance(parsed, list) else []
except (json.JSONDecodeError, TypeError):
return []
return list(value) if value else []
estimated_fields = _json_list(data.get('estimated_fields'))
source_ids = _json_list(data.get('source_ids'))
return cls(
id=data.get('id'),
name=data.get('name', ''),
@@ -95,8 +190,11 @@ class Activity:
variations=data.get('variations'),
category=data.get('category', ''),
subcategory=data.get('subcategory'),
content_type=data.get('content_type'),
source_file=data.get('source_file', ''),
source_files=source_files,
page_reference=data.get('page_reference'),
source_excerpt=data.get('source_excerpt'),
age_group_min=data.get('age_group_min'),
age_group_max=data.get('age_group_max'),
participants_min=data.get('participants_min'),
@@ -110,6 +208,20 @@ class Activity:
keywords=data.get('keywords'),
tags=tags,
popularity_score=data.get('popularity_score', 0),
language=data.get('language'),
normalized_name=data.get('normalized_name'),
extraction_confidence=data.get('extraction_confidence'),
needs_review=data.get('needs_review', 0) or 0,
name_ro=data.get('name_ro'),
description_ro=data.get('description_ro'),
rules_ro=data.get('rules_ro'),
variations_ro=data.get('variations_ro'),
indoor_outdoor=data.get('indoor_outdoor'),
space_needed=data.get('space_needed'),
estimated_fields=estimated_fields,
source_id=data.get('source_id'),
source_ids=source_ids,
chunk_key=data.get('chunk_key'),
created_at=data.get('created_at'),
updated_at=data.get('updated_at')
)
@@ -150,4 +262,44 @@ class Activity:
return self.materials_category
elif self.materials_list:
return self.materials_list[:100] + "..." if len(self.materials_list) > 100 else self.materials_list
return "nu specificate"
return "nu specificate"
# --- Enrichment / bilingual display helpers ------------------------------
def get_display_name(self) -> str:
"""Romanian name when enriched, else the original."""
return self.name_ro or self.name
def get_display_description(self) -> str:
"""Romanian description when enriched, else the original."""
return self.description_ro or self.description
def get_display_rules(self) -> Optional[str]:
"""Romanian rules when enriched, else the original."""
return self.rules_ro or self.rules
def get_display_variations(self) -> Optional[str]:
"""Romanian variations when enriched, else the original."""
return self.variations_ro or self.variations
def has_translation(self) -> bool:
"""True if any Romanian enrichment text is present."""
return bool(self.name_ro or self.description_ro
or self.rules_ro or self.variations_ro)
def is_estimated(self, field_name: str) -> bool:
"""True if `field_name` was inferred by enrichment (source was silent)."""
return field_name in (self.estimated_fields or [])
def get_indoor_outdoor_display(self) -> Optional[str]:
"""RO label for indoor_outdoor, or None when unset."""
if not self.indoor_outdoor:
return None
from app.config_taxonomy import indoor_outdoor_display_name
return indoor_outdoor_display_name(self.indoor_outdoor)
def get_space_needed_display(self) -> Optional[str]:
"""RO label for space_needed, or None when unset."""
if not self.space_needed:
return None
from app.config_taxonomy import space_needed_display_name
return space_needed_display_name(self.space_needed)

View File

@@ -30,6 +30,8 @@ class DatabaseManager:
"""Initialize database with v2.0 schema"""
with self._get_connection() as conn:
# Main activities table
# NOTE: schema is rebuilt from scratch (plan §6) — no in-place
# migration. The old DB is deleted and recreated by build_database.
conn.execute("""
CREATE TABLE IF NOT EXISTS activities (
id INTEGER PRIMARY KEY AUTOINCREMENT,
@@ -39,9 +41,12 @@ class DatabaseManager:
variations TEXT,
category TEXT NOT NULL,
subcategory TEXT,
content_type TEXT,
source_file TEXT NOT NULL,
source_files TEXT,
page_reference TEXT,
source_excerpt TEXT,
-- Structured parameters
age_group_min INTEGER,
age_group_max INTEGER,
@@ -49,26 +54,47 @@ class DatabaseManager:
participants_max INTEGER,
duration_min INTEGER,
duration_max INTEGER,
-- Categories for filtering
materials_category TEXT,
materials_list TEXT,
skills_developed TEXT,
difficulty_level TEXT,
-- Metadata
keywords TEXT,
tags TEXT,
popularity_score INTEGER DEFAULT 0,
-- Extraction / language metadata
language TEXT,
normalized_name TEXT,
extraction_confidence TEXT,
needs_review INTEGER DEFAULT 0,
-- Enrichment overlay (bilingual + inferred filters; Part B)
name_ro TEXT,
description_ro TEXT,
rules_ro TEXT,
variations_ro TEXT,
indoor_outdoor TEXT,
space_needed TEXT,
estimated_fields TEXT,
source_id TEXT,
source_ids TEXT,
chunk_key TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
# FTS5 virtual table for search
conn.execute("""
CREATE VIRTUAL TABLE IF NOT EXISTS activities_fts USING fts5(
name, description, rules, variations, keywords,
materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro,
content='activities',
content_rowid='id'
)
@@ -92,6 +118,9 @@ class DatabaseManager:
"CREATE INDEX IF NOT EXISTS idx_activities_age ON activities(age_group_min, age_group_max)",
"CREATE INDEX IF NOT EXISTS idx_activities_participants ON activities(participants_min, participants_max)",
"CREATE INDEX IF NOT EXISTS idx_activities_duration ON activities(duration_min, duration_max)",
"CREATE INDEX IF NOT EXISTS idx_activities_normalized_name ON activities(normalized_name)",
"CREATE INDEX IF NOT EXISTS idx_activities_indoor_outdoor ON activities(indoor_outdoor)",
"CREATE INDEX IF NOT EXISTS idx_activities_space_needed ON activities(space_needed)",
"CREATE INDEX IF NOT EXISTS idx_categories_type ON categories(type)"
]
@@ -102,24 +131,42 @@ class DatabaseManager:
conn.execute("""
CREATE TRIGGER IF NOT EXISTS activities_fts_insert AFTER INSERT ON activities
BEGIN
INSERT INTO activities_fts(rowid, name, description, rules, variations, keywords)
VALUES (new.id, new.name, new.description, new.rules, new.variations, new.keywords);
INSERT INTO activities_fts(rowid, name, description, rules, variations,
keywords, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro)
VALUES (new.id, new.name, new.description, new.rules, new.variations,
new.keywords, new.materials_list, new.skills_developed,
new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
END
""")
conn.execute("""
CREATE TRIGGER IF NOT EXISTS activities_fts_delete AFTER DELETE ON activities
BEGIN
DELETE FROM activities_fts WHERE rowid = old.id;
INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
variations, keywords, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro)
VALUES ('delete', old.id, old.name, old.description, old.rules,
old.variations, old.keywords, old.materials_list, old.skills_developed,
old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
END
""")
conn.execute("""
CREATE TRIGGER IF NOT EXISTS activities_fts_update AFTER UPDATE ON activities
BEGIN
DELETE FROM activities_fts WHERE rowid = old.id;
INSERT INTO activities_fts(rowid, name, description, rules, variations, keywords)
VALUES (new.id, new.name, new.description, new.rules, new.variations, new.keywords);
INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
variations, keywords, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro)
VALUES ('delete', old.id, old.name, old.description, old.rules,
old.variations, old.keywords, old.materials_list, old.skills_developed,
old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
INSERT INTO activities_fts(rowid, name, description, rules, variations,
keywords, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro)
VALUES (new.id, new.name, new.description, new.rules, new.variations,
new.keywords, new.materials_list, new.skills_developed,
new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
END
""")
@@ -179,11 +226,17 @@ class DatabaseManager:
"""Update category usage counts"""
categories_to_update = [
('category', activity.category),
('content_type', activity.content_type),
('language', activity.language),
('age_group', activity.get_age_range_display()),
('participants', activity.get_participants_display()),
('duration', activity.get_duration_display()),
('materials', activity.get_materials_display()),
('difficulty', activity.difficulty_level),
# Enrichment axes — slugs stored as value; UI maps to RO via
# DISPLAY_NAMES. Without these the new dropdowns would be empty.
('indoor_outdoor', activity.indoor_outdoor),
('space_needed', activity.space_needed),
]
for cat_type, cat_value in categories_to_update:
@@ -210,6 +263,8 @@ class DatabaseManager:
duration_max: Optional[int] = None,
materials_category: Optional[str] = None,
difficulty_level: Optional[str] = None,
indoor_outdoor: Optional[str] = None,
space_needed: Optional[str] = None,
limit: int = 100) -> List[Dict[str, Any]]:
"""Enhanced search with FTS5 and filters"""
@@ -267,7 +322,15 @@ class DatabaseManager:
if difficulty_level:
base_query += " AND difficulty_level = ?"
params.append(difficulty_level)
if indoor_outdoor:
base_query += " AND indoor_outdoor = ?"
params.append(indoor_outdoor)
if space_needed:
base_query += " AND space_needed = ?"
params.append(space_needed)
# Add ordering and limit
query = f"{base_query} {order_clause} LIMIT ?"
params.append(limit)
@@ -332,8 +395,11 @@ class DatabaseManager:
def clear_database(self):
"""Clear all data from database"""
with self._get_connection() as conn:
# Deleting from activities fires the delete trigger, which removes
# the matching FTS rows. The explicit 'delete-all' command then
# guarantees the external-content FTS index is fully cleared.
conn.execute("DELETE FROM activities")
conn.execute("DELETE FROM activities_fts")
conn.execute("INSERT INTO activities_fts(activities_fts) VALUES('delete-all')")
conn.execute("DELETE FROM categories")
conn.commit()

View File

@@ -2,8 +2,6 @@
Services for INDEX-SISTEM-JOCURI v2.0
"""
from .parser import IndexMasterParser
from .indexer import ActivityIndexer
from .search import SearchService
__all__ = ['IndexMasterParser', 'ActivityIndexer', 'SearchService']
__all__ = ['SearchService']

View File

@@ -1,248 +0,0 @@
"""
Activity indexer service for INDEX-SISTEM-JOCURI v2.0
Coordinates parsing and database indexing
"""
from typing import List, Dict, Any
from pathlib import Path
from app.models.database import DatabaseManager
from app.models.activity import Activity
from app.services.parser import IndexMasterParser
import time
class ActivityIndexer:
"""Service for indexing activities from INDEX_MASTER into database"""
def __init__(self, db_manager: DatabaseManager, index_master_path: str):
"""Initialize indexer with database manager and INDEX_MASTER path"""
self.db = db_manager
self.parser = IndexMasterParser(index_master_path)
self.indexing_stats = {}
def index_all_activities(self, clear_existing: bool = False) -> Dict[str, Any]:
"""Index all activities from INDEX_MASTER into database"""
print("🚀 Starting activity indexing process...")
start_time = time.time()
# Clear existing data if requested
if clear_existing:
print("🗑️ Clearing existing database...")
self.db.clear_database()
# Parse activities from INDEX_MASTER
print("📖 Parsing INDEX_MASTER file...")
activities = self.parser.parse_all_categories()
if not activities:
print("❌ No activities were parsed!")
return {'success': False, 'error': 'No activities parsed'}
# Filter valid activities
valid_activities = []
for activity in activities:
if self.parser.validate_activity_completeness(activity):
valid_activities.append(activity)
else:
print(f"⚠️ Skipping incomplete activity: {activity.name[:50]}...")
print(f"✅ Validated {len(valid_activities)} activities out of {len(activities)} parsed")
if len(valid_activities) < 100:
print(f"⚠️ Warning: Only {len(valid_activities)} valid activities found. Expected 500+")
# Bulk insert into database
print("💾 Inserting activities into database...")
try:
inserted_count = self.db.bulk_insert_activities(valid_activities)
# Rebuild FTS index for optimal search performance
print("🔍 Rebuilding search index...")
self.db.rebuild_fts_index()
end_time = time.time()
indexing_time = end_time - start_time
# Generate final statistics (with error handling)
try:
stats = self._generate_indexing_stats(valid_activities, indexing_time)
stats['inserted_count'] = inserted_count
stats['success'] = True
except Exception as e:
print(f"⚠️ Error generating statistics: {e}")
stats = {
'success': True,
'inserted_count': inserted_count,
'indexing_time_seconds': indexing_time,
'error': f'Stats generation failed: {str(e)}'
}
print(f"✅ Indexing complete! {inserted_count} activities indexed in {indexing_time:.2f}s")
# Verify database state (with error handling)
try:
db_stats = self.db.get_statistics()
print(f"📊 Database now contains {db_stats['total_activities']} activities")
except Exception as e:
print(f"⚠️ Error getting database statistics: {e}")
print(f"📊 Database insertion completed, statistics unavailable")
return stats
except Exception as e:
print(f"❌ Error during database insertion: {e}")
return {'success': False, 'error': str(e)}
def index_specific_category(self, category_code: str) -> Dict[str, Any]:
"""Index activities from a specific category only"""
print(f"🎯 Indexing specific category: {category_code}")
# Load content and parse specific category
if not self.parser.load_content():
return {'success': False, 'error': 'Could not load INDEX_MASTER'}
category_name = self.parser.category_mapping.get(category_code)
if not category_name:
return {'success': False, 'error': f'Unknown category code: {category_code}'}
activities = self.parser.parse_category_section(category_code, category_name)
if not activities:
return {'success': False, 'error': f'No activities found in category {category_code}'}
# Filter valid activities
valid_activities = [a for a in activities if self.parser.validate_activity_completeness(a)]
try:
inserted_count = self.db.bulk_insert_activities(valid_activities)
return {
'success': True,
'category': category_name,
'inserted_count': inserted_count,
'total_parsed': len(activities),
'valid_activities': len(valid_activities)
}
except Exception as e:
return {'success': False, 'error': str(e)}
def _generate_indexing_stats(self, activities: List[Activity], indexing_time: float) -> Dict[str, Any]:
"""Generate comprehensive indexing statistics"""
# Get parser statistics
parser_stats = self.parser.get_parsing_statistics()
# Calculate additional metrics
categories = {}
age_ranges = {}
durations = {}
materials = {}
for activity in activities:
# Category breakdown
if activity.category in categories:
categories[activity.category] += 1
else:
categories[activity.category] = 1
# Age range analysis (with safety check)
try:
age_key = activity.get_age_range_display() or "nespecificat"
age_ranges[age_key] = age_ranges.get(age_key, 0) + 1
except Exception as e:
print(f"Warning: Error getting age range for activity {activity.name}: {e}")
age_ranges["nespecificat"] = age_ranges.get("nespecificat", 0) + 1
# Duration analysis (with safety check)
try:
duration_key = activity.get_duration_display() or "nespecificat"
durations[duration_key] = durations.get(duration_key, 0) + 1
except Exception as e:
print(f"Warning: Error getting duration for activity {activity.name}: {e}")
durations["nespecificat"] = durations.get("nespecificat", 0) + 1
# Materials analysis (with safety check)
try:
materials_key = activity.get_materials_display() or "nespecificat"
materials[materials_key] = materials.get(materials_key, 0) + 1
except Exception as e:
print(f"Warning: Error getting materials for activity {activity.name}: {e}")
materials["nespecificat"] = materials.get("nespecificat", 0) + 1
return {
'indexing_time_seconds': indexing_time,
'parsing_stats': parser_stats,
'distribution': {
'categories': categories,
'age_ranges': age_ranges,
'durations': durations,
'materials': materials
},
'quality_metrics': {
'completion_rate': parser_stats.get('completion_rate', 0),
'average_description_length': parser_stats.get('average_description_length', 0),
'activities_with_metadata': sum(1 for a in activities if a.age_group_min or a.participants_min or a.duration_min)
}
}
def verify_indexing_quality(self) -> Dict[str, Any]:
"""Verify the quality of indexed data"""
try:
# Get database statistics
db_stats = self.db.get_statistics()
# Check for minimum activity count
total_activities = db_stats['total_activities']
meets_minimum = total_activities >= 500
# Check category distribution
categories = db_stats.get('categories', {})
category_coverage = len(categories)
# Sample some activities to check quality
sample_activities = self.db.search_activities(limit=10)
quality_issues = []
for activity in sample_activities:
if not activity.get('description') or len(activity['description']) < 10:
quality_issues.append(f"Activity {activity.get('name', 'Unknown')} has insufficient description")
if not activity.get('category'):
quality_issues.append(f"Activity {activity.get('name', 'Unknown')} missing category")
return {
'total_activities': total_activities,
'meets_minimum_requirement': meets_minimum,
'minimum_target': 500,
'category_coverage': category_coverage,
'expected_categories': len(self.parser.category_mapping),
'quality_issues': quality_issues,
'quality_score': max(0, 100 - len(quality_issues) * 10),
'database_stats': db_stats
}
except Exception as e:
return {'error': str(e), 'quality_score': 0}
def get_indexing_progress(self) -> Dict[str, Any]:
"""Get current indexing progress and status"""
try:
db_stats = self.db.get_statistics()
# Calculate progress towards 500+ activities goal
total_activities = db_stats['total_activities']
target_activities = 500
progress_percentage = min(100, (total_activities / target_activities) * 100)
return {
'current_activities': total_activities,
'target_activities': target_activities,
'progress_percentage': progress_percentage,
'status': 'completed' if total_activities >= target_activities else 'in_progress',
'categories_indexed': list(db_stats.get('categories', {}).keys()),
'database_size_mb': db_stats.get('database_size_bytes', 0) / (1024 * 1024)
}
except Exception as e:
return {'error': str(e), 'status': 'error'}

View File

@@ -1,340 +0,0 @@
"""
Advanced parser for INDEX_MASTER_JOCURI_ACTIVITATI.md
Extracts 500+ individual activities with full details
"""
import re
from pathlib import Path
from typing import List, Dict, Optional, Tuple
from app.models.activity import Activity
class IndexMasterParser:
"""Advanced parser for extracting real activities from INDEX_MASTER"""
def __init__(self, index_file_path: str):
"""Initialize parser with INDEX_MASTER file path"""
self.index_file_path = Path(index_file_path)
self.content = ""
self.activities = []
# Category mapping for main sections (exact match from file)
self.category_mapping = {
'[A]': 'JOCURI CERCETĂȘEȘTI ȘI SCOUT',
'[B]': 'TEAM BUILDING ȘI COMUNICARE',
'[C]': 'CAMPING ȘI ACTIVITĂȚI EXTERIOR',
'[D]': 'ESCAPE ROOM ȘI PUZZLE-URI',
'[E]': 'ORIENTARE ȘI BUSOLE',
'[F]': 'PRIMUL AJUTOR ȘI SIGURANȚA',
'[G]': 'ACTIVITĂȚI EDUCAȚIONALE',
'[H]': 'RESURSE SPECIALE'
}
def load_content(self) -> bool:
"""Load and validate INDEX_MASTER content"""
try:
if not self.index_file_path.exists():
print(f"❌ INDEX_MASTER file not found: {self.index_file_path}")
return False
with open(self.index_file_path, 'r', encoding='utf-8') as f:
self.content = f.read()
if len(self.content) < 1000: # Sanity check
print(f"⚠️ INDEX_MASTER file seems too small: {len(self.content)} chars")
return False
print(f"✅ Loaded INDEX_MASTER: {len(self.content)} characters")
return True
except Exception as e:
print(f"❌ Error loading INDEX_MASTER: {e}")
return False
def parse_all_categories(self) -> List[Activity]:
"""Parse all categories and extract individual activities"""
if not self.load_content():
return []
print("🔍 Starting comprehensive parsing of INDEX_MASTER...")
# Parse each main category
for category_code, category_name in self.category_mapping.items():
print(f"\n📂 Processing category {category_code}: {category_name}")
category_activities = self.parse_category_section(category_code, category_name)
self.activities.extend(category_activities)
print(f" ✅ Extracted {len(category_activities)} activities")
print(f"\n🎯 Total activities extracted: {len(self.activities)}")
return self.activities
def parse_category_section(self, category_code: str, category_name: str) -> List[Activity]:
"""Parse a specific category section"""
activities = []
# Find the category section - exact pattern match
# Look for the actual section, not the table of contents
pattern = rf"^## {re.escape(category_code)} {re.escape(category_name)}\s*$"
matches = list(re.finditer(pattern, self.content, re.MULTILINE | re.IGNORECASE))
if not matches:
print(f" ⚠️ Category section not found: {category_code}")
return activities
# Take the last match (should be the actual section, not TOC)
match = matches[-1]
print(f" 📍 Found section at position {match.start()}")
# Extract content until next main category or end
start_pos = match.end()
# Find next main category (look for complete header)
next_category_pattern = r"^## \[[A-H]\] [A-ZĂÂÎȘȚ]"
next_match = re.search(next_category_pattern, self.content[start_pos:], re.MULTILINE)
if next_match:
end_pos = start_pos + next_match.start()
section_content = self.content[start_pos:end_pos]
else:
section_content = self.content[start_pos:]
# Parse subsections within the category
activities.extend(self._parse_subsections(section_content, category_name))
return activities
def _parse_subsections(self, section_content: str, category_name: str) -> List[Activity]:
"""Parse subsections within a category"""
activities = []
# Find all subsections (### markers)
subsection_pattern = r"^### (.+?)$"
subsections = re.finditer(subsection_pattern, section_content, re.MULTILINE)
subsection_list = list(subsections)
for i, subsection in enumerate(subsection_list):
subsection_title = subsection.group(1).strip()
subsection_start = subsection.end()
# Find end of subsection
if i + 1 < len(subsection_list):
subsection_end = subsection_list[i + 1].start()
else:
subsection_end = len(section_content)
subsection_text = section_content[subsection_start:subsection_end]
# Parse individual games in this subsection
subsection_activities = self._parse_games_in_subsection(
subsection_text, category_name, subsection_title
)
activities.extend(subsection_activities)
return activities
def _parse_games_in_subsection(self, subsection_text: str, category_name: str, subsection_title: str) -> List[Activity]:
"""Parse individual games within a subsection"""
activities = []
# Look for "Exemple de jocuri:" sections
examples_pattern = r"\*\*Exemple de jocuri:\*\*\s*\n(.*?)(?=\n\*\*|$)"
examples_matches = re.finditer(examples_pattern, subsection_text, re.DOTALL)
for examples_match in examples_matches:
examples_text = examples_match.group(1)
# Extract individual games (numbered list)
game_pattern = r"^(\d+)\.\s*\*\*(.+?)\*\*\s*-\s*(.+?)$"
games = re.finditer(game_pattern, examples_text, re.MULTILINE)
for game_match in games:
game_number = game_match.group(1)
game_name = game_match.group(2).strip()
game_description = game_match.group(3).strip()
# Extract metadata from subsection
metadata = self._extract_subsection_metadata(subsection_text)
# Create activity
activity = Activity(
name=game_name,
description=game_description,
category=category_name,
subcategory=subsection_title,
source_file=f"INDEX_MASTER_JOCURI_ACTIVITATI.md",
page_reference=f"{category_name} > {subsection_title} > #{game_number}",
**metadata
)
activities.append(activity)
# Also extract from direct activity descriptions without "Exemple de jocuri"
activities.extend(self._parse_direct_activities(subsection_text, category_name, subsection_title))
return activities
def _extract_subsection_metadata(self, subsection_text: str) -> Dict:
"""Extract metadata from subsection text"""
metadata = {}
# Extract participants info
participants_pattern = r"\*\*Participanți:\*\*\s*(.+?)(?:\n|\*\*)"
participants_match = re.search(participants_pattern, subsection_text)
if participants_match:
participants_text = participants_match.group(1).strip()
participants = self._parse_participants(participants_text)
metadata.update(participants)
# Extract duration
duration_pattern = r"\*\*Durata:\*\*\s*(.+?)(?:\n|\*\*)"
duration_match = re.search(duration_pattern, subsection_text)
if duration_match:
duration_text = duration_match.group(1).strip()
duration = self._parse_duration(duration_text)
metadata.update(duration)
# Extract materials
materials_pattern = r"\*\*Materiale:\*\*\s*(.+?)(?:\n|\*\*)"
materials_match = re.search(materials_pattern, subsection_text)
if materials_match:
materials_text = materials_match.group(1).strip()
metadata['materials_list'] = materials_text
metadata['materials_category'] = self._categorize_materials(materials_text)
# Extract keywords
keywords_pattern = r"\*\*Cuvinte cheie:\*\*\s*(.+?)(?:\n|\*\*)"
keywords_match = re.search(keywords_pattern, subsection_text)
if keywords_match:
metadata['keywords'] = keywords_match.group(1).strip()
return metadata
def _parse_participants(self, participants_text: str) -> Dict:
"""Parse participants information"""
result = {}
# Look for number ranges like "8-30 copii" or "5-15 persoane"
range_pattern = r"(\d+)-(\d+)"
range_match = re.search(range_pattern, participants_text)
if range_match:
result['participants_min'] = int(range_match.group(1))
result['participants_max'] = int(range_match.group(2))
else:
# Look for single numbers
number_pattern = r"(\d+)\+"
number_match = re.search(number_pattern, participants_text)
if number_match:
result['participants_min'] = int(number_match.group(1))
# Extract age information
age_pattern = r"(\d+)-(\d+)\s*ani"
age_match = re.search(age_pattern, participants_text)
if age_match:
result['age_group_min'] = int(age_match.group(1))
result['age_group_max'] = int(age_match.group(2))
return result
def _parse_duration(self, duration_text: str) -> Dict:
"""Parse duration information"""
result = {}
# Look for time ranges like "5-20 minute" or "15-30min"
range_pattern = r"(\d+)-(\d+)\s*(?:minute|min)"
range_match = re.search(range_pattern, duration_text)
if range_match:
result['duration_min'] = int(range_match.group(1))
result['duration_max'] = int(range_match.group(2))
else:
# Look for single duration
single_pattern = r"(\d+)\+?\s*(?:minute|min)"
single_match = re.search(single_pattern, duration_text)
if single_match:
result['duration_min'] = int(single_match.group(1))
return result
def _categorize_materials(self, materials_text: str) -> str:
"""Categorize materials into simple categories"""
materials_lower = materials_text.lower()
if any(word in materials_lower for word in ['fără', 'nu necesare', 'nimic', 'minime']):
return 'Fără materiale'
elif any(word in materials_lower for word in ['hârtie', 'creion', 'marker', 'simple']):
return 'Materiale simple'
elif any(word in materials_lower for word in ['computer', 'proiector', 'echipament', 'complexe']):
return 'Materiale complexe'
else:
return 'Materiale variate'
def _parse_direct_activities(self, subsection_text: str, category_name: str, subsection_title: str) -> List[Activity]:
"""Parse activities that are described directly without 'Exemple de jocuri' section"""
activities = []
# Look for activity descriptions in sections that don't have "Exemple de jocuri"
if "**Exemple de jocuri:**" not in subsection_text:
# Try to extract from file descriptions
file_pattern = r"\*\*Fișier:\*\*\s*`([^`]+)`.*?\*\*(.+?)\*\*"
file_matches = re.finditer(file_pattern, subsection_text, re.DOTALL)
for file_match in file_matches:
file_name = file_match.group(1)
description_part = file_match.group(2)
# Create a general activity for this file
activity = Activity(
name=f"Activități din {file_name}",
description=f"Colecție de activități din fișierul {file_name}. {description_part[:200]}...",
category=category_name,
subcategory=subsection_title,
source_file=file_name,
page_reference=f"{category_name} > {subsection_title}",
**self._extract_subsection_metadata(subsection_text)
)
activities.append(activity)
return activities
def validate_activity_completeness(self, activity: Activity) -> bool:
"""Validate that an activity has all necessary fields"""
required_fields = ['name', 'description', 'category', 'source_file']
for field in required_fields:
if not getattr(activity, field) or not getattr(activity, field).strip():
return False
# Check minimum description length
if len(activity.description) < 10:
return False
return True
def get_parsing_statistics(self) -> Dict:
"""Get statistics about the parsing process"""
if not self.activities:
return {'total_activities': 0}
category_counts = {}
valid_activities = 0
for activity in self.activities:
# Count by category
if activity.category in category_counts:
category_counts[activity.category] += 1
else:
category_counts[activity.category] = 1
# Count valid activities
if self.validate_activity_completeness(activity):
valid_activities += 1
return {
'total_activities': len(self.activities),
'valid_activities': valid_activities,
'completion_rate': (valid_activities / len(self.activities)) * 100 if self.activities else 0,
'category_breakdown': category_counts,
'average_description_length': sum(len(a.description) for a in self.activities) / len(self.activities) if self.activities else 0
}

View File

@@ -5,8 +5,19 @@ Enhanced search with FTS5 and intelligent filtering
from typing import List, Dict, Any, Optional
from app.models.database import DatabaseManager
from app.config_taxonomy import NON_GAME_CONTENT_TYPES
import re
# Category slugs that are themselves "non-game" — selecting one of these as a
# category filter also lifts the default non-game content_type exclusion.
NON_GAME_CATEGORIES = {"retete", "cantece-ceremonii"}
# When a Python-side post-filter is active the DB LIMIT is applied *before*
# filtering, so we over-fetch to still satisfy the caller's `limit`.
_OVERSCAN_FACTOR = 5
_OVERSCAN_CAP = 2000
class SearchService:
"""Enhanced search service with intelligent query processing"""
@@ -24,22 +35,72 @@ class SearchService:
if filters is None:
filters = {}
# Process and normalize search text
processed_search = self._process_search_text(search_text)
# Map web filters to database fields
db_filters = self._map_filters_to_db_fields(filters)
# content_type and language are filtered in Python: the DB layer does
# not expose them as query parameters. The DEFAULT search excludes the
# non-game content types (rețete / cântece / ceremonii) — they surface
# only when the user explicitly filters that content_type, or picks a
# non-game category. See plan §6.
content_type, exclude_non_game = self._resolve_content_type_filter(filters)
language = (filters.get('language') or '').strip().lower() or None
post_filtering = bool(content_type or exclude_non_game or language)
# Over-fetch when post-filtering so the final list can still reach `limit`.
fetch_limit = min(limit * _OVERSCAN_FACTOR, _OVERSCAN_CAP) if post_filtering else limit
# Perform database search
results = self.db.search_activities(
search_text=processed_search,
**db_filters,
limit=limit
limit=fetch_limit
)
# Post-process results for relevance and ranking
return self._post_process_results(results, processed_search, filters)
# Apply content_type / language post-filters
results = self._apply_content_type_filter(results, content_type, exclude_non_game)
if language:
results = [r for r in results
if (r.get('language') or '').strip().lower() == language]
# Post-process results for relevance and ranking, then honour `limit`
results = self._post_process_results(results, processed_search, filters)
return results[:limit]
def _resolve_content_type_filter(self, filters: Dict[str, str]):
"""Determine the content_type post-filter.
Returns (explicit_content_type | None, exclude_non_game: bool):
- an explicit `content_type` filter → that value, no exclusion;
- a `category` filter on a non-game category → no exclusion;
- otherwise → default search, exclude non-game content types.
"""
content_type = (filters.get('content_type') or '').strip()
if content_type:
return content_type, False
category = (filters.get('category') or '').strip()
if category in NON_GAME_CATEGORIES:
return None, False
return None, True
def _apply_content_type_filter(self,
results: List[Dict[str, Any]],
content_type: Optional[str],
exclude_non_game: bool) -> List[Dict[str, Any]]:
"""Filter results by content_type (explicit include vs default exclude)."""
if content_type:
return [r for r in results
if (r.get('content_type') or '') == content_type]
if exclude_non_game:
# Rows with NULL/unknown content_type are kept — only the known
# non-game types are dropped from the default search.
return [r for r in results
if (r.get('content_type') or '') not in NON_GAME_CONTENT_TYPES]
return results
def _process_search_text(self, search_text: Optional[str]) -> Optional[str]:
"""Process and enhance search text for better FTS5 results"""
@@ -83,10 +144,16 @@ class SearchService:
if not filter_value or not filter_value.strip():
continue
# content_type / language are NOT database query params — they are
# applied as Python post-filters in search_activities(). Skip them
# here so they never reach DatabaseManager.search_activities().
if filter_key in ('content_type', 'language'):
continue
# Map filter types to database fields
if filter_key == 'category':
db_filters['category'] = filter_value
elif filter_key == 'age_group':
# Parse age range (e.g., "5-8 ani", "12+ ani")
age_match = re.search(r'(\d+)(?:-(\d+))?\s*ani?', filter_value)
@@ -133,7 +200,14 @@ class SearchService:
elif filter_key == 'difficulty':
db_filters['difficulty_level'] = filter_value
elif filter_key == 'indoor_outdoor':
# Equality filter on the slug column (mirror difficulty).
db_filters['indoor_outdoor'] = filter_value
elif filter_key == 'space_needed':
db_filters['space_needed'] = filter_value
# Handle any other custom filters
else:
# Generic filter handling - try to match against keywords or tags
@@ -177,21 +251,22 @@ class SearchService:
boost_score = 0
# Check name matches (highest priority)
name_lower = result.get('name', '').lower()
# NB: use `or ''` — nullable columns come back as None, not ''.
name_lower = (result.get('name') or '').lower()
for term in search_terms:
if term in name_lower:
boost_score += 10
if name_lower.startswith(term):
boost_score += 5 # Extra boost for name starts with term
# Check description matches
desc_lower = result.get('description', '').lower()
desc_lower = (result.get('description') or '').lower()
for term in search_terms:
if term in desc_lower:
boost_score += 3
# Check keywords matches
keywords_lower = result.get('keywords', '').lower()
keywords_lower = (result.get('keywords') or '').lower()
for term in search_terms:
if term in keywords_lower:
boost_score += 5
@@ -280,11 +355,14 @@ class SearchService:
return []
try:
# Search for activities that match the partial query
# Search for activities that match the partial query.
# Over-fetch then drop non-game content types so autocomplete
# mirrors the default search (no rețete / cântece / ceremonii).
results = self.db.search_activities(
search_text=f'"{partial_query}"',
limit=limit * 2
limit=limit * 6
)
results = self._apply_content_type_filter(results, None, True)
suggestions = []
seen = set()

View File

@@ -705,4 +705,30 @@ body {
box-shadow: none;
border: 1px solid #ddd;
}
}
/* Enrichment markers (plan Part A7) */
.estimated {
color: #8a6d3b;
font-style: italic;
font-size: 0.85em;
font-weight: normal;
}
.original-text > summary {
cursor: pointer;
color: #555;
user-select: none;
}
.original-text .original-content {
margin-top: 0.75rem;
padding-left: 1rem;
border-left: 3px solid #e0e0e0;
color: #555;
}
.download-hint {
color: #888;
font-size: 0.85em;
}

View File

@@ -8,14 +8,20 @@
<nav class="breadcrumb">
<a href="{{ url_for('main.index') }}">Căutare</a>
<span class="breadcrumb-separator">»</span>
<span class="breadcrumb-current">{{ activity.name }}</span>
<span class="breadcrumb-current">{{ activity.get_display_name() }}</span>
</nav>
<!-- Activity header -->
<header class="activity-detail-header">
<div class="activity-title-section">
<h1 class="activity-detail-title">{{ activity.name }}</h1>
<span class="activity-category-badge">{{ activity.category }}</span>
<h1 class="activity-detail-title">{{ activity.get_display_name() }}</h1>
<span class="activity-category-badge">{{ display_names.get(activity.category, activity.category) }}</span>
{% if activity.content_type %}
<span class="activity-content-type-badge">{{ display_names.get(activity.content_type, activity.content_type) }}</span>
{% endif %}
{% if activity.needs_review %}
<span class="activity-badge needs-review" title="Această activitate necesită verificare">⚠ De verificat</span>
{% endif %}
</div>
{% if activity.subcategory %}
@@ -25,27 +31,46 @@
<!-- Activity content -->
<div class="activity-detail-content">
<!-- Main description -->
<!-- Main description (Romanian-primary, falls back to original) -->
<section class="activity-section">
<h2 class="section-title">Descriere</h2>
<div class="activity-description">{{ activity.description }}</div>
<div class="activity-description">{{ activity.get_display_description() }}</div>
</section>
<!-- Rules and variations -->
{% if activity.rules %}
{% if activity.get_display_rules() %}
<section class="activity-section">
<h2 class="section-title">Reguli</h2>
<div class="activity-rules">{{ activity.rules }}</div>
<div class="activity-rules">{{ activity.get_display_rules() }}</div>
</section>
{% endif %}
{% if activity.variations %}
{% if activity.get_display_variations() %}
<section class="activity-section">
<h2 class="section-title">Variații</h2>
<div class="activity-variations">{{ activity.variations }}</div>
<div class="activity-variations">{{ activity.get_display_variations() }}</div>
</section>
{% endif %}
<!-- Original (pre-translation) text, collapsed by default -->
{% if activity.has_translation() %}
<details class="activity-section original-text">
<summary class="section-title">Text original ({{ display_names.get(activity.language, activity.language or 'sursă') }})</summary>
<div class="original-content">
<h3 class="metadata-title">{{ activity.name }}</h3>
<div class="activity-description">{{ activity.description }}</div>
{% if activity.rules %}
<h4 class="metadata-title">Reguli</h4>
<div class="activity-rules">{{ activity.rules }}</div>
{% endif %}
{% if activity.variations %}
<h4 class="metadata-title">Variații</h4>
<div class="activity-variations">{{ activity.variations }}</div>
{% endif %}
</div>
</details>
{% endif %}
<!-- Metadata grid -->
<section class="activity-section">
<h2 class="section-title">Detalii activitate</h2>
@@ -53,21 +78,35 @@
{% if activity.get_age_range_display() != "toate vârstele" %}
<div class="metadata-card">
<h3 class="metadata-title">Grupa de vârstă</h3>
<p class="metadata-value">{{ activity.get_age_range_display() }}</p>
<p class="metadata-value">{{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div>
{% endif %}
{% if activity.get_participants_display() != "orice număr" %}
<div class="metadata-card">
<h3 class="metadata-title">Participanți</h3>
<p class="metadata-value">{{ activity.get_participants_display() }}</p>
<p class="metadata-value">{{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div>
{% endif %}
{% if activity.get_duration_display() != "durată variabilă" %}
<div class="metadata-card">
<h3 class="metadata-title">Durata</h3>
<p class="metadata-value">{{ activity.get_duration_display() }}</p>
<p class="metadata-value">{{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div>
{% endif %}
{% if activity.get_indoor_outdoor_display() %}
<div class="metadata-card">
<h3 class="metadata-title">Interior / exterior</h3>
<p class="metadata-value">{{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div>
{% endif %}
{% if activity.get_space_needed_display() %}
<div class="metadata-card">
<h3 class="metadata-title">Spațiu necesar</h3>
<p class="metadata-value">{{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div>
{% endif %}
@@ -119,9 +158,15 @@
<h2 class="section-title">Informații sursă</h2>
<div class="source-info">
{% if activity.source_file %}
{% if config.SOURCE_DOWNLOAD_ENABLED %}
<p><strong>Fișier sursă:</strong>
<a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a>
<span class="download-hint">(descarcă)</span></p>
{% else %}
<p><strong>Fișier sursă:</strong> {{ activity.source_file }}</p>
{% endif %}
{% endif %}
{% if activity.page_reference %}
<p><strong>Referință:</strong> {{ activity.page_reference }}</p>
{% endif %}

View File

@@ -36,7 +36,31 @@
<select name="category" id="category" class="filter-select">
<option value="">Toate categoriile</option>
{% for category in filters.category %}
<option value="{{ category }}">{{ category }}</option>
<option value="{{ category }}">{{ display_names.get(category, category) }}</option>
{% endfor %}
</select>
</div>
{% endif %}
{% if filters.content_type %}
<div class="filter-group">
<label for="content_type" class="filter-label">Tip conținut</label>
<select name="content_type" id="content_type" class="filter-select">
<option value="">Doar jocuri și activități</option>
{% for content_type in filters.content_type %}
<option value="{{ content_type }}">{{ display_names.get(content_type, content_type) }}</option>
{% endfor %}
</select>
</div>
{% endif %}
{% if filters.language %}
<div class="filter-group">
<label for="language" class="filter-label">Limbă</label>
<select name="language" id="language" class="filter-select">
<option value="">Toate limbile</option>
{% for language in filters.language %}
<option value="{{ language }}">{{ display_names.get(language, language) }}</option>
{% endfor %}
</select>
</div>
@@ -101,6 +125,30 @@
</select>
</div>
{% endif %}
{% if filters.indoor_outdoor %}
<div class="filter-group">
<label for="indoor_outdoor" class="filter-label">Interior / exterior</label>
<select name="indoor_outdoor" id="indoor_outdoor" class="filter-select">
<option value="">Oriunde</option>
{% for io in filters.indoor_outdoor %}
<option value="{{ io }}">{{ display_names.get(io, io) }}</option>
{% endfor %}
</select>
</div>
{% endif %}
{% if filters.space_needed %}
<div class="filter-group">
<label for="space_needed" class="filter-label">Spațiu necesar</label>
<select name="space_needed" id="space_needed" class="filter-select">
<option value="">Orice spațiu</option>
{% for sp in filters.space_needed %}
<option value="{{ sp }}">{{ display_names.get(sp, sp) }}</option>
{% endfor %}
</select>
</div>
{% endif %}
{% endif %}
</div>

View File

@@ -24,7 +24,29 @@
<option value="">Toate categoriile</option>
{% for category in filters.category %}
<option value="{{ category }}" {% if applied_filters.category == category %}selected{% endif %}>
{{ category }}
{{ display_names.get(category, category) }}
</option>
{% endfor %}
</select>
{% endif %}
{% if filters.content_type %}
<select name="content_type" class="filter-select compact">
<option value="">Doar jocuri și activități</option>
{% for content_type in filters.content_type %}
<option value="{{ content_type }}" {% if applied_filters.content_type == content_type %}selected{% endif %}>
{{ display_names.get(content_type, content_type) }}
</option>
{% endfor %}
</select>
{% endif %}
{% if filters.language %}
<select name="language" class="filter-select compact">
<option value="">Toate limbile</option>
{% for language in filters.language %}
<option value="{{ language }}" {% if applied_filters.language == language %}selected{% endif %}>
{{ display_names.get(language, language) }}
</option>
{% endfor %}
</select>
@@ -63,6 +85,28 @@
</select>
{% endif %}
{% if filters.indoor_outdoor %}
<select name="indoor_outdoor" class="filter-select compact">
<option value="">Oriunde</option>
{% for io in filters.indoor_outdoor %}
<option value="{{ io }}" {% if applied_filters.indoor_outdoor == io %}selected{% endif %}>
{{ display_names.get(io, io) }}
</option>
{% endfor %}
</select>
{% endif %}
{% if filters.space_needed %}
<select name="space_needed" class="filter-select compact">
<option value="">Orice spațiu</option>
{% for sp in filters.space_needed %}
<option value="{{ sp }}" {% if applied_filters.space_needed == sp %}selected{% endif %}>
{{ display_names.get(sp, sp) }}
</option>
{% endfor %}
</select>
{% endif %}
<button type="button" class="btn btn-secondary btn-sm" onclick="clearFilters()">
Resetează
</button>
@@ -106,31 +150,46 @@
<header class="activity-header">
<h3 class="activity-title">
<a href="{{ url_for('main.activity_detail', activity_id=activity.id) }}">
{{ activity.name }}
{{ activity.get_display_name() }}
</a>
</h3>
<span class="activity-category">{{ activity.category }}</span>
<span class="activity-category">{{ display_names.get(activity.category, activity.category) }}</span>
{% if activity.needs_review %}
<span class="activity-badge needs-review" title="Această activitate necesită verificare">⚠ De verificat</span>
{% endif %}
</header>
<div class="activity-content">
<p class="activity-description">{{ activity.description }}</p>
<p class="activity-description">{{ activity.get_display_description() }}</p>
<div class="activity-metadata">
{% if activity.get_age_range_display() != "toate vârstele" %}
<span class="metadata-item">
<strong>Vârsta:</strong> {{ activity.get_age_range_display() }}
<strong>Vârsta:</strong> {{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}
</span>
{% endif %}
{% if activity.get_participants_display() != "orice număr" %}
<span class="metadata-item">
<strong>Participanți:</strong> {{ activity.get_participants_display() }}
<strong>Participanți:</strong> {{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}
</span>
{% endif %}
{% if activity.get_duration_display() != "durată variabilă" %}
<span class="metadata-item">
<strong>Durata:</strong> {{ activity.get_duration_display() }}
<strong>Durata:</strong> {{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}
</span>
{% endif %}
{% if activity.get_indoor_outdoor_display() %}
<span class="metadata-item">
<strong>Loc:</strong> {{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}
</span>
{% endif %}
{% if activity.get_space_needed_display() %}
<span class="metadata-item">
<strong>Spațiu:</strong> {{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}
</span>
{% endif %}
@@ -143,7 +202,11 @@
{% if activity.source_file %}
<div class="activity-source">
{% if config.SOURCE_DOWNLOAD_ENABLED %}
<small>Sursă: <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a></small>
{% else %}
<small>Sursă: {{ activity.source_file }}</small>
{% endif %}
</div>
{% endif %}
</div>

View File

@@ -3,15 +3,28 @@ Flask routes for INDEX-SISTEM-JOCURI v2.0
Clean, minimalist web interface with dynamic filters
"""
from flask import Blueprint, request, render_template, jsonify, current_app
from flask import (
Blueprint, request, render_template, jsonify, current_app,
send_from_directory,
)
from app.models.database import DatabaseManager
from app.models.activity import Activity
from app.services.search import SearchService
import os
from pathlib import Path
from app.config_taxonomy import (
CATEGORIES, CONTENT_TYPES, INDOOR_OUTDOOR, SPACE_NEEDED,
)
bp = Blueprint('main', __name__)
# Slug -> Romanian display name. Category, content_type, indoor_outdoor and
# space_needed slugs never collide, so a single flat map is enough for the UI
# filter labels.
LANGUAGE_NAMES = {'ro': 'Română', 'en': 'Engleză'}
DISPLAY_NAMES = {
**CATEGORIES, **CONTENT_TYPES, **INDOOR_OUTDOOR, **SPACE_NEEDED,
**LANGUAGE_NAMES,
}
# Initialize database manager (will be configured in application factory)
def get_db_manager():
"""Get database manager instance"""
@@ -36,15 +49,17 @@ def index():
# Get database statistics for the interface
stats = db.get_statistics()
return render_template('index.html',
return render_template('index.html',
filters=filter_options,
display_names=DISPLAY_NAMES,
stats=stats)
except Exception as e:
print(f"Error loading main page: {e}")
# Fallback with empty filters
return render_template('index.html',
return render_template('index.html',
filters={},
display_names=DISPLAY_NAMES,
stats={'total_activities': 0})
@bp.route('/search', methods=['GET', 'POST'])
@@ -82,8 +97,9 @@ def search():
search_query=search_query,
applied_filters=filters,
filters=filter_options,
display_names=DISPLAY_NAMES,
results_count=len(activities))
except Exception as e:
print(f"Search error: {e}")
return render_template('results.html',
@@ -91,6 +107,7 @@ def search():
search_query='',
applied_filters={},
filters={},
display_names=DISPLAY_NAMES,
results_count=0,
error=str(e))
@@ -121,12 +138,51 @@ def activity_detail(activity_id):
return render_template('activity.html',
activity=activity,
display_names=DISPLAY_NAMES,
similar_activities=similar_activities)
except Exception as e:
print(f"Error loading activity {activity_id}: {e}")
return render_template('404.html'), 404
@bp.route('/source/<int:activity_id>')
def source_download(activity_id):
"""Download the original source file for an activity (plan A6).
Shipped DARK: returns 404 unless SOURCE_DOWNLOAD_ENABLED is set (copyright
exposure — the user opts in). Resolves the activity's `source_file` under
CORPUS_DIR. send_from_directory does the safe-join and blocks traversal;
web-mirror / extension-less sources that are not real files 404 gracefully.
"""
if not current_app.config.get('SOURCE_DOWNLOAD_ENABLED', False):
return render_template('404.html'), 404
try:
db = get_db_manager()
activity_data = db.get_activity_by_id(activity_id)
if not activity_data:
return render_template('404.html'), 404
source_file = (activity_data.get('source_file') or '').strip()
if not source_file:
return render_template('404.html'), 404
corpus_dir = current_app.config.get('CORPUS_DIR')
if not corpus_dir:
return render_template('404.html'), 404
try:
# send_from_directory rejects path traversal and missing files with
# a 404 (NotFound) — no manual safe_join needed.
return send_from_directory(
corpus_dir, source_file, as_attachment=True
)
except Exception:
# Missing file / web-mirror source with no on-disk original.
return render_template('404.html'), 404
except Exception as e:
print(f"Source download error for {activity_id}: {e}")
return render_template('404.html'), 404
@bp.route('/health')
def health_check():
"""Health check endpoint for Docker"""

Binary file not shown.

379
data/review_decisions.json Normal file
View File

@@ -0,0 +1,379 @@
{
"96560c4dee400911e01ec5bf6f2460f2b6008bdb": "merge",
"7b93e6fbec6dab9c43862f4104d2f0e83784ccca": "merge",
"2765d81e6572977b411f9521871a7a3e557cab6b": "merge",
"bdd723ba8575ec94e6ed9e036ea62512ef643b40": "merge",
"78f0c433f6da30bb922ef1f8637a747410ac3528": "merge",
"0d3a5f2519d69e04558431d5990de8b1fb59a9bc": "merge",
"5abca033e2bd9f49a7b65cad2e357c7ea7e2236d": "merge",
"379999f9a603dfe4cb2316687a2266e17001ea8a": "merge",
"90a16119eadd2daa1ecb8b897c20875e54c5934a": "merge",
"d75b4131ece700309957aa17dae01f1feb9b3c35": "merge",
"0d0ccd256e8a02e45128b9dcc196b49ce6b5ec0d": "merge",
"db8afcd1ae6e0525dce1403caccf417de1423c68": "merge",
"ec157c89b9c3ff0691fc92b9ca0aa586f93c4880": "merge",
"67ce4ec87bcc33aab599a1c93c930e93c7762ae4": "merge",
"952cbbbd28453b93bae97de9edc473bb9525bac9": "merge",
"7a1acb62bb01bcd568fd67699da2523b293ce8a6": "merge",
"043c6dcc5a4d80f43ad9308e454e4ec287321a26": "merge",
"d10374d745aeb6abe1c0cc827baf98b311933196": "merge",
"bdce7856cee7f98e187b35b3ad806782e738f007": "merge",
"58103e94a6020d75e2d79fc1877b08b15e2a3d67": "merge",
"dec54f2d3015f73eea7558fb52d2107fc43b7937": "merge",
"d4b1e9dca9f3bdd4302f311ee96e6b39e96d5b1f": "merge",
"93b16e984bc1b98de7ec9403df61522e53ee26e6": "merge",
"6202ef279e70c0850369575f8dd9a6a338006a82": "merge",
"02b83749ab358c1c591f134d4d252a72843fb443": "merge",
"903af17ce988c40ab64c855a8b1f8e989d36883d": "merge",
"e0d6a73c57fcd151bbc4d7b7727aa8d6e1a3301c": "merge",
"63583f4d78dfe919ed195910a917d8cfe52ebe09": "merge",
"37306f964eaa9a123e40aaa7b642520ce3be0349": "merge",
"e355dd178be6d4589c227a047f2083e55806c4ae": "merge",
"3d99eecf4ad2e034ad9cac724cfb9f5d57bfd45f": "merge",
"e7c893a786fba566ea0cd5c8ee2a27e280560fd1": "merge",
"cf7a7b6db84a7070e9495f6ff2970d8860621885": "merge",
"a7157217540041a575cd3909a8bd7fe9345932ef": "merge",
"15f50b3eb8baeac6cf58df3938ddcda2ae0c3224": "merge",
"3c977df90998d069d7bba6a11d2bc7a891f5b45d": "merge",
"7b4361f2e3d9f017d78513bebfb0f2aa7243f987": "merge",
"d386aa70c890c3538b5f6b201e135f7ea9613ae9": "merge",
"1da5be5acb36f222a67999de864cbcbcebe8d876": "merge",
"7bf9319e4c0322747fd2e35971d7f7932e835eef": "merge",
"29d7060cf880d7db900d6fb64cf2cb3874794b88": "merge",
"9832cc0a6625f24354d6dfee3527086d042dd52b": "merge",
"12ea5a54fb5f123441d8fae1451f6cdd02f572bb": "merge",
"0638d8c1c81fe5c52ee29823be9f0e162f9aed12": "merge",
"a2d1962c3c3f1f471f33a093291119d67890d26e": "merge",
"ca7437967bbe393da5f5f463e359711929d6ba00": "merge",
"6637ee7fd0ae519f87e4374ac6d8d3df12a0c0e2": "merge",
"b7a8104fa06e2e2491f5699eefa5dbc8ffb26209": "merge",
"21bf2790f28f654d0ccd38d99e8bfdd9876f2d10": "merge",
"1e6a13a18d750a8b20457f71bfe64950a4c50f22": "merge",
"299a31f4187c38a9ff84773d4ed1a3eccfe6904c": "merge",
"a013365c84b79fac919ea70a2c2820881bf05e82": "merge",
"dfdc05810d2e84b93a2edd64cf8138aeb9a6ccf3": "merge",
"cf88b32fe2c2c09907d1813a1fefe9b17754d9fb": "merge",
"210ba3d41c7424428365e7ed33cf86c9819f1467": "merge",
"a458a6f96eccddd5b36392f38ef5c7dd16a238cd": "merge",
"e4198708e6d69d69ce8c7cc14d5c1e4757944808": "merge",
"6e3c6b282675e787c8726c0d4cc536357f6b1438": "merge",
"6370503a78ecaacf6e8738fe3419d2e4a4b280ec": "merge",
"5e6bbf4ea08efd5187954c5bb40d67a0f4ec223c": "merge",
"d49a48f1dbf890d9c802c0c7e0e7e38a8b28645f": "merge",
"668ca80cde843b5bcfc1e3c3599b114893481c6b": "merge",
"2d85fa7ad793b0fec8066b1b45ae51da7f09fae1": "merge",
"1416073acad77bef155224b1f8005f778554aada": "merge",
"ddb18e2eeb058b1dbbcc1afd444256a82fccc346": "merge",
"f1f5fb5b2f73f4035098d5af23da8ca2f20f54ca": "merge",
"d3f4d626827e6798675b55a4785c2829ef93ef69": "merge",
"8e83a64a6aa030d6d02d3d736c3fffe8249033dd": "merge",
"d93a71716835cd1f8856abe093580f1e6aeab023": "merge",
"98cc404cdd51987d3a6d15e88269c2978286e80a": "merge",
"4d31104a3cf321dcd831e354a6d2556cffeafbd0": "merge",
"b0e6d245954401f9a766addafb2dde59b5bf5fe9": "merge",
"722f70854b1aff75d059410ebfdcf99fa15fe828": "merge",
"3c58885138b0e9da95fcb0ed7906339cd104e168": "merge",
"daa05e337e1fc8f02c4679a1da21dd74deafda0c": "merge",
"f5ab25d06273404abfef1a55a1288f518d668638": "merge",
"7484f446f8c8e59e6f248a465ea74b60f87102a0": "merge",
"d072ac1c088c2fccf40536706d6433ecac0e865a": "merge",
"9c8c02b5aa8e2ca3fcd6a91b46883beeadfa77a1": "merge",
"81fed4a423c79dee6a8c7a3b82b4873e0003597c": "merge",
"de6d50103958f692181a2c8a7d8aff51a9e3c6a3": "merge",
"d1efa3c2cf23ee5cf845716a94af6e0c9ee2e3e0": "merge",
"d762147d9a4bd258a874ac9d899e712fa08e3865": "merge",
"1b28f43ef78eb3031c172f8c1a7c62834b902945": "merge",
"28dc9dde42db1dbf1fe826219b0cc2bb5b203304": "merge",
"3dc8037f2ed01e025c39dae2900ad40b5ccbd546": "merge",
"80d95a9a86341093da7a2266225f1e8b016bfdc1": "merge",
"7ec81e486f1da7c383c20837f06d162241ccf720": "merge",
"f653b01b650d420eeb1a581e3d2ce3823b7000f2": "merge",
"45b0184b1c666135ef903680935ccca6a638ee8a": "merge",
"9562ba2777d2ed8b3c590beac7697d2ea731a2f9": "merge",
"37d803fce5601c7f4e924368b07a74cd03df7317": "merge",
"9a92c8f1ef4120ad82993b095dad3b3dbd1c2e5f": "merge",
"4265795c137dd7229c4a593beab00d8401225afd": "merge",
"d308fbe7842da27ee2e3204f700afd9364dde327": "merge",
"3aaa23c16b88060510826b497c6433f010db7bb0": "merge",
"d827b842f11c0752ffe8cddb6c372b8deab01924": "merge",
"0232a13fd5589c4e1cbfe23c13a8b7b863b665eb": "merge",
"1b7d6c88a61344bb2033d5de64d913acf2d527f9": "merge",
"d9d8104bd70c4c03b6841ccbc07d078b9bc22c81": "merge",
"ad2251d37f8911b33e2b5227d2e7acb0533fb10a": "merge",
"7b1ec96916a6b5cac294e05ffc46c42a821076dc": "merge",
"7c29838be6491258e5e3d169d931412adc8d2f01": "merge",
"c050c15f2d3e0a2cf8d1145d3a9050d29d2f540f": "merge",
"3d6731ad69adec25662c431c5d42c7d4a1934c89": "merge",
"d07cb8c2e9d0c4a192b3966698e6671d21ea982b": "merge",
"218e30c5d34937563d5d328e51d983a33611a1f5": "merge",
"4547df7094cd3f8791adc07996d3c3264d1b8ff4": "merge",
"7de994f358426f4c3e48b54196e4c0c6c71d60c4": "merge",
"60473b807e6a9c940dacc726d3270efe8b6a0c5b": "merge",
"7db9d39d8d2baf52c6e586b95472699f42c9c98f": "merge",
"2409751dda5ae660a28b230694e0bc2bd2207133": "merge",
"b525c8c36de926f6b891d613df9835b06bc6be59": "merge",
"533ba4716e2ed1154e3ce6323aa899ea148abdf5": "merge",
"b815d67ecd6ac3a64a6f792ffddc63a462754100": "merge",
"6d0c3759ee029bb6d3fcd693dca9753f5186ecf3": "merge",
"50ae80ee932281e9f4424c71f4d2607ce842737b": "merge",
"7507208a9ca4f83a3634904810bdcd6a1633058f": "merge",
"8614f97398b7b50fb50ae8405f8695f2960489d9": "merge",
"9d8779922c44bc4a805447283b8b037569131b18": "merge",
"965d619a9809c4f3831e308cf6214ddcebb9f835": "merge",
"7f99acd1fef4a139c96911773d188f5be94ddbb5": "merge",
"c5478f32f290c116b8a801736d2832a4ae1fafa5": "merge",
"d589902ef6b25634da0fd89c00e3bbb44e8d868b": "merge",
"9b3887e6d292ba2d3642689538231fb1a4d1dc08": "merge",
"a2571504a7e62563aadb34799f40882afcd56ec8": "merge",
"29f4b1f18f23495691ca5d6bc33150f36612a16f": "merge",
"be8529c39e659ba00ba30ea4ebbb05510ceaa684": "merge",
"3113a3f8ab0795d3dfcbf4b0dc64f5887bfcfeb6": "merge",
"53e4ecab7d2fc23efa1f75a93b7da1fbe46fb38a": "merge",
"aec818b52ddc3de410cb56692e95493a0661ac40": "merge",
"19cd0d05e21206203498eda94b4ba8b777d360f2": "merge",
"b596ae2fa7fd7dd059095b989f8477fc0023ef2a": "merge",
"68e69967ef4f7437f21373b52f77a3701f61054d": "merge",
"d72598ee9abdfd3d5c07b1e6f3e9fa11424b569f": "merge",
"d420186d189d783fde3d29908ade3221eac109df": "merge",
"7a3c7465e4221511994a2b2ebf6aa5c0b3353b0c": "merge",
"c19d94c259c4bb55cda1198b96c8e723ca48904b": "merge",
"f7222a7fa8b9930b0aef8e377e820423bd891b7a": "merge",
"aeda2c5a5aa5c0ae8b33f72f81a76e0617b45474": "merge",
"3c75d378d76756f56f2c5827d0adeaa6c0c88e04": "merge",
"9bd53fc3c7b669d84cfe1f9778791ca32c1df621": "merge",
"e065420daad8abd3e6908381775c9ab5aeb804c0": "merge",
"107bd809c7ca5d24b1ea2d0e59843b6386be43ad": "merge",
"32457099f22bfff1d9ab4f24c87768c893e295c6": "merge",
"965038421cb0d65fc937cbf4f99b487732807fe0": "merge",
"7b0e5e6a8e49a9f3594cdf17559cfff52c1a329a": "merge",
"fa88682c3eab4a5fed4bb64f5a9b5733018d12f8": "merge",
"4cddf1c9b6424b9d66cf6adf2e631dda8a92b88f": "merge",
"0733103d3ec92e1bc7a05f7a6be53b9ea19ea1c9": "merge",
"30d28ba19a6ece36d32b54ca30c0a71fa4b04dcf": "merge",
"91ca8e066bfd41514ed775e2cbe485d4d74ee85c": "merge",
"8b952bbf8e45e128e8a5b224a3c3da7d166327b1": "merge",
"5ae4f3199d76f8805e38174425f70fd7bb99296c": "merge",
"3ac614a52102ecd7448d4ebd47641264b3feba99": "merge",
"e25061e58c6d66b88e89635d670d5360ebf2bc6d": "merge",
"277434fe9d2179b91d2cfbd660a16f95ba590031": "merge",
"8ddcb8e0885cd61847206cad017e1d65b002e798": "merge",
"bcda860319366bab0690195d2d9665f72492e19e": "merge",
"d9ed88849aa59a5ba31798854a25843815cc2d3a": "merge",
"eae46f64b96fff8db16751cf271698a54336d7d3": "merge",
"cfd5e3814f129e59c011acce3ef97d3a95b21465": "merge",
"c7929c37f1367d52bd529e671e2c951dbb80f618": "merge",
"b5137e14a8914e9007fc78b48b9c8ac5060d3885": "merge",
"38373f03cc19db9363c9be6dd1df60f8c0344a3a": "merge",
"8e2784d1d0abf50c83d280dc3c50d51bcc8bb947": "merge",
"9322e84a111ea8d6aacae5998d742b1ef7689870": "merge",
"2ab3a81c5d889f7ed58d72a4c42e54728955a3b1": "merge",
"3c87b2d1c3df5d3cfe920e992805d55f840b023f": "merge",
"3f14169b7cce95214f874f5dc7acf214a1576e09": "merge",
"bfbd7c57a96b5a66970d28f23ec65e71c3d0af63": "merge",
"56841bf251e2631d31accd7b44ad1223ef55116e": "merge",
"e4a907fb74246d5af8201df32ab1db0ffcf10582": "merge",
"af373139d280bc4b36b31a3e917a442d441f76c4": "merge",
"6d69ce5b4cb5dae80f7fc6fdd74092624d7e629b": "merge",
"5d93d92c1ed5ec1b9eea859a8d10dd5542f6bf0a": "merge",
"0d3655929d5a20efade4fdebdbe1e51a857aa4a8": "merge",
"c6f5a99b9970977b87527e68a79856957266ec85": "merge",
"ac8798c7b1bda482f86770a1152b33e1eb09936f": "merge",
"da588a2ebac1e9fce8a00635be058aa08a534006": "merge",
"b1b5f3e3195a835e110c12826b0f68aa6c5dfda8": "merge",
"2cbfda9b32acb9e3353046193185ae459b6baade": "merge",
"4530e7f505b85a4bc418f8e204fd2dce3e4bcd51": "merge",
"d0f306635d8df199e336616980ff8367f23a86c6": "merge",
"043651de409eaa56f0f5f02ccb36acb4f9379c7c": "merge",
"78febd51e41d71f721454190ffe4defd3be64cf2": "merge",
"878234eaaa6a82da2a07cbddf7f7a9312b7aee40": "merge",
"d2e283086b32df1b74da9ff6bac014867f3e02b9": "merge",
"8494ded48880d67c8857860f485a428c85ddb17c": "merge",
"d99eab2896ee0c15d96ce9008387f1caae4992d5": "merge",
"996d708cc4d0a7532f24889ed8fa33565f1252db": "merge",
"bea98344fcc400338f45fa7aea05c1c716ff2909": "merge",
"7ed550a014b20ddb6d406ae470fc564501fcdd2c": "merge",
"783f22275c9c238507aa5903ec8b0d99e3a4a348": "merge",
"d32c5eea6b7e682bc998ae968b835436784c8501": "merge",
"e1aae6922f87193a5d0c8efdeabfd55e80ba13c1": "merge",
"9aa988ee82d5a2ce587b2cb4ad304c1449985322": "merge",
"e11c9441f3fb67f4c4fed049c015fb6eb8b73ca7": "merge",
"fdc0642b90b7afb0520294d90a73060ed5211cde": "merge",
"29c466523090f5f957dd5feb7a10a8ee839d6a68": "merge",
"987e658e3f40c874da5e237a635ea3eba63520af": "merge",
"ef677e0935c55d3da20872e27328eddbae34b5b9": "merge",
"b34cbc1ac68e6e0cc61f273d030520cb0f6639fb": "merge",
"7769d0a01f93ed6437128827ba88987807aa7154": "merge",
"5abd16d445171cdc30752149066755886ac627b7": "merge",
"b3da0c637ac87ff1032213ce007d365af512ce33": "merge",
"ea4065e3f76043bad07c6e6aff17c342abf59668": "merge",
"c3e40ebe130c4603359654c228ee8f3518ea6b46": "merge",
"0a98a18116f3384808d9472d370e9c65984f10f7": "merge",
"057f5aab9399ceab6ac1d8b59da831e952e53dbd": "merge",
"2d41a71e7b7db81fbc2757b814d9189c67cda4cf": "merge",
"c27d096ed3d5d94e1f9c1574eff52635c78ae282": "merge",
"b4eed2c25fd8260393df142bef5d4a964f3bd10f": "merge",
"f6bd9d75156541527e66fab5e474c1422ad9dbf9": "merge",
"a449eb060de8beba56dd0dbafde9fee4f65288c9": "merge",
"cde603b1d5c0ea111641e21abe2e75a49d157de4": "merge",
"c3b0799baf3248e581be8a9cecf35351a7c4a363": "merge",
"2c08a477dd2717c07da9f2bbaf1940403d0b8f86": "merge",
"a55f668e2623aa50abded937ff4400382e80d90a": "merge",
"abb24637bfd92e7212e9dafc9a7e5ad14d7cc8a4": "merge",
"b862dd55a70d645da81f6c0cdd921c8728674633": "merge",
"6197261b63f675835c3290be0f10ced10400e326": "merge",
"74c82141e805ebaab913cea281416883849cc771": "merge",
"1ce689262c53a212ad8f94a778ea4bf80bc62910": "merge",
"05d9af704c387479856a2390b81a27fed4413285": "merge",
"492c25ec0c4667bbe01177f689f3c51b8f31190d": "merge",
"ec79b6e520a87ca24ae5265aff1b39a87dcadf82": "merge",
"eb5d32991120f20c2ff1c16e92b31c78aa3dcd22": "merge",
"e2dc20814e229ddef4569ca4d405c0b553313d1a": "merge",
"9e37a3e0e6336d306c263bd4739e9231cb8221dd": "merge",
"2c541c5bf5b6a93301fc8ee089228a09b8581959": "merge",
"f5b717b5da236e42a840093e4562fe2896f44a20": "merge",
"04a3b2dcba0a00ee833e2e9c4a5f032f2cf80742": "merge",
"f8452066c036fa7c09b4acd03de4f0c661a6bec6": "merge",
"b595ffe402b14a1e927c2f0fe9554f581354999d": "merge",
"786ab65d5e3c51246f1c457b77eff154860bf959": "merge",
"8b6d467df7b0161820cf698a33c665217c597d04": "merge",
"264d317f0a0d92e82c2943ca9365367382902c27": "merge",
"6d95efd3025e3175aa75c1aa7ed7ec7d1ac3d1a1": "merge",
"165c5c31e2fb155e00a2664fcca334c25356af69": "merge",
"998d2db54bde3578f8e7dbaa5e4612f20836e5d0": "merge",
"c764f15e260fc5f2ac445e8216e27130fa9af9e7": "merge",
"0120182e18a57cd9d936bff9bf74675bb9233cdb": "merge",
"09c39c6b32699cf61aa71fa6689dd676b5e803a9": "merge",
"1549346e5b321eaf1d767f673e642cac11d22700": "merge",
"cd7a074efdf2dd6e9272e05cb88a8f8da08a8a8b": "merge",
"f418edf54719e46b2707f0c6ec9a47da38ab099a": "merge",
"9e20edd1cc156c387a6fd4b47f0495a5d1d5d969": "merge",
"dac94f8716a6f192fcfb660453b3074e55b60c3f": "merge",
"c62767464497b98bbab2fbf724ce069b749c0d95": "merge",
"37bc745042745d9984d9323f83de2bb6f2af0d49": "merge",
"3df2f823828b92a35075fb207844cf788b8bfc09": "merge",
"153cbe257b938edc3bda1aeea3729c53abc81d57": "merge",
"0da73ba76f0e83a24cb7ec86ad426fc79f0be00e": "merge",
"517597920bde6cfea71154e6a13fa4a9bcb18db2": "merge",
"283bd504ae23412dfcf671c7d6febdf42232383c": "merge",
"4d889b00c2811ade234b312f72d725ffab3971e7": "merge",
"c5e723a923eb6c0d07f34e05c30465a862b3fd5c": "merge",
"9e9afe3963dcbfcb1191110145772373c49739dd": "merge",
"4a238771093ed7936603426b9a19355fa48a086a": "merge",
"641291a504ecc051c99fc385151f68f6f2c7711a": "merge",
"d6e47126fe3754fde729dcb7777cc25165e57dba": "merge",
"812cdf28743b2f01cceedca18cf7e237f6f38c1b": "merge",
"de2ce5f515b0d9647320bd71d8d013c68fb48bc4": "merge",
"14ffc9df5acc2537d3cca6322d097363d7106d52": "merge",
"5941736673d484453b49edc863a9dfa669f4cdb2": "merge",
"d68039d71f1bc980a8cca9176df5fd672575db46": "merge",
"5e8667cd2a0862a7b528779267ea074d88501494": "merge",
"7b6c4999ce6d158f72aba749543f5c8a1cbfb08b": "merge",
"ace457ccf21922424aa38a0a29d5aa7dbd4e7fc1": "merge",
"5707d00d0d3b646d83541c9205405f5e41ad07b0": "merge",
"261e67f706670c9f81e7b2c6d0ec1fc934c4109a": "merge",
"eec31103367b39bc1b4db3d0bce21650ad8f4382": "merge",
"39880f589acdfeb25866adf3246a76df08798f19": "merge",
"65ced2d78530d25626bcf18c4005a0aae5ed62c2": "merge",
"48cf968689bc72fa3ae82ce65a7b0943489f0154": "merge",
"4f2e11bbb3805459f2eacf219e5ec03f1d28cb35": "merge",
"0ea6dbabece8f3e5ea404f61c30c377f60a472e5": "merge",
"7ede6afb1141d462955d620a82608841220798b7": "merge",
"9ff9d8bf1215f2d98d3707653c09bcacf4e296de": "merge",
"b2e944340a3cccf5c1c399aa814e2fabe1ac864d": "merge",
"ce4d859cbca0368652db3168769a4e6ccac8f91a": "merge",
"6f657a344ac4233f86f0cf9ff11aa0aed4be94e0": "merge",
"b7101934d892ce071b407567acf6804e41c9b89c": "merge",
"2b5d28295e73e25c51359577e55470b59a4c283a": "merge",
"f0c9de980532b717b361439c4ae7d58e4b6a49ea": "merge",
"13bd864c337a32bc7207d44c4f1af0ce58f15072": "merge",
"710c6b384f10cc08c9a2116614e7736d035e65ae": "merge",
"7109848fc1fc60cba60a51e3cb0a514d3c90eebc": "merge",
"801e5080975fda8bab6c060efb2e7f3ac85d8182": "merge",
"a4910a93356eea624e5a08fe05634b94c0dd0285": "merge",
"7ff8ff8d36da93bbeade44b70f6b944ea32135d0": "merge",
"02402c6f4cde79fc0d4db36d033d4ea504d97c90": "merge",
"09c196193a16b47338b3e5446882ffa8ce3f8b16": "merge",
"2d427628ac15e8d04daa564e4c0f1a3aa7f93126": "merge",
"d65181509712e219fbe71407f0cd24845f306cd1": "merge",
"a82a00d0170e01857e493ed08f20d9b2133a8791": "merge",
"a1c17e0481ed6b2a3bc68f8bbf135173f636c2f6": "merge",
"1cefb01dff5305eba60b9f19e1acd5087bd8b60a": "merge",
"2f766fa9b5c81f85fa225f5e9573785fd0ddf5eb": "merge",
"12e822ee75c1f3a0d59cbb8e76370a7b7dd0e4dc": "merge",
"6e285cebbe113587482274cb3d51eb4ad8139643": "merge",
"f79b4edf703c01e6a5b7a75c43b691a2de7c6ca4": "merge",
"cc3901d8e10db1561222e2b39c62391298a14d82": "merge",
"21ee92f1adb0d480c328db4ded13728e890c0ea0": "merge",
"7b95c6dbd4990cbf4400e7ae89e2d530e31baa4e": "merge",
"978078ce3a4aaa7c425cc31e3dfece248a4f3a51": "merge",
"569a61c811956afb56325ede193d1d24a2d28013": "merge",
"77c9f803050c33ef23e6e428fbc98d5bde14ab2b": "merge",
"02dc4ec08801ab168b2838f3322a3a477ee179be": "merge",
"f804ddaa38c0d16757bd82f4aa58292da39a427c": "merge",
"62cc44865952495586686578bfb905b0026a762f": "merge",
"ae10bcefb1c772dd8898f56a6590a4342a880539": "merge",
"59427406b24625474bcefc245fd1f27b30bc2c81": "merge",
"3c7b681ea2f999c7581bf38d4c8e2a5486f8e00f": "merge",
"2c17be0a20d9399ab3b7c491080aee118301d4e1": "merge",
"072d8163161c207aafa8fff050ad93b7cafa815d": "merge",
"881b8d6623cf104d25bef36a17f89af27d0a6e8e": "merge",
"41875d183c0ef7d3a71477ae0d0f4bea0763812c": "merge",
"ec1cfbe8410b6569a010a708f81ab9768e882cf1": "merge",
"bd6faef7a688ea78b5017c19442e302f0337a3f1": "merge",
"225823b9e7e36d1d9e2e8ff0dbce20f3de6dc040": "merge",
"35d2a6b775147a9b60212d43810338231a239865": "merge",
"a6484f51b8be1c8b1898464e2a60b7b4496afd23": "merge",
"058172d888c1d39d7af7dc95fb6808bcde2a9a33": "merge",
"8d0a231bab26e73c7ed7e908e049fb48f4ee333f": "merge",
"96230c861c8a18100f40daa6a1d6e2ea91778707": "merge",
"a3f4effb653fdc612188a260ee04e6fd58213d41": "merge",
"3adb67d39baa74d72ca6fd926ab000b0e5b50007": "merge",
"e21b4aaf288b2223783f39db5383c39fdd32773f": "merge",
"885d004c25647a9561922fc2d5f2c2a7fe4f91f4": "merge",
"53c8b120a142ebcafbec64ca3c9040afa39f4d02": "merge",
"3e5be3f2f9b83d0506d261b45913d2cd5bd1e36a": "merge",
"ce65391f84fecb5d44d7de447d87bb1ef64a60bf": "merge",
"a39abb5b691049421ebb8720c0d5305a95bb244e": "merge",
"a6065db7c298ecb714b86f930bec5988d4466603": "merge",
"607aae23587abc641177812447a45099973c41a9": "merge",
"c94ec26898c0d2bc0b76a42f690bb3722c01fc43": "merge",
"d31c729d0d9dbef75416af90502bf75194e6e28a": "merge",
"28442303003251487bbf9376acda1e6f35c1ea59": "merge",
"1b263555acc9753b4de67191c636c9d403a69c98": "merge",
"88aaa450db0dd1121d47ff9f605e727efbdcaab7": "merge",
"d9e19f0a3a5dea8a04d5b2ad241f2ef9c1f06f59": "merge",
"16e379debd6d37512791ee9333079c669df15575": "merge",
"f51a06926de6189cc24abd4e67a325aaf1ad464a": "merge",
"9b8e2100c116ded02740c4cb777a5f157faccdb0": "merge",
"f34a747406f51b6531f04c0d14242226503051de": "merge",
"db2dedf1eed6cba52eb672be03855792e85da808": "merge",
"b6259a36610f0fd926ee6778b779fe538b9a510d": "merge",
"717e057c3fccf73ff4c574c7fe925b16058d8809": "merge",
"9b6e897d7f92a031ced4241f81f8b1a16f1dcebe": "merge",
"7c9434809e1fb66093267cb386ede23690217c5d": "merge",
"8d37e5a1fe1ead57a6fd6358f3dc734a661d69c3": "merge",
"b8336b4760340e35f4e3160b8e9c6bbe1da9cae1": "merge",
"1c3f0ea4fc7e4c995d8a4926081335a8c5f68dce": "merge",
"837abf0a5cce27a14c20cd5a21af867f8174be43": "merge",
"a1c63a7e1777c43cbf07834d8ad5afc05321445b": "merge",
"22c80d3819f0537048ef7d5d5ec9097713fc8dac": "merge",
"3eebe3845a49ee6374433d49c75d203021533a88": "merge",
"deba84d9ce4b353cd6c332fa1ce130541f9c3ceb": "merge",
"2a6f99c6ea2519768459228412712bd5e53ade15": "merge",
"469559bb434980822263f6b4986c50b6acba5cf0": "merge",
"97caa33b8724da030a27366e497db1848a38b202": "merge",
"68fdeb90581c9b5f0ebad6bb83210afd427deee7": "merge",
"cbcf7732216210ec544f18a676118c8ad7183695": "merge",
"5854cac43108e0ec536ae892a31d53c961386b6c": "merge",
"acb7e01a2f4e6516bcdd9d9f41f7993d81ae944c": "merge",
"97bac2c4905f6a4d8b21feb0bd7639222ff14b03": "merge",
"c29cea5412dc33461b3a57339e324d85cc9d8d2c": "merge",
"e8301271a18a1914cea7bf4ed31f38db7aba2dc7": "merge",
"a1192f7fa21c7525312de0620d64411918ca6bc7": "merge",
"8d9b61b5c93dc9506e9beeb545d28a6f0605b0ce": "merge",
"d2d22d4f31fbfa79cd7bc21d804a26cf2733615e": "merge",
"a4c98e31776765ce53ab27a4992e7e3797f0aa53": "merge",
"f8cafe342cb100e754258b7b8ea1d6f830b1939b": "merge",
"5872e950aa89f5211554bb41a4fea2983547fabc": "merge",
"26fa80ae816ebdf833bb53b111214dd7d6fcef14": "merge"
}

View File

@@ -0,0 +1,107 @@
# SUBAGENT — Activity enrichment
You are a subagent in the game-library enrichment pipeline. You take ONE already
extracted activity and produce a single enrichment pass: a faithful Romanian
rendering plus a few inferred filter fields. You do **one** activity per prompt.
This is **not** re-extraction. The activity text already exists and is trusted.
Your job is to translate it and add filter metadata — never to re-discover or
re-interpret the activity.
## Your task
The prompt gives you two blocks:
1. **Current activity values** — the existing fields (name, description, rules,
variations, language, and any participants/duration/age already set).
2. **Source chunk text** — the original passage the activity came from. This is
your ground truth for any expansion. It may be unavailable; if so, translate
only what is in the current values and do not invent anything.
Produce one JSON object and write it to the path named in the prompt
(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
`content_key` string from the prompt.
## Rules
### Translation (always)
- Translate `name`, `description`, `rules`, `variations` into natural, fluent
Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
- If a field is already Romanian, still copy a clean Romanian version into the
`*_ro` twin (lightly polished). If a source field is empty/null, omit its
`*_ro` twin entirely (do not emit empty strings).
- Translate faithfully. Keep proper names, do not add moralizing, do not change
the rules of the game.
### Description expansion (constrained)
- You MAY make `description_ro` richer than a literal translation — but ONLY
using detail that is actually present in the **source chunk text**. Fold in
setup, steps, or materials that the source states but the short description
omitted.
- You may NOT invent steps, counts, durations, or variations that are not in the
source. If the source is thin, the translation stays thin. Hallucinated
expansion is the one unacceptable failure here.
### Inferred filter fields (mark when inferred)
Fill these when you can, using the source text first, then reasonable inference:
- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
- `participants_min`, `participants_max`: integers (people).
- `duration_min`, `duration_max`: integers (minutes).
- `age_group_min`, `age_group_max`: integers (years).
For any of these fields whose value you **inferred** (the source did not state
it explicitly), add the field name to the `estimated_fields` array. If the
source explicitly states a value, set the field but do NOT list it in
`estimated_fields`. Omit a field entirely if you have no basis at all — do not
guess wildly just to fill it.
Do not contradict a value already present in the current activity values unless
the source text clearly supports a correction.
## Enum vocabulary (fixed — use these exact slugs)
- `indoor_outdoor`: `indoor` | `outdoor` | `either`
- `space_needed`: `mic` | `mediu` | `mare`
## Output format
Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
```json
{
"content_key": "<the exact key from the prompt>",
"name_ro": "…",
"description_ro": "…",
"rules_ro": "…",
"variations_ro": "…",
"indoor_outdoor": "outdoor",
"space_needed": "mediu",
"participants_min": 6,
"participants_max": 20,
"duration_min": 15,
"duration_max": 30,
"age_group_min": 8,
"age_group_max": 14,
"estimated_fields": ["space_needed", "duration_min", "duration_max"]
}
```
Include only the fields you actually fill. Always include `content_key` and
`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
no commentary, no markdown fences in the file itself.
### CRITICAL — escape quotes inside string values
Any ASCII double-quote (`"`, U+0022) inside a string value MUST be escaped as
`\"`. Your Romanian text is full of `„cuvânt"` — written raw, the closing ASCII
`"` terminates the JSON string early and the whole file fails to parse (and your
enrichment for this activity is silently lost). Either keep the typographic
marks (`„ "`) or escape every literal ASCII `"`. After writing, re-read the file
and confirm it parses as valid JSON.
## Report
After writing the file, report in under 30 words: the activity name and which
fields you estimated.

113
scripts/SUBAGENT_PROMPT.md Normal file
View File

@@ -0,0 +1,113 @@
# SUBAGENT — Activity extraction
You are a subagent in the game-library extraction pipeline. You extract
educational activities (games, team-building, scouting, recipes, songs,
ceremonies) from one chunk of a source document into structured JSON.
## Your task
1. **Read ONLY the chunk you were assigned.** Do not read other chunks, other
files, or the original document. The chunk is a `.txt` file with
`--- PAGE N ---` markers.
2. Identify **every distinct activity** in the chunk.
3. For each activity, fill the schema in `scripts/activity_schema.json`.
4. Write the result to `data/extracted/<chunk_key>.json`.
## What counts as "a distinct activity"
A distinct activity is a self-contained game/activity/recipe/song/ceremony with
its own name and a real description of how to do it. It is NOT:
- a bare mention or a cross-reference with no description — **skip it**;
- a sub-variant of an activity already extracted — fold it into `variations`;
- a heading, a table of contents entry, or running page chrome.
If the same activity is split across a page boundary inside your chunk, treat it
as **one** activity and combine the text.
## Output format
The file is one JSON object: a `header` plus an `activities` array.
```json
{
"header": {
"source_id": "<set from the prompt>",
"chunk_key": "<set from the prompt>",
"source_hash": "<set from the prompt>",
"schema_version": "1.0",
"prompt_version": "1.0",
"chunk_range": "pages 1-20"
},
"activities": [ ... ]
}
```
## Rules for each activity
- **`name`** — the activity's real name (≥3 characters).
- **`description`** — real prose describing the activity. No hard length limit,
but it must actually describe what happens.
- **`rules`** — how it is played / carried out, if the source gives rules.
- **`category`** — exactly one taxonomy slug (see the `enum` in the schema):
`jocuri-cercetasesti`, `team-building`, `icebreakers`, `camp-outdoor`,
`wide-games`, `orientare`, `prim-ajutor`, `escape-room-puzzle`,
`creative-stem`, `sports-active`, `cantece-ceremonii`, `retete`,
`supravietuire`, `integrare-incluziune`, `conflict-empatie`, `altele`.
When unsure, use `altele`.
- **`content_type`** — the FORM of the content, independent of category:
`joc`, `activitate`, `reteta`, `cantec`, or `ceremonie`.
- **`language`** — `ro` or `en` (the language the activity is written in).
- **`source_excerpt`** — **MANDATORY.** A short quote (one or two sentences)
copied **verbatim** from the chunk. This is the anti-hallucination anchor: it
is checked as a fuzzy substring of the chunk, and invented quotes are
rejected.
- **`page_reference`** — **MANDATORY.** The `--- PAGE N ---` marker(s) the
activity came from, e.g. `"page 14"` or `"pages 14-15"`.
- **`extraction_confidence`** — `high`, `med`, or `low`. Use `low` when the
source text for the activity is thin or ambiguous.
## Never invent data
- Do **not** invent ages, participant counts, or durations. If the source does
not state them, leave those fields `null`.
- Do **not** paraphrase the `source_excerpt` — copy it character for character.
- Better to extract fewer activities accurately than to pad the output.
## Escaping quotes inside JSON strings (CRITICAL)
Any ASCII double-quote (`"`, U+0022) that appears **inside a string value** must
be written escaped as `\"`. This is the single most common way these extractions
break: Romanian source text uses typographic quotes like `„cuvânt"` where the
closing mark is a plain ASCII `"`. Written raw, it terminates the JSON string
early and corrupts the whole file. So:
- `"description": "grupul cântă „Unu\" în cor"` ← correct (inner `"` escaped)
- `"description": "grupul cântă „Unu" în cor"` ← BROKEN (unescaped `"`)
Prefer keeping the source's typographic quotes (`„ "`), but whenever a literal
ASCII `"` lands inside a value, escape it. After writing, re-read the file and
confirm it parses as valid JSON.
## Writing large outputs in batches (IMPORTANT)
A single Write tool call has a hard ~32K output-token limit. Dense chunks
(50+ activities) will exceed this. If you estimate >30 activities, write the
file **incrementally**:
1. First Write: emit the file with `header` + the first batch (≤25 activities)
and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
2. For each subsequent batch (≤25 activities at a time), use an Edit call
that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
`,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
closing brace plus the last activity's tail) so the Edit is unambiguous.
3. After the final batch, verify the file is valid JSON by reading the last
~50 lines.
This keeps each tool call under the output-token cap.
## Before you finish
- Every activity has a non-empty `source_excerpt` and `page_reference`.
- The file validates against `scripts/activity_schema.json`.
- You only used text from your assigned chunk.

View File

@@ -0,0 +1,110 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Game-library extraction output",
"description": "One subagent output file: a header carrying provenance/version metadata plus the list of activities extracted from a single chunk.",
"type": "object",
"required": ["header", "activities"],
"additionalProperties": false,
"properties": {
"header": {
"type": "object",
"required": ["source_hash", "schema_version", "prompt_version", "chunk_range"],
"additionalProperties": true,
"properties": {
"source_hash": {"type": "string", "minLength": 8},
"schema_version": {"type": "string"},
"prompt_version": {"type": "string"},
"chunk_range": {"type": "string"},
"source_id": {"type": ["string", "null"]},
"chunk_key": {"type": ["string", "null"]}
}
},
"activities": {
"type": "array",
"items": {"$ref": "#/definitions/activity"}
}
},
"definitions": {
"activity": {
"type": "object",
"required": [
"name",
"description",
"category",
"content_type",
"language",
"extraction_confidence",
"source_excerpt",
"page_reference"
],
"additionalProperties": false,
"properties": {
"name": {"type": "string", "minLength": 3},
"description": {"type": "string", "minLength": 1},
"rules": {"type": ["string", "null"]},
"variations": {"type": ["string", "null"]},
"category": {
"type": "string",
"enum": [
"jocuri-cercetasesti",
"team-building",
"icebreakers",
"camp-outdoor",
"wide-games",
"orientare",
"prim-ajutor",
"escape-room-puzzle",
"creative-stem",
"sports-active",
"cantece-ceremonii",
"retete",
"supravietuire",
"integrare-incluziune",
"conflict-empatie",
"altele"
]
},
"subcategory": {"type": ["string", "null"]},
"content_type": {
"type": "string",
"enum": ["joc", "activitate", "reteta", "cantec", "ceremonie"]
},
"language": {"type": "string", "enum": ["ro", "en"]},
"extraction_confidence": {
"type": "string",
"enum": ["high", "med", "low"]
},
"source_excerpt": {"type": "string", "minLength": 1},
"page_reference": {"type": "string", "minLength": 1},
"source_file": {"type": ["string", "null"]},
"age_group_min": {"type": ["integer", "null"], "minimum": 0},
"age_group_max": {"type": ["integer", "null"], "minimum": 0},
"participants_min": {"type": ["integer", "null"], "minimum": 0},
"participants_max": {"type": ["integer", "null"], "minimum": 0},
"duration_min": {"type": ["integer", "null"], "minimum": 0},
"duration_max": {"type": ["integer", "null"], "minimum": 0},
"materials_category": {"type": ["string", "null"]},
"materials_list": {
"type": ["array", "null"],
"items": {"type": "string"}
},
"skills_developed": {
"type": ["array", "null"],
"items": {"type": "string"}
},
"difficulty_level": {
"type": ["string", "null"],
"enum": ["usor", "mediu", "dificil", null]
},
"keywords": {
"type": ["array", "null"],
"items": {"type": "string"}
},
"tags": {
"type": ["array", "null"],
"items": {"type": "string"}
}
}
}
}
}

780
scripts/build_database.py Normal file
View File

@@ -0,0 +1,780 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
build_database.py — build data/activities.db from the subagent extraction JSON.
Replaces the old import_claude_activities.py. Pipeline (plan §4):
1. `--rebuild` builds into data/activities.db.tmp; on success the live DB is
backed up to data/activities.db.bak and the tmp file is swapped in with an
atomic os.replace. A mid-build crash leaves the live DB untouched.
2. Every data/extracted/*.json is validated against scripts/activity_schema.json;
invalid files are moved to data/extracted/_rejected/ with an error log.
2b. Each source_excerpt must appear as a fuzzy substring (rapidfuzz
partial_ratio >= 90) of its source chunk — non-matches are hallucinations
and the activity is dropped (logged to _rejected/).
3. `category` is normalized to a valid taxonomy slug (fallback `altele`).
4. Dedup (D5): group by exact normalized_name, never across languages; within a
group rapidfuzz on descriptions — >=85 auto-merge, 60-85 borderline (keep
both, needs_review), <60 separate variants.
5. data/review_decisions.json is applied before insert.
6. Bulk insert into the tmp DB, populate the categories table, rebuild FTS.
7. A QA report is printed.
Usage:
python scripts/build_database.py --rebuild
"""
from __future__ import annotations
import argparse
import json
import os
import shutil
import sys
from collections import defaultdict
from pathlib import Path
from typing import Any, Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from app.config_taxonomy import ( # noqa: E402
category_display_name,
normalize_category,
normalize_content_type,
)
from app.models.activity import Activity # noqa: E402
from app.models.database import DatabaseManager # noqa: E402
from import_common import ( # noqa: E402
DEFAULT_SCHEMA_PATH,
content_key,
excerpt_matches,
find_chunk_text,
iter_extraction_files,
load_schema,
normalize_name,
source_path_for,
)
# dedup thresholds (rapidfuzz token_sort_ratio, 0..100 scale)
AUTO_MERGE_THRESHOLD = 85.0
BORDERLINE_THRESHOLD = 60.0
# --------------------------------------------------------------------------
# extraction dict -> Activity
# --------------------------------------------------------------------------
def _csv(value: Any) -> Optional[str]:
"""Schema arrays -> comma string for the (TEXT) DB columns."""
if value is None:
return None
if isinstance(value, str):
return value.strip() or None
if isinstance(value, (list, tuple)):
parts = [str(v).strip() for v in value if str(v).strip()]
return ", ".join(parts) or None
return str(value)
def _split_csv(value: Optional[str]) -> list[str]:
if not value:
return []
return [p.strip() for p in str(value).split(",") if p.strip()]
def dict_to_activity(
adict: dict,
source_file: str,
source_id: Optional[str] = None,
chunk_key: Optional[str] = None,
) -> Activity:
"""Build an Activity from one extraction-JSON activity object."""
tags = adict.get("tags") or []
if isinstance(tags, str):
tags = _split_csv(tags)
source_files = adict.get("source_files") or []
if isinstance(source_files, str):
source_files = _split_csv(source_files)
if source_file and source_file not in source_files:
source_files = [source_file, *source_files]
return Activity(
source_id=source_id,
source_ids=[source_id] if source_id else [],
chunk_key=chunk_key,
name=(adict.get("name") or "").strip(),
description=(adict.get("description") or "").strip(),
rules=adict.get("rules"),
variations=adict.get("variations"),
category=normalize_category(adict.get("category", "")),
subcategory=adict.get("subcategory"),
content_type=normalize_content_type(adict.get("content_type", "")),
source_file=source_file,
source_files=list(source_files),
page_reference=adict.get("page_reference"),
source_excerpt=adict.get("source_excerpt"),
age_group_min=adict.get("age_group_min"),
age_group_max=adict.get("age_group_max"),
participants_min=adict.get("participants_min"),
participants_max=adict.get("participants_max"),
duration_min=adict.get("duration_min"),
duration_max=adict.get("duration_max"),
materials_category=adict.get("materials_category"),
materials_list=_csv(adict.get("materials_list")),
skills_developed=_csv(adict.get("skills_developed")),
difficulty_level=adict.get("difficulty_level"),
keywords=_csv(adict.get("keywords")),
tags=list(tags),
language=adict.get("language"),
extraction_confidence=adict.get("extraction_confidence"),
)
# --------------------------------------------------------------------------
# step 3 — category normalization is done in dict_to_activity; a non-taxonomy
# value silently falls back to `altele`. This logs the substitutions.
# --------------------------------------------------------------------------
def log_category_fallbacks(raw_pairs: list[tuple[str, str]]) -> list[str]:
"""raw_pairs = (original, slug); return human-readable fallback messages."""
msgs = []
for original, slug in raw_pairs:
if slug == "altele" and normalize_name(original or "") not in ("", "altele"):
msgs.append(f"category '{original}' -> altele (not in taxonomy)")
return msgs
# --------------------------------------------------------------------------
# step 4 — dedup
# --------------------------------------------------------------------------
def _longest(*values: Optional[str]) -> Optional[str]:
best: Optional[str] = None
for v in values:
if v and (best is None or len(v) > len(best)):
best = v
return best
def _union_csv(values: list[Optional[str]]) -> Optional[str]:
seen: list[str] = []
for value in values:
for item in _split_csv(value):
if item not in seen:
seen.append(item)
return ", ".join(seen) or None
def merge_cluster(cluster: list[Activity]) -> Activity:
"""Collapse a cluster of duplicate activities into one merged Activity."""
if len(cluster) == 1:
return cluster[0]
# representative = the one with the longest description
rep = max(cluster, key=lambda a: len(a.description or ""))
merged = Activity(
name=rep.name,
description=_longest(*(a.description for a in cluster)) or rep.description,
rules=_longest(*(a.rules for a in cluster)),
variations=_longest(*(a.variations for a in cluster)),
category=rep.category,
subcategory=rep.subcategory,
content_type=rep.content_type,
source_file=rep.source_file,
page_reference=rep.page_reference,
source_excerpt=rep.source_excerpt,
age_group_min=rep.age_group_min,
age_group_max=rep.age_group_max,
participants_min=rep.participants_min,
participants_max=rep.participants_max,
duration_min=rep.duration_min,
duration_max=rep.duration_max,
materials_category=rep.materials_category,
materials_list=_union_csv([a.materials_list for a in cluster]),
skills_developed=_union_csv([a.skills_developed for a in cluster]),
difficulty_level=rep.difficulty_level,
keywords=_union_csv([a.keywords for a in cluster]),
language=rep.language,
extraction_confidence=rep.extraction_confidence,
)
# union of tags
tags: list[str] = []
for a in cluster:
for t in a.tags or []:
if t not in tags:
tags.append(t)
merged.tags = tags
# accumulate every source the activity was seen in
sources: list[str] = []
for a in cluster:
for s in [a.source_file, *(a.source_files or [])]:
if s and s not in sources:
sources.append(s)
merged.source_files = sources
# source provenance: keep rep's chunk_key/source_id as primary, union the
# source_ids for the download route. Enrichment fields (name_ro,
# description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
# enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
# row's content_key, so merging must not pre-populate them.
merged.source_id = rep.source_id
merged.chunk_key = rep.chunk_key
source_ids: list[str] = []
for a in cluster:
for sid in [a.source_id, *(a.source_ids or [])]:
if sid and sid not in source_ids:
source_ids.append(sid)
merged.source_ids = source_ids
# popularity_score++ per merged duplicate (plan §4)
merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
return merged
def dedup_activities(activities: list[Activity]) -> tuple[list[Activity], dict]:
"""
Dedup per plan D5.
Groups by (normalized_name, language) — different languages are NEVER
merged. Within a group, descriptions are clustered with rapidfuzz:
>= 85 -> same cluster (auto-merge)
60-85 -> borderline: kept as separate clusters, both flagged needs_review
< 60 -> separate variants
"""
from rapidfuzz import fuzz
groups: dict[tuple, list[Activity]] = defaultdict(list)
for act in activities:
key = (act.normalized_name or normalize_name(act.name), act.language)
groups[key].append(act)
result: list[Activity] = []
stats = {"input": len(activities), "auto_merged": 0, "borderline": 0, "output": 0}
for members in groups.values():
clusters: list[list[Activity]] = []
borderline_idx: set[int] = set()
for act in members:
best_idx, best_score = -1, -1.0
borderline_here: list[int] = []
for idx, cluster in enumerate(clusters):
score = fuzz.token_sort_ratio(
act.description or "", cluster[0].description or ""
)
if score >= AUTO_MERGE_THRESHOLD:
if score > best_score:
best_idx, best_score = idx, score
elif score >= BORDERLINE_THRESHOLD:
borderline_here.append(idx)
if best_idx >= 0:
clusters[best_idx].append(act)
else:
clusters.append([act])
new_idx = len(clusters) - 1
for bidx in borderline_here:
borderline_idx.add(bidx)
borderline_idx.add(new_idx)
for idx, cluster in enumerate(clusters):
merged = merge_cluster(cluster)
if len(cluster) > 1:
stats["auto_merged"] += len(cluster) - 1
if idx in borderline_idx:
merged.needs_review = 1
stats["borderline"] += 1
result.append(merged)
stats["output"] = len(result)
return result, stats
# --------------------------------------------------------------------------
# step 5 — review decisions
# --------------------------------------------------------------------------
def load_review_decisions(path: Path) -> dict:
if path and path.is_file():
try:
data = json.loads(path.read_text(encoding="utf-8"))
if isinstance(data, dict):
return data
except (json.JSONDecodeError, OSError):
pass
return {}
def apply_review_decisions(
activities: list[Activity], decisions: dict
) -> tuple[list[Activity], dict]:
"""
Apply data/review_decisions.json (plan §5c).
Keyed by the stable content_key. A decision of `drop` removes the row;
`keep-separate` / `merge` clear needs_review (the user has resolved it).
Rows with no decision keep needs_review and resurface in the queue.
"""
kept: list[Activity] = []
stats = {"dropped": 0, "resolved": 0}
for act in activities:
key = content_key(
act.normalized_name or normalize_name(act.name),
act.language,
act.description or "",
)
entry = decisions.get(key)
decision = entry.get("decision") if isinstance(entry, dict) else entry
if decision == "drop":
stats["dropped"] += 1
continue
if decision in ("keep-separate", "merge"):
act.needs_review = 0
stats["resolved"] += 1
kept.append(act)
return kept, stats
# --------------------------------------------------------------------------
# step 5b — enrichment overlay (plan Part B)
# --------------------------------------------------------------------------
# Translation / inferred-filter fields written by run_enrichment.py. Applied
# AFTER dedup + review decisions, keyed on the same stable content_key, so the
# overlay survives rebuilds as long as extraction text is frozen.
_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
_ENRICHMENT_INT_FIELDS = (
"participants_min", "participants_max",
"duration_min", "duration_max",
"age_group_min", "age_group_max",
)
def load_enrichment(path: Path) -> dict:
"""Load data/enrichment.json (flat map content_key -> field dict)."""
if path and path.is_file():
try:
data = json.loads(path.read_text(encoding="utf-8"))
if isinstance(data, dict):
return data
except (json.JSONDecodeError, OSError):
pass
return {}
def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
"""
Overlay enrichment fields onto the post-dedup activity list (plan B2).
Keyed by content_key. Only fields PRESENT in an entry are written; absent
fields leave the underlying DB value untouched. indoor_outdoor /
space_needed are normalized to slugs (None on unrecognised). Inferred
fields are recorded in `estimated_fields`. Translated / expanded text is
NOT re-validated against the source here — expansion fidelity is the
enrichment prompt's responsibility (plan B2 comment).
Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
"""
from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
matched_keys: set[str] = set()
fields_stated: dict[str, int] = defaultdict(int)
fields_estimated: dict[str, int] = defaultdict(int)
for act in activities:
key = content_key(
act.normalized_name or normalize_name(act.name),
act.language,
act.description or "",
)
entry = enrichment.get(key)
if not isinstance(entry, dict):
continue
matched_keys.add(key)
estimated = set(entry.get("estimated_fields") or [])
# bilingual text twins
for fld in _ENRICHMENT_TEXT_FIELDS:
val = entry.get(fld)
if isinstance(val, str) and val.strip():
setattr(act, fld, val.strip())
# inferred / clarified structured numeric fields
for fld in _ENRICHMENT_INT_FIELDS:
if entry.get(fld) is not None:
try:
setattr(act, fld, int(entry[fld]))
except (TypeError, ValueError):
pass
# enum filters — normalized to slug, dropped if unrecognised
if entry.get("indoor_outdoor") is not None:
slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
if slug:
act.indoor_outdoor = slug
if entry.get("space_needed") is not None:
slug = normalize_space_needed(entry["space_needed"])
if slug:
act.space_needed = slug
act.estimated_fields = sorted(estimated)
# QA tally: stated vs estimated population, per field
for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
if entry.get(fld) is None:
continue
if fld in estimated:
fields_estimated[fld] += 1
else:
fields_stated[fld] += 1
return {
"entries": len(enrichment),
"matched": len(matched_keys),
"orphaned": len(enrichment) - len(matched_keys),
"fields_stated": dict(fields_stated),
"fields_estimated": dict(fields_estimated),
}
# --------------------------------------------------------------------------
# golden-set recall (plan §7)
# --------------------------------------------------------------------------
def _golden_names(data: Any) -> list[str]:
items = data.get("activities", data) if isinstance(data, dict) else data
names: list[str] = []
for item in items or []:
if isinstance(item, str):
names.append(item)
elif isinstance(item, dict) and item.get("name"):
names.append(item["name"])
return names
def golden_recall(golden_dir: Path, activities: list[Activity]) -> Optional[dict]:
if not golden_dir or not golden_dir.is_dir():
return None
found = {normalize_name(a.name) for a in activities}
expected, hits = 0, 0
for gf in sorted(golden_dir.glob("*.json")):
try:
data = json.loads(gf.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
continue
for name in _golden_names(data):
expected += 1
if normalize_name(name) in found:
hits += 1
if expected == 0:
return None
return {"expected": expected, "found": hits, "recall": round(hits / expected, 3)}
# --------------------------------------------------------------------------
# load + validate + excerpt-check the extraction files
# --------------------------------------------------------------------------
def collect_activities(
extracted_dir: Path,
chunks_dir: Path,
sources_dir: Path,
schema: dict,
) -> dict:
"""Validate, excerpt-check and convert every extraction file."""
rejected_dir = extracted_dir / "_rejected"
activities: list[Activity] = []
report = {
"files_total": 0,
"files_valid": 0,
"files_rejected_schema": 0,
"activities_raw": 0,
"activities_hallucinated": 0,
"category_fallbacks": [],
}
raw_categories: list[tuple[str, str]] = []
from import_common import chunk_key_for # local import to avoid clutter
for json_path in iter_extraction_files(extracted_dir):
report["files_total"] += 1
try:
data = json.loads(json_path.read_text(encoding="utf-8"))
except json.JSONDecodeError as exc:
_reject_file(json_path, rejected_dir, [f"invalid JSON: {exc}"])
report["files_rejected_schema"] += 1
continue
from import_common import validate_extraction
errors = validate_extraction(data, schema)
if errors:
_reject_file(json_path, rejected_dir, errors)
report["files_rejected_schema"] += 1
continue
report["files_valid"] += 1
header = data.get("header", {})
chunk_text = find_chunk_text(json_path, header, chunks_dir)
chunk_key = chunk_key_for(json_path, header)
source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
fallback_source = (
source_path_for(source_id, sources_dir) or source_id or json_path.stem
)
hallucinated: list[dict] = []
for adict in data.get("activities", []):
report["activities_raw"] += 1
excerpt = adict.get("source_excerpt") or ""
# if the chunk text is unavailable we cannot verify — keep but the
# QA report still counts it under activities_raw.
if chunk_text is not None and not excerpt_matches(excerpt, chunk_text):
hallucinated.append(adict)
report["activities_hallucinated"] += 1
continue
src = adict.get("source_file") or fallback_source
raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
activities.append(dict_to_activity(adict, src, source_id, chunk_key))
if hallucinated:
_log_hallucinations(json_path, rejected_dir, hallucinated)
report["category_fallbacks"] = log_category_fallbacks(raw_categories)
report["activities"] = activities
return report
def _reject_file(json_path: Path, rejected_dir: Path, errors: list[str]) -> None:
rejected_dir.mkdir(parents=True, exist_ok=True)
dest = rejected_dir / json_path.name
shutil.move(str(json_path), str(dest))
log = rejected_dir / f"{json_path.stem}.errors.txt"
log.write_text(
f"REJECTED (schema validation): {json_path.name}\n\n"
+ "\n".join(f" - {e}" for e in errors)
+ "\n",
encoding="utf-8",
)
def _log_hallucinations(
json_path: Path, rejected_dir: Path, hallucinated: list[dict]
) -> None:
rejected_dir.mkdir(parents=True, exist_ok=True)
log = rejected_dir / f"{json_path.stem}.hallucinations.txt"
lines = [f"DROPPED activities (source_excerpt not found in chunk): {json_path.name}", ""]
for a in hallucinated:
lines.append(f" - {a.get('name')!r}")
lines.append(f" excerpt: {a.get('source_excerpt')!r}")
log.write_text("\n".join(lines) + "\n", encoding="utf-8")
# --------------------------------------------------------------------------
# DB write + atomic swap
# --------------------------------------------------------------------------
def _enrich_category_display_names(db_path: Path) -> None:
"""Give the categories table proper Romanian display names for slugs."""
import sqlite3
conn = sqlite3.connect(db_path)
try:
rows = conn.execute(
"SELECT value FROM categories WHERE type = 'category'"
).fetchall()
for (slug,) in rows:
conn.execute(
"UPDATE categories SET display_name = ? WHERE type='category' AND value = ?",
(category_display_name(slug), slug),
)
conn.commit()
finally:
conn.close()
def write_database(db_tmp_path: Path, activities: list[Activity]) -> None:
"""Create a fresh tmp DB, bulk insert, populate categories, rebuild FTS."""
if db_tmp_path.exists():
db_tmp_path.unlink()
db = DatabaseManager(str(db_tmp_path))
db.bulk_insert_activities(activities)
_enrich_category_display_names(db_tmp_path)
db.rebuild_fts_index()
def atomic_swap(db_tmp_path: Path, db_path: Path) -> Optional[Path]:
"""Back up the live DB then atomically swap the tmp file in."""
backup: Optional[Path] = None
if db_path.exists():
backup = db_path.with_suffix(db_path.suffix + ".bak")
shutil.copy2(db_path, backup)
os.replace(db_tmp_path, db_path)
return backup
# --------------------------------------------------------------------------
# orchestration
# --------------------------------------------------------------------------
def rebuild(
*,
extracted_dir: Path,
chunks_dir: Path,
sources_dir: Path,
db_path: Path,
decisions_path: Optional[Path] = None,
enrichment_path: Optional[Path] = None,
schema_path: Path = DEFAULT_SCHEMA_PATH,
golden_dir: Optional[Path] = None,
do_swap: bool = True,
) -> dict:
"""
Full rebuild. Everything is built into <db_path>.tmp; the live DB is only
touched by the final atomic swap, so a crash anywhere above leaves it intact.
"""
extracted_dir = Path(extracted_dir)
db_path = Path(db_path)
db_tmp_path = db_path.with_suffix(db_path.suffix + ".tmp")
schema = load_schema(schema_path)
collected = collect_activities(extracted_dir, Path(chunks_dir), Path(sources_dir), schema)
activities: list[Activity] = collected.pop("activities")
deduped, dedup_stats = dedup_activities(activities)
decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
final, decision_stats = apply_review_decisions(deduped, decisions)
# Enrichment overlay — applied immediately after review decisions, on the
# post-dedup list, keyed on the same stable content_key (plan B2).
enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
enrichment_stats = apply_enrichment(final, enrichment)
try:
write_database(db_tmp_path, final)
backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
except Exception:
if db_tmp_path.exists():
db_tmp_path.unlink()
raise
report = {
**collected,
"dedup": dedup_stats,
"decisions": decision_stats,
"enrichment": enrichment_stats,
"final_count": len(final),
"backup": str(backup) if backup else None,
"swapped": do_swap,
"qa": _qa_report(final, collected, golden_dir),
}
return report
def _qa_report(
activities: list[Activity], collected: dict, golden_dir: Optional[Path]
) -> dict:
per_category: dict[str, int] = defaultdict(int)
per_content_type: dict[str, int] = defaultdict(int)
confidence: dict[str, int] = defaultdict(int)
with_rules = 0
for a in activities:
per_category[a.category] += 1
per_content_type[a.content_type or "?"] += 1
confidence[a.extraction_confidence or "?"] += 1
if a.rules and a.rules.strip():
with_rules += 1
raw = collected.get("activities_raw", 0)
hallucinated = collected.get("activities_hallucinated", 0)
return {
"total": len(activities),
"per_category": dict(per_category),
"per_content_type": dict(per_content_type),
"extraction_confidence": dict(confidence),
"pct_with_rules": round(100 * with_rules / len(activities), 1) if activities else 0.0,
"needs_review": sum(1 for a in activities if a.needs_review),
"hallucination_rate": round(100 * hallucinated / raw, 2) if raw else 0.0,
"golden_recall": golden_recall(Path(golden_dir), activities) if golden_dir else None,
}
def print_report(report: dict) -> None:
qa = report["qa"]
print("=" * 60)
print("BUILD DATABASE — QA REPORT")
print("=" * 60)
print(f"extraction files : {report['files_total']} "
f"(valid {report['files_valid']}, schema-rejected {report['files_rejected_schema']})")
print(f"activities raw : {report['activities_raw']}")
print(f" hallucinated drop : {report['activities_hallucinated']} "
f"({qa['hallucination_rate']}%)")
d = report["dedup"]
print(f"dedup : {d['input']} -> {d['output']} "
f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
print(f"review decisions : dropped {report['decisions']['dropped']}, "
f"resolved {report['decisions']['resolved']}")
enr = report.get("enrichment")
if enr and enr.get("entries"):
print(f"enrichment : {enr['entries']} entries "
f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
all_fields = sorted(set(stated) | set(estimated))
if all_fields:
print(" field population : (stated / estimated)")
for fld in all_fields:
print(f" {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
print(f"final inserted : {report['final_count']}")
print(f"% with rules : {qa['pct_with_rules']}")
print(f"needs_review rows : {qa['needs_review']}")
print("per category :")
for slug, n in sorted(qa["per_category"].items(), key=lambda kv: -kv[1]):
print(f" {slug:<24}: {n}")
print("per content_type :")
for ct, n in sorted(qa["per_content_type"].items(), key=lambda kv: -kv[1]):
print(f" {ct:<24}: {n}")
print("extraction_confidence:")
for c, n in sorted(qa["extraction_confidence"].items()):
print(f" {c:<24}: {n}")
if qa["golden_recall"]:
g = qa["golden_recall"]
print(f"golden recall : {g['found']}/{g['expected']} = {g['recall']}")
if report["category_fallbacks"]:
print("category fallbacks :")
for msg in report["category_fallbacks"]:
print(f" {msg}")
if report["backup"]:
print(f"live DB backed up to : {report['backup']}")
print("=" * 60)
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Build activities.db from extraction JSON.")
parser.add_argument("--rebuild", action="store_true",
help="rebuild the database from scratch (only mode supported)")
parser.add_argument("--extracted", default="data/extracted")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--sources", default="data/sources")
parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--decisions", default="data/review_decisions.json")
parser.add_argument("--enrichment", default="data/enrichment.json")
parser.add_argument("--golden", default="data/golden")
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
args = parser.parse_args(argv)
if not args.rebuild:
parser.error("only --rebuild is supported (full rebuild, no incremental merge)")
report = rebuild(
extracted_dir=Path(args.extracted),
chunks_dir=Path(args.chunks),
sources_dir=Path(args.sources),
db_path=Path(args.db),
decisions_path=Path(args.decisions),
enrichment_path=Path(args.enrichment),
schema_path=Path(args.schema),
golden_dir=Path(args.golden),
)
print_report(report)
return 0
if __name__ == "__main__":
raise SystemExit(main())

251
scripts/chunk_sources.py Normal file
View File

@@ -0,0 +1,251 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
chunk_sources.py — split normalized data/sources/*.txt into ~20-page chunks
for subagent extraction, and maintain data/chunks/manifest.json.
Paginated text → ~20-page chunks, ~4-page overlap (plan D8).
Unpaginated text → ~10000-word windows, ~2000-word overlap.
The manifest is a cache derived from the filesystem + per-chunk state. Re-running
this script is idempotent: existing chunk states (pending/assigned/done/rejected)
survive as long as the source content hash is unchanged.
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
SCRIPT_DIR = Path(__file__).resolve().parent
if str(SCRIPT_DIR) not in sys.path:
sys.path.insert(0, str(SCRIPT_DIR))
from extract_common import content_hash, split_pages # noqa: E402
SCHEMA_VERSION = "1.0"
PAGES_PER_CHUNK = 20
PAGE_OVERLAP = 4
WORD_WINDOW = 10_000
WORD_OVERLAP = 2_000
VALID_STATES = {"pending", "assigned", "done", "rejected"}
# --------------------------------------------------------------------------
# header parsing
# --------------------------------------------------------------------------
def parse_source(text: str) -> tuple[dict, str]:
"""Split a normalized source file into (header_dict, body)."""
lines = text.splitlines()
header: dict = {}
body_start = 0
in_header = True
for i, line in enumerate(lines):
if line.startswith("--- PAGE "):
body_start = i
break
if not in_header:
continue
if set(line.strip()) == {"="} and line.strip():
body_start = i + 1
in_header = False # header ends at the rule line
continue
if ":" in line:
key, _, val = line.partition(":")
header[key.strip()] = val.strip()
body = "\n".join(lines[body_start:])
return header, body
# --------------------------------------------------------------------------
# chunking — pure functions
# --------------------------------------------------------------------------
def chunk_pages(
pages: list[tuple[int, str]],
pages_per_chunk: int = PAGES_PER_CHUNK,
overlap: int = PAGE_OVERLAP,
) -> list[dict]:
"""
Split an ordered list of (page_no, text) into overlapping chunks.
stride = pages_per_chunk - overlap. Because stride < pages_per_chunk - 1, any
activity straddling a page boundary appears whole in at least one chunk.
"""
if not pages:
return []
stride = max(1, pages_per_chunk - overlap)
chunks: list[dict] = []
i = 0
n = len(pages)
while i < n:
window = pages[i : i + pages_per_chunk]
first, last = window[0][0], window[-1][0]
text = "".join(
f"\n--- PAGE {num} ---\n{txt}\n" for num, txt in window
)
chunks.append(
{"page_start": first, "page_end": last,
"chunk_range": f"pages {first}-{last}", "text": text}
)
if i + pages_per_chunk >= n:
break
i += stride
return chunks
def chunk_words(
text: str, window: int = WORD_WINDOW, overlap: int = WORD_OVERLAP
) -> list[dict]:
"""Split unpaginated text into overlapping word windows."""
words = text.split()
if not words:
return []
stride = max(1, window - overlap)
chunks: list[dict] = []
i = 0
n = len(words)
while i < n:
seg = words[i : i + window]
chunks.append(
{"word_start": i, "word_end": i + len(seg),
"chunk_range": f"words {i}-{i + len(seg)}", "text": " ".join(seg)}
)
if i + window >= n:
break
i += stride
return chunks
def make_chunks(source_text: str) -> list[dict]:
"""Chunk one normalized source file. Picks page- or word-windowing."""
_, body = parse_source(source_text)
pages = split_pages(body)
if pages:
return chunk_pages(pages)
return chunk_words(body)
# --------------------------------------------------------------------------
# manifest
# --------------------------------------------------------------------------
def _empty_manifest() -> dict:
return {"schema_version": SCHEMA_VERSION, "chunks": {}}
def load_manifest(manifest_path: Path) -> dict:
if manifest_path.exists():
try:
data = json.loads(manifest_path.read_text(encoding="utf-8"))
data.setdefault("schema_version", SCHEMA_VERSION)
data.setdefault("chunks", {})
return data
except (json.JSONDecodeError, OSError):
pass
return _empty_manifest()
def save_manifest(manifest: dict, manifest_path: Path) -> None:
manifest_path.parent.mkdir(parents=True, exist_ok=True)
manifest_path.write_text(
json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
)
def chunk_source_file(
source_path: Path, chunks_dir: Path, manifest: dict
) -> list[str]:
"""
Chunk one data/sources/<id>.txt → data/chunks/<id>/<id>.partNN.txt and
register every chunk in `manifest`. Preserves prior state when the source
content hash is unchanged. Returns the list of chunk keys written.
"""
source_id = source_path.stem
text = source_path.read_text(encoding="utf-8", errors="replace")
src_hash = content_hash(text)
chunks = make_chunks(text)
out_dir = chunks_dir / source_id
out_dir.mkdir(parents=True, exist_ok=True)
written: list[str] = []
for idx, chunk in enumerate(chunks, 1):
key = f"{source_id}.part{idx:02d}"
chunk_file = out_dir / f"{key}.txt"
chunk_file.write_text(chunk["text"], encoding="utf-8")
prior = manifest["chunks"].get(key)
# preserve state only if the source content is unchanged
if prior and prior.get("source_hash") == src_hash and \
prior.get("state") in VALID_STATES:
state = prior["state"]
else:
state = "pending"
manifest["chunks"][key] = {
"source_id": source_id,
"source_hash": src_hash,
"part": idx,
"chunk_range": chunk["chunk_range"],
"chunk_file": str(chunk_file.relative_to(chunks_dir.parent)),
"expected_json": f"{key}.json",
"state": state,
}
written.append(key)
return written
def prune_stale(manifest: dict, live_keys: set[str]) -> list[str]:
"""Drop manifest entries whose chunk no longer exists on disk."""
stale = [k for k in manifest["chunks"] if k not in live_keys]
for k in stale:
del manifest["chunks"][k]
return stale
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def run(sources_dir: Path, chunks_dir: Path) -> dict:
"""Chunk every *.txt in sources_dir. Returns a summary dict."""
manifest_path = chunks_dir / "manifest.json"
manifest = load_manifest(manifest_path)
live_keys: set[str] = set()
source_files = sorted(sources_dir.glob("*.txt"))
for src in source_files:
live_keys.update(chunk_source_file(src, chunks_dir, manifest))
stale = prune_stale(manifest, live_keys)
save_manifest(manifest, manifest_path)
states: dict[str, int] = {}
for meta in manifest["chunks"].values():
states[meta["state"]] = states.get(meta["state"], 0) + 1
return {
"sources": len(source_files),
"chunks": len(live_keys),
"pruned": len(stale),
"states": states,
}
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(description="Chunk normalized sources.")
parser.add_argument("--sources", default="data/sources", help="sources dir")
parser.add_argument("--chunks", default="data/chunks", help="chunks output dir")
args = parser.parse_args(argv)
summary = run(Path(args.sources), Path(args.chunks))
print(f"sources processed : {summary['sources']}")
print(f"chunks written : {summary['chunks']}")
print(f"stale pruned : {summary['pruned']}")
for state, count in sorted(summary["states"].items()):
print(f" {state:<10}: {count}")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -1,54 +0,0 @@
# TEMPLATE PENTRU EXTRACȚIE ACTIVITĂȚI CU CLAUDE
## Instrucțiuni pentru Claude Code:
Pentru fiecare PDF/DOC, folosește următorul format de extracție:
### 1. Citește fișierul:
```
Claude, te rog citește fișierul: [CALE_FISIER]
```
### 2. Extrage activitățile folosind acest template JSON:
```json
{
"source_file": "[NUME_FISIER]",
"activities": [
{
"name": "Numele activității",
"description": "Descrierea completă a activității",
"rules": "Regulile jocului/activității",
"variations": "Variante sau adaptări",
"category": "[A-H] bazat pe tip",
"age_group_min": 6,
"age_group_max": 14,
"participants_min": 4,
"participants_max": 20,
"duration_min": 10,
"duration_max": 30,
"materials_list": "Lista materialelor necesare",
"skills_developed": "Competențe dezvoltate",
"difficulty_level": "Ușor/Mediu/Dificil",
"keywords": "cuvinte cheie separate prin virgulă",
"tags": "taguri relevante"
}
]
}
```
### 3. Salvează în fișier:
După extracție, salvează JSON-ul în: `/scripts/extracted_activities/[NUME_FISIER].json`
### 4. Priorități de procesare:
**TOP PRIORITY (procesează primele):**
1. 1000 Fantastic Scout Games.pdf
2. Cartea Mare a jocurilor.pdf
3. 160-de-activitati-dinamice-jocuri-pentru-team-building.pdf
4. 101 Ways to Create an Unforgettable Camp Experience.pdf
5. 151 Awesome Summer Camp Nature Activities.pdf
**Categorii de focus:**
- [A] Jocuri Cercetășești
- [C] Camping & Activități Exterior
- [G] Activități Educaționale

View File

@@ -1,164 +0,0 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
DATABASE SETUP SCRIPT - INDEX-SISTEM-JOCURI
Script pentru recrearea bazelor de date din .gitignore
Folosește clasele DatabaseManager pentru consistență
Usage:
python scripts/create_databases.py
python scripts/create_databases.py --clear-existing
"""
import sys
import argparse
from pathlib import Path
# Add src to path so we can import our modules
sys.path.append(str(Path(__file__).parent.parent / 'src'))
from database import DatabaseManager
from game_library_manager import GameLibraryManager
def create_main_database(db_path: str = "data/activities.db", clear: bool = False):
"""Create the main activities database"""
db_file = Path(db_path)
if clear and db_file.exists():
print(f"🗑️ Removing existing database: {db_path}")
db_file.unlink()
print(f"📊 Creating main database: {db_path}")
db = DatabaseManager(db_path)
# Test the database
try:
stats = db.get_statistics()
print(f"✅ Database created successfully: {stats['total_activities']} activities")
return True
except Exception as e:
print(f"❌ Error creating database: {e}")
return False
def create_game_library_database(db_path: str = "data/game_library.db", clear: bool = False):
"""Create the legacy game library database"""
db_file = Path(db_path)
if clear and db_file.exists():
print(f"🗑️ Removing existing database: {db_path}")
db_file.unlink()
print(f"📊 Creating game library database: {db_path}")
manager = GameLibraryManager(db_path)
print(f"✅ Game library database created successfully")
return True
def create_test_database(db_path: str = "data/test_activities.db", clear: bool = False):
"""Create the test database"""
db_file = Path(db_path)
if clear and db_file.exists():
print(f"🗑️ Removing existing database: {db_path}")
db_file.unlink()
print(f"📊 Creating test database: {db_path}")
db = DatabaseManager(db_path)
# Add some test data
test_activity = {
'title': 'Test Activity - Setup Script',
'description': 'This is a test activity created by the setup script',
'file_path': 'test/sample.txt',
'file_type': 'TXT',
'category': 'test',
'age_group': '8-12 ani',
'participants': '5-10 persoane',
'duration': '15-30min',
'materials': 'Fără materiale',
'tags': '["test", "setup"]',
'source_text': 'Sample test content for verification'
}
try:
db.insert_activity(test_activity)
stats = db.get_statistics()
print(f"✅ Test database created with sample data: {stats['total_activities']} activities")
return True
except Exception as e:
print(f"❌ Error creating test database: {e}")
return False
def ensure_data_directory():
"""Ensure the data directory exists"""
data_dir = Path("data")
if not data_dir.exists():
print(f"📁 Creating data directory: {data_dir}")
data_dir.mkdir(parents=True)
else:
print(f"📁 Data directory exists: {data_dir}")
def main():
"""Main setup function"""
parser = argparse.ArgumentParser(description='Create databases for INDEX-SISTEM-JOCURI')
parser.add_argument('--clear-existing', '-c', action='store_true',
help='Remove existing databases before creating new ones')
parser.add_argument('--main-only', action='store_true',
help='Create only the main activities database')
parser.add_argument('--test-only', action='store_true',
help='Create only the test database')
args = parser.parse_args()
print("🚀 DATABASE SETUP - INDEX-SISTEM-JOCURI")
print("=" * 50)
# Ensure data directory exists
ensure_data_directory()
success_count = 0
total_count = 0
if args.test_only:
total_count = 1
if create_test_database(clear=args.clear_existing):
success_count += 1
elif args.main_only:
total_count = 1
if create_main_database(clear=args.clear_existing):
success_count += 1
else:
# Create all databases
databases = [
("Main activities", lambda: create_main_database(clear=args.clear_existing)),
("Game library", lambda: create_game_library_database(clear=args.clear_existing)),
("Test activities", lambda: create_test_database(clear=args.clear_existing))
]
total_count = len(databases)
for name, create_func in databases:
print(f"\n📂 Creating {name} database...")
try:
if create_func():
success_count += 1
except Exception as e:
print(f"❌ Failed to create {name} database: {e}")
print("\n" + "=" * 50)
print(f"🎯 SUMMARY: {success_count}/{total_count} databases created successfully")
if success_count == total_count:
print("✅ All databases ready!")
print("\nNext steps:")
print("1. Run indexer: cd src && python indexer.py --clear-db")
print("2. Start web app: cd src && python app.py")
else:
print("⚠️ Some databases failed to create. Check errors above.")
return 1
return 0
if __name__ == '__main__':
sys.exit(main())

102
scripts/enrich_wave.sh Executable file
View File

@@ -0,0 +1,102 @@
#!/bin/bash
# ============================================================================
# enrich_wave.sh — ONE throttled enrichment wave, fully headless (no Claude
# session). Designed to be run by the LXC's OS cron at night.
#
# - Prepares a bounded wave (first N missing keys) via enrichment_wave.py.
# - Runs ONE `claude -p` per batch file, PAR batches concurrently (OS-level
# parallelism — no Workflow tool, no 2-per-workflow cap, no session needed).
# - When the backlog is empty, runs --collect + --rebuild and stops.
#
# Throttle = --keys (default 700 ≈ 75% of a 5h usage window ≈ 950 keys).
# A single flock guarantees waves never overlap.
#
# Usage: scripts/enrich_wave.sh [KEYS] [PAR]
# KEYS = max keys this wave (default 700)
# PAR = concurrent claude -p (default 6)
# ============================================================================
set -uo pipefail
REPO="/workspace/game-library"
LOG_DIR="/workspace/.claude-logs"
LOCK="/tmp/enrich_wave.lock"
KEYS="${1:-700}"
PAR="${2:-6}"
MAX_TURNS=100
# --- environment (cron has a minimal env) ---------------------------------- #
export HOME="${HOME:-/home/claude}"
[ -f "$HOME/.nvm/nvm.sh" ] && . "$HOME/.nvm/nvm.sh" >/dev/null 2>&1
export PATH="$HOME/.nvm/versions/node/v20.19.6/bin:/usr/local/bin:/usr/bin:/bin:$PATH"
mkdir -p "$LOG_DIR"
TS="$(date +%Y%m%d_%H%M%S)"
LOG="$LOG_DIR/enrich_${TS}.log"
log() { echo "[$(date '+%H:%M:%S')] $*" | tee -a "$LOG"; }
# --- single-instance lock: skip if a wave is still running ----------------- #
exec 9>"$LOCK"
if ! flock -n 9; then
log "another wave holds the lock; exiting."
exit 0
fi
cd "$REPO" || { log "cannot cd $REPO"; exit 1; }
command -v claude >/dev/null 2>&1 || { log "claude CLI not on PATH"; exit 1; }
log "=== enrichment wave start (keys=$KEYS par=$PAR) ==="
# --- 1) prepare bounded wave (batch files only) ---------------------------- #
PREP="$(python3 scripts/enrichment_wave.py --prepare --keys "$KEYS" --no-shards 2>&1)"
echo "$PREP" | tee -a "$LOG"
if echo "$PREP" | grep -q "WAVE: COMPLETE"; then
log "backlog empty -> collect + rebuild"
python3 scripts/run_enrichment.py --collect >>"$LOG" 2>&1
python3 scripts/build_database.py --rebuild >>"$LOG" 2>&1
grep -E "enrichment .*matched" "$LOG" | tail -1 | tee -a "$LOG"
log "=== ENRICHMENT COMPLETE ==="
exit 0
fi
# --- 2) per-batch headless enrichment, PAR-way parallel -------------------- #
read -r -d '' BATCH_PROMPT <<'EOP'
You are an enrichment subagent in the game-library pipeline. Working dir: /workspace/game-library.
Read `scripts/ENRICHMENT_PROMPT.md` FIRST — it defines the rules and output format EXACTLY (translate faithfully to Romanian; expand description_ro ONLY from the source chunk text; mark inferred filter fields in estimated_fields; fixed enum vocab).
Your batch file is __BATCHFILE__ — it lists content_keys, one per line. For EACH key:
1. IDEMPOTENT SKIP: if `data/enrichment_parts/<key>.json` already exists AND parses as valid JSON, SKIP it (do not rewrite).
2. Otherwise read its prompt `data/enrichment_prompts/<key>.prompt.md`, produce the enrichment JSON per ENRICHMENT_PROMPT.md, and write it to `data/enrichment_parts/<key>.json` (MUST include the exact "content_key": "<key>").
3. Validate it parses: python3 -c "import json;json.load(open('data/enrichment_parts/<key>.json'))".
CRITICAL — JSON quote escaping: any literal ASCII double-quote inside a string value MUST be escaped as \". Romanian text uses „cuvant" where the closing mark is a plain ASCII " — written raw it breaks the JSON. Either keep the typographic „ " marks or escape every ASCII ". Re-read and re-validate each file; fix any that fail.
Work through EVERY key in the batch file. If a key's prompt is missing, skip it and continue. When done, reply with one line: the count written and skipped.
EOP
export REPO LOG MAX_TURNS BATCH_PROMPT
run_one() {
local bf="$1"
local name; name="$(basename "$bf")"
local prompt="${BATCH_PROMPT/__BATCHFILE__/$bf}"
cd "$REPO" || return 1
timeout 1200 claude -p "$prompt" \
--allowedTools "Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)" \
--max-turns "$MAX_TURNS" </dev/null >>"$LOG.$name.out" 2>&1
echo "[$(date '+%H:%M:%S')] done $name (exit $?)" >>"$LOG"
}
export -f run_one
BATCHES=(data/enrichment_batches/batch_*.txt)
log "launching ${#BATCHES[@]} batches, $PAR concurrent..."
printf '%s\n' "${BATCHES[@]}" | xargs -P "$PAR" -I{} bash -c 'run_one "$@"' _ {}
# --- 3) summary ------------------------------------------------------------ #
if grep -qi "session limit\|usage limit" "$LOG".batch_*.out 2>/dev/null; then
log "WINDOW EXHAUSTED (usage limit hit mid-wave) — unfinished keys retry next fire."
fi
STATUS="$(python3 scripts/enrichment_wave.py --status 2>&1 | grep -E 'good|missing|done')"
echo "$STATUS" | tee -a "$LOG"
log "=== wave done ==="

294
scripts/enrichment_wave.py Normal file
View File

@@ -0,0 +1,294 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
enrichment_wave.py — throttled, window-paced wave preparation for the corpus
enrichment pipeline.
The enrichment backlog (~9541 keys) does NOT fit in one 5-hour Anthropic usage
window. Launching all remaining batches at once always runs the window to
EXHAUSTION (the "subagent completed without calling StructuredOutput" signature),
consuming 100% and blocking other work. There is no readable real-time window
meter, so pacing must be BLIND: cap each wave to a fixed KEY COUNT (sized to
~75% of empirical window capacity, ~950 keys), and let an external scheduler
(cron, every 6h) space waves across windows.
This script encapsulates the reconcile + bounded-wave preparation that used to
live as ad-hoc inline Python. It does NOT call the LLM and does NOT launch
workflows — it only prepares files on disk and prints what to launch.
Modes:
--status read-only: print done / missing / pct
--prepare --keys N --shards K drop corrupt parts; take the FIRST N missing
keys (sorted, deterministic); write batch
files for ONLY those; regenerate K shard JS
files covering exactly those batches; print
machine-greppable WAVE:/SHARD: lines.
Idempotency: a key is "done" iff data/enrichment_parts/<key>.json exists AND
parses. Re-running --prepare with the same args is deterministic (same sorted
first-N keys), so a re-fire never reshuffles work. Parts on disk are the durable
checkpoint.
Output contract (parsed by the cron wave-runner):
WAVE: COMPLETE -> backlog empty; run collect+rebuild
WAVE: PREPARED keys=.. batches=.. shards=.. remaining_after=..
SHARD: data/enrichment_wf/shard_0.js -> one line per workflow to launch
...
Usage:
python3 scripts/enrichment_wave.py --status
python3 scripts/enrichment_wave.py --prepare --keys 700 --shards 8
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
PROMPT_SUFFIX = ".prompt.md"
PART_SUFFIX = ".json"
BATCH_SIZE_DEFAULT = 12
KEYS_DEFAULT = 700
SHARDS_DEFAULT = 8
# Resolved relative to REPO_ROOT so the script works from any cwd.
DEF_PROMPTS = "data/enrichment_prompts"
DEF_PARTS = "data/enrichment_parts"
DEF_BATCHES = "data/enrichment_batches"
DEF_WF = "data/enrichment_wf"
TEMPLATE_NAME = "shard.js.tmpl"
# --------------------------------------------------------------------------- #
# Helpers
# --------------------------------------------------------------------------- #
def _abs(p: str) -> Path:
q = Path(p)
return q if q.is_absolute() else (REPO_ROOT / q)
def part_ok(path: Path) -> bool:
"""A part counts as done iff it parses as a JSON object."""
try:
return isinstance(json.load(open(path, encoding="utf-8")), dict)
except Exception:
return False
def corrupt_parts(parts_dir: Path) -> list[Path]:
return [p for p in parts_dir.glob("*" + PART_SUFFIX) if not part_ok(p)]
def compute_missing(prompts_dir: Path, parts_dir: Path) -> list[str]:
"""Keys whose prompt exists but whose part is absent. Sorted = deterministic."""
missing = []
for pr in prompts_dir.glob("*" + PROMPT_SUFFIX):
key = pr.name[: -len(PROMPT_SUFFIX)]
if not (parts_dir / (key + PART_SUFFIX)).exists():
missing.append(key)
return sorted(missing)
def count_done(prompts_dir: Path, parts_dir: Path) -> tuple[int, int]:
"""(good_parts_with_prompt, total_prompts)."""
total = 0
good = 0
for pr in prompts_dir.glob("*" + PROMPT_SUFFIX):
total += 1
key = pr.name[: -len(PROMPT_SUFFIX)]
part = parts_dir / (key + PART_SUFFIX)
if part.exists() and part_ok(part):
good += 1
return good, total
def write_batches(keys: list[str], batches_dir: Path, size: int) -> int:
"""Replace all batch_*.txt with fresh files of <= size keys. Returns NB."""
batches_dir.mkdir(parents=True, exist_ok=True)
for old in batches_dir.glob("batch_*.txt"):
old.unlink()
nb = 0
for i in range(0, len(keys), size):
chunk = keys[i : i + size]
(batches_dir / f"batch_{nb:04d}.txt").write_text(
"\n".join(chunk) + "\n", encoding="utf-8"
)
nb += 1
return nb
def shard_ranges(nb: int, k: int) -> list[tuple[int, int]]:
"""Split [0,nb) into k contiguous, disjoint, total-covering ranges.
Even distribution: the first (nb % k) shards carry one extra batch. When
nb < k the trailing ranges are empty [x,x) and are dropped by the caller.
"""
if nb <= 0 or k <= 0:
return []
base, extra = divmod(nb, k)
ranges = []
start = 0
for i in range(k):
length = base + (1 if i < extra else 0)
ranges.append((start, start + length))
start += length
return ranges
def render_shard(template: str, shard: int, start: int, end: int, nshards: int) -> str:
return (
template.replace("__SHARD__", str(shard))
.replace("__START__", str(start))
.replace("__END__", str(end))
.replace("__NSHARDS__", str(nshards))
)
def write_shards(ranges: list[tuple[int, int]], template: str, wf_dir: Path) -> list[Path]:
"""Delete stale shard_*.js, then write one per NON-EMPTY range. Returns paths."""
wf_dir.mkdir(parents=True, exist_ok=True)
for old in wf_dir.glob("shard_*.js"):
old.unlink()
non_empty = [(i, s, e) for i, (s, e) in enumerate(ranges) if e > s]
nshards = len(non_empty)
paths = []
# Re-index shards 0..nshards-1 so labels/meta stay contiguous even if some
# trailing ranges were empty (tiny final wave with fewer batches than K).
for new_idx, (_, s, e) in enumerate(non_empty):
path = wf_dir / f"shard_{new_idx}.js"
path.write_text(
render_shard(template, new_idx, s, e, nshards), encoding="utf-8"
)
paths.append(path)
return paths
def rel(path: Path) -> str:
try:
return str(path.relative_to(REPO_ROOT))
except ValueError:
return str(path)
# --------------------------------------------------------------------------- #
# Modes
# --------------------------------------------------------------------------- #
def cmd_status(prompts_dir: Path, parts_dir: Path) -> int:
good, total = count_done(prompts_dir, parts_dir)
parts_on_disk = len(list(parts_dir.glob("*" + PART_SUFFIX)))
bad = len(corrupt_parts(parts_dir))
missing = total - good
pct = (100.0 * good / total) if total else 0.0
print("=== enrichment status ===")
print(f"prompts (universe) : {total}")
print(f"parts on disk : {parts_on_disk}")
print(f"good (done) : {good}")
print(f"corrupt parts : {bad} (reported only; --prepare drops them)")
print(f"missing : {missing}")
print(f"done : {pct:.1f}%")
if total:
print(f"WAVE: {'COMPLETE' if missing == 0 else 'PENDING'} missing={missing}")
return 0
def cmd_prepare(
prompts_dir: Path,
parts_dir: Path,
batches_dir: Path,
wf_dir: Path,
keys: int,
shards: int,
batch_size: int,
make_shards: bool = True,
) -> int:
template = ""
if make_shards:
template_path = wf_dir / TEMPLATE_NAME
if not template_path.is_file():
print(f"ERROR: missing shard template {rel(template_path)}", file=sys.stderr)
return 2
template = template_path.read_text(encoding="utf-8")
# 1) drop corrupt parts (only mutation to parts/)
dropped = 0
for p in corrupt_parts(parts_dir):
p.unlink()
dropped += 1
# 2) compute missing (deterministic)
missing = compute_missing(prompts_dir, parts_dir)
# 3) empty -> COMPLETE sentinel, no files written
if not missing:
print(f"dropped_corrupt={dropped}")
print("WAVE: COMPLETE")
return 0
# 4) clamp to first N
take = missing[:keys]
# 5) batches for ONLY those keys
nb = write_batches(take, batches_dir, batch_size)
# 6) shard scripts covering exactly those batches (skipped on the bash path)
paths = []
if make_shards:
ranges = shard_ranges(nb, shards)
paths = write_shards(ranges, template, wf_dir)
remaining_after = len(missing) - len(take)
print(f"dropped_corrupt={dropped}")
print(
f"WAVE: PREPARED keys={len(take)} batches={nb} "
f"shards={len(paths)} remaining_after={remaining_after}"
)
for p in paths:
print(f"SHARD: {rel(p)}")
return 0
def main(argv=None) -> int:
ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("--status", action="store_true", help="read-only progress report")
ap.add_argument("--prepare", action="store_true", help="prepare one bounded wave")
ap.add_argument("--keys", type=int, default=KEYS_DEFAULT, help=f"max keys this wave (default {KEYS_DEFAULT})")
ap.add_argument("--shards", type=int, default=SHARDS_DEFAULT, help=f"workflow shards (default {SHARDS_DEFAULT})")
ap.add_argument("--batch-size", type=int, default=BATCH_SIZE_DEFAULT, help=f"keys per batch (default {BATCH_SIZE_DEFAULT})")
ap.add_argument("--no-shards", action="store_true", help="prepare batch files only; skip shard JS generation (bash/headless path)")
ap.add_argument("--prompts", default=DEF_PROMPTS)
ap.add_argument("--parts", default=DEF_PARTS)
ap.add_argument("--batches", default=DEF_BATCHES)
ap.add_argument("--wf-dir", default=DEF_WF)
args = ap.parse_args(argv)
prompts_dir = _abs(args.prompts)
parts_dir = _abs(args.parts)
batches_dir = _abs(args.batches)
wf_dir = _abs(args.wf_dir)
if not prompts_dir.is_dir():
print(f"ERROR: prompts dir not found: {rel(prompts_dir)}", file=sys.stderr)
return 2
parts_dir.mkdir(parents=True, exist_ok=True)
if args.keys < 1 or args.shards < 1 or args.batch_size < 1:
print("ERROR: --keys/--shards/--batch-size must be >= 1", file=sys.stderr)
return 2
if args.prepare:
return cmd_prepare(
prompts_dir, parts_dir, batches_dir, wf_dir,
args.keys, args.shards, args.batch_size,
make_shards=not args.no_shards,
)
# default to status
return cmd_status(prompts_dir, parts_dir)
if __name__ == "__main__":
raise SystemExit(main())

361
scripts/extract_common.py Normal file
View File

@@ -0,0 +1,361 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
extract_common.py — single home for per-format text extraction.
Every extractor returns a plain text *body* with synthetic page markers
(`--- PAGE N ---`). The file-level header (`SOURCE:` / `CONVERTED:`) is added
by normalize_sources.py, not here.
Critical fix vs. the old pdf_to_text_converter.py: there is NO `max_pages` cap.
Large books are extracted in full.
"""
from __future__ import annotations
import hashlib
import importlib
import os
import re
import shutil
import subprocess
import tempfile
import zipfile
from pathlib import Path
from typing import Callable
PAGE_MARKER_RE = re.compile(r"^--- PAGE (\d+) ---\s*$", re.MULTILINE)
# paragraphs per synthetic page for paginated-by-flow formats (docx)
DOCX_PARAS_PER_PAGE = 40
# formats we deliberately ignore (epub duplicates existing PDFs — plan §1)
IGNORED_EXTENSIONS = {".epub"}
# obvious junk filenames skipped during a walk
JUNK_NAMES = {"desktop.ini", "linkuri-jocuri.txt"}
JUNK_SUFFIXES = {".bak", ".tmp", ".ini"}
# --------------------------------------------------------------------------
# page assembly helpers
# --------------------------------------------------------------------------
def join_pages(pages: list[str], start: int = 1) -> str:
"""Join a list of page texts into a body string with `--- PAGE N ---`."""
out: list[str] = []
for i, text in enumerate(pages, start):
out.append(f"\n--- PAGE {i} ---\n{(text or '').strip()}\n")
return "".join(out)
def split_pages(body: str) -> list[tuple[int, str]]:
"""Inverse of join_pages: parse a body into [(page_number, text), ...]."""
matches = list(PAGE_MARKER_RE.finditer(body))
if not matches:
return []
pages: list[tuple[int, str]] = []
for idx, m in enumerate(matches):
num = int(m.group(1))
seg_start = m.end()
seg_end = matches[idx + 1].start() if idx + 1 < len(matches) else len(body)
pages.append((num, body[seg_start:seg_end].strip()))
return pages
def count_page_markers(body: str) -> int:
return len(PAGE_MARKER_RE.findall(body))
# --------------------------------------------------------------------------
# format detection
# --------------------------------------------------------------------------
FORMAT_BY_EXT = {
".pdf": "pdf",
".docx": "docx",
".doc": "doc",
".pptx": "pptx",
".ppt": "pptx",
".htm": "html",
".html": "html",
".zip": "zip",
".epub": "epub",
".txt": "txt",
}
def detect_format(path: str | os.PathLike) -> str:
"""Return a format key for a path based on its extension."""
ext = Path(path).suffix.lower()
return FORMAT_BY_EXT.get(ext, "unknown")
def is_junk(path: str | os.PathLike) -> bool:
p = Path(path)
name = p.name.lower()
if name in JUNK_NAMES:
return True
if name.startswith("readme") and p.suffix.lower() == ".md":
return True
if p.suffix.lower() in JUNK_SUFFIXES:
return True
return False
# --------------------------------------------------------------------------
# content hashing + near-duplicate elimination
# --------------------------------------------------------------------------
def _normalize_for_hash(text: str) -> str:
return re.sub(r"\s+", " ", (text or "")).strip().lower()
def content_hash(text: str) -> str:
"""Stable SHA1 of whitespace-normalized text — used for exact-dup detection."""
return hashlib.sha1(_normalize_for_hash(text).encode("utf-8")).hexdigest()
def near_duplicate_ratio(a: str, b: str) -> float:
"""Similarity score in [0, 100] between two texts (rapidfuzz token ratio)."""
from rapidfuzz import fuzz
return fuzz.token_sort_ratio(_normalize_for_hash(a), _normalize_for_hash(b))
def dedupe_texts(
items: list[tuple[str, str]], threshold: float = 95.0
) -> list[tuple[str, str]]:
"""
Drop exact and near-duplicate texts from a list of (key, text) pairs.
Used for HTML mirror pages (print copies, repeated index/footer pages).
Keeps the first occurrence; O(n) on exact hash, O(n*k) fuzzy only against
already-kept items.
"""
kept: list[tuple[str, str]] = []
seen_hashes: set[str] = set()
for key, text in items:
h = content_hash(text)
if h in seen_hashes:
continue
if any(near_duplicate_ratio(text, kt) >= threshold for _, kt in kept):
continue
seen_hashes.add(h)
kept.append((key, text))
return kept
# --------------------------------------------------------------------------
# preflight dependency check
# --------------------------------------------------------------------------
REQUIRED_PYTHON_MODULES = {
"pdfplumber": "pdfplumber",
"PyPDF2": "pypdf2",
"docx": "python-docx",
"pptx": "python-pptx",
"bs4": "beautifulsoup4",
"lxml": "lxml",
"jsonschema": "jsonschema",
"rapidfuzz": "rapidfuzz",
"chardet": "chardet",
}
def preflight(check_ocr: bool = False) -> dict:
"""
Check system + Python dependencies before a long normalization run.
Returns {'ok': bool, 'missing_python': [...], 'missing_system': [...],
'warnings': [...]}. libreoffice is a *warning* (only .doc needs it),
tesseract only checked when check_ocr=True.
"""
missing_python: list[str] = []
for module, pip_name in REQUIRED_PYTHON_MODULES.items():
try:
importlib.import_module(module)
except ImportError:
missing_python.append(pip_name)
warnings: list[str] = []
missing_system: list[str] = []
if not (shutil.which("libreoffice") or shutil.which("soffice")):
warnings.append("libreoffice not found — legacy .doc files cannot be converted")
if check_ocr and not shutil.which("tesseract"):
missing_system.append("tesseract (OCR requested but not installed)")
return {
"ok": not missing_python and not missing_system,
"missing_python": missing_python,
"missing_system": missing_system,
"warnings": warnings,
}
# --------------------------------------------------------------------------
# per-format extractors
# --------------------------------------------------------------------------
def extract_pdf(path: str | os.PathLike) -> str:
"""PDF → body. pdfplumber primary, PyPDF2 fallback. No page cap."""
path = str(path)
try:
return _extract_pdf_pdfplumber(path)
except Exception:
return _extract_pdf_pypdf2(path)
def _extract_pdf_pdfplumber(path: str) -> str:
import pdfplumber
pages: list[str] = []
with pdfplumber.open(path) as pdf:
for page in pdf.pages: # ALL pages — no max_pages
try:
pages.append(page.extract_text() or "")
except Exception:
pages.append("")
return join_pages(pages)
def _extract_pdf_pypdf2(path: str) -> str:
import PyPDF2
pages: list[str] = []
with open(path, "rb") as fh:
reader = PyPDF2.PdfReader(fh)
for page in reader.pages: # ALL pages — no max_pages
try:
pages.append(page.extract_text() or "")
except Exception:
pages.append("")
return join_pages(pages)
def extract_docx(path: str | os.PathLike) -> str:
"""docx → body. Synthetic page marker every DOCX_PARAS_PER_PAGE paragraphs."""
import docx
document = docx.Document(str(path))
paragraphs = [p.text for p in document.paragraphs]
pages: list[str] = []
for i in range(0, max(len(paragraphs), 1), DOCX_PARAS_PER_PAGE):
chunk = paragraphs[i : i + DOCX_PARAS_PER_PAGE]
pages.append("\n".join(chunk))
return join_pages(pages)
def extract_doc(path: str | os.PathLike) -> str:
"""
Legacy .doc → body via `libreoffice --headless --convert-to docx`.
Raises RuntimeError if libreoffice is unavailable — the caller marks the
resulting source `needs_review` regardless (conversion is imperfect).
"""
soffice = shutil.which("libreoffice") or shutil.which("soffice")
if not soffice:
raise RuntimeError("libreoffice/soffice not available — cannot convert .doc")
src = Path(path).resolve()
with tempfile.TemporaryDirectory() as tmp:
subprocess.run(
[soffice, "--headless", "--convert-to", "docx", "--outdir", tmp, str(src)],
check=True,
capture_output=True,
timeout=300,
)
converted = Path(tmp) / (src.stem + ".docx")
if not converted.exists():
raise RuntimeError(f"libreoffice produced no output for {src.name}")
return extract_docx(converted)
def extract_pptx(path: str | os.PathLike) -> str:
"""pptx → body. One page per slide: title + body text + speaker notes."""
from pptx import Presentation
presentation = Presentation(str(path))
pages: list[str] = []
for slide in presentation.slides:
parts: list[str] = []
for shape in slide.shapes:
if shape.has_text_frame and shape.text_frame.text.strip():
parts.append(shape.text_frame.text.strip())
if slide.has_notes_slide:
notes = slide.notes_slide.notes_text_frame.text.strip()
if notes:
parts.append(f"[NOTES] {notes}")
pages.append("\n".join(parts))
return join_pages(pages)
def extract_html(path: str | os.PathLike) -> str:
"""HTML mirror page → body. Strips nav/script/style/footer/header/aside."""
import chardet
from bs4 import BeautifulSoup
raw = Path(path).read_bytes()
enc = chardet.detect(raw).get("encoding") or "utf-8"
soup = BeautifulSoup(raw.decode(enc, errors="replace"), "lxml")
for tag in soup(["script", "style", "nav", "footer", "header", "aside", "noscript"]):
tag.decompose()
# also drop common chrome by role/class
for tag in soup.find_all(attrs={"role": ["navigation", "banner", "contentinfo"]}):
tag.decompose()
text = soup.get_text(separator="\n")
lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
return join_pages(["\n".join(lines)])
def extract_zip(path: str | os.PathLike) -> str:
"""
zip → body. Unzips into a temp dir and recurses on every extractable inner
file. Inner files are page-renumbered into one continuous body.
"""
path = str(path)
pages: list[str] = []
with tempfile.TemporaryDirectory() as tmp:
try:
with zipfile.ZipFile(path) as zf:
zf.extractall(tmp)
except zipfile.BadZipFile:
return ""
for inner in sorted(Path(tmp).rglob("*")):
if not inner.is_file() or is_junk(inner):
continue
fmt = detect_format(inner)
if fmt in ("unknown", "epub", "zip"):
# nested zips handled by recursion below
if fmt == "zip":
body = extract_zip(inner)
pages.extend(t for _, t in split_pages(body))
continue
try:
body = extract_file(inner)
except Exception:
continue
pages.extend(t for _, t in split_pages(body))
return join_pages(pages)
EXTRACTORS: dict[str, Callable[[str | os.PathLike], str]] = {
"pdf": extract_pdf,
"docx": extract_docx,
"doc": extract_doc,
"pptx": extract_pptx,
"html": extract_html,
"zip": extract_zip,
}
def extract_file(path: str | os.PathLike) -> str:
"""Dispatch a single file to the right extractor. Returns a page-marked body."""
fmt = detect_format(path)
if fmt == "txt":
body = Path(path).read_text(encoding="utf-8", errors="replace")
# already paginated? pass through; else wrap as one page
return body if count_page_markers(body) else join_pages([body])
extractor = EXTRACTORS.get(fmt)
if extractor is None:
raise ValueError(f"No extractor for format '{fmt}': {path}")
return extractor(path)

View File

@@ -1,424 +0,0 @@
#!/usr/bin/env python3
"""
HTML Activity Extractor - Proceseaz 1876 fiiere HTML
Extrage automat activiti folosind pattern recognition
"""
import os
import re
import json
from pathlib import Path
from bs4 import BeautifulSoup
import chardet
from typing import List, Dict, Optional
import sqlite3
from datetime import datetime
class HTMLActivityExtractor:
def __init__(self, db_path='data/activities.db'):
self.db_path = db_path
# Pattern-uri pentru detectare activiti <20>n rom<6F>n
self.activity_patterns = {
'title_patterns': [
r'(?i)(joc|activitate|exerci[t]iu|team[\s-]?building|energizer|ice[\s-]?breaker)[\s:]+([^\.]{5,100})',
r'(?i)<h[1-6][^>]*>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</h[1-6]>',
r'(?i)<strong>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</strong>',
r'(?i)^[\d]+\.?\s*([A-Z][^\.]{10,100}(?:joc|activitate|exerci[t]iu)[^\.]{0,50})$',
],
'description_markers': [
'descriere', 'reguli', 'cum se joac[a]', 'instructiuni',
'obiectiv', 'desfasurare', 'explicatie', 'mod de joc'
],
'materials_markers': [
'materiale', 'necesare', 'echipament', 'ce avem nevoie',
'se folosesc', 'trebuie sa avem', 'dotari'
],
'age_patterns': [
r'(?i)v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
r'(?i)(\d+)[\s-]+(\d+)\s*ani',
r'(?i)pentru\s+(\d+)[\s-]+(\d+)\s*ani',
r'(?i)categoria?\s*(?:de\s*)?v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
],
'participants_patterns': [
r'(?i)(\d+)[\s-]+(\d+)\s*(?:participan[t]i|juc[a]tori|persoane|copii)',
r'(?i)num[a]r\s*(?:de\s*)?(?:participan[t]i|juc[a]tori)[\s:]+(\d+)[\s-]+(\d+)',
r'(?i)grup\s*de\s*(\d+)[\s-]+(\d+)',
],
'duration_patterns': [
r'(?i)durat[a][\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
r'(?i)timp[\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
r'(?i)(\d+)[\s-]+(\d+)\s*minute',
]
}
# Categorii predefinite bazate pe sistemul existent
self.categories = {
'[A]': ['joc', 'joaca', 'distractie', 'amuzament'],
'[B]': ['aventura', 'explorare', 'descoperire'],
'[C]': ['camping', 'tabara', 'excursie', 'drumetie'],
'[D]': ['foc', 'flacara', 'lumina'],
'[E]': ['noduri', 'fr<EFBFBD>nghii', 'sfori', 'legare'],
'[F]': ['bushcraft', 'supravietuire', 'survival'],
'[G]': ['educatie', 'educativ', 'invatare', 'scoala'],
'[H]': ['orientare', 'busola', 'harta', 'navigare']
}
def detect_encoding(self, file_path):
"""Detecteaz encoding-ul fiierului"""
with open(file_path, 'rb') as f:
result = chardet.detect(f.read())
return result['encoding'] or 'utf-8'
def extract_from_html(self, html_path: str) -> List[Dict]:
"""Extrage activiti dintr-un singur fiier HTML"""
activities = []
try:
# Detectare encoding i citire
encoding = self.detect_encoding(html_path)
with open(html_path, 'r', encoding=encoding, errors='ignore') as f:
content = f.read()
soup = BeautifulSoup(content, 'lxml')
# Metod 1: Caut liste de activiti
activities.extend(self._extract_from_lists(soup, html_path))
# Metod 2: Caut activiti <20>n headings
activities.extend(self._extract_from_headings(soup, html_path))
# Metod 3: Caut pattern-uri <20>n text
activities.extend(self._extract_from_patterns(soup, html_path))
# Metod 4: Caut <20>n tabele
activities.extend(self._extract_from_tables(soup, html_path))
except Exception as e:
print(f"Error processing {html_path}: {e}")
return activities
def _extract_from_lists(self, soup, source_file):
"""Extrage activiti din liste HTML (ul, ol)"""
activities = []
for list_elem in soup.find_all(['ul', 'ol']):
# Verific dac lista pare s conin activiti
list_text = list_elem.get_text().lower()
if any(marker in list_text for marker in ['joc', 'activitate', 'exercitiu']):
for li in list_elem.find_all('li'):
text = li.get_text(strip=True)
if len(text) > 20: # Minim 20 caractere pentru o activitate valid
activity = self._create_activity_from_text(text, source_file)
if activity:
activities.append(activity)
return activities
def _extract_from_headings(self, soup, source_file):
"""Extrage activiti bazate pe headings"""
activities = []
for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
heading_text = heading.get_text(strip=True)
# Verific dac heading-ul conine cuvinte cheie
if any(keyword in heading_text.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
# Caut descrierea <20>n elementele urmtoare
description = ""
next_elem = heading.find_next_sibling()
while next_elem and next_elem.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
if next_elem.name in ['p', 'div', 'ul']:
description += next_elem.get_text(strip=True) + " "
if len(description) > 500: # Limit descriere
break
next_elem = next_elem.find_next_sibling()
if description:
activity = {
'name': heading_text[:200],
'description': description[:1000],
'source_file': str(source_file),
'category': self._detect_category(heading_text + " " + description)
}
activities.append(activity)
return activities
def _extract_from_patterns(self, soup, source_file):
"""Extrage activiti folosind pattern matching"""
activities = []
text = soup.get_text()
# Caut pattern-uri de activiti
for pattern in self.activity_patterns['title_patterns']:
matches = re.finditer(pattern, text, re.MULTILINE)
for match in matches:
title = match.group(0) if match.lastindex == 0 else match.group(match.lastindex)
if len(title) > 10:
# Extrage context <20>n jurul match-ului
start = max(0, match.start() - 200)
end = min(len(text), match.end() + 500)
context = text[start:end]
activity = self._create_activity_from_text(context, source_file, title)
if activity:
activities.append(activity)
return activities
def _extract_from_tables(self, soup, source_file):
"""Extrage activiti din tabele"""
activities = []
for table in soup.find_all('table'):
rows = table.find_all('tr')
if len(rows) > 1: # Cel puin header i o linie de date
# Detecteaz coloanele relevante
headers = [th.get_text(strip=True).lower() for th in rows[0].find_all(['th', 'td'])]
for row in rows[1:]:
cells = row.find_all(['td'])
if cells:
activity_data = {}
for i, cell in enumerate(cells):
if i < len(headers):
activity_data[headers[i]] = cell.get_text(strip=True)
# Creeaz activitate din date tabel
if any(key in activity_data for key in ['joc', 'activitate', 'nume', 'titlu']):
activity = self._create_activity_from_table_data(activity_data, source_file)
if activity:
activities.append(activity)
return activities
def _create_activity_from_text(self, text, source_file, title=None):
"""Creeaz un dicionar de activitate din text"""
if not text or len(text) < 30:
return None
activity = {
'name': title or text[:100].split('.')[0].strip(),
'description': text[:1000],
'source_file': str(source_file),
'category': self._detect_category(text),
'keywords': self._extract_keywords(text),
'created_at': datetime.now().isoformat()
}
# Extrage metadata suplimentar
activity.update(self._extract_metadata(text))
return activity
def _create_activity_from_table_data(self, data, source_file):
"""Creeaz activitate din date de tabel"""
activity = {
'source_file': str(source_file),
'created_at': datetime.now().isoformat()
}
# Mapare c<>mpuri tabel la c<>mpuri DB
field_mapping = {
'nume': 'name', 'titlu': 'name', 'joc': 'name', 'activitate': 'name',
'descriere': 'description', 'detalii': 'description', 'explicatie': 'description',
'materiale': 'materials_list', 'echipament': 'materials_list',
'varsta': 'age_group_min', 'categoria': 'category',
'participanti': 'participants_min', 'numar': 'participants_min',
'durata': 'duration_min', 'timp': 'duration_min'
}
for table_field, db_field in field_mapping.items():
if table_field in data:
activity[db_field] = data[table_field]
# Validare minim
if 'name' in activity and len(activity.get('name', '')) > 5:
return activity
return None
def _extract_metadata(self, text):
"""Extrage metadata din text folosind pattern-uri"""
metadata = {}
# Extrage v<>rsta
for pattern in self.activity_patterns['age_patterns']:
match = re.search(pattern, text)
if match:
metadata['age_group_min'] = int(match.group(1))
metadata['age_group_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
break
# Extrage numr participani
for pattern in self.activity_patterns['participants_patterns']:
match = re.search(pattern, text)
if match:
metadata['participants_min'] = int(match.group(1))
metadata['participants_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
break
# Extrage durata
for pattern in self.activity_patterns['duration_patterns']:
match = re.search(pattern, text)
if match:
metadata['duration_min'] = int(match.group(1))
metadata['duration_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
break
# Extrage materiale
materials = []
text_lower = text.lower()
for marker in self.activity_patterns['materials_markers']:
idx = text_lower.find(marker)
if idx != -1:
# Extrage urmtoarele 200 caractere dup marker
materials_text = text[idx:idx+200]
# Extrage items din list
items = re.findall(r'[-"]\s*([^\n-"]+)', materials_text)
if items:
materials.extend(items)
if materials:
metadata['materials_list'] = ', '.join(materials[:10]) # Maxim 10 materiale
return metadata
def _detect_category(self, text):
"""Detecteaz categoria activitii bazat pe cuvinte cheie"""
text_lower = text.lower()
for category, keywords in self.categories.items():
if any(keyword in text_lower for keyword in keywords):
return category
return '[A]' # Default categoria jocuri
def _extract_keywords(self, text):
"""Extrage cuvinte cheie din text"""
keywords = []
text_lower = text.lower()
# Lista de cuvinte cheie relevante
keyword_list = [
'cooperare', 'competitie', 'echipa', 'creativitate', 'miscare',
'strategie', 'comunicare', 'incredere', 'coordonare', 'atentie',
'reflexe', 'logica', 'imaginatie', 'muzica', 'dans', 'sport',
'natura', 'mediu', 'stiinta', 'matematica', 'limba', 'cultura'
]
for keyword in keyword_list:
if keyword in text_lower:
keywords.append(keyword)
return ', '.join(keywords[:5]) # Maxim 5 keywords
def save_to_database(self, activities):
"""Salveaz activitile <20>n baza de date"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
saved_count = 0
duplicate_count = 0
for activity in activities:
try:
# Verific duplicate
cursor.execute(
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
(activity.get('name'), activity.get('source_file'))
)
if cursor.fetchone():
duplicate_count += 1
continue
# Pregtete valorile pentru insert
columns = []
values = []
placeholders = []
for key, value in activity.items():
if key != 'created_at': # Skip created_at, it has default
columns.append(key)
values.append(value)
placeholders.append('?')
# Insert <20>n DB
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
cursor.execute(query, values)
saved_count += 1
except Exception as e:
print(f"Error saving activity: {e}")
continue
conn.commit()
conn.close()
return saved_count, duplicate_count
def process_all_html_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
"""Proceseaz toate fiierele HTML din directorul specificat"""
base_path = Path(base_path)
html_files = list(base_path.rglob("*.html"))
html_files.extend(list(base_path.rglob("*.htm")))
print(f"Found {len(html_files)} HTML files to process")
all_activities = []
processed = 0
errors = 0
for i, html_file in enumerate(html_files):
try:
activities = self.extract_from_html(str(html_file))
all_activities.extend(activities)
processed += 1
# Progress update
if (i + 1) % 100 == 0:
print(f"Progress: {i+1}/{len(html_files)} files processed, {len(all_activities)} activities found")
# Save batch to DB
if all_activities:
saved, dupes = self.save_to_database(all_activities)
print(f"Batch saved: {saved} new activities, {dupes} duplicates skipped")
all_activities = [] # Clear buffer
except Exception as e:
print(f"Error processing {html_file}: {e}")
errors += 1
# Save remaining activities
if all_activities:
saved, dupes = self.save_to_database(all_activities)
print(f"Final batch saved: {saved} new activities, {dupes} duplicates skipped")
print(f"\nProcessing complete!")
print(f"Files processed: {processed}")
print(f"Errors: {errors}")
return processed, errors
# Funcie main pentru test
if __name__ == "__main__":
extractor = HTMLActivityExtractor()
# Test pe un fiier sample mai <20>nt<6E>i
print("Testing on sample file first...")
# Gsete un fiier HTML pentru test
test_files = list(Path('/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri').rglob("*.html"))[:3]
for test_file in test_files:
print(f"\nTesting: {test_file}")
activities = extractor.extract_from_html(str(test_file))
print(f"Found {len(activities)} activities")
if activities:
print(f"Sample activity: {activities[0]['name'][:50]}...")
# <20>ntreab dac s continue cu procesarea complet
response = input("\nContinue with full processing? (y/n): ")
if response.lower() == 'y':
extractor.process_all_html_files()

View File

@@ -1,78 +0,0 @@
#!/usr/bin/env python3
"""
Import activities extracted by Claude from JSON files
"""
import json
import sqlite3
from pathlib import Path
from datetime import datetime
class ClaudeActivityImporter:
def __init__(self, db_path='data/activities.db'):
self.db_path = db_path
self.json_dir = Path('scripts/extracted_activities')
self.json_dir.mkdir(exist_ok=True)
def import_json_file(self, json_path):
"""Import activities from a single JSON file"""
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
source_file = data.get('source_file', str(json_path))
activities = data.get('activities', [])
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
imported = 0
for activity in activities:
try:
# Add source file and timestamp
activity['source_file'] = source_file
activity['created_at'] = datetime.now().isoformat()
# Prepare insert
columns = list(activity.keys())
values = list(activity.values())
placeholders = ['?' for _ in values]
# Check for duplicate
cursor.execute(
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
(activity.get('name'), source_file)
)
if not cursor.fetchone():
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
cursor.execute(query, values)
imported += 1
except Exception as e:
print(f"Error importing activity: {e}")
conn.commit()
conn.close()
print(f"Imported {imported} activities from {json_path.name}")
return imported
def import_all_json_files(self):
"""Import all JSON files from the extracted_activities directory"""
json_files = list(self.json_dir.glob("*.json"))
if not json_files:
print("No JSON files found in extracted_activities directory")
return 0
total_imported = 0
for json_file in json_files:
imported = self.import_json_file(json_file)
total_imported += imported
print(f"\nTotal imported: {total_imported} activities from {len(json_files)} files")
return total_imported
if __name__ == "__main__":
importer = ClaudeActivityImporter()
importer.import_all_json_files()

179
scripts/import_common.py Normal file
View File

@@ -0,0 +1,179 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
import_common.py — shared helpers for the import / validation side of the
extraction pipeline (Lane C).
Used by build_database.py and validate_extractions.py:
* JSON-schema validation of subagent extraction files,
* the anti-hallucination source_excerpt substring check (E5),
* locating the source chunk that an extraction file came from,
* the stable content key used by the needs_review queue.
"""
from __future__ import annotations
import hashlib
import json
import re
import unicodedata
from pathlib import Path
from typing import Any, Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
DEFAULT_SCHEMA_PATH = SCRIPT_DIR / "activity_schema.json"
# rapidfuzz.partial_ratio is on a 0..100 scale; an excerpt counts as a real
# quote from the source when it scores at least this against the chunk text.
EXCERPT_MATCH_THRESHOLD = 90.0
# --------------------------------------------------------------------------
# schema validation
# --------------------------------------------------------------------------
def load_schema(schema_path: str | Path = DEFAULT_SCHEMA_PATH) -> dict:
"""Load the activity JSON schema produced by Lane A."""
return json.loads(Path(schema_path).read_text(encoding="utf-8"))
def validate_extraction(data: Any, schema: dict) -> list[str]:
"""
Validate one parsed extraction file against `schema`.
Returns a list of human-readable error strings; empty list == valid.
"""
import jsonschema
validator = jsonschema.Draft7Validator(schema)
errors: list[str] = []
for err in sorted(validator.iter_errors(data), key=lambda e: list(e.path)):
location = "/".join(str(p) for p in err.path) or "<root>"
errors.append(f"{location}: {err.message}")
return errors
# --------------------------------------------------------------------------
# excerpt verification (E5 — anti-hallucination)
# --------------------------------------------------------------------------
def _normalize_text(text: str) -> str:
return re.sub(r"\s+", " ", (text or "")).strip().lower()
def excerpt_score(excerpt: str, chunk_text: str) -> float:
"""Best fuzzy-substring score (0..100) of `excerpt` inside `chunk_text`."""
from rapidfuzz import fuzz
if not excerpt or not chunk_text:
return 0.0
return float(fuzz.partial_ratio(_normalize_text(excerpt), _normalize_text(chunk_text)))
def excerpt_matches(
excerpt: str, chunk_text: str, threshold: float = EXCERPT_MATCH_THRESHOLD
) -> bool:
"""True when `excerpt` appears (fuzzily) as a substring of `chunk_text`."""
return excerpt_score(excerpt, chunk_text) >= threshold
# --------------------------------------------------------------------------
# locating the source chunk an extraction file came from
# --------------------------------------------------------------------------
def chunk_key_for(json_path: Path, header: Optional[dict]) -> str:
"""
Resolve the chunk key for an extraction file.
Prefers the explicit `chunk_key` in the header, otherwise falls back to the
JSON file stem (extraction files are named `<chunk_key>.json`).
"""
if header and header.get("chunk_key"):
return str(header["chunk_key"])
return json_path.stem
def source_id_for(chunk_key: str, header: Optional[dict]) -> str:
"""Resolve the source id; `<source_id>.partNN` → `<source_id>`."""
if header and header.get("source_id"):
return str(header["source_id"])
# chunk keys look like "<source_id>.partNN"
return chunk_key.rsplit(".part", 1)[0]
def find_chunk_text(
json_path: Path, header: Optional[dict], chunks_dir: Path
) -> Optional[str]:
"""
Return the text of the source chunk for an extraction file, or None.
Looks for data/chunks/<source_id>/<chunk_key>.txt, then falls back to a
recursive glob on the chunk key.
"""
chunk_key = chunk_key_for(json_path, header)
source_id = source_id_for(chunk_key, header)
candidate = chunks_dir / source_id / f"{chunk_key}.txt"
if candidate.is_file():
return candidate.read_text(encoding="utf-8", errors="replace")
matches = list(chunks_dir.rglob(f"{chunk_key}.txt"))
if matches:
return matches[0].read_text(encoding="utf-8", errors="replace")
return None
def source_path_for(source_id: str, sources_dir: Path) -> Optional[str]:
"""
Read the original `SOURCE:` path from a normalized source header.
data/sources/<source_id>.txt starts with a `SOURCE: <relative path>` line.
"""
src_file = sources_dir / f"{source_id}.txt"
if not src_file.is_file():
return None
try:
with src_file.open(encoding="utf-8", errors="replace") as fh:
for line in fh:
if line.startswith("SOURCE:"):
return line.split(":", 1)[1].strip()
if line.startswith("=") or line.startswith("--- PAGE "):
break
except OSError:
return None
return None
# --------------------------------------------------------------------------
# stable content key for the needs_review queue (plan §5c)
# --------------------------------------------------------------------------
def normalize_name(name: str) -> str:
"""Diacritic-free, lowercased, whitespace-collapsed name (dedup key)."""
if not name:
return ""
decomposed = unicodedata.normalize("NFKD", name)
ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
return re.sub(r"\s+", " ", ascii_str.lower().strip())
def content_key(normalized_name: str, language: Optional[str], description: str) -> str:
"""
Stable hash identifying a row for the review queue.
Only borderline-kept-separate rows and legacy `.doc` rows ever carry
needs_review, and neither is auto-merged — so their (normalized_name,
language, description) triple is stable across rebuilds.
"""
payload = f"{normalized_name}\x1f{language or ''}\x1f{_normalize_text(description)}"
return hashlib.sha1(payload.encode("utf-8")).hexdigest()
# --------------------------------------------------------------------------
# iteration
# --------------------------------------------------------------------------
def iter_extraction_files(extracted_dir: Path):
"""Yield every *.json directly under `extracted_dir` (skips _rejected/)."""
if not extracted_dir.is_dir():
return
for path in sorted(extracted_dir.glob("*.json")):
if path.is_file():
yield path

View File

@@ -0,0 +1,255 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
normalize_sources.py — walk data/carti-camp-jocuri/ and write data/sources/<id>.txt.
Output files keep the existing header format:
SOURCE: <original relative path>
CONVERTED: <iso date>
FORMAT: <pdf|docx|doc|pptx|html-mirror|zip>
NEEDS_REVIEW: <reason> (optional — legacy .doc conversions)
==================================================
--- PAGE 1 ---
...
Each source gets a stable id = <8-hex hash of relative path>_<sanitized stem>,
so two files with the same name in different folders never collide.
The pipeline is script-only: this normalizes formats, it does NOT run extraction.
Run `--check-deps` before a long job.
"""
from __future__ import annotations
import argparse
import datetime as _dt
import hashlib
import re
import sys
from pathlib import Path
SCRIPT_DIR = Path(__file__).resolve().parent
if str(SCRIPT_DIR) not in sys.path:
sys.path.insert(0, str(SCRIPT_DIR))
from extract_common import ( # noqa: E402
count_page_markers,
dedupe_texts,
detect_format,
extract_file,
extract_html,
is_junk,
join_pages,
preflight,
split_pages,
)
HEADER_RULE = "=" * 50
# --------------------------------------------------------------------------
# stable source id
# --------------------------------------------------------------------------
def sanitize_stem(stem: str) -> str:
s = re.sub(r"[^\w]+", "_", stem, flags=re.UNICODE).strip("_").lower()
return s[:60] or "source"
def stable_id(relative_path: str | Path) -> str:
"""Collision-proof id derived from the path relative to the corpus root."""
rel = str(relative_path).replace("\\", "/")
digest = hashlib.sha1(rel.encode("utf-8")).hexdigest()[:8]
stem = sanitize_stem(Path(rel).stem)
return f"{digest}_{stem}"
# --------------------------------------------------------------------------
# header
# --------------------------------------------------------------------------
def build_header(
source_rel: str, fmt: str, needs_review: str | None = None
) -> str:
today = _dt.date.today().isoformat()
lines = [
f"SOURCE: {source_rel}",
f"CONVERTED: {today}",
f"FORMAT: {fmt}",
]
if needs_review:
lines.append(f"NEEDS_REVIEW: {needs_review}")
lines.append(HEADER_RULE)
return "\n".join(lines) + "\n\n"
# --------------------------------------------------------------------------
# mirror-site directories
# --------------------------------------------------------------------------
MIRROR_PAGE_EXTS = {".html", ".htm"}
def is_mirror_dir(path: Path) -> bool:
"""A directory counts as a site mirror if it contains HTML pages."""
if not path.is_dir():
return False
if path.name.endswith("_files"):
return False
return any(
p.suffix.lower() in MIRROR_PAGE_EXTS
for p in path.rglob("*")
if p.is_file()
)
def normalize_mirror(mirror_dir: Path) -> str:
"""Extract every HTML page in a mirror dir, dedupe near-duplicates, join."""
pages: list[tuple[str, str]] = []
for html in sorted(mirror_dir.rglob("*")):
if not html.is_file() or html.suffix.lower() not in MIRROR_PAGE_EXTS:
continue
if "_files" in html.parts:
continue
try:
body = extract_html(html)
except Exception:
continue
text = "\n".join(t for _, t in split_pages(body))
if text.strip():
pages.append((str(html.relative_to(mirror_dir)), text))
pages = dedupe_texts(pages)
return join_pages([t for _, t in pages])
# --------------------------------------------------------------------------
# one source
# --------------------------------------------------------------------------
def normalize_one(
path: Path, corpus_root: Path, out_dir: Path
) -> dict | None:
"""
Normalize a single file or mirror directory → data/sources/<id>.txt.
Returns a result dict, or None if the entry was skipped (junk / ignored).
"""
rel = path.relative_to(corpus_root)
sid = stable_id(rel)
if path.is_dir():
if not is_mirror_dir(path):
return None
fmt, needs_review = "html-mirror", None
body = normalize_mirror(path)
else:
if is_junk(path):
return None
fmt = detect_format(path)
if fmt in ("unknown", "epub", "txt"):
return None # epub duplicates PDFs; txt is not a source format here
needs_review = "legacy .doc conversion is imperfect" if fmt == "doc" else None
try:
body = extract_file(path)
except Exception as exc: # noqa: BLE001
return {"id": sid, "source": str(rel), "status": "error", "error": str(exc)}
if not body.strip():
return {"id": sid, "source": str(rel), "status": "empty"}
out_path = out_dir / f"{sid}.txt"
out_path.write_text(build_header(str(rel), fmt, needs_review) + body,
encoding="utf-8")
return {
"id": sid,
"source": str(rel),
"status": "ok",
"format": fmt,
"pages": count_page_markers(body),
"needs_review": bool(needs_review),
}
# --------------------------------------------------------------------------
# walk
# --------------------------------------------------------------------------
def iter_corpus_entries(corpus_root: Path):
"""Yield top-level files and mirror directories under the corpus root."""
for entry in sorted(corpus_root.iterdir()):
if entry.name.startswith("."):
continue
if entry.is_dir():
if is_mirror_dir(entry):
yield entry
else:
yield entry
def run(corpus_root: Path, out_dir: Path) -> dict:
out_dir.mkdir(parents=True, exist_ok=True)
results: list[dict] = []
for entry in iter_corpus_entries(corpus_root):
res = normalize_one(entry, corpus_root, out_dir)
if res is not None:
results.append(res)
summary = {
"total": len(results),
"ok": sum(1 for r in results if r["status"] == "ok"),
"errors": sum(1 for r in results if r["status"] == "error"),
"empty": sum(1 for r in results if r["status"] == "empty"),
"needs_review": sum(1 for r in results if r.get("needs_review")),
"results": results,
}
return summary
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def print_preflight(report: dict) -> int:
print("Dependency preflight")
print("--------------------")
if report["missing_python"]:
print(" MISSING Python packages: " + ", ".join(report["missing_python"]))
else:
print(" Python packages: OK")
if report["missing_system"]:
print(" MISSING system tools : " + ", ".join(report["missing_system"]))
for w in report["warnings"]:
print(f" WARNING: {w}")
print(" => " + ("READY" if report["ok"] else "NOT READY — install the above"))
return 0 if report["ok"] else 1
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(description="Normalize mixed sources to .txt")
parser.add_argument("--corpus", default="data/carti-camp-jocuri",
help="corpus root to walk")
parser.add_argument("--out", default="data/sources", help="output directory")
parser.add_argument("--check-deps", action="store_true",
help="run dependency preflight and exit")
parser.add_argument("--ocr", action="store_true",
help="include OCR (tesseract) in the preflight check")
args = parser.parse_args(argv)
if args.check_deps:
return print_preflight(preflight(check_ocr=args.ocr))
report = preflight(check_ocr=args.ocr)
if report["missing_python"]:
print_preflight(report)
return 1
for w in report["warnings"]:
print(f"WARNING: {w}")
summary = run(Path(args.corpus), Path(args.out))
print(f"normalized : {summary['ok']}/{summary['total']}")
print(f"errors : {summary['errors']}")
print(f"empty : {summary['empty']}")
print(f"needs_review: {summary['needs_review']}")
for r in summary["results"]:
if r["status"] != "ok":
print(f" [{r['status']}] {r['source']}")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -1,143 +0,0 @@
#!/usr/bin/env python3
"""
PDF Mass Conversion to Text for Activity Extraction
Handles all PDF sizes efficiently with multiple fallback methods
"""
import os
import json
from pathlib import Path
import PyPDF2
import pdfplumber
from typing import List, Dict
import logging
class PDFConverter:
def __init__(self, max_pages=50):
self.max_pages = max_pages
self.conversion_stats = {}
def convert_pdf_to_text(self, pdf_path: str) -> str:
"""Convert PDF to text using multiple methods with fallbacks"""
try:
# Method 1: pdfplumber (best for tables and layout)
return self._convert_with_pdfplumber(pdf_path)
except Exception as e:
print(f"pdfplumber failed for {pdf_path}: {e}")
try:
# Method 2: PyPDF2 (fallback)
return self._convert_with_pypdf2(pdf_path)
except Exception as e2:
print(f"PyPDF2 also failed for {pdf_path}: {e2}")
return ""
def _convert_with_pdfplumber(self, pdf_path: str) -> str:
"""Primary conversion method using pdfplumber"""
text_content = ""
with pdfplumber.open(pdf_path) as pdf:
total_pages = len(pdf.pages)
pages_to_process = min(total_pages, self.max_pages)
print(f" Converting {pdf_path}: {pages_to_process}/{total_pages} pages")
for i, page in enumerate(pdf.pages[:pages_to_process]):
try:
page_text = page.extract_text()
if page_text:
text_content += f"\n--- PAGE {i+1} ---\n"
text_content += page_text
text_content += "\n"
except Exception as e:
print(f" Error on page {i+1}: {e}")
continue
self.conversion_stats[pdf_path] = {
'method': 'pdfplumber',
'pages_processed': pages_to_process,
'total_pages': total_pages,
'success': True,
'text_length': len(text_content)
}
return text_content
def _convert_with_pypdf2(self, pdf_path: str) -> str:
"""Fallback conversion method using PyPDF2"""
text_content = ""
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
total_pages = len(reader.pages)
pages_to_process = min(total_pages, self.max_pages)
print(f" Converting {pdf_path} (fallback): {pages_to_process}/{total_pages} pages")
for i in range(pages_to_process):
try:
page = reader.pages[i]
page_text = page.extract_text()
if page_text:
text_content += f"\n--- PAGE {i+1} ---\n"
text_content += page_text
text_content += "\n"
except Exception as e:
print(f" Error on page {i+1}: {e}")
continue
self.conversion_stats[pdf_path] = {
'method': 'PyPDF2',
'pages_processed': pages_to_process,
'total_pages': total_pages,
'success': True,
'text_length': len(text_content)
}
return text_content
def convert_all_pdfs(self, pdf_directory: str, output_directory: str):
"""Convert all PDFs in directory to text files"""
pdf_files = list(Path(pdf_directory).glob("**/*.pdf"))
print(f"🔄 Converting {len(pdf_files)} PDF files to text...")
os.makedirs(output_directory, exist_ok=True)
for i, pdf_path in enumerate(pdf_files):
print(f"\n[{i+1}/{len(pdf_files)}] Processing {pdf_path.name}...")
# Convert to text
text_content = self.convert_pdf_to_text(str(pdf_path))
if text_content.strip():
# Save as text file
output_file = Path(output_directory) / f"{pdf_path.stem}.txt"
with open(output_file, 'w', encoding='utf-8') as f:
f.write(f"SOURCE: {pdf_path}\n")
f.write(f"CONVERTED: 2025-01-11\n")
f.write("="*50 + "\n\n")
f.write(text_content)
print(f" ✅ Saved: {output_file}")
else:
print(f" ❌ No text extracted from {pdf_path.name}")
# Save conversion statistics
stats_file = Path(output_directory) / "conversion_stats.json"
with open(stats_file, 'w', encoding='utf-8') as f:
json.dump(self.conversion_stats, f, indent=2, ensure_ascii=False)
print(f"\n🎉 PDF conversion complete! Check {output_directory}")
return len([f for f in self.conversion_stats.values() if f['success']])
# Usage
if __name__ == "__main__":
converter = PDFConverter(max_pages=50)
# Convert all PDFs
pdf_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri"
output_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri/INDEX-SISTEM-JOCURI/converted_pdfs"
converted_count = converter.convert_all_pdfs(pdf_dir, output_dir)
print(f"Final result: {converted_count} PDFs successfully converted")

View File

@@ -0,0 +1,244 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
repair_extractions.py — one-shot repair of malformed extraction JSON.
Subagents systematically emit unescaped ASCII double-quotes inside string
values (Romanian text like „Unu" uses a closing " that terminates the JSON
string early). Re-extraction reproduces the bug, so we repair instead.
IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
ending the string at the stray quote and reinterpreting the trailing text as a
new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
truncation is silent (the field is still non-empty) and slips past a naive
presence check. So we use a faithful char-scanner that ESCAPES stray quotes
(\\") instead of splitting on them, then validate the result against the real
activity schema (additionalProperties:false also catches any residual split).
This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
disk. We write clean JSON back to data/extracted/ and the build reads vanilla
json.
Source selection (faithful recovery needs the ORIGINAL malformed text):
* a chunk is a candidate when a MALFORMED original exists — either the
top-level data/extracted/<key>.json is itself invalid, or a malformed
original sits in data/extracted/_rejected/<key>.json.
* the malformed original is preferred as the repair source.
* chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
output that lost the original) are NOT silently "repaired" — if such a chunk
has no valid top-level file it is reported as needing RE-EXTRACTION.
Usage:
python scripts/repair_extractions.py # report only (dry run)
python scripts/repair_extractions.py --apply # write repaired JSON
"""
from __future__ import annotations
import argparse
import glob
import json
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
EXTRACTED = REPO_ROOT / "data" / "extracted"
REJECTED = EXTRACTED / "_rejected"
if str(SCRIPT_DIR) not in __import__("sys").path:
__import__("sys").path.insert(0, str(SCRIPT_DIR))
from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction # noqa: E402
def escape_stray_quotes(s: str) -> str:
"""Escape ASCII double-quotes that occur INSIDE a JSON string value.
A `"` inside a string is treated as a real string-close only when the next
non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
content and is escaped to `\\"`. This preserves the full value instead of
truncating it (the json_repair failure mode).
"""
out: list[str] = []
in_str = False
esc = False
n = len(s)
i = 0
while i < n:
c = s[i]
if esc:
out.append(c)
esc = False
i += 1
continue
if c == "\\":
out.append(c)
esc = True
i += 1
continue
if c == '"':
if not in_str:
in_str = True
out.append(c)
else:
j = i + 1
while j < n and s[j] in " \t\r\n":
j += 1
nxt = s[j] if j < n else ""
if nxt in ",}]:" or nxt == "":
in_str = False
out.append(c)
else:
out.append('\\"') # content quote → escape, keep value whole
i += 1
continue
out.append(c)
i += 1
return "".join(out)
def _is_valid_json(path: Path) -> bool:
try:
json.loads(path.read_text(encoding="utf-8"))
return True
except (json.JSONDecodeError, OSError):
return False
def _malformed_source(key: str) -> Optional[Path]:
"""Return the malformed-original file for a chunk, preferring top-level."""
live = EXTRACTED / f"{key}.json"
if live.exists() and not _is_valid_json(live):
return live
rej = REJECTED / f"{key}.json"
if rej.exists() and not _is_valid_json(rej):
return rej
return None
def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
"""
(repair_candidates, needs_reextraction).
repair_candidates: key -> malformed source file (faithfully repairable).
needs_reextraction: chunks with no malformed original AND no valid
top-level file (their original was lost) — must be re-extracted.
"""
keys = set()
for fn in glob.glob(str(EXTRACTED / "*.json")):
keys.add(Path(fn).stem)
for fn in glob.glob(str(REJECTED / "*.json")):
keys.add(Path(fn).stem)
candidates: dict[str, Path] = {}
needs_reextraction: list[str] = []
for key in sorted(keys):
# A malformed original anywhere is faithfully repairable, and is the
# source of truth even if a (json_repair-produced, possibly truncated)
# valid top-level file exists — escaping the original never truncates,
# so re-repairing from it is always >= the json_repair output.
src = _malformed_source(key)
if src is not None:
candidates[key] = src
continue
live = EXTRACTED / f"{key}.json"
if live.exists() and _is_valid_json(live):
continue # genuinely-valid extraction, nothing to do
# no valid top-level and no malformed original to repair from
needs_reextraction.append(key)
return candidates, needs_reextraction
def repair(apply: bool) -> int:
schema = load_schema(DEFAULT_SCHEMA_PATH)
candidates, needs_reextraction = _candidate_keys()
print("=" * 64)
print(f"REPAIR EXTRACTIONS ({'APPLY' if apply else 'dry run'})")
print("=" * 64)
print(f"repair candidates: {len(candidates)}")
def _textlen(data: dict) -> int:
total = 0
for a in data.get("activities", []):
if isinstance(a, dict):
for v in a.values():
if isinstance(v, str):
total += len(v)
return total
ok = 0
kept_toplevel = 0
still_bad: list[str] = []
schema_fail: list[tuple[str, str]] = []
for key, src in candidates.items():
live = EXTRACTED / f"{key}.json"
live_valid = live.exists() and _is_valid_json(live)
raw = src.read_text(encoding="utf-8")
fixed = escape_stray_quotes(raw)
try:
data = json.loads(fixed)
except json.JSONDecodeError as exc:
if live_valid:
kept_toplevel += 1 # genuine top-level is fine; stale _rejected
else:
still_bad.append(f"{key}: still invalid after escape ({exc})")
continue
errors = validate_extraction(data, schema)
if errors:
if live_valid:
kept_toplevel += 1
else:
schema_fail.append((key, errors[0]))
print(f" {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
continue
# Faithfulness guard: only replace a valid top-level when the escaped
# repair carries STRICTLY more text (i.e. the top-level was a truncated
# json_repair output). Genuine extractions are kept untouched.
if live_valid:
try:
live_data = json.loads(live.read_text(encoding="utf-8"))
except json.JSONDecodeError:
live_data = {}
if _textlen(data) <= _textlen(live_data):
kept_toplevel += 1
continue
n = len(data.get("activities", []))
print(f" {key[:50]:<50} {n:>3} acts REPAIR")
if apply:
live.write_text(
json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
)
ok += 1
print("-" * 64)
print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
f"needs re-extraction: {len(needs_reextraction)}")
for key, err in schema_fail:
print(f" ⚠ schema {key}: {err[:60]}")
for msg in still_bad:
print(f"{msg}")
for key in needs_reextraction:
print(f" ↻ re-extract: {key}")
if not apply:
print("\nDry run — re-run with --apply to write repaired JSON.")
print("=" * 64)
return 0
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
parser.add_argument("--apply", action="store_true",
help="write repaired JSON (default: dry run)")
args = parser.parse_args(argv)
return repair(args.apply)
if __name__ == "__main__":
raise SystemExit(main())

145
scripts/review_queue.py Normal file
View File

@@ -0,0 +1,145 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
review_queue.py — CLI for the needs_review lifecycle (plan §5c).
Rows land in the queue when dedup leaves a borderline pair separate, or when a
legacy `.doc` source was converted imperfectly. Each row has a stable content
key; a decision written here is stored in data/review_decisions.json (git
tracked) and re-applied by build_database.py on every rebuild, so the queue
never resurfaces a resolved row.
Commands:
python scripts/review_queue.py list
python scripts/review_queue.py resolve <id> <merge|keep-separate|drop>
"""
from __future__ import annotations
import argparse
import json
import sqlite3
import sys
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from import_common import content_key, normalize_name # noqa: E402
VALID_DECISIONS = ("merge", "keep-separate", "drop")
# --------------------------------------------------------------------------
# review_decisions.json
# --------------------------------------------------------------------------
def load_decisions(path: Path) -> dict:
if path.is_file():
try:
data = json.loads(path.read_text(encoding="utf-8"))
if isinstance(data, dict):
return data
except (json.JSONDecodeError, OSError):
pass
return {}
def save_decisions(decisions: dict, path: Path) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(
json.dumps(decisions, indent=2, ensure_ascii=False, sort_keys=True),
encoding="utf-8",
)
# --------------------------------------------------------------------------
# queue
# --------------------------------------------------------------------------
def list_queue(db_path: Path) -> list[dict]:
"""Return every needs_review row in the current DB, with its content key."""
if not db_path.is_file():
return []
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
try:
rows = conn.execute(
"SELECT name, normalized_name, language, description "
"FROM activities WHERE needs_review = 1 ORDER BY normalized_name"
).fetchall()
except sqlite3.OperationalError:
return []
finally:
conn.close()
out = []
for row in rows:
norm = row["normalized_name"] or normalize_name(row["name"])
key = content_key(norm, row["language"], row["description"] or "")
out.append({
"id": key,
"name": row["name"],
"language": row["language"],
"description": row["description"] or "",
})
return out
def resolve(decisions_path: Path, content_id: str, decision: str) -> dict:
"""Record a decision for a content key in review_decisions.json."""
if decision not in VALID_DECISIONS:
raise ValueError(
f"invalid decision {decision!r}; expected one of {VALID_DECISIONS}"
)
decisions = load_decisions(decisions_path)
decisions[content_id] = {"decision": decision}
save_decisions(decisions, decisions_path)
return decisions
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="needs_review queue CLI")
parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--decisions", default="data/review_decisions.json")
sub = parser.add_subparsers(dest="command", required=True)
sub.add_parser("list", help="list rows currently flagged needs_review")
p_resolve = sub.add_parser("resolve", help="record a decision for a row")
p_resolve.add_argument("id", help="content id from `list`")
p_resolve.add_argument("decision", choices=VALID_DECISIONS)
args = parser.parse_args(argv)
if args.command == "list":
rows = list_queue(Path(args.db))
if not rows:
print("review queue is empty.")
return 0
print(f"{len(rows)} row(s) need review:\n")
for r in rows:
desc = r["description"][:80].replace("\n", " ")
print(f" id : {r['id']}")
print(f" name : {r['name']} [{r['language']}]")
print(f" desc : {desc}")
print(f" -> review_queue.py resolve {r['id']} <merge|keep-separate|drop>")
print()
return 0
if args.command == "resolve":
resolve(Path(args.decisions), args.id, args.decision)
print(f"recorded: {args.id} -> {args.decision}")
print(f"written to {args.decisions} (applied on next build_database --rebuild)")
return 0
return 1
if __name__ == "__main__":
raise SystemExit(main())

289
scripts/run_enrichment.py Normal file
View File

@@ -0,0 +1,289 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
run_enrichment.py — enrichment orchestrator (plan Part B3).
Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
already-rebuilt data/activities.db, and for every activity emits one subagent
prompt asking for a single bilingual + inferred-filter enrichment pass. Like
extraction, this script does NOT call the LLM — the interactive Claude Code
orchestrator launches waves of subagents on the emitted prompts.
Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
import_common.content_key(normalized_name, language, _normalize_text(description))
— the SAME function build_database uses to apply the overlay. The key is stable
only while the extraction text is frozen, so enrichment runs AFTER the freezing
rebuild.
Modes:
(default) emit one prompt per activity that has no enrichment part yet
(resumable: data/enrichment_parts/<key>.json present => skip)
--collect merge data/enrichment_parts/*.json -> data/enrichment.json
Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
the emitted prompts to a single source / category for the sign-off pilot.
Usage:
python scripts/run_enrichment.py --source teambuilding_corbu # pilot
python scripts/run_enrichment.py # all rows
python scripts/run_enrichment.py --collect # merge parts
"""
from __future__ import annotations
import argparse
import json
import sqlite3
import sys
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from import_common import ( # noqa: E402
content_key,
find_chunk_text,
normalize_name,
)
from repair_extractions import escape_stray_quotes # noqa: E402
ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
# Columns pulled from the DB into the prompt as the "current value" context.
_DB_COLUMNS = (
"id", "name", "description", "rules", "variations",
"category", "content_type", "language", "normalized_name",
"page_reference", "source_id", "chunk_key",
"participants_min", "participants_max",
"duration_min", "duration_max",
"age_group_min", "age_group_max",
)
# How much source-chunk text to inline. Chunks are page-sized; cap so a dense
# chunk does not blow the prompt up, but keep enough to ground the expansion.
_CHUNK_TEXT_CAP = 12000
def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
try:
cols = ", ".join(_DB_COLUMNS)
sql = f"SELECT {cols} FROM activities"
params: list = []
if source_substr:
sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
params = [f"%{source_substr}%", f"%{source_substr}%"]
sql += " ORDER BY source_id, id"
return [dict(r) for r in conn.execute(sql, params).fetchall()]
finally:
conn.close()
def _row_content_key(row: dict) -> str:
return content_key(
row.get("normalized_name") or normalize_name(row.get("name") or ""),
row.get("language"),
row.get("description") or "",
)
def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
"""Locate the source-chunk text via the row's chunk_key / source_id."""
header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
if not header["chunk_key"]:
return None
# find_chunk_text resolves from the header when chunk_key is present;
# the json_path arg is only a fallback, so a synthetic path is fine.
text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
if text and len(text) > _CHUNK_TEXT_CAP:
text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
return text
def _current_fields_block(row: dict) -> str:
"""The activity's current DB values, as a compact JSON block for context."""
fields = {
"name": row.get("name"),
"description": row.get("description"),
"rules": row.get("rules"),
"variations": row.get("variations"),
"category": row.get("category"),
"content_type": row.get("content_type"),
"language": row.get("language"),
"participants_min": row.get("participants_min"),
"participants_max": row.get("participants_max"),
"duration_min": row.get("duration_min"),
"duration_max": row.get("duration_max"),
"age_group_min": row.get("age_group_min"),
"age_group_max": row.get("age_group_max"),
}
return json.dumps(fields, ensure_ascii=False, indent=2)
def emit_enrichment_prompt(
row: dict, key: str, chunks_dir: Path, prompts_dir: Path
) -> Path:
"""Write the subagent enrichment prompt for one activity."""
chunk_text = _chunk_text_for_row(row, chunks_dir)
source_block = (
chunk_text if chunk_text is not None
else "[source chunk text unavailable — translate only what is given "
"above; do NOT invent steps, and mark any inferred filter field "
"as estimated]"
)
part_path = f"data/enrichment_parts/{key}.json"
text = "\n".join([
f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
"",
f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
"Single pass. Translate faithfully to Romanian; expand description_ro "
"ONLY from the source chunk text below; mark inferred filter fields in "
"`estimated_fields`.",
"",
f"Write the result JSON to: `{part_path}`",
f'It MUST include `"content_key": "{key}"`.',
f'Page reference: {row.get("page_reference") or "?"}',
"",
"## Current activity values (the text to translate / enrich)",
"```json",
_current_fields_block(row),
"```",
"",
"## Source chunk text (ground description_ro expansion in THIS only)",
"```",
source_block,
"```",
"",
])
prompts_dir.mkdir(parents=True, exist_ok=True)
out = prompts_dir / f"{key}.prompt.md"
out.write_text(text, encoding="utf-8")
return out
def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
"""Merge data/enrichment_parts/*.json into one flat content_key map."""
merged: dict = {}
bad: list[str] = []
repaired: list[str] = []
if parts_dir.is_dir():
for part in sorted(parts_dir.glob("*.json")):
raw = part.read_text(encoding="utf-8")
try:
data = json.loads(raw)
except json.JSONDecodeError:
# Enrichment subagents hit the same unescaped-ASCII-quote bug as
# extraction (description_ro is full of Romanian „…"). Repair by
# escaping rather than dropping the activity's enrichment.
try:
data = json.loads(escape_stray_quotes(raw))
repaired.append(part.name)
except json.JSONDecodeError:
bad.append(part.name)
continue
except OSError:
bad.append(part.name)
continue
if not isinstance(data, dict):
bad.append(part.name)
continue
key = data.get("content_key") or part.stem
entry = {k: v for k, v in data.items() if k != "content_key"}
merged[key] = entry
out_path.write_text(
json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
)
return {"entries": len(merged), "repaired": repaired,
"bad_parts": bad, "out": str(out_path)}
def run_emit(
*,
db_path: Path,
chunks_dir: Path,
parts_dir: Path,
prompts_dir: Path,
source_substr: Optional[str],
limit: Optional[int],
) -> dict:
rows = _fetch_rows(db_path, source_substr)
emitted, skipped = 0, 0
for row in rows:
key = _row_content_key(row)
if (parts_dir / f"{key}.json").is_file():
skipped += 1
continue
emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
emitted += 1
if limit and emitted >= limit:
break
return {
"rows": len(rows),
"emitted": emitted,
"skipped_done": skipped,
"prompts_dir": str(prompts_dir),
}
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--parts", default="data/enrichment_parts")
parser.add_argument("--prompts", default="data/enrichment_prompts")
parser.add_argument("--out", default="data/enrichment.json")
parser.add_argument("--source", default=None,
help="only rows whose source_id/chunk_key contains this (pilot)")
parser.add_argument("--limit", type=int, default=None,
help="cap emitted prompts (pilot)")
parser.add_argument("--collect", action="store_true",
help="merge enrichment parts into the overlay JSON")
args = parser.parse_args(argv)
print("=" * 60)
print("ENRICHMENT ORCHESTRATOR")
print("=" * 60)
if args.collect:
result = collect_enrichment(Path(args.parts), Path(args.out))
print(f"collected : {result['entries']} entries -> {result['out']}")
if result["repaired"]:
print(f"repaired : {len(result['repaired'])} parts (unescaped-quote fix)")
if result["bad_parts"]:
print(f"bad parts : {len(result['bad_parts'])} (skipped)")
for name in result["bad_parts"]:
print(f" - {name}")
print("Run build_database.py --rebuild to apply the overlay.")
print("=" * 60)
return 0
summary = run_emit(
db_path=Path(args.db),
chunks_dir=Path(args.chunks),
parts_dir=Path(args.parts),
prompts_dir=Path(args.prompts),
source_substr=args.source,
limit=args.limit,
)
print(f"rows in DB : {summary['rows']}"
+ (f" (filtered by '{args.source}')" if args.source else ""))
print(f"already enriched : {summary['skipped_done']}")
print(f"prompts emitted : {summary['emitted']}")
if summary["emitted"]:
print(f"prompts dir : {summary['prompts_dir']}/")
print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
"writing data/enrichment_parts/<key>.json, then run "
"run_enrichment.py --collect and build_database.py --rebuild.")
else:
print("Nothing to emit — run --collect then build_database.py --rebuild.")
print("=" * 60)
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -1,50 +1,140 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Main extraction orchestrator
Ruleaza intregul proces de extractie
run_extraction.py — extraction orchestrator (plan §3).
The pipeline is script-only up to the LLM step: this script normalizes the
corpus, chunks the normalized sources, and emits one subagent prompt per
`pending` chunk. It does NOT run the extraction itself — that step is the
interactive Claude Code orchestrator launching waves of subagents.
Steps:
1. normalize data/carti-camp-jocuri/ -> data/sources/*.txt
2. chunk data/sources/*.txt -> data/chunks/<id>/*.txt + manifest.json
3. emit one prompt per `pending` chunk -> data/chunks/_prompts/*.md
4. report how many chunks remain `pending`
Usage:
python scripts/run_extraction.py
python scripts/run_extraction.py --skip-normalize # re-chunk only
"""
from __future__ import annotations
import argparse
import sys
import time
from pathlib import Path
from typing import Optional
from unified_processor import UnifiedProcessor
from import_claude_activities import ClaudeActivityImporter
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
import chunk_sources # noqa: E402
import normalize_sources # noqa: E402
SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
def emit_chunk_prompt(chunk_key: str, meta: dict, prompts_dir: Path) -> Path:
"""Write the subagent prompt for one pending chunk."""
chunk_file = meta.get("chunk_file", f"data/chunks/<id>/{chunk_key}.txt")
expected_json = meta.get("expected_json", f"{chunk_key}.json")
text = "\n".join([
f"# EXTRACTION — chunk `{chunk_key}`",
"",
f"Read ONLY this chunk: `{chunk_file}`",
f"Chunk range: {meta.get('chunk_range', '?')}",
"",
f"Follow the rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
"Identify every distinct activity, fill the schema "
"(`scripts/activity_schema.json`), and write the result to:",
"",
f" data/extracted/{expected_json}",
"",
"Header fields to set: "
f'source_id="{meta.get("source_id", "")}", chunk_key="{chunk_key}", '
f'source_hash="{meta.get("source_hash", "")}".',
"",
])
prompts_dir.mkdir(parents=True, exist_ok=True)
out = prompts_dir / f"{chunk_key}.prompt.md"
out.write_text(text, encoding="utf-8")
return out
def run(
*,
corpus_root: Path,
sources_dir: Path,
chunks_dir: Path,
skip_normalize: bool = False,
) -> dict:
summary: dict = {}
if not skip_normalize:
norm = normalize_sources.run(corpus_root, sources_dir)
summary["normalized"] = {"ok": norm["ok"], "total": norm["total"],
"errors": norm["errors"]}
chunk_summary = chunk_sources.run(sources_dir, chunks_dir)
summary["chunks"] = chunk_summary
manifest_path = chunks_dir / "manifest.json"
manifest = chunk_sources.load_manifest(manifest_path)
prompts_dir = chunks_dir / "_prompts"
pending = {k: m for k, m in manifest["chunks"].items()
if m.get("state") == "pending"}
for key, meta in sorted(pending.items()):
emit_chunk_prompt(key, meta, prompts_dir)
states: dict[str, int] = {}
for m in manifest["chunks"].values():
states[m.get("state", "?")] = states.get(m.get("state", "?"), 0) + 1
summary["states"] = states
summary["pending"] = len(pending)
summary["prompts_dir"] = str(prompts_dir)
return summary
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Extraction orchestrator.")
parser.add_argument("--corpus", default="data/carti-camp-jocuri")
parser.add_argument("--sources", default="data/sources")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--skip-normalize", action="store_true",
help="skip normalization, re-chunk existing sources only")
args = parser.parse_args(argv)
summary = run(
corpus_root=Path(args.corpus),
sources_dir=Path(args.sources),
chunks_dir=Path(args.chunks),
skip_normalize=args.skip_normalize,
)
print("=" * 60)
print("EXTRACTION ORCHESTRATOR")
print("=" * 60)
if "normalized" in summary:
n = summary["normalized"]
print(f"normalized : {n['ok']}/{n['total']} (errors {n['errors']})")
print(f"chunks : {summary['chunks']['chunks']}")
for state, count in sorted(summary["states"].items()):
print(f" {state:<10}: {count}")
print(f"\npending chunks remaining : {summary['pending']}")
if summary["pending"]:
print(f"subagent prompts written : {summary['prompts_dir']}/")
print("Launch waves of ~5-10 subagents on those prompts, then run "
"validate_extractions.py and build_database.py --rebuild.")
else:
print("All chunks extracted — run build_database.py --rebuild.")
print("=" * 60)
return 0
def main():
print("="*60)
print("ACTIVITY EXTRACTION SYSTEM")
print("Strategy S8: Hybrid Claude + Scripts")
print("="*60)
# Step 1: Run automated extraction
print("\nSTEP 1: Automated Extraction")
print("-"*40)
processor = UnifiedProcessor()
processor.process_automated_formats()
# Step 2: Wait for Claude processing
print("\n" + "="*60)
print("STEP 2: Manual Claude Processing Required")
print("-"*40)
print("Please process PDF/DOC files with Claude using the template.")
print("Files are listed in: pdf_doc_for_claude.txt")
print("Save extracted activities as JSON in: scripts/extracted_activities/")
print("="*60)
response = input("\nHave you completed Claude processing? (y/n): ")
if response.lower() == 'y':
# Step 3: Import Claude-extracted activities
print("\nSTEP 3: Importing Claude-extracted activities")
print("-"*40)
importer = ClaudeActivityImporter()
importer.import_all_json_files()
print("\n" + "="*60)
print("EXTRACTION COMPLETE!")
print("="*60)
if __name__ == "__main__":
main()
raise SystemExit(main())

View File

@@ -1,197 +0,0 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Text/Markdown Activity Extractor
Proceseaza fisiere TXT si MD pentru extractie activitati
"""
import re
from pathlib import Path
from typing import List, Dict
import sqlite3
from datetime import datetime
class TextActivityExtractor:
def __init__(self, db_path='data/activities.db'):
self.db_path = db_path
self.activity_patterns = {
'section_headers': [
r'^#{1,6}\s*(.+)$', # Markdown headers
r'^([A-Z][^\.]{10,100})$', # Titluri simple
r'^\d+\.\s*(.+)$', # Numbered lists
r'^[•\-\*]\s*(.+)$', # Bullet points
],
'activity_markers': [
'joc:', 'activitate:', 'exercitiu:', 'team building:',
'nume:', 'titlu:', 'denumire:'
]
}
def extract_from_text(self, file_path: str) -> List[Dict]:
"""Extrage activitati din fisier text/markdown"""
activities = []
try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
# Metoda 1: Cauta sectiuni markdown
if file_path.endswith('.md'):
activities.extend(self._extract_from_markdown(content, file_path))
# Metoda 2: Cauta pattern-uri generale
activities.extend(self._extract_from_patterns(content, file_path))
# Metoda 3: Cauta blocuri de text structurate
activities.extend(self._extract_from_blocks(content, file_path))
except Exception as e:
print(f"Error processing {file_path}: {e}")
return activities
def _extract_from_markdown(self, content, source_file):
"""Extrage activitati din format markdown"""
activities = []
lines = content.split('\n')
current_activity = None
current_content = []
for line in lines:
# Verifica daca e header de activitate
if re.match(r'^#{1,3}\s*(.+)', line):
# Salveaza activitatea anterioara daca exista
if current_activity and current_content:
current_activity['description'] = '\n'.join(current_content[:20]) # Max 20 linii
activities.append(current_activity)
# Verifica daca noul header e o activitate
header_text = re.sub(r'^#{1,3}\s*', '', line)
if any(marker in header_text.lower() for marker in ['joc', 'activitate', 'exercitiu']):
current_activity = {
'name': header_text[:200],
'source_file': str(source_file),
'category': '[A]'
}
current_content = []
else:
current_activity = None
elif current_activity:
# Adauga continut la activitatea curenta
if line.strip():
current_content.append(line)
# Salveaza ultima activitate
if current_activity and current_content:
current_activity['description'] = '\n'.join(current_content[:20])
activities.append(current_activity)
return activities
def _extract_from_patterns(self, content, source_file):
"""Extrage folosind pattern matching"""
activities = []
# Cauta markeri specifici de activitati
for marker in self.activity_patterns['activity_markers']:
pattern = re.compile(f'{re.escape(marker)}\\s*(.+?)(?=\\n\\n|{re.escape(marker)}|$)',
re.IGNORECASE | re.DOTALL)
matches = pattern.finditer(content)
for match in matches:
activity_text = match.group(1)
if len(activity_text) > 20:
activity = {
'name': activity_text.split('\n')[0][:200],
'description': activity_text[:1000],
'source_file': str(source_file),
'category': '[A]'
}
activities.append(activity)
return activities
def _extract_from_blocks(self, content, source_file):
"""Extrage din blocuri de text separate"""
activities = []
# Imparte in blocuri separate de linii goale
blocks = re.split(r'\n\s*\n', content)
for block in blocks:
if len(block) > 50: # Minim 50 caractere
lines = block.strip().split('\n')
first_line = lines[0].strip()
# Verifica daca blocul pare o activitate
if any(keyword in first_line.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
activity = {
'name': first_line[:200],
'description': block[:1000],
'source_file': str(source_file),
'category': '[A]'
}
activities.append(activity)
return activities
def save_to_database(self, activities):
"""Salveaza in baza de date"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
saved_count = 0
for activity in activities:
try:
# Check for duplicates
cursor.execute(
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
(activity.get('name'), activity.get('source_file'))
)
if not cursor.fetchone():
columns = list(activity.keys())
values = list(activity.values())
placeholders = ['?' for _ in values]
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
cursor.execute(query, values)
saved_count += 1
except Exception as e:
print(f"Error saving: {e}")
conn.commit()
conn.close()
return saved_count
def process_all_text_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
"""Proceseaza toate fisierele text si markdown"""
base_path = Path(base_path)
text_files = list(base_path.rglob("*.txt"))
md_files = list(base_path.rglob("*.md"))
all_files = text_files + md_files
print(f"Found {len(all_files)} text/markdown files")
all_activities = []
for file_path in all_files:
activities = self.extract_from_text(str(file_path))
all_activities.extend(activities)
print(f"Processed {file_path.name}: {len(activities)} activities")
# Save to database
saved = self.save_to_database(all_activities)
print(f"\nTotal saved: {saved} activities from {len(all_files)} files")
return len(all_files), saved
if __name__ == "__main__":
extractor = TextActivityExtractor()
extractor.process_all_text_files()

View File

@@ -1,151 +0,0 @@
#!/usr/bin/env python3
"""
Unified Activity Processor
Orchestreaz toate extractoarele pentru procesare complet
"""
import time
from pathlib import Path
from html_extractor import HTMLActivityExtractor
from text_extractor import TextActivityExtractor
import sqlite3
class UnifiedProcessor:
def __init__(self, db_path='data/activities.db'):
self.db_path = db_path
self.html_extractor = HTMLActivityExtractor(db_path)
self.text_extractor = TextActivityExtractor(db_path)
self.stats = {
'html_processed': 0,
'text_processed': 0,
'pdf_to_process': 0,
'doc_to_process': 0,
'total_activities': 0,
'start_time': None,
'end_time': None
}
def get_current_activity_count(self):
"""Obine numrul curent de activiti din DB"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM activities")
count = cursor.fetchone()[0]
conn.close()
return count
def count_files_to_process(self, base_path):
"""Numr fiierele care trebuie procesate"""
base_path = Path(base_path)
counts = {
'html': len(list(base_path.rglob("*.html"))) + len(list(base_path.rglob("*.htm"))),
'txt': len(list(base_path.rglob("*.txt"))),
'md': len(list(base_path.rglob("*.md"))),
'pdf': len(list(base_path.rglob("*.pdf"))),
'doc': len(list(base_path.rglob("*.doc"))),
'docx': len(list(base_path.rglob("*.docx")))
}
return counts
def process_automated_formats(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
"""Proceseaz toate formatele care pot fi automatizate"""
print("="*60)
print("UNIFIED ACTIVITY PROCESSOR - AUTOMATED PHASE")
print("="*60)
self.stats['start_time'] = time.time()
initial_count = self.get_current_activity_count()
# Afieaz statistici iniiale
file_counts = self.count_files_to_process(base_path)
print(f"\nFiles to process:")
for format, count in file_counts.items():
print(f" {format.upper()}: {count} files")
print(f"\nCurrent activities in database: {initial_count}")
print("-"*60)
# FAZA 1: Procesare HTML (prioritate maxim - volum mare)
print("\n[1/2] Processing HTML files...")
print("-"*40)
html_processed, html_errors = self.html_extractor.process_all_html_files(base_path)
self.stats['html_processed'] = html_processed
# FAZA 2: Procesare Text/MD
print("\n[2/2] Processing Text/Markdown files...")
print("-"*40)
text_processed, text_saved = self.text_extractor.process_all_text_files(base_path)
self.stats['text_processed'] = text_processed
# Statistici finale
self.stats['end_time'] = time.time()
final_count = self.get_current_activity_count()
self.stats['total_activities'] = final_count - initial_count
# Identific fiierele care necesit procesare manual
self.stats['pdf_to_process'] = file_counts['pdf']
self.stats['doc_to_process'] = file_counts['doc'] + file_counts['docx']
self.print_summary()
self.save_pdf_doc_list(base_path)
def print_summary(self):
"""Afieaz rezumatul procesrii"""
print("\n" + "="*60)
print("PROCESSING SUMMARY")
print("="*60)
duration = self.stats['end_time'] - self.stats['start_time']
print(f"\nAutomated Processing Results:")
print(f" HTML files processed: {self.stats['html_processed']}")
print(f" Text/MD files processed: {self.stats['text_processed']}")
print(f" New activities added: {self.stats['total_activities']}")
print(f" Processing time: {duration:.1f} seconds")
print(f"\nFiles requiring Claude processing:")
print(f" PDF files: {self.stats['pdf_to_process']}")
print(f" DOC/DOCX files: {self.stats['doc_to_process']}")
print("\n" + "="*60)
print("NEXT STEPS:")
print("1. Review the file 'pdf_doc_for_claude.txt' for manual processing")
print("2. Use Claude to extract activities from PDF/DOC files")
print("3. Focus on largest PDF files first (highest activity density)")
print("="*60)
def save_pdf_doc_list(self, base_path):
"""Salveaz lista de PDF/DOC pentru procesare cu Claude"""
base_path = Path(base_path)
pdf_files = sorted(base_path.rglob("*.pdf"), key=lambda p: p.stat().st_size, reverse=True)
doc_files = list(base_path.rglob("*.doc"))
docx_files = list(base_path.rglob("*.docx"))
with open('pdf_doc_for_claude.txt', 'w', encoding='utf-8') as f:
f.write("PDF/DOC FILES FOR CLAUDE PROCESSING\n")
f.write("="*60 + "\n")
f.write("Files sorted by size (largest first = likely more activities)\n\n")
f.write("TOP PRIORITY PDF FILES (process these first):\n")
f.write("-"*40 + "\n")
for i, pdf in enumerate(pdf_files[:20], 1):
size_mb = pdf.stat().st_size / (1024*1024)
f.write(f"{i}. {pdf.name} ({size_mb:.1f} MB)\n")
f.write(f" Path: {pdf}\n\n")
if len(pdf_files) > 20:
f.write(f"\n... and {len(pdf_files)-20} more PDF files\n\n")
f.write("\nDOC/DOCX FILES:\n")
f.write("-"*40 + "\n")
for doc in doc_files + docx_files:
size_kb = doc.stat().st_size / 1024
f.write(f"- {doc.name} ({size_kb:.1f} KB)\n")
print(f"\nPDF/DOC list saved to: pdf_doc_for_claude.txt")
if __name__ == "__main__":
processor = UnifiedProcessor()
processor.process_automated_formats()

View File

@@ -0,0 +1,208 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
validate_extractions.py — validate every data/extracted/*.json (plan §5b).
For each extraction file it runs two checks:
1. JSON-schema validation against scripts/activity_schema.json,
2. the source_excerpt anti-hallucination check (each excerpt must be a fuzzy
substring of the chunk it came from).
For every failing chunk it:
* writes the exact re-extraction prompt to data/extracted/_reextract/<chunk>.prompt.md,
* marks the chunk `rejected` in data/chunks/manifest.json.
The orchestrator then re-launches subagents only on the `rejected` chunks; the
loop repeats until nothing is rejected.
Usage:
python scripts/validate_extractions.py
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from import_common import ( # noqa: E402
DEFAULT_SCHEMA_PATH,
chunk_key_for,
excerpt_matches,
excerpt_score,
find_chunk_text,
iter_extraction_files,
load_schema,
validate_extraction,
)
SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
# --------------------------------------------------------------------------
# re-extraction prompt
# --------------------------------------------------------------------------
def build_reextraction_prompt(
chunk_key: str, chunk_file: Optional[str], errors: list[str]
) -> str:
"""The exact prompt to hand a subagent to re-extract a rejected chunk."""
chunk_ref = chunk_file or f"data/chunks/<source_id>/{chunk_key}.txt"
lines = [
f"# RE-EXTRACTION — chunk `{chunk_key}`",
"",
"The previous extraction for this chunk was **REJECTED**. Reasons:",
"",
]
lines += [f"- {e}" for e in errors]
lines += [
"",
"## What to do",
"",
f"1. Read ONLY this chunk: `{chunk_ref}`",
f"2. Follow the extraction rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
"3. Fix every problem listed above. In particular:",
" - every `source_excerpt` must be copied **verbatim** from the chunk",
" (it is checked as a fuzzy substring — invented quotes are rejected);",
" - `source_excerpt` and `page_reference` are mandatory on every activity;",
" - the output must validate against `scripts/activity_schema.json`.",
f"4. Overwrite the extraction file `data/extracted/{chunk_key}.json`.",
"",
]
return "\n".join(lines)
# --------------------------------------------------------------------------
# manifest
# --------------------------------------------------------------------------
def load_manifest(manifest_path: Path) -> dict:
if manifest_path.is_file():
try:
data = json.loads(manifest_path.read_text(encoding="utf-8"))
data.setdefault("chunks", {})
return data
except (json.JSONDecodeError, OSError):
pass
return {"chunks": {}}
def save_manifest(manifest: dict, manifest_path: Path) -> None:
manifest_path.parent.mkdir(parents=True, exist_ok=True)
manifest_path.write_text(
json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
)
def mark_rejected(manifest: dict, chunk_key: str) -> None:
"""Flip a chunk to `rejected` in the manifest (creating the entry if new)."""
entry = manifest["chunks"].get(chunk_key, {})
entry["state"] = "rejected"
manifest["chunks"][chunk_key] = entry
# --------------------------------------------------------------------------
# validation
# --------------------------------------------------------------------------
def validate_file(json_path: Path, schema: dict, chunks_dir: Path) -> list[str]:
"""Return the list of errors for one extraction file (empty == valid)."""
try:
data = json.loads(json_path.read_text(encoding="utf-8"))
except json.JSONDecodeError as exc:
return [f"invalid JSON: {exc}"]
errors = validate_extraction(data, schema)
if errors:
return errors
header = data.get("header", {})
chunk_text = find_chunk_text(json_path, header, chunks_dir)
if chunk_text is None:
return [f"source chunk not found for {chunk_key_for(json_path, header)}"]
for adict in data.get("activities", []):
excerpt = adict.get("source_excerpt") or ""
if not excerpt_matches(excerpt, chunk_text):
score = excerpt_score(excerpt, chunk_text)
errors.append(
f"activity {adict.get('name')!r}: source_excerpt not found in "
f"chunk (best match {score:.0f}/100) — possible hallucination"
)
return errors
def run(
extracted_dir: Path,
chunks_dir: Path,
manifest_path: Path,
schema_path: Path = DEFAULT_SCHEMA_PATH,
) -> dict:
schema = load_schema(schema_path)
manifest = load_manifest(manifest_path)
reextract_dir = extracted_dir / "_reextract"
report = {"total": 0, "valid": 0, "rejected": 0, "rejected_chunks": []}
for json_path in iter_extraction_files(extracted_dir):
report["total"] += 1
errors = validate_file(json_path, schema, chunks_dir)
if not errors:
report["valid"] += 1
continue
report["rejected"] += 1
try:
data = json.loads(json_path.read_text(encoding="utf-8"))
header = data.get("header", {})
except json.JSONDecodeError:
header = {}
chunk_key = chunk_key_for(json_path, header)
chunk_file = None
meta = manifest["chunks"].get(chunk_key)
if meta:
chunk_file = meta.get("chunk_file")
reextract_dir.mkdir(parents=True, exist_ok=True)
prompt = build_reextraction_prompt(chunk_key, chunk_file, errors)
(reextract_dir / f"{chunk_key}.prompt.md").write_text(prompt, encoding="utf-8")
mark_rejected(manifest, chunk_key)
report["rejected_chunks"].append({"chunk": chunk_key, "errors": errors})
save_manifest(manifest, manifest_path)
return report
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Validate extraction JSON files.")
parser.add_argument("--extracted", default="data/extracted")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--manifest", default="data/chunks/manifest.json")
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
args = parser.parse_args(argv)
report = run(
Path(args.extracted), Path(args.chunks), Path(args.manifest), Path(args.schema)
)
print(f"extraction files : {report['total']}")
print(f" valid : {report['valid']}")
print(f" rejected : {report['rejected']}")
for item in report["rejected_chunks"]:
print(f" [rejected] {item['chunk']}")
for err in item["errors"]:
print(f" - {err}")
if report["rejected"]:
print(f"\nRe-extraction prompts written to {args.extracted}/_reextract/")
return 0
if __name__ == "__main__":
raise SystemExit(main())

114
tests/conftest.py Normal file
View File

@@ -0,0 +1,114 @@
# -*- coding: utf-8 -*-
"""
Shared pytest fixtures for the extraction-pipeline tests.
scripts/ is not a package, so it is added to sys.path here. Synthetic fixtures
(PDF, docx, zip, HTML) are generated at runtime — no binary blobs in the repo.
"""
import sys
import zipfile
from pathlib import Path
import pytest
REPO_ROOT = Path(__file__).resolve().parent.parent
SCRIPTS_DIR = REPO_ROOT / "scripts"
if str(SCRIPTS_DIR) not in sys.path:
sys.path.insert(0, str(SCRIPTS_DIR))
# --------------------------------------------------------------------------
# synthetic PDF — deliberately large to pin the "no max_pages" regression
# --------------------------------------------------------------------------
@pytest.fixture
def big_pdf(tmp_path):
"""A 60-page PDF; each page carries a unique 'PDFMARK-<n>' token."""
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
path = tmp_path / "big.pdf"
c = canvas.Canvas(str(path), pagesize=letter)
for n in range(1, 61):
c.drawString(72, 720, f"PDFMARK-{n} synthetic activity page number {n}")
c.drawString(72, 700, "Acest joc educativ se joaca in echipa.")
c.showPage()
c.save()
return path
# --------------------------------------------------------------------------
# synthetic docx — 100 paragraphs => 3 synthetic pages at 40 paras/page
# --------------------------------------------------------------------------
@pytest.fixture
def sample_docx(tmp_path):
import docx
path = tmp_path / "sample.docx"
document = docx.Document()
for i in range(100):
document.add_paragraph(f"Paragraf {i}: continut joc team-building.")
document.save(str(path))
return path
# --------------------------------------------------------------------------
# synthetic HTML mirror page — with nav/script/footer chrome to strip
# --------------------------------------------------------------------------
HTML_WITH_NAV = """<!doctype html>
<html><head><title>Joc</title>
<style>.x{color:red}</style>
<script>var tracking = 1;</script>
</head><body>
<nav><a href="/">Home</a><a href="/games">Games</a></nav>
<header>Site Banner Junk</header>
<main>
<h1>Vanatoarea de comori</h1>
<p>Acesta este un joc real de orientare pentru cercetasi.</p>
<p>Jucatorii cauta indicii ascunse in tabara.</p>
</main>
<footer>Copyright 2024 - toate drepturile rezervate</footer>
</body></html>
"""
@pytest.fixture
def html_with_nav(tmp_path):
path = tmp_path / "page.html"
path.write_text(HTML_WITH_NAV, encoding="utf-8")
return path
# --------------------------------------------------------------------------
# synthetic zip — contains a docx and a stray junk file
# --------------------------------------------------------------------------
@pytest.fixture
def sample_zip(tmp_path, sample_docx):
path = tmp_path / "archive.zip"
with zipfile.ZipFile(path, "w") as zf:
zf.write(sample_docx, arcname="inner/sample.docx")
zf.writestr("desktop.ini", "junk")
return path
# --------------------------------------------------------------------------
# synthetic normalized source — paginated, with an activity straddling a
# page boundary so the chunker overlap can be verified.
# --------------------------------------------------------------------------
@pytest.fixture
def paginated_source(tmp_path):
"""A 50-page normalized source. An activity spans the page 20/21 boundary."""
lines = ["SOURCE: synthetic/test.pdf", "CONVERTED: 2026-05-19",
"FORMAT: pdf", "=" * 50, ""]
for n in range(1, 51):
lines.append(f"--- PAGE {n} ---")
if n == 20:
lines.append("ACTIVITY-START jocul podului care traverseaza pagina")
elif n == 21:
lines.append("continuare a jocului podului ACTIVITY-END")
else:
lines.append(f"continut obisnuit pe pagina {n}")
lines.append("")
path = tmp_path / "src_paginated.txt"
path.write_text("\n".join(lines), encoding="utf-8")
return path

3
tests/fixtures/.gitkeep vendored Normal file
View File

@@ -0,0 +1,3 @@
# Test fixtures (synthetic PDF/docx/zip/HTML) are generated at runtime by
# tests/conftest.py — no binary blobs are committed. This file only preserves
# the directory in git.

View File

@@ -0,0 +1,334 @@
# -*- coding: utf-8 -*-
"""
Tests for scripts/build_database.py — the import / dedup / swap side.
Covers: category -> slug + `altele` fallback; dedup across all three threshold
bands; EN != RO never merged; field combination on merge; atomic swap with a
simulated mid-build crash; the source_excerpt substring check.
"""
import json
import os
import sys
from pathlib import Path
import pytest
REPO_ROOT = Path(__file__).resolve().parent.parent
SCRIPTS_DIR = REPO_ROOT / "scripts"
for _p in (str(REPO_ROOT), str(SCRIPTS_DIR)):
if _p not in sys.path:
sys.path.insert(0, _p)
import build_database as bd # noqa: E402
from app.models.activity import Activity # noqa: E402
from app.models.database import DatabaseManager # noqa: E402
# --------------------------------------------------------------------------
# helpers
# --------------------------------------------------------------------------
def _activity(**over):
base = dict(
name="Jocul testului",
description="O activitate de echipa in aer liber.",
category="team-building",
content_type="joc",
language="ro",
extraction_confidence="high",
)
base.update(over)
return Activity(**base)
def _ext_activity(**over):
"""A schema-valid extraction-JSON activity object."""
base = dict(
name="Jocul testului",
description="O activitate de echipa in aer liber.",
category="team-building",
content_type="joc",
language="ro",
extraction_confidence="high",
source_excerpt="ANCHOR-EXCERPT despre jocul testului",
page_reference="page 1",
)
base.update(over)
return base
def _write_extraction(extracted_dir, chunk_key, activities, source_id="src01"):
extracted_dir.mkdir(parents=True, exist_ok=True)
payload = {
"header": {
"source_hash": "hash1234deadbeef",
"schema_version": "1.0",
"prompt_version": "1.0",
"chunk_range": "pages 1-20",
"source_id": source_id,
"chunk_key": chunk_key,
},
"activities": activities,
}
(extracted_dir / f"{chunk_key}.json").write_text(
json.dumps(payload, ensure_ascii=False), encoding="utf-8"
)
def _write_chunk(chunks_dir, source_id, chunk_key, text):
d = chunks_dir / source_id
d.mkdir(parents=True, exist_ok=True)
(d / f"{chunk_key}.txt").write_text(text, encoding="utf-8")
# --------------------------------------------------------------------------
# step 3 — category normalization
# --------------------------------------------------------------------------
def test_category_alias_mapped_to_slug():
act = bd.dict_to_activity(_ext_activity(category="teambuilding"), "s.txt")
assert act.category == "team-building"
def test_unknown_category_falls_back_to_altele():
act = bd.dict_to_activity(_ext_activity(category="zzz-not-a-category"), "s.txt")
assert act.category == "altele"
def test_content_type_normalized():
act = bd.dict_to_activity(_ext_activity(content_type="games"), "s.txt")
assert act.content_type == "joc"
# --------------------------------------------------------------------------
# step 4 — dedup, three bands
# --------------------------------------------------------------------------
def test_dedup_auto_merge_identical_descriptions():
""">= 85 similar -> a single merged row."""
a = _activity(description="copiii formeaza echipe si traverseaza terenul")
b = _activity(description="copiii formeaza echipe si traverseaza terenul")
out, stats = bd.dedup_activities([a, b])
assert len(out) == 1
assert stats["auto_merged"] == 1
assert out[0].needs_review == 0
def test_dedup_borderline_keeps_both_and_flags_needs_review():
"""60-85 similar -> both kept, both flagged needs_review."""
from rapidfuzz import fuzz
d1 = "alpha beta gamma delta epsilon"
d2 = "alpha beta gamma delta epsilon zeta eta theta iota"
score = fuzz.token_sort_ratio(d1, d2)
assert 60.0 <= score < 85.0, f"precondition: score={score} not borderline"
a = _activity(description=d1)
b = _activity(description=d2)
out, stats = bd.dedup_activities([a, b])
assert len(out) == 2
assert stats["borderline"] == 2
assert all(act.needs_review == 1 for act in out)
def test_dedup_low_similarity_kept_as_separate_variants():
"""< 60 similar -> separate variants, no needs_review."""
from rapidfuzz import fuzz
d1 = "alpha beta gamma delta epsilon"
d2 = "quebec romeo sierra tango uniform victor whiskey"
assert fuzz.token_sort_ratio(d1, d2) < 60.0
a = _activity(description=d1)
b = _activity(description=d2)
out, stats = bd.dedup_activities([a, b])
assert len(out) == 2
assert stats["auto_merged"] == 0
assert all(act.needs_review == 0 for act in out)
def test_dedup_never_merges_across_languages():
"""Same name + same description but EN vs RO -> two distinct rows."""
desc = "children form teams and cross the field"
ro = _activity(name="Cursa", description=desc, language="ro")
en = _activity(name="Cursa", description=desc, language="en")
out, stats = bd.dedup_activities([ro, en])
assert len(out) == 2
assert stats["auto_merged"] == 0
langs = {a.language for a in out}
assert langs == {"ro", "en"}
def test_merge_combines_fields():
"""On merge: longest description/rules, union materials, accumulated sources."""
desc = "copiii formeaza echipe si traverseaza terenul cu obstacole"
a = _activity(
description=desc,
rules="regula scurta",
materials_list="franghie, esarfa",
source_file="a.txt",
keywords="echipa",
)
b = _activity(
description=desc,
rules="o regula mult mai lunga si mai detaliata pentru joc",
materials_list="busola, esarfa",
source_file="b.txt",
keywords="cooperare",
)
out, _ = bd.dedup_activities([a, b])
assert len(out) == 1
merged = out[0]
assert merged.rules == "o regula mult mai lunga si mai detaliata pentru joc"
mats = set(m.strip() for m in merged.materials_list.split(","))
assert mats == {"franghie", "esarfa", "busola"}
assert set(merged.source_files) == {"a.txt", "b.txt"}
assert merged.popularity_score == 1
assert set(k.strip() for k in merged.keywords.split(",")) == {"echipa", "cooperare"}
# --------------------------------------------------------------------------
# step 5 — review decisions
# --------------------------------------------------------------------------
def test_review_decision_drop_removes_row():
from import_common import content_key, normalize_name
a = _activity(description="o descriere de test")
key = content_key(normalize_name(a.name), a.language, a.description)
kept, stats = bd.apply_review_decisions([a], {key: {"decision": "drop"}})
assert kept == []
assert stats["dropped"] == 1
def test_review_decision_keep_separate_clears_needs_review():
from import_common import content_key, normalize_name
a = _activity(description="o descriere de test")
a.needs_review = 1
key = content_key(normalize_name(a.name), a.language, a.description)
kept, stats = bd.apply_review_decisions([a], {key: {"decision": "keep-separate"}})
assert len(kept) == 1 and kept[0].needs_review == 0
assert stats["resolved"] == 1
# --------------------------------------------------------------------------
# step 2b — source_excerpt hallucination check
# --------------------------------------------------------------------------
def test_hallucinated_excerpt_activity_dropped(tmp_path):
extracted = tmp_path / "extracted"
chunks = tmp_path / "chunks"
sources = tmp_path / "sources"
good = _ext_activity(
name="Joc real", source_excerpt="textul real apare in bucata sursa"
)
bad = _ext_activity(
name="Joc inventat",
source_excerpt="acest citat nu exista nicaieri in sursa originala xyzzy",
)
_write_extraction(extracted, "src01.part01", [good, bad])
_write_chunk(
chunks, "src01", "src01.part01",
"--- PAGE 1 ---\ntextul real apare in bucata sursa pentru jocul real.\n",
)
from import_common import load_schema
schema = load_schema()
res = bd.collect_activities(extracted, chunks, sources, schema)
names = {a.name for a in res["activities"]}
assert names == {"Joc real"}
assert res["activities_hallucinated"] == 1
assert (extracted / "_rejected").exists()
def test_schema_invalid_file_moved_to_rejected(tmp_path):
extracted = tmp_path / "extracted"
chunks = tmp_path / "chunks"
sources = tmp_path / "sources"
extracted.mkdir(parents=True)
# missing required header keys + bad activity
(extracted / "bad.json").write_text(
json.dumps({"header": {}, "activities": [{"name": "x"}]}),
encoding="utf-8",
)
from import_common import load_schema
res = bd.collect_activities(extracted, chunks, sources, load_schema())
assert res["files_rejected_schema"] == 1
assert not (extracted / "bad.json").exists()
assert (extracted / "_rejected" / "bad.json").exists()
assert (extracted / "_rejected" / "bad.errors.txt").exists()
# --------------------------------------------------------------------------
# end-to-end rebuild + atomic swap
# --------------------------------------------------------------------------
def _setup_corpus(tmp_path):
extracted = tmp_path / "extracted"
chunks = tmp_path / "chunks"
sources = tmp_path / "sources"
excerpt = "jocul testului este o activitate de echipa"
_write_extraction(
extracted, "src01.part01",
[_ext_activity(source_excerpt=excerpt)],
)
_write_chunk(chunks, "src01", "src01.part01",
f"--- PAGE 1 ---\n{excerpt} in aer liber.\n")
return extracted, chunks, sources
def test_rebuild_creates_database(tmp_path):
extracted, chunks, sources = _setup_corpus(tmp_path)
db_path = tmp_path / "activities.db"
report = bd.rebuild(
extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
db_path=db_path,
)
assert db_path.exists()
assert report["final_count"] == 1
db = DatabaseManager(str(db_path))
rows = db.search_activities()
assert len(rows) == 1
assert rows[0]["category"] == "team-building"
def test_atomic_swap_keeps_live_db_intact_on_crash(tmp_path, monkeypatch):
"""A mid-build crash must leave the live DB byte-identical."""
extracted, chunks, sources = _setup_corpus(tmp_path)
db_path = tmp_path / "activities.db"
# a pre-existing live DB with sentinel content
live = DatabaseManager(str(db_path))
live.insert_activity(_activity(name="Sentinel viu"))
before = db_path.read_bytes()
def boom(self, *a, **k):
raise RuntimeError("simulated mid-build crash")
monkeypatch.setattr(DatabaseManager, "bulk_insert_activities", boom)
with pytest.raises(RuntimeError, match="simulated mid-build crash"):
bd.rebuild(
extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
db_path=db_path,
)
# live DB untouched, tmp cleaned up
assert db_path.read_bytes() == before
assert not (tmp_path / "activities.db.tmp").exists()
def test_rebuild_backs_up_live_db(tmp_path):
extracted, chunks, sources = _setup_corpus(tmp_path)
db_path = tmp_path / "activities.db"
DatabaseManager(str(db_path)).insert_activity(_activity(name="Vechi"))
report = bd.rebuild(
extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
db_path=db_path,
)
assert report["backup"] is not None
assert Path(report["backup"]).exists()
assert os.path.basename(report["backup"]) == "activities.db.bak"

183
tests/test_chunk_sources.py Normal file
View File

@@ -0,0 +1,183 @@
# -*- coding: utf-8 -*-
"""Tests for scripts/chunk_sources.py."""
import json
import chunk_sources as cs
import normalize_sources as ns
def _pages(n):
return [(i, f"text-{i}") for i in range(1, n + 1)]
# --------------------------------------------------------------------------
# header parsing
# --------------------------------------------------------------------------
def test_parse_source_splits_header_and_body(paginated_source):
text = paginated_source.read_text(encoding="utf-8")
header, body = cs.parse_source(text)
assert header["FORMAT"] == "pdf"
assert body.lstrip().startswith("--- PAGE 1 ---")
# --------------------------------------------------------------------------
# page chunking
# --------------------------------------------------------------------------
def test_chunk_pages_basic_split():
chunks = cs.chunk_pages(_pages(50), pages_per_chunk=20, overlap=4)
# stride 16: starts at pages 1, 17, 33, ...
assert chunks[0]["page_start"] == 1 and chunks[0]["page_end"] == 20
assert chunks[1]["page_start"] == 17
assert chunks[-1]["page_end"] == 50
def test_chunk_pages_have_overlap():
chunks = cs.chunk_pages(_pages(50), pages_per_chunk=20, overlap=4)
overlap = chunks[0]["page_end"] - chunks[1]["page_start"] + 1
assert overlap == 4
def test_chunk_pages_short_document_single_chunk():
chunks = cs.chunk_pages(_pages(8), pages_per_chunk=20, overlap=4)
assert len(chunks) == 1
assert chunks[0]["page_start"] == 1 and chunks[0]["page_end"] == 8
def test_chunk_pages_empty():
assert cs.chunk_pages([]) == []
def test_activity_at_page_boundary_intact_in_one_chunk(paginated_source):
"""An activity straddling the page 20/21 boundary must appear whole in >=1 chunk."""
text = paginated_source.read_text(encoding="utf-8")
chunks = cs.make_chunks(text)
full = [
c for c in chunks
if "ACTIVITY-START" in c["text"] and "ACTIVITY-END" in c["text"]
]
assert full, "activity spanning a page boundary was split across all chunks"
# --------------------------------------------------------------------------
# word-window chunking for unpaginated text
# --------------------------------------------------------------------------
def test_chunk_words_window_and_overlap():
text = " ".join(f"w{i}" for i in range(25_000))
chunks = cs.chunk_words(text, window=10_000, overlap=2_000)
assert len(chunks) == 3 # stride 8000 over 25000 words
first = chunks[0]["text"].split()
second = chunks[1]["text"].split()
assert first[8_000:10_000] == second[0:2_000] # 2000-word overlap
def test_make_chunks_unpaginated_uses_word_windows():
body = "cuvant " * 15_000
text = "SOURCE: x\nFORMAT: txt\n" + "=" * 50 + "\n\n" + body
chunks = cs.make_chunks(text)
assert len(chunks) >= 2
assert chunks[0]["chunk_range"].startswith("words")
# --------------------------------------------------------------------------
# stable source ids — anti-collision
# --------------------------------------------------------------------------
def test_stable_id_same_stem_different_path_no_collision():
a = ns.stable_id("camp/games/scout.pdf")
b = ns.stable_id("school/lessons/scout.pdf")
assert a != b
assert a.endswith("_scout") and b.endswith("_scout")
def test_stable_id_deterministic():
assert ns.stable_id("a/b/c.pdf") == ns.stable_id("a/b/c.pdf")
# --------------------------------------------------------------------------
# manifest registry + idempotency
# --------------------------------------------------------------------------
def test_run_writes_chunks_and_manifest(paginated_source, tmp_path):
sources_dir = tmp_path / "sources"
sources_dir.mkdir()
(sources_dir / paginated_source.name).write_text(
paginated_source.read_text(encoding="utf-8"), encoding="utf-8"
)
chunks_dir = tmp_path / "chunks"
summary = cs.run(sources_dir, chunks_dir)
assert summary["sources"] == 1
assert summary["chunks"] >= 2
manifest = json.loads((chunks_dir / "manifest.json").read_text())
assert manifest["chunks"]
for key, meta in manifest["chunks"].items():
assert meta["state"] == "pending"
assert meta["expected_json"] == f"{key}.json"
assert (chunks_dir.parent / meta["chunk_file"]).exists()
def test_manifest_idempotent_preserves_state(paginated_source, tmp_path):
sources_dir = tmp_path / "sources"
sources_dir.mkdir()
(sources_dir / paginated_source.name).write_text(
paginated_source.read_text(encoding="utf-8"), encoding="utf-8"
)
chunks_dir = tmp_path / "chunks"
manifest_path = chunks_dir / "manifest.json"
cs.run(sources_dir, chunks_dir)
# orchestrator marks one chunk done
manifest = json.loads(manifest_path.read_text())
first_key = next(iter(manifest["chunks"]))
n_before = len(manifest["chunks"])
manifest["chunks"][first_key]["state"] = "done"
manifest_path.write_text(json.dumps(manifest), encoding="utf-8")
# re-run: 'done' must survive, no chunk added or lost
cs.run(sources_dir, chunks_dir)
manifest2 = json.loads(manifest_path.read_text())
assert len(manifest2["chunks"]) == n_before
assert manifest2["chunks"][first_key]["state"] == "done"
assert all(
m["state"] in ("pending", "done") for m in manifest2["chunks"].values()
)
def test_manifest_resets_state_when_source_changes(paginated_source, tmp_path):
sources_dir = tmp_path / "sources"
sources_dir.mkdir()
src = sources_dir / paginated_source.name
src.write_text(paginated_source.read_text(encoding="utf-8"), encoding="utf-8")
chunks_dir = tmp_path / "chunks"
manifest_path = chunks_dir / "manifest.json"
cs.run(sources_dir, chunks_dir)
manifest = json.loads(manifest_path.read_text())
first_key = next(iter(manifest["chunks"]))
manifest["chunks"][first_key]["state"] = "done"
manifest_path.write_text(json.dumps(manifest), encoding="utf-8")
# mutate the source content -> hash changes -> state resets
src.write_text(src.read_text(encoding="utf-8") + "\n--- PAGE 51 ---\nextra\n",
encoding="utf-8")
cs.run(sources_dir, chunks_dir)
manifest2 = json.loads(manifest_path.read_text())
assert manifest2["chunks"][first_key]["state"] == "pending"
def test_prune_stale_removes_orphan_entries(paginated_source, tmp_path):
sources_dir = tmp_path / "sources"
sources_dir.mkdir()
src = sources_dir / paginated_source.name
src.write_text(paginated_source.read_text(encoding="utf-8"), encoding="utf-8")
chunks_dir = tmp_path / "chunks"
cs.run(sources_dir, chunks_dir)
# delete the source -> its chunks become stale
src.unlink()
summary = cs.run(sources_dir, chunks_dir)
assert summary["chunks"] == 0
assert summary["pruned"] >= 1
manifest = json.loads((chunks_dir / "manifest.json").read_text())
assert manifest["chunks"] == {}

231
tests/test_enrichment.py Normal file
View File

@@ -0,0 +1,231 @@
"""
Tests for the enrichment overlay (plan Part B) and the new filter axes /
bilingual display helpers (plan Part A).
Covers:
* config_taxonomy.normalize_indoor_outdoor / normalize_space_needed
* build_database.apply_enrichment keying, field application, estimated tally
* DatabaseManager indoor_outdoor / space_needed equality filters
* FTS5 indexing of the *_ro columns
* Activity bilingual display helpers
"""
import os
import sys
import pytest
PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
if PROJECT_ROOT not in sys.path:
sys.path.insert(0, PROJECT_ROOT)
SCRIPTS = os.path.join(PROJECT_ROOT, "scripts")
if SCRIPTS not in sys.path:
sys.path.insert(0, SCRIPTS)
from app.models.activity import Activity # noqa: E402
from app.models.database import DatabaseManager # noqa: E402
from app.config_taxonomy import ( # noqa: E402
normalize_indoor_outdoor,
normalize_space_needed,
)
from import_common import content_key, normalize_name # noqa: E402
from build_database import apply_enrichment # noqa: E402
# --------------------------------------------------------------------------
# taxonomy normalizers
# --------------------------------------------------------------------------
@pytest.mark.parametrize("raw,expected", [
("indoor", "indoor"),
("Outdoor", "outdoor"),
("either", "either"),
("interior", "indoor"),
("aer liber", "outdoor"),
("both", "either"),
("", None),
("nonsense", None),
(None, None),
])
def test_normalize_indoor_outdoor(raw, expected):
assert normalize_indoor_outdoor(raw) == expected
@pytest.mark.parametrize("raw,expected", [
("mic", "mic"),
("MEDIU", "mediu"),
("mare", "mare"),
("small", "mic"),
("large", "mare"),
("", None),
("huge", None),
(None, None),
])
def test_normalize_space_needed(raw, expected):
assert normalize_space_needed(raw) == expected
# --------------------------------------------------------------------------
# apply_enrichment
# --------------------------------------------------------------------------
def _activity(name="Joc de test", description="O descriere de test.", language="ro"):
return Activity(
name=name, description=description, category="team-building",
content_type="joc", source_file="t.txt", language=language,
)
def _key_for(act: Activity) -> str:
return content_key(
act.normalized_name or normalize_name(act.name),
act.language,
act.description or "",
)
def test_apply_enrichment_matches_and_applies_fields():
act = _activity()
key = _key_for(act)
enrichment = {
key: {
"name_ro": "Joc de test (RO)",
"description_ro": "Descriere îmbogățită în română.",
"indoor_outdoor": "outdoor",
"space_needed": "mediu",
"participants_min": 4,
"participants_max": 12,
"estimated_fields": ["space_needed", "participants_min", "participants_max"],
}
}
stats = apply_enrichment([act], enrichment)
assert act.name_ro == "Joc de test (RO)"
assert act.description_ro == "Descriere îmbogățită în română."
assert act.indoor_outdoor == "outdoor"
assert act.space_needed == "mediu"
assert act.participants_min == 4 and act.participants_max == 12
assert set(act.estimated_fields) == {"space_needed", "participants_min", "participants_max"}
assert stats["entries"] == 1
assert stats["matched"] == 1
assert stats["orphaned"] == 0
# indoor_outdoor stated, space_needed estimated
assert stats["fields_stated"].get("indoor_outdoor") == 1
assert stats["fields_estimated"].get("space_needed") == 1
def test_apply_enrichment_orphan_entry_counted():
act = _activity()
enrichment = {"deadbeef" * 5: {"name_ro": "nu se potrivește"}}
stats = apply_enrichment([act], enrichment)
assert stats["matched"] == 0
assert stats["orphaned"] == 1
assert act.name_ro is None # untouched
def test_apply_enrichment_absent_fields_leave_value_untouched():
act = _activity()
act.participants_min = 5
key = _key_for(act)
# entry only translates name; participants must be preserved
apply_enrichment([act], {key: {"name_ro": "Tradus"}})
assert act.participants_min == 5
assert act.name_ro == "Tradus"
def test_apply_enrichment_drops_unrecognised_enum():
act = _activity()
key = _key_for(act)
apply_enrichment([act], {key: {"indoor_outdoor": "spaceship", "space_needed": "mic"}})
assert act.indoor_outdoor is None # unrecognised → dropped
assert act.space_needed == "mic"
# --------------------------------------------------------------------------
# DB equality filters + FTS on *_ro
# --------------------------------------------------------------------------
@pytest.fixture
def db(tmp_path):
return DatabaseManager(str(tmp_path / "enrich.db"))
def _insert(db, **overrides):
base = dict(
name="Activitate", description="desc", category="camp-outdoor",
content_type="joc", source_file="t.txt", language="ro",
)
base.update(overrides)
return db.insert_activity(Activity(**base))
def test_indoor_outdoor_equality_filter(db):
_insert(db, name="In casa", indoor_outdoor="indoor")
_insert(db, name="Afara", indoor_outdoor="outdoor")
res = db.search_activities(indoor_outdoor="outdoor")
assert len(res) == 1
assert res[0]["name"] == "Afara"
def test_space_needed_equality_filter(db):
_insert(db, name="Mic", space_needed="mic")
_insert(db, name="Mare", space_needed="mare")
res = db.search_activities(space_needed="mare")
assert len(res) == 1
assert res[0]["name"] == "Mare"
def test_fts_indexes_name_ro(db):
_insert(db, name="Treasure Hunt", name_ro="Vânătoarea de comori")
# term only present in the Romanian twin
res = db.search_activities(search_text="comori")
assert len(res) == 1
assert res[0]["name"] == "Treasure Hunt"
def test_fts_indexes_description_ro(db):
_insert(db, name="Game", description="english desc",
description_ro="o activitate de cooperare")
res = db.search_activities(search_text="cooperare")
assert len(res) == 1
def test_ro_columns_round_trip(db):
aid = _insert(
db, name="X", name_ro="X-ro", description_ro="d-ro",
rules_ro="r-ro", variations_ro="v-ro",
indoor_outdoor="either", space_needed="mediu",
estimated_fields=["duration_min"], source_id="src1",
source_ids=["src1", "src2"], chunk_key="src1.part01",
)
row = db.get_activity_by_id(aid)
loaded = Activity.from_dict(row)
assert loaded.name_ro == "X-ro"
assert loaded.indoor_outdoor == "either"
assert loaded.space_needed == "mediu"
assert loaded.estimated_fields == ["duration_min"]
assert loaded.source_ids == ["src1", "src2"]
assert loaded.chunk_key == "src1.part01"
# --------------------------------------------------------------------------
# display helpers
# --------------------------------------------------------------------------
def test_display_helpers_prefer_ro_with_fallback():
act = _activity(name="Original", description="Original desc")
assert act.get_display_name() == "Original" # no translation yet
assert act.get_display_description() == "Original desc"
act.name_ro = "Tradus"
act.description_ro = "Descriere tradusă"
assert act.get_display_name() == "Tradus"
assert act.get_display_description() == "Descriere tradusă"
assert act.has_translation() is True
def test_is_estimated_and_axis_displays():
act = _activity()
act.indoor_outdoor = "outdoor"
act.space_needed = "mare"
act.estimated_fields = ["space_needed"]
assert act.get_indoor_outdoor_display() == "Exterior"
assert act.get_space_needed_display() == "Spațiu mare"
assert act.is_estimated("space_needed") is True
assert act.is_estimated("indoor_outdoor") is False

View File

@@ -0,0 +1,177 @@
# -*- coding: utf-8 -*-
"""Tests for scripts/extract_common.py."""
import shutil
import zipfile
import pytest
import extract_common as ec
# --------------------------------------------------------------------------
# format detection
# --------------------------------------------------------------------------
def test_detect_format():
assert ec.detect_format("a/b/file.PDF") == "pdf"
assert ec.detect_format("x.docx") == "docx"
assert ec.detect_format("x.doc") == "doc"
assert ec.detect_format("x.pptx") == "pptx"
assert ec.detect_format("x.html") == "html"
assert ec.detect_format("x.zip") == "zip"
assert ec.detect_format("x.epub") == "epub"
assert ec.detect_format("x.xyz") == "unknown"
def test_is_junk():
assert ec.is_junk("some/desktop.ini")
assert ec.is_junk("notes.bak")
assert ec.is_junk("README.md")
assert not ec.is_junk("1000 Scout Games.pdf")
# --------------------------------------------------------------------------
# PDF — the critical "no max_pages" regression
# --------------------------------------------------------------------------
def test_pdf_extracts_all_60_pages(big_pdf):
body = ec.extract_pdf(big_pdf)
# the old converter capped at 50 pages — page 60 must be present now
assert "--- PAGE 60 ---" in body
assert "PDFMARK-60" in body
assert ec.count_page_markers(body) == 60
def test_pdf_does_not_truncate_mid_document(big_pdf):
body = ec.extract_pdf(big_pdf)
pages = ec.split_pages(body)
assert pages[-1][0] == 60 # last marker is the real last page
# --------------------------------------------------------------------------
# page join / split round-trip
# --------------------------------------------------------------------------
def test_join_split_round_trip():
body = ec.join_pages(["alpha", "beta", "gamma"])
pages = ec.split_pages(body)
assert [n for n, _ in pages] == [1, 2, 3]
assert [t for _, t in pages] == ["alpha", "beta", "gamma"]
def test_split_pages_no_markers_returns_empty():
assert ec.split_pages("plain text with no markers") == []
# --------------------------------------------------------------------------
# docx — synthetic page markers
# --------------------------------------------------------------------------
def test_docx_synthetic_page_markers(sample_docx):
body = ec.extract_docx(sample_docx)
# 100 paragraphs / 40 per page => 3 pages
assert ec.count_page_markers(body) == 3
assert "Paragraf 99" in body
# --------------------------------------------------------------------------
# HTML mirror — nav/script/footer stripped
# --------------------------------------------------------------------------
def test_html_strips_chrome(html_with_nav):
body = ec.extract_html(html_with_nav)
assert "Vanatoarea de comori" in body
assert "joc real de orientare" in body
# chrome must be gone
assert "tracking" not in body
assert "Site Banner Junk" not in body
assert "toate drepturile rezervate" not in body
assert "Games" not in body
# --------------------------------------------------------------------------
# content hash + near-duplicate elimination
# --------------------------------------------------------------------------
def test_content_hash_ignores_whitespace():
assert ec.content_hash("hello world") == ec.content_hash("hello world\n")
assert ec.content_hash("hello world") != ec.content_hash("goodbye world")
def test_dedupe_exact_duplicates():
items = [("a", "joc identic"), ("b", "joc identic"), ("c", "alt joc")]
kept = ec.dedupe_texts(items)
assert [k for k, _ in kept] == ["a", "c"]
def test_dedupe_near_duplicates():
base = "Vanatoarea de comori este un joc de orientare pentru cercetasi in tabara."
near = base + " Pagina printata." # >95% similar
items = [("orig", base), ("print", near), ("other", "Cu totul alt continut diferit aici.")]
kept = ec.dedupe_texts(items, threshold=85.0)
keys = [k for k, _ in kept]
assert "orig" in keys
assert "print" not in keys
assert "other" in keys
# --------------------------------------------------------------------------
# zip recursion
# --------------------------------------------------------------------------
def test_zip_recurses_into_inner_files(sample_zip):
body = ec.extract_zip(sample_zip)
assert "Paragraf 0" in body
assert ec.count_page_markers(body) > 0
def test_zip_bad_archive_returns_empty(tmp_path):
bad = tmp_path / "broken.zip"
bad.write_text("not a zip", encoding="utf-8")
assert ec.extract_zip(bad) == ""
def test_nested_zip(tmp_path, sample_zip):
outer = tmp_path / "outer.zip"
with zipfile.ZipFile(outer, "w") as zf:
zf.write(sample_zip, arcname="nested/archive.zip")
body = ec.extract_zip(outer)
assert "Paragraf 0" in body
# --------------------------------------------------------------------------
# preflight
# --------------------------------------------------------------------------
def test_preflight_python_packages_present():
report = ec.preflight()
# all required packages are installed in the test environment
assert report["missing_python"] == []
def test_preflight_reports_libreoffice_state():
report = ec.preflight()
has_lo = bool(shutil.which("libreoffice") or shutil.which("soffice"))
if has_lo:
assert all("libreoffice" not in w for w in report["warnings"])
else:
assert any("libreoffice" in w for w in report["warnings"])
def test_preflight_ocr_flag():
report = ec.preflight(check_ocr=True)
if not shutil.which("tesseract"):
assert any("tesseract" in m for m in report["missing_system"])
# --------------------------------------------------------------------------
# legacy .doc — skipped unless libreoffice is installed
# --------------------------------------------------------------------------
@pytest.mark.skipif(
not (shutil.which("libreoffice") or shutil.which("soffice")),
reason="libreoffice not installed",
)
def test_doc_conversion(tmp_path, sample_docx):
doc_path = tmp_path / "legacy.doc"
shutil.copy(sample_docx, doc_path) # smoke test of the docx path
body = ec.extract_doc(doc_path)
assert ec.count_page_markers(body) >= 1
def test_doc_without_libreoffice_raises(tmp_path, monkeypatch):
monkeypatch.setattr(ec.shutil, "which", lambda _: None)
with pytest.raises(RuntimeError):
ec.extract_doc(tmp_path / "whatever.doc")

139
tests/test_fts.py Normal file
View File

@@ -0,0 +1,139 @@
"""
Integration tests for the FTS5 search index.
Confirms that materials_list and skills_developed are indexed by FTS5 and kept
in sync by the insert / update / delete triggers (plan §6, §7).
"""
import os
import sys
import json
import pytest
# Make the project root importable when pytest is run from anywhere.
PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
if PROJECT_ROOT not in sys.path:
sys.path.insert(0, PROJECT_ROOT)
from app.models.activity import Activity # noqa: E402
from app.models.database import DatabaseManager # noqa: E402
@pytest.fixture
def db(tmp_path):
"""A fresh DatabaseManager backed by a temporary SQLite file."""
return DatabaseManager(str(tmp_path / "test_activities.db"))
def _make_activity(**overrides):
base = dict(
name="Vânătoarea de comori",
description="O activitate de echipă în aer liber.",
category="camp-outdoor",
content_type="joc",
source_file="test.txt",
language="ro",
)
base.update(overrides)
return Activity(**base)
def test_search_by_materials_list(db):
"""A term that only appears in materials_list returns the activity."""
activity = _make_activity(materials_list="frânghie, eșarfă, busolă")
db.insert_activity(activity)
results = db.search_activities(search_text="busolă")
assert len(results) == 1
assert results[0]["name"] == "Vânătoarea de comori"
def test_search_by_skills_developed(db):
"""A term that only appears in skills_developed returns the activity."""
activity = _make_activity(skills_developed="comunicare, leadership, rabdare")
db.insert_activity(activity)
results = db.search_activities(search_text="leadership")
assert len(results) == 1
assert results[0]["name"] == "Vânătoarea de comori"
def test_term_absent_from_indexed_columns_no_hit(db):
"""A term present in no indexed column yields no hit (control)."""
db.insert_activity(_make_activity(materials_list="frânghie"))
assert db.search_activities(search_text="zzzunlikelyterm") == []
def test_delete_trigger_removes_from_fts(db):
"""Deleting an activity removes it from the FTS index (delete trigger)."""
activity = _make_activity(materials_list="catalige")
activity_id = db.insert_activity(activity)
assert len(db.search_activities(search_text="catalige")) == 1
with db._get_connection() as conn:
conn.execute("DELETE FROM activities WHERE id = ?", (activity_id,))
conn.commit()
assert db.search_activities(search_text="catalige") == []
def test_update_trigger_resyncs_fts(db):
"""Updating materials_list re-syncs the FTS index (update trigger)."""
activity = _make_activity(materials_list="creioane")
activity_id = db.insert_activity(activity)
assert len(db.search_activities(search_text="creioane")) == 1
with db._get_connection() as conn:
conn.execute(
"UPDATE activities SET materials_list = ? WHERE id = ?",
("acuarele", activity_id),
)
conn.commit()
# Old term gone, new term found.
assert db.search_activities(search_text="creioane") == []
assert len(db.search_activities(search_text="acuarele")) == 1
def test_rebuild_fts_index(db):
"""rebuild_fts_index keeps materials_list / skills_developed searchable."""
db.insert_activity(_make_activity(skills_developed="orientare"))
db.rebuild_fts_index()
assert len(db.search_activities(search_text="orientare")) == 1
def test_new_schema_columns_round_trip(db):
"""New activity columns persist and load back via from_dict."""
activity = _make_activity(
source_files=["a.txt", "b.txt"],
source_excerpt="Citat scurt din sursă.",
extraction_confidence="high",
needs_review=1,
normalized_name="vanatoarea de comori",
)
activity_id = db.insert_activity(activity)
row = db.get_activity_by_id(activity_id)
assert row["content_type"] == "joc"
assert row["language"] == "ro"
assert row["extraction_confidence"] == "high"
assert row["needs_review"] == 1
assert row["normalized_name"] == "vanatoarea de comori"
assert json.loads(row["source_files"]) == ["a.txt", "b.txt"]
assert row["source_excerpt"] == "Citat scurt din sursă."
loaded = Activity.from_dict(row)
assert loaded.source_files == ["a.txt", "b.txt"]
assert loaded.content_type == "joc"
def test_normalized_name_auto_derived(db):
"""normalized_name is auto-derived from name when not provided."""
activity = Activity(
name="Ștafetă cu Obstacole",
description="desc",
category="sports-active",
source_file="t.txt",
)
assert activity.normalized_name == "stafeta cu obstacole"

140
tests/test_search.py Normal file
View File

@@ -0,0 +1,140 @@
# -*- coding: utf-8 -*-
"""
CRITICAL REGRESSION TEST (plan §6, §7).
`search.py` changed the result sets of /search and /api/search: the default
search now EXCLUDES the non-game content types (rețetă / cântec / ceremonie),
which surface only when the user explicitly filters that content_type or picks
a non-game category. This test guards that behaviour.
"""
import pytest
from app.models.activity import Activity
from app.models.database import DatabaseManager
from app.services.search import SearchService
from app.config_taxonomy import NON_GAME_CONTENT_TYPES
# --------------------------------------------------------------------------
# fixtures
# --------------------------------------------------------------------------
def _activity(name, content_type, category="altele", language="ro"):
return Activity(
name=name,
description=f"Descriere pentru {name}, un conținut de tip {content_type}.",
category=category,
content_type=content_type,
language=language,
source_file="test/fixture.txt",
)
@pytest.fixture
def search_service(tmp_path):
"""A SearchService over a temp DB seeded with one row per content_type."""
db = DatabaseManager(str(tmp_path / "activities.db"))
db.clear_database()
db.bulk_insert_activities([
_activity("Vanatoarea de comori", "joc", category="wide-games"),
_activity("Cercul de cunoastere", "activitate", category="icebreakers"),
_activity("Reteta de paine la ceaun", "reteta", category="retete"),
_activity("Cantecul de tabara", "cantec", category="cantece-ceremonii"),
_activity("Ceremonia de inchidere", "ceremonie", category="cantece-ceremonii"),
_activity("Game in English", "joc", category="wide-games", language="en"),
])
return SearchService(db)
def _content_types(results):
return {r.get("content_type") for r in results}
# --------------------------------------------------------------------------
# the regression: default search excludes non-game content types
# --------------------------------------------------------------------------
def test_default_search_excludes_non_game_content(search_service):
"""No filters → rețete / cântece / ceremonii must NOT appear."""
results = search_service.search_activities()
types = _content_types(results)
assert types, "default search returned nothing"
for non_game in NON_GAME_CONTENT_TYPES:
assert non_game not in types, (
f"default search leaked non-game content_type '{non_game}'"
)
# game content is still present
assert "joc" in types
assert "activitate" in types
def test_default_search_with_text_excludes_non_game(search_service):
"""A text query still excludes non-game content by default."""
results = search_service.search_activities(search_text="conținut")
assert NON_GAME_CONTENT_TYPES[0] not in _content_types(results)
# --------------------------------------------------------------------------
# explicit content_type filter INCLUDES the non-game rows
# --------------------------------------------------------------------------
def test_explicit_content_type_filter_includes_non_game(search_service):
"""Filtering content_type=reteta returns exactly the rețete."""
results = search_service.search_activities(filters={"content_type": "reteta"})
types = _content_types(results)
assert types == {"reteta"}, f"expected only rețete, got {types}"
assert len(results) == 1
def test_explicit_content_type_filter_for_cantec(search_service):
results = search_service.search_activities(filters={"content_type": "cantec"})
assert _content_types(results) == {"cantec"}
# --------------------------------------------------------------------------
# a non-game CATEGORY filter also lifts the exclusion
# --------------------------------------------------------------------------
def test_non_game_category_filter_includes_non_game(search_service):
"""Picking category=cantece-ceremonii surfaces cântece + ceremonii."""
results = search_service.search_activities(
filters={"category": "cantece-ceremonii"})
types = _content_types(results)
assert "cantec" in types
assert "ceremonie" in types
def test_game_category_filter_still_excludes_non_game(search_service):
"""A normal (game) category filter keeps the non-game exclusion."""
results = search_service.search_activities(filters={"category": "wide-games"})
types = _content_types(results)
for non_game in NON_GAME_CONTENT_TYPES:
assert non_game not in types
# --------------------------------------------------------------------------
# language filter
# --------------------------------------------------------------------------
def test_language_filter_ro(search_service):
results = search_service.search_activities(filters={"language": "ro"})
assert results
assert all(r.get("language") == "ro" for r in results)
def test_language_filter_en(search_service):
results = search_service.search_activities(filters={"language": "en"})
assert results
assert all(r.get("language") == "en" for r in results)
assert {r.get("name") for r in results} == {"Game in English"}
# --------------------------------------------------------------------------
# get_filter_options surfaces the new axes
# --------------------------------------------------------------------------
def test_filter_options_include_content_type_and_language(search_service):
"""The dynamic-filter mechanism now exposes content_type + language."""
options = search_service.db.get_filter_options()
assert "content_type" in options
assert "language" in options
assert "joc" in options["content_type"]
assert set(options["language"]) == {"ro", "en"}

View File

@@ -0,0 +1,156 @@
# -*- coding: utf-8 -*-
"""
Tests for scripts/validate_extractions.py.
Covers: schema rejection, the source_excerpt hallucination check, the content
of the generated re-extraction prompt, and the manifest `rejected` marking.
"""
import json
import sys
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parent.parent
SCRIPTS_DIR = REPO_ROOT / "scripts"
for _p in (str(REPO_ROOT), str(SCRIPTS_DIR)):
if _p not in sys.path:
sys.path.insert(0, _p)
import validate_extractions as ve # noqa: E402
# --------------------------------------------------------------------------
# helpers
# --------------------------------------------------------------------------
def _ext_activity(**over):
base = dict(
name="Jocul testului",
description="O activitate de echipa in aer liber.",
category="team-building",
content_type="joc",
language="ro",
extraction_confidence="high",
source_excerpt="ancora din bucata sursa",
page_reference="page 1",
)
base.update(over)
return base
def _write_extraction(extracted_dir, chunk_key, activities, header_extra=None):
extracted_dir.mkdir(parents=True, exist_ok=True)
header = {
"source_hash": "hash1234deadbeef",
"schema_version": "1.0",
"prompt_version": "1.0",
"chunk_range": "pages 1-20",
"source_id": "src01",
"chunk_key": chunk_key,
}
if header_extra:
header.update(header_extra)
payload = {"header": header, "activities": activities}
(extracted_dir / f"{chunk_key}.json").write_text(
json.dumps(payload, ensure_ascii=False), encoding="utf-8"
)
def _write_chunk(chunks_dir, source_id, chunk_key, text):
d = chunks_dir / source_id
d.mkdir(parents=True, exist_ok=True)
(d / f"{chunk_key}.txt").write_text(text, encoding="utf-8")
# --------------------------------------------------------------------------
# tests
# --------------------------------------------------------------------------
def test_valid_file_passes(tmp_path):
extracted = tmp_path / "extracted"
chunks = tmp_path / "chunks"
excerpt = "ancora din bucata sursa apare aici"
_write_extraction(extracted, "src01.part01", [_ext_activity(source_excerpt=excerpt)])
_write_chunk(chunks, "src01", "src01.part01", f"--- PAGE 1 ---\n{excerpt}\n")
report = ve.run(extracted, chunks, tmp_path / "manifest.json")
assert report["valid"] == 1
assert report["rejected"] == 0
def test_schema_invalid_file_rejected(tmp_path):
extracted = tmp_path / "extracted"
chunks = tmp_path / "chunks"
extracted.mkdir(parents=True)
(extracted / "src01.part01.json").write_text(
json.dumps({"header": {}, "activities": [{"name": "x"}]}), encoding="utf-8"
)
report = ve.run(extracted, chunks, tmp_path / "manifest.json")
assert report["rejected"] == 1
prompt = extracted / "_reextract" / "src01.part01.prompt.md"
assert prompt.exists()
def test_hallucinated_excerpt_rejected(tmp_path):
extracted = tmp_path / "extracted"
chunks = tmp_path / "chunks"
_write_extraction(
extracted, "src01.part01",
[_ext_activity(source_excerpt="citat complet inventat care nu exista qqqq")],
)
_write_chunk(chunks, "src01", "src01.part01",
"--- PAGE 1 ---\ntext complet diferit despre altceva.\n")
report = ve.run(extracted, chunks, tmp_path / "manifest.json")
assert report["rejected"] == 1
errors = report["rejected_chunks"][0]["errors"]
assert any("hallucination" in e for e in errors)
def test_reextraction_prompt_content(tmp_path):
extracted = tmp_path / "extracted"
chunks = tmp_path / "chunks"
_write_extraction(
extracted, "src01.part01",
[_ext_activity(source_excerpt="citat inventat care nu exista zzzz")],
)
_write_chunk(chunks, "src01", "src01.part01",
"--- PAGE 1 ---\ntext despre cu totul altceva aici.\n")
ve.run(extracted, chunks, tmp_path / "manifest.json")
prompt = (extracted / "_reextract" / "src01.part01.prompt.md").read_text(
encoding="utf-8"
)
assert "src01.part01" in prompt
assert "REJECTED" in prompt
assert "verbatim" in prompt
assert "data/extracted/src01.part01.json" in prompt
def test_manifest_marks_chunk_rejected(tmp_path):
extracted = tmp_path / "extracted"
chunks = tmp_path / "chunks"
manifest_path = tmp_path / "manifest.json"
manifest_path.write_text(
json.dumps({"chunks": {"src01.part01": {"state": "done",
"chunk_file": "chunks/src01/src01.part01.txt"}}}),
encoding="utf-8",
)
_write_extraction(
extracted, "src01.part01",
[_ext_activity(source_excerpt="citat fabricat absent vvvv")],
)
_write_chunk(chunks, "src01", "src01.part01",
"--- PAGE 1 ---\nun continut neinrudit.\n")
ve.run(extracted, chunks, manifest_path)
manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
assert manifest["chunks"]["src01.part01"]["state"] == "rejected"
def test_build_reextraction_prompt_lists_errors():
prompt = ve.build_reextraction_prompt(
"abc.part03", "data/chunks/abc/abc.part03.txt",
["header: 'source_hash' is a required property"],
)
assert "abc.part03" in prompt
assert "source_hash" in prompt