Compare commits
7 Commits
e0080edf85
...
f7a37f91ec
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f7a37f91ec | ||
|
|
d6971e47f8 | ||
|
|
bcfb6841eb | ||
|
|
46d9592a55 | ||
|
|
09999ccd40 | ||
|
|
3d9f266696 | ||
|
|
66ae831c36 |
12
.gitignore
vendored
12
.gitignore
vendored
@@ -165,9 +165,14 @@ cython_debug/
|
||||
*.db.backup
|
||||
*.db.bak
|
||||
*.db.tmp
|
||||
*.db.prefreeze*
|
||||
*.sqlite.backup
|
||||
*.sqlite3.backup
|
||||
|
||||
# Agent runtime locks
|
||||
.claude/scheduled_tasks.lock
|
||||
.claude/*.lock
|
||||
|
||||
# Temporary files
|
||||
*.tmp
|
||||
*.backup
|
||||
@@ -179,6 +184,13 @@ data/sources/
|
||||
data/chunks/
|
||||
data/extracted/
|
||||
|
||||
# Enrichment pipeline intermediates (LLM output; final result lands in data/activities.db)
|
||||
data/enrichment_prompts/
|
||||
data/enrichment_parts/
|
||||
data/enrichment_batches/
|
||||
data/enrichment_wf/
|
||||
data/enrichment.json
|
||||
|
||||
# Keep main production database, the hand-written index, and committed golden set
|
||||
!data/activities.db
|
||||
!data/INDEX_MASTER_JOCURI_ACTIVITATI.md
|
||||
|
||||
71
ENRICHMENT_PILOT.md
Normal file
71
ENRICHMENT_PILOT.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# Enrichment PILOT — sign-off required before full-corpus scaling
|
||||
|
||||
**Date:** 2026-05-29. Pilot covers **34 activities** (the STOP gate from `HANDOFF.md`
|
||||
step 3, guarding ~6–8k LLM calls across the full corpus).
|
||||
|
||||
## Pipeline integrity (all green)
|
||||
|
||||
| Hop | Expected | Actual |
|
||||
|-----|----------|--------|
|
||||
| prompts emitted | 34 | 34 |
|
||||
| part files on disk (valid JSON, key matches filename) | 34 | 34 |
|
||||
| `enrichment.json` entries after `--collect` | 34 | 34 |
|
||||
| rebuild overlay: `matched` / `orphaned` | 34 / 0 | **34 / 0** |
|
||||
|
||||
No leak at any hop. `orphaned 0` confirms the content_key the rebuild computes
|
||||
matches what `run_enrichment` emitted (no dedup rep-selection drift).
|
||||
|
||||
## Pilot composition
|
||||
|
||||
Deliberately mixed to exercise BOTH operations (corpus is 7076 EN / 2465 RO, so
|
||||
en→ro translation is the dominant + highest-risk path):
|
||||
|
||||
- **26** rows from `teambuilding_corbu` — all Romanian → **ro→ro polish**
|
||||
- **8** rows from `d3959920_outdoor_games` — all English → **en→ro translation**
|
||||
|
||||
Result: ~7 genuine en→ro translations + ~27 ro→ro polish.
|
||||
|
||||
## Field population (stated vs estimated)
|
||||
|
||||
```
|
||||
age_group_max : 0 stated / 30 estimated
|
||||
age_group_min : 0 / 34
|
||||
duration_max : 3 / 29
|
||||
duration_min : 4 / 28
|
||||
indoor_outdoor : 12 / 22
|
||||
participants_max : 0 / 24
|
||||
participants_min : 4 / 30
|
||||
space_needed : 2 / 32
|
||||
```
|
||||
|
||||
Almost everything is estimated — sources rarely state ages/durations explicitly.
|
||||
The pipeline marks every inferred field in `estimated_fields`, and the UI shows an
|
||||
`(estimat)` marker, so estimates are transparent to end users.
|
||||
|
||||
## What to evaluate (the three sign-off axes)
|
||||
|
||||
1. **Translation fidelity (en→ro)** — e.g. *Labels → Etichete*, *Ships in a Fog →
|
||||
Nave în ceață*, *Spot the Colours → Găsește culorile*. Game rules preserved,
|
||||
no moralizing added, proper terms kept.
|
||||
2. **Description fidelity / expansion** — ro→ro rows fold in setup/material detail
|
||||
that IS in the source chunk (e.g. *Găsește-ți fratele și sora* adds "carton A6"
|
||||
+ "la semnal, toți încep simultan"; *Ce-mi place?* folds in the character-traits
|
||||
discussion). No invented steps observed.
|
||||
3. **Estimation plausibility** — mostly reasonable. **Weak spots to judge:** a few
|
||||
age ranges are very wide/defaulted (e.g. *Găsește-ți fratele și sora* → age
|
||||
10–99). If wide age defaults are unacceptable, tighten the ENRICHMENT_PROMPT
|
||||
guidance before scaling.
|
||||
|
||||
## Inspect the data yourself
|
||||
|
||||
```bash
|
||||
sqlite3 data/activities.db "select name, name_ro, language, indoor_outdoor, space_needed, estimated_fields from activities where name_ro is not null;"
|
||||
# raw overlay: data/enrichment.json (34 entries)
|
||||
# per-activity parts: data/enrichment_parts/*.json
|
||||
```
|
||||
|
||||
## After sign-off (do NOT auto-proceed)
|
||||
|
||||
Scale in waves of ~8–16 Sonnet subagents over the rest of the corpus
|
||||
(`run_enrichment.py` is additive + resumable — skips already-enriched keys),
|
||||
`--collect`, then final `build_database.py --rebuild --enrichment`.
|
||||
276
HANDOFF.md
Normal file
276
HANDOFF.md
Normal file
@@ -0,0 +1,276 @@
|
||||
# HANDOFF — Enrichment PILOT done; STOP at user sign-off gate before full-corpus scaling
|
||||
|
||||
**Snapshot:** 2026-05-29 (updated). Executing plan `enumerated-petting-badger.md`
|
||||
(bilingual index + enrichment + new filters + source download).
|
||||
|
||||
**>>> CURRENT STATE (2026-05-29): user SIGNED OFF on the pilot → full-corpus
|
||||
enrichment SCALING IN PROGRESS via 8 PARALLEL background Workflows on Sonnet.
|
||||
Parts on disk (`data/enrichment_parts/<key>.json`) = the durable checkpoint. <<<**
|
||||
|
||||
Two earlier single-workflow runs were stopped: the first ran on Opus by mistake
|
||||
(workflow `agent()` inherits the main-loop model unless `model:'sonnet'` is passed —
|
||||
fixed). Measured rate: ~4.7 keys/min single-shard, ~17 keys/min at 3 shards (linear,
|
||||
no real rate-limit errors — the "429" hits in transcripts are line numbers in chunk
|
||||
text, not API errors). Concurrency is capped at 2 PER workflow (`nproc`=4 →
|
||||
`min(16,cores-2)`), so parallelism = run many workflows.
|
||||
|
||||
**8 shard scripts: `data/enrichment_wf/shard_0.js` … `shard_7.js`**, each owns a
|
||||
disjoint batch range of `data/enrichment_batches/batch_NNNN.txt` (780 batches × ~12
|
||||
keys = 9357 keys; ranges: s0 [0,98) s1 [98,196) s2 [196,294) s3 [294,392) s4 [392,490)
|
||||
s5 [490,588) s6 [588,686) s7 [686,780)). Each agent is IDEMPOTENT (skips keys whose
|
||||
part already exists + parses), so re-launching any shard is safe. Run IDs:
|
||||
s0 `wf_3c314d06-01c` · s1 `wf_ecc7d151-a11` · s2 `wf_4156be35-748` ·
|
||||
s3 `wf_fa16abee-17a` · s4 `wf_a0f595b8-8fe` · s5 `wf_b3505593-09a` ·
|
||||
s6 `wf_ad0d731e-12e` · s7 `wf_a919a99b-1d2`.
|
||||
|
||||
### ▶ RESUME HERE (2026-06-01 — THROTTLED CRON SYSTEM now drives enrichment)
|
||||
|
||||
**Enrichment progress: 3074 / 9541 done (32.2%), ~6467 missing.** Nothing running.
|
||||
Parts on disk (`data/enrichment_parts/*.json`) are the durable, idempotent checkpoint.
|
||||
|
||||
**A paced wave system now runs FULLY HEADLESS via the LXC's OS cron — NO Claude
|
||||
session required.** Fixes the "always runs to exhaustion" bug: each wave caps at
|
||||
~75% of a 5h window and the next window is reached by time (cron).
|
||||
|
||||
ARCHITECTURE: OS cron → `scripts/enrich_wave.sh` → one `claude -p` per batch,
|
||||
PAR-way parallel (OS-level — NOT the Workflow tool, which can't be used headless:
|
||||
`claude -p` is one-shot and would exit before background workflows finish). Each
|
||||
headless `claude -p` reads a batch file and writes `data/enrichment_parts/<key>.json`.
|
||||
|
||||
- **`scripts/enrichment_wave.py`** (prepares a bounded wave, no LLM):
|
||||
- `--status` — read-only progress (done / missing / pct / corrupt count).
|
||||
- `--prepare --keys 700 --no-shards` — drop corrupt parts; take FIRST 700
|
||||
sorted-missing keys; write batch files for ONLY those; print
|
||||
`WAVE: PREPARED …` or `WAVE: COMPLETE`. `--no-shards` = batch files only
|
||||
(the headless path). (Without `--no-shards` it also regenerates Workflow shard
|
||||
JS from `data/enrichment_wf/shard.js.tmpl` — only needed for the old Workflow path.)
|
||||
- **`scripts/enrich_wave.sh [KEYS] [PAR]`** (the headless orchestrator, run by cron):
|
||||
flock-guarded (waves never overlap); `--prepare`; if `WAVE: COMPLETE` → `--collect`
|
||||
+ `--rebuild` and stop; else `xargs -P PAR` one `claude -p` per batch
|
||||
(`--allowedTools Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)`, `</dev/null`).
|
||||
Logs to `/workspace/.claude-logs/enrich_<ts>.log`. Detects + logs "WINDOW EXHAUSTED".
|
||||
- **OS crontab (user `claude`, `crontab -l` to view):** two night fires
|
||||
`20 23 * * *` and `50 0 * * *` UTC (= 02:20 & 03:50 EEST). Timed AFTER the live-
|
||||
confirmed **23:00 UTC usage-window reset** so both land in the fresh post-reset
|
||||
window (user asleep → safe to use it fully; two 700-caps top out at the window's
|
||||
~950 capacity). Self-healing: a fire into an exhausted window is a harmless no-op
|
||||
(`claude -p` prints "session limit", writes nothing) and those keys retry next fire.
|
||||
|
||||
**Auth caveat:** headless `claude -p` uses the OAuth token in
|
||||
`~/.claude/.credentials.json` (verified working). If it ever expires and can't refresh
|
||||
non-interactively, cron fires fail with auth errors → user must `claude` login once.
|
||||
|
||||
**Manual fallback (one wave, any time, no session needed):**
|
||||
```bash
|
||||
/workspace/game-library/scripts/enrich_wave.sh 700 6 # runs a full wave now
|
||||
# or step-by-step:
|
||||
python3 scripts/enrichment_wave.py --status # progress
|
||||
python3 scripts/run_enrichment.py --collect && python3 scripts/build_database.py --rebuild # at WAVE: COMPLETE
|
||||
# gate: rebuild must print enrichment {N} (matched N, orphaned 0)
|
||||
```
|
||||
|
||||
**Control:** `crontab -e` to retime/disable; `crontab -r` removes all. Tune `--keys`
|
||||
(KEYS arg) up to drain faster, down if logs show "WINDOW EXHAUSTED" early. One full
|
||||
window ≈ 950 keys ≈ 100%.
|
||||
|
||||
**Hard facts learned:**
|
||||
- Workflow concurrency is capped at **2 per workflow** (`nproc`=4 → `min(16,cores-2)`); parallelism = run many workflow processes. 3 shards measured ~17 keys/min (linear, no real rate-limit).
|
||||
- Workflow `agent()` inherits the **main-loop model unless `model:'sonnet'` is passed** — the FIRST run silently used Opus; always pass model.
|
||||
- The full corpus does **NOT fit in one 5h usage window** — it needs SEVERAL windows. Parallelism only cuts wall-clock inside a window, not total token budget.
|
||||
- Main-session token drain was **polling** (sleep/grep loops), NOT launching workflows. Launch + wait-for-notification only.
|
||||
- StructuredOutput failures appear when a window exhausts mid-run — harmless; idempotent skip + the regenerate-from-missing reconcile recover every dropped key.
|
||||
|
||||
(prev note) Earlier STOPPED at 593/9541 — hit 92% of the 5h Anthropic
|
||||
usage window (resets 23:00 UTC). KEY LESSON: the full corpus does NOT fit in one 5h
|
||||
window; 6.2% + the session's other work already used ~92%. Enrichment must be spread
|
||||
across MANY 5h windows (parallelism only cuts wall-clock inside a window, not total
|
||||
token budget). Resume model: after each window reset, regenerate batches from
|
||||
currently-missing, relaunch a bounded number of shards, stop before the window
|
||||
exhausts. Idempotent shards + parts-on-disk make this safe to repeat. Consider a
|
||||
cron/scheduled job that runs a bounded wave each reset.
|
||||
|
||||
**To regenerate batches from currently-missing + relaunch a shard** (reconcile):
|
||||
```bash
|
||||
python3 - <<'PY'
|
||||
import glob, os
|
||||
BATCH=12
|
||||
missing=sorted(os.path.basename(p)[:-9-len('.md')] for p in glob.glob('data/enrichment_prompts/*.prompt.md')
|
||||
if not os.path.exists('data/enrichment_parts/'+os.path.basename(p)[:-len('.prompt.md')]+'.json'))
|
||||
for old in glob.glob('data/enrichment_batches/batch_*.txt'): os.remove(old)
|
||||
for n,i in enumerate(range(0,len(missing),BATCH)):
|
||||
open(f'data/enrichment_batches/batch_{n:04d}.txt','w').write('\n'.join(missing[i:i+BATCH])+'\n')
|
||||
print('missing',len(missing),'batches',n+1)
|
||||
PY
|
||||
# then edit START/END in the shard files to cover the new batch count and re-invoke Workflow({scriptPath: 'data/enrichment_wf/shard_K.js'})
|
||||
```
|
||||
|
||||
### Resume / completion procedure (do this when the workflow finishes — or to continue a new session)
|
||||
|
||||
The pipeline is RESUMABLE: parts on disk are truth; re-running regenerates work only for missing keys.
|
||||
|
||||
1. **Rebuild the batch list from what's still missing** (prompt exists, part absent), then re-run the workflow for the gap:
|
||||
```bash
|
||||
# regenerate batch files for missing keys (script below already lives in shell history; logic:
|
||||
# for each data/enrichment_prompts/<key>.prompt.md with no data/enrichment_parts/<key>.json,
|
||||
# split into data/enrichment_batches/batch_NNNN.txt of 12)
|
||||
```
|
||||
The workflow script is at
|
||||
`.../workflows/scripts/enrich-corpus-wf_440c0a2f-17f.js` (nBatches hardcoded → update it to the new batch count, or it defaults to 793). Re-invoke with `{scriptPath: ...}`.
|
||||
2. **Reconcile loop** (expect 2–3 passes — some parts WILL drop: flaky agents, a stray quote that slips re-validation): repeat step 1 until `missing == 0`.
|
||||
3. **Collect + final rebuild ONCE at the end** (don't rebuild after every wave — 9541 rows is wasted work):
|
||||
```bash
|
||||
python3 scripts/run_enrichment.py --collect # robust: repairs stray-quote parts, skips+reports truly-broken
|
||||
python3 scripts/build_database.py --rebuild # picks up --enrichment by default
|
||||
```
|
||||
**Gate:** rebuild must print `enrichment {entries} (matched {entries}, orphaned 0)`. Done-criterion is the reconcile counts converging: `emitted == parts-on-disk == entries == matched`.
|
||||
|
||||
### ⚠ FREEZE IS NOW LOCKED
|
||||
Enrichment content_keys depend on the current freeze. The earlier "re-freezing is safe"
|
||||
note is **INVERTED** now — do NOT re-extract or re-freeze `data/extracted/` until the
|
||||
final `--rebuild`, or content_keys drift and the overlay orphans.
|
||||
|
||||
## Where we are
|
||||
|
||||
| Step (plan Part C) | Status |
|
||||
|--------------------|--------|
|
||||
| 1. Finish extraction | **DONE** — 582 chunks extracted & valid (7 re-extracted this session); 6 content-filter-blocked, accepted as missing |
|
||||
| 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** |
|
||||
| 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** |
|
||||
| 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** |
|
||||
| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9541 activities** (re-frozen with the 7 chunks) |
|
||||
| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total |
|
||||
| 4. Enrichment pilot → **STOP for user sign-off** | **DONE — 34 activities enriched (26 ro-polish + 8 en→ro), pipeline 34/34 matched, 0 orphaned. AWAITING SIGN-OFF.** |
|
||||
| 5. Final rebuild `--enrichment` | not started (post sign-off) |
|
||||
|
||||
## The 7 re-extracted chunks (this session)
|
||||
|
||||
Re-extracted via Sonnet subagents, all valid JSON, re-frozen into the corpus.
|
||||
One (`d297a434…part01`) had an activity named "Eu" (<3 chars, schema-rejected);
|
||||
renamed faithfully to "Eu sunt..." (matches the source affirmation). The 6
|
||||
content-filter-blocked chunks remain accepted as missing.
|
||||
|
||||
Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
|
||||
is gitignored (575 files on disk, durable across /clear).
|
||||
|
||||
## The 13 missing chunks (out of 588)
|
||||
|
||||
**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
|
||||
- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
|
||||
- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`
|
||||
|
||||
**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
|
||||
incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
|
||||
```
|
||||
3f9c8232_teambuilding_corbu_29092023.part01
|
||||
5f959f85_scoli_fara_bullying.part02
|
||||
83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
|
||||
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
|
||||
d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
|
||||
d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
|
||||
e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
|
||||
```
|
||||
Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
|
||||
`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
|
||||
the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
|
||||
overlay keys depend on the current freeze yet.
|
||||
|
||||
## The json_repair incident (important — root cause + what was fixed)
|
||||
|
||||
Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
|
||||
text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
|
||||
were affected.
|
||||
|
||||
First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
|
||||
ends the string and reinterprets the trailing text as a new key, silently dropping the
|
||||
rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
|
||||
the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
|
||||
an extra key slipped through. Applying json_repair output to disk also **overwrote the
|
||||
malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
|
||||
re-extraction.
|
||||
|
||||
**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
|
||||
(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
|
||||
validates against the real schema, and only replaces a valid top-level file when the
|
||||
repaired version carries **strictly more text** (a length guard that catches truncated
|
||||
json_repair output while leaving genuine extractions untouched). Re-running it cleanly
|
||||
repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
|
||||
`json_repair` is no longer used anywhere. Do NOT reintroduce it.
|
||||
|
||||
`build_database.py` does NOT depend on the repair script (the "DB regenerable from
|
||||
data/extracted/" invariant holds — plain `json.loads` only).
|
||||
|
||||
## What the code does now (all committed)
|
||||
|
||||
**Part A — plumbing (corpus-independent):**
|
||||
- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
|
||||
indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
|
||||
chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
|
||||
indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
|
||||
and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
|
||||
the categories table so dropdowns populate.
|
||||
- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
|
||||
/ `get_display_description` / `get_display_rules` / `get_display_variations`
|
||||
(RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
|
||||
`get_indoor_outdoor_display`, `get_space_needed_display`.
|
||||
- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
|
||||
`normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
|
||||
fallback — never fabricate a value) + display-name helpers.
|
||||
- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
|
||||
`merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
|
||||
**never** touches enrichment fields (those are applied post-dedup).
|
||||
|
||||
**Part A — UI/search:**
|
||||
- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
|
||||
`space_needed` to DB equality filters.
|
||||
- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
|
||||
`SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
|
||||
`source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
|
||||
web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
|
||||
- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
|
||||
- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
|
||||
helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
|
||||
"Text original" section, indoor/space cards, estimat markers, and the download link
|
||||
(only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
|
||||
|
||||
**Part B — enrichment pipeline (built, not yet run):**
|
||||
- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
|
||||
applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
|
||||
`import_common.content_key(normalized_name, language, _normalize_text(description))`
|
||||
(reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
|
||||
`enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
|
||||
Translated/expanded text is NOT re-validated against source (by design).
|
||||
- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
|
||||
skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
|
||||
per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
|
||||
`find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
|
||||
merges parts → `data/enrichment.json`.
|
||||
- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
|
||||
`description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
|
||||
fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
|
||||
|
||||
## Exact next steps
|
||||
|
||||
1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
|
||||
JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
|
||||
If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
|
||||
2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
|
||||
note the new total (~9418 + the 7 chunks' activities).
|
||||
3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 6–8k LLM calls):
|
||||
- Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
|
||||
(or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
|
||||
- Launch a small wave of Sonnet subagents on those prompts (each writes
|
||||
`data/enrichment_parts/<key>.json`).
|
||||
- `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
|
||||
- `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
|
||||
- **STOP. Hand the user translation-quality + estimation-plausibility + description-
|
||||
fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
|
||||
auto-proceed past this gate.
|
||||
4. After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, `--collect`,
|
||||
final `--rebuild --enrichment`.
|
||||
|
||||
## Verify / run
|
||||
|
||||
- Tests: `python3 -m pytest tests/ -q` → 99 pass.
|
||||
- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
|
||||
only if the user accepts the copyright exposure of serving original files.
|
||||
- `data/activities.db.bak` is the pre-this-freeze backup.
|
||||
@@ -22,6 +22,18 @@ class Config:
|
||||
# Search settings
|
||||
SEARCH_RESULTS_LIMIT = int(os.environ.get('SEARCH_RESULTS_LIMIT', '100'))
|
||||
FTS_ENABLED = True
|
||||
|
||||
# Source-file download (plan A6). Shipped DARK by default: serving the
|
||||
# original PDFs/books carries a copyright exposure the user must opt into.
|
||||
# The /source/<id> route 404s entirely while this is false; the UI hides
|
||||
# the download link. Enable with SOURCE_DOWNLOAD_ENABLED=true.
|
||||
SOURCE_DOWNLOAD_ENABLED = (
|
||||
os.environ.get('SOURCE_DOWNLOAD_ENABLED', 'false').lower() == 'true'
|
||||
)
|
||||
# Root of the original corpus. source_file values are relative to this.
|
||||
CORPUS_DIR = os.environ.get('CORPUS_DIR') or str(
|
||||
Path(__file__).parent.parent / 'data' / 'carti-camp-jocuri'
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def ensure_directories():
|
||||
|
||||
313
app/config_taxonomy.py
Normal file
313
app/config_taxonomy.py
Normal file
@@ -0,0 +1,313 @@
|
||||
"""
|
||||
Controlled category taxonomy for game-library.
|
||||
|
||||
Single source of truth for activity categories. The DB stores the *slug*;
|
||||
the UI displays the Romanian name. `category` (thematic domain) and
|
||||
`content_type` (form of the content) are INDEPENDENT axes — see plan §2.
|
||||
"""
|
||||
|
||||
import unicodedata
|
||||
import re
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
# --- Categories (thematic domain) --------------------------------------------
|
||||
# slug -> Romanian display name. ~16 fixed slugs; `altele` is the mandatory
|
||||
# fallback and MUST always be present.
|
||||
CATEGORIES: Dict[str, str] = {
|
||||
"jocuri-cercetasesti": "Jocuri cercetășești",
|
||||
"team-building": "Team-building",
|
||||
"icebreakers": "Icebreakers / spargerea gheții",
|
||||
"camp-outdoor": "Tabără și activități în aer liber",
|
||||
"wide-games": "Wide games / jocuri de teren",
|
||||
"orientare": "Orientare",
|
||||
"prim-ajutor": "Prim ajutor",
|
||||
"escape-room-puzzle": "Escape room și puzzle",
|
||||
"creative-stem": "Creativitate și STEM",
|
||||
"sports-active": "Sport și activități fizice",
|
||||
"cantece-ceremonii": "Cântece și ceremonii",
|
||||
"retete": "Rețete",
|
||||
"supravietuire": "Supraviețuire",
|
||||
"integrare-incluziune": "Integrare și incluziune",
|
||||
"conflict-empatie": "Conflict și empatie",
|
||||
"altele": "Altele",
|
||||
}
|
||||
|
||||
# Mandatory fallback slug.
|
||||
FALLBACK_CATEGORY = "altele"
|
||||
|
||||
# Ordered list of valid slugs.
|
||||
CATEGORY_SLUGS: List[str] = list(CATEGORIES.keys())
|
||||
|
||||
# --- Content type (form of the content) --------------------------------------
|
||||
# Independent axis from `category`. The UI default search excludes the
|
||||
# non-game content types (see plan §6).
|
||||
CONTENT_TYPES: Dict[str, str] = {
|
||||
"joc": "Joc",
|
||||
"activitate": "Activitate",
|
||||
"reteta": "Rețetă",
|
||||
"cantec": "Cântec",
|
||||
"ceremonie": "Ceremonie",
|
||||
}
|
||||
|
||||
CONTENT_TYPE_SLUGS: List[str] = list(CONTENT_TYPES.keys())
|
||||
|
||||
# Content types considered "non-game" — excluded from the default UI search.
|
||||
NON_GAME_CONTENT_TYPES: List[str] = ["reteta", "cantec", "ceremonie"]
|
||||
|
||||
DEFAULT_CONTENT_TYPE = "activitate"
|
||||
|
||||
# --- Aliases -----------------------------------------------------------------
|
||||
# Map of normalized arbitrary strings -> canonical slug. Keys are already
|
||||
# diacritic-stripped, lowercased and hyphenated (see _slugify). This catches
|
||||
# legacy / messy values from the old DB and common English/Romanian variants.
|
||||
_CATEGORY_ALIASES: Dict[str, str] = {
|
||||
# legacy junk
|
||||
"general-activity": "altele",
|
||||
"general": "altele",
|
||||
"educational": "creative-stem",
|
||||
"d": "altele",
|
||||
"a": "altele",
|
||||
"b": "altele",
|
||||
"c": "altele",
|
||||
# scouting
|
||||
"cercetasie": "jocuri-cercetasesti",
|
||||
"cercetasesti": "jocuri-cercetasesti",
|
||||
"scout": "jocuri-cercetasesti",
|
||||
"scouting": "jocuri-cercetasesti",
|
||||
"scout-games": "jocuri-cercetasesti",
|
||||
"jocuri-cercetasesti": "jocuri-cercetasesti",
|
||||
# team building
|
||||
"teambuilding": "team-building",
|
||||
"team": "team-building",
|
||||
"cooperare": "team-building",
|
||||
# icebreakers
|
||||
"icebreaker": "icebreakers",
|
||||
"spargerea-ghetii": "icebreakers",
|
||||
"cunoastere": "icebreakers",
|
||||
"energizers": "icebreakers",
|
||||
"energizer": "icebreakers",
|
||||
# camp / outdoor
|
||||
"camp": "camp-outdoor",
|
||||
"tabara": "camp-outdoor",
|
||||
"outdoor": "camp-outdoor",
|
||||
"aer-liber": "camp-outdoor",
|
||||
# wide games
|
||||
"wide-game": "wide-games",
|
||||
"jocuri-de-teren": "wide-games",
|
||||
"joc-de-teren": "wide-games",
|
||||
"big-games": "wide-games",
|
||||
# orientare
|
||||
"orienteering": "orientare",
|
||||
"navigatie": "orientare",
|
||||
# prim ajutor
|
||||
"first-aid": "prim-ajutor",
|
||||
"primul-ajutor": "prim-ajutor",
|
||||
# escape room / puzzle
|
||||
"escape-room": "escape-room-puzzle",
|
||||
"escaperoom": "escape-room-puzzle",
|
||||
"puzzle": "escape-room-puzzle",
|
||||
"puzzles": "escape-room-puzzle",
|
||||
"ghicitori": "escape-room-puzzle",
|
||||
# creative / stem
|
||||
"creative": "creative-stem",
|
||||
"creativitate": "creative-stem",
|
||||
"stem": "creative-stem",
|
||||
"arts-and-crafts": "creative-stem",
|
||||
"craft": "creative-stem",
|
||||
"crafts": "creative-stem",
|
||||
"stiinta": "creative-stem",
|
||||
# sports
|
||||
"sport": "sports-active",
|
||||
"sports": "sports-active",
|
||||
"sportive": "sports-active",
|
||||
"active": "sports-active",
|
||||
"miscare": "sports-active",
|
||||
"physical": "sports-active",
|
||||
# songs / ceremonies
|
||||
"cantece": "cantece-ceremonii",
|
||||
"cantec": "cantece-ceremonii",
|
||||
"songs": "cantece-ceremonii",
|
||||
"ceremonii": "cantece-ceremonii",
|
||||
"ceremonie": "cantece-ceremonii",
|
||||
"ceremony": "cantece-ceremonii",
|
||||
# recipes
|
||||
"reteta": "retete",
|
||||
"recipe": "retete",
|
||||
"recipes": "retete",
|
||||
"cooking": "retete",
|
||||
"gatit": "retete",
|
||||
# survival
|
||||
"survival": "supravietuire",
|
||||
"supravietuire": "supravietuire",
|
||||
# inclusion
|
||||
"integrare": "integrare-incluziune",
|
||||
"incluziune": "integrare-incluziune",
|
||||
"inclusion": "integrare-incluziune",
|
||||
# conflict / empathy
|
||||
"conflict": "conflict-empatie",
|
||||
"empatie": "conflict-empatie",
|
||||
"empathy": "conflict-empatie",
|
||||
"rezolvarea-conflictelor": "conflict-empatie",
|
||||
# fallback
|
||||
"altele": "altele",
|
||||
"other": "altele",
|
||||
"others": "altele",
|
||||
"misc": "altele",
|
||||
}
|
||||
|
||||
|
||||
def _slugify(value: str) -> str:
|
||||
"""Lowercase, strip diacritics, collapse non-alphanumerics to hyphens."""
|
||||
if not value:
|
||||
return ""
|
||||
# Decompose accents (ă -> a, ș -> s, ț -> t, etc.)
|
||||
decomposed = unicodedata.normalize("NFKD", value)
|
||||
ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
|
||||
ascii_str = ascii_str.lower().strip()
|
||||
ascii_str = re.sub(r"[^a-z0-9]+", "-", ascii_str)
|
||||
return ascii_str.strip("-")
|
||||
|
||||
|
||||
def normalize_category(value: str) -> str:
|
||||
"""Map an arbitrary string to a valid category slug.
|
||||
|
||||
Returns one of CATEGORY_SLUGS, falling back to `altele` for anything
|
||||
unrecognised or empty.
|
||||
"""
|
||||
if not value:
|
||||
return FALLBACK_CATEGORY
|
||||
slug = _slugify(str(value))
|
||||
if not slug:
|
||||
return FALLBACK_CATEGORY
|
||||
# Exact slug match.
|
||||
if slug in CATEGORIES:
|
||||
return slug
|
||||
# Alias match.
|
||||
if slug in _CATEGORY_ALIASES:
|
||||
return _CATEGORY_ALIASES[slug]
|
||||
return FALLBACK_CATEGORY
|
||||
|
||||
|
||||
def normalize_content_type(value: str) -> str:
|
||||
"""Map an arbitrary string to a valid content_type slug.
|
||||
|
||||
Returns one of CONTENT_TYPE_SLUGS, falling back to `activitate`.
|
||||
"""
|
||||
if not value:
|
||||
return DEFAULT_CONTENT_TYPE
|
||||
slug = _slugify(str(value))
|
||||
if slug in CONTENT_TYPES:
|
||||
return slug
|
||||
# Light alias handling for plural / English forms.
|
||||
aliases = {
|
||||
"jocuri": "joc",
|
||||
"game": "joc",
|
||||
"games": "joc",
|
||||
"activitati": "activitate",
|
||||
"activity": "activitate",
|
||||
"retete": "reteta",
|
||||
"recipe": "reteta",
|
||||
"cantece": "cantec",
|
||||
"song": "cantec",
|
||||
"ceremonii": "ceremonie",
|
||||
"ceremony": "ceremonie",
|
||||
}
|
||||
return aliases.get(slug, DEFAULT_CONTENT_TYPE)
|
||||
|
||||
|
||||
# --- Indoor / outdoor (enrichment axis) --------------------------------------
|
||||
# Where the activity is run. Inferred during enrichment when the source is
|
||||
# silent — such inferences are flagged in `estimated_fields`. slug -> RO label.
|
||||
INDOOR_OUTDOOR: Dict[str, str] = {
|
||||
"indoor": "Interior",
|
||||
"outdoor": "Exterior",
|
||||
"either": "Interior sau exterior",
|
||||
}
|
||||
|
||||
# --- Space needed (enrichment axis) ------------------------------------------
|
||||
# Rough footprint the activity requires. slug -> RO label.
|
||||
SPACE_NEEDED: Dict[str, str] = {
|
||||
"mic": "Spațiu mic",
|
||||
"mediu": "Spațiu mediu",
|
||||
"mare": "Spațiu mare",
|
||||
}
|
||||
|
||||
# Aliases for robustness against LLM output variation. Keys are _slugify'd.
|
||||
_INDOOR_OUTDOOR_ALIASES: Dict[str, str] = {
|
||||
"interior": "indoor",
|
||||
"inside": "indoor",
|
||||
"in": "indoor",
|
||||
"exterior": "outdoor",
|
||||
"outside": "outdoor",
|
||||
"out": "outdoor",
|
||||
"aer-liber": "outdoor",
|
||||
"both": "either",
|
||||
"any": "either",
|
||||
"ambele": "either",
|
||||
"interior-exterior": "either",
|
||||
"indoor-outdoor": "either",
|
||||
}
|
||||
|
||||
_SPACE_NEEDED_ALIASES: Dict[str, str] = {
|
||||
"small": "mic",
|
||||
"redus": "mic",
|
||||
"putin": "mic",
|
||||
"medium": "mediu",
|
||||
"moderat": "mediu",
|
||||
"large": "mare",
|
||||
"big": "mare",
|
||||
"mult": "mare",
|
||||
"spatiu-mic": "mic",
|
||||
"spatiu-mediu": "mediu",
|
||||
"spatiu-mare": "mare",
|
||||
}
|
||||
|
||||
|
||||
def normalize_indoor_outdoor(value: str) -> Optional[str]:
|
||||
"""Map an arbitrary string to an indoor_outdoor slug, or None.
|
||||
|
||||
Unlike categories, this has NO mandatory fallback: an unrecognised or
|
||||
empty value yields None (field simply absent), so we never fabricate a
|
||||
location the enrichment did not assert.
|
||||
"""
|
||||
if not value:
|
||||
return None
|
||||
slug = _slugify(str(value))
|
||||
if slug in INDOOR_OUTDOOR:
|
||||
return slug
|
||||
return _INDOOR_OUTDOOR_ALIASES.get(slug)
|
||||
|
||||
|
||||
def normalize_space_needed(value: str) -> Optional[str]:
|
||||
"""Map an arbitrary string to a space_needed slug, or None (no fallback)."""
|
||||
if not value:
|
||||
return None
|
||||
slug = _slugify(str(value))
|
||||
if slug in SPACE_NEEDED:
|
||||
return slug
|
||||
return _SPACE_NEEDED_ALIASES.get(slug)
|
||||
|
||||
|
||||
def indoor_outdoor_display_name(slug: str) -> str:
|
||||
"""RO display name for an indoor_outdoor slug."""
|
||||
return INDOOR_OUTDOOR.get(slug, slug)
|
||||
|
||||
|
||||
def space_needed_display_name(slug: str) -> str:
|
||||
"""RO display name for a space_needed slug."""
|
||||
return SPACE_NEEDED.get(slug, slug)
|
||||
|
||||
|
||||
def is_valid_category(slug: str) -> bool:
|
||||
"""True if `slug` is a valid category slug."""
|
||||
return slug in CATEGORIES
|
||||
|
||||
|
||||
def category_display_name(slug: str) -> str:
|
||||
"""Romanian display name for a slug (fallback to the slug itself)."""
|
||||
return CATEGORIES.get(slug, slug)
|
||||
|
||||
|
||||
def content_type_display_name(slug: str) -> str:
|
||||
"""Romanian display name for a content_type slug."""
|
||||
return CONTENT_TYPES.get(slug, slug)
|
||||
@@ -5,6 +5,22 @@ Activity data model for INDEX-SISTEM-JOCURI v2.0
|
||||
from dataclasses import dataclass, field
|
||||
from typing import List, Optional, Dict, Any
|
||||
import json
|
||||
import re
|
||||
import unicodedata
|
||||
|
||||
|
||||
def normalize_name(name: str) -> str:
|
||||
"""Diacritic-free, lowercased, whitespace-collapsed form of a name.
|
||||
|
||||
Used as the exact-match key for dedup grouping (see plan §4).
|
||||
"""
|
||||
if not name:
|
||||
return ""
|
||||
decomposed = unicodedata.normalize("NFKD", name)
|
||||
ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
|
||||
ascii_str = ascii_str.lower().strip()
|
||||
ascii_str = re.sub(r"\s+", " ", ascii_str)
|
||||
return ascii_str
|
||||
|
||||
@dataclass
|
||||
class Activity:
|
||||
@@ -19,10 +35,19 @@ class Activity:
|
||||
# Categories
|
||||
category: str = ""
|
||||
subcategory: Optional[str] = None
|
||||
|
||||
# content_type is an axis INDEPENDENT of category:
|
||||
# one of joc/activitate/reteta/cantec/ceremonie (see config_taxonomy).
|
||||
content_type: Optional[str] = None
|
||||
|
||||
# Source information
|
||||
source_file: str = ""
|
||||
page_reference: Optional[str] = None
|
||||
# source_files: JSON-encoded list of every source the activity was seen in.
|
||||
# `source_file` (singular) stays as the primary/original source; build_database
|
||||
# (Lane C) accumulates the full list here on dedup-merge.
|
||||
source_files: List[str] = field(default_factory=list)
|
||||
# Short verbatim quote from the source — anti-hallucination anchor.
|
||||
source_excerpt: Optional[str] = None
|
||||
|
||||
# Age and participants
|
||||
age_group_min: Optional[int] = None
|
||||
@@ -44,11 +69,41 @@ class Activity:
|
||||
keywords: Optional[str] = None
|
||||
tags: List[str] = field(default_factory=list)
|
||||
popularity_score: int = 0
|
||||
|
||||
|
||||
# Extraction / language metadata
|
||||
language: Optional[str] = None # 'ro' / 'en'
|
||||
normalized_name: Optional[str] = None # dedup key; auto-derived from name
|
||||
extraction_confidence: Optional[str] = None # 'high' / 'med' / 'low'
|
||||
needs_review: int = 0
|
||||
|
||||
# Enrichment overlay (applied at build time from data/enrichment.json; see
|
||||
# plan Part B). Bilingual: the EN/source text stays in name/description/...
|
||||
# and the Romanian rendering lands in the *_ro twins. Absent fields leave
|
||||
# the underlying DB value untouched.
|
||||
name_ro: Optional[str] = None
|
||||
description_ro: Optional[str] = None
|
||||
rules_ro: Optional[str] = None
|
||||
variations_ro: Optional[str] = None
|
||||
indoor_outdoor: Optional[str] = None # slug: indoor / outdoor / either
|
||||
space_needed: Optional[str] = None # slug: mic / mediu / mare
|
||||
# Names of fields whose value was INFERRED by enrichment (source was
|
||||
# silent) rather than stated in the source — surfaced as "(estimat)" in UI.
|
||||
estimated_fields: List[str] = field(default_factory=list)
|
||||
|
||||
# Source provenance for the download route + enrichment keying.
|
||||
source_id: Optional[str] = None # e.g. "876d1a2d_marcaje_turistice"
|
||||
source_ids: List[str] = field(default_factory=list) # all source_ids merged
|
||||
chunk_key: Optional[str] = None # e.g. "<source_id>.part01"
|
||||
|
||||
# Database fields
|
||||
id: Optional[int] = None
|
||||
created_at: Optional[str] = None
|
||||
updated_at: Optional[str] = None
|
||||
|
||||
def __post_init__(self):
|
||||
"""Derive normalized_name from name when not explicitly provided."""
|
||||
if not self.normalized_name:
|
||||
self.normalized_name = normalize_name(self.name)
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
"""Convert activity to dictionary for database storage"""
|
||||
@@ -59,8 +114,11 @@ class Activity:
|
||||
'variations': self.variations,
|
||||
'category': self.category,
|
||||
'subcategory': self.subcategory,
|
||||
'content_type': self.content_type,
|
||||
'source_file': self.source_file,
|
||||
'source_files': json.dumps(self.source_files) if self.source_files else None,
|
||||
'page_reference': self.page_reference,
|
||||
'source_excerpt': self.source_excerpt,
|
||||
'age_group_min': self.age_group_min,
|
||||
'age_group_max': self.age_group_max,
|
||||
'participants_min': self.participants_min,
|
||||
@@ -73,7 +131,21 @@ class Activity:
|
||||
'difficulty_level': self.difficulty_level,
|
||||
'keywords': self.keywords,
|
||||
'tags': json.dumps(self.tags) if self.tags else None,
|
||||
'popularity_score': self.popularity_score
|
||||
'popularity_score': self.popularity_score,
|
||||
'language': self.language,
|
||||
'normalized_name': self.normalized_name or normalize_name(self.name),
|
||||
'extraction_confidence': self.extraction_confidence,
|
||||
'needs_review': self.needs_review,
|
||||
'name_ro': self.name_ro,
|
||||
'description_ro': self.description_ro,
|
||||
'rules_ro': self.rules_ro,
|
||||
'variations_ro': self.variations_ro,
|
||||
'indoor_outdoor': self.indoor_outdoor,
|
||||
'space_needed': self.space_needed,
|
||||
'estimated_fields': json.dumps(self.estimated_fields) if self.estimated_fields else None,
|
||||
'source_id': self.source_id,
|
||||
'source_ids': json.dumps(self.source_ids) if self.source_ids else None,
|
||||
'chunk_key': self.chunk_key,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
@@ -86,7 +158,30 @@ class Activity:
|
||||
tags = json.loads(data['tags'])
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
tags = []
|
||||
|
||||
|
||||
# source_files may arrive as a JSON string (DB) or a list (extraction)
|
||||
source_files = data.get('source_files')
|
||||
if isinstance(source_files, str):
|
||||
try:
|
||||
source_files = json.loads(source_files)
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
source_files = []
|
||||
elif source_files is None:
|
||||
source_files = []
|
||||
|
||||
# estimated_fields / source_ids: JSON string (DB) or list (in-memory)
|
||||
def _json_list(value):
|
||||
if isinstance(value, str):
|
||||
try:
|
||||
parsed = json.loads(value)
|
||||
return parsed if isinstance(parsed, list) else []
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return []
|
||||
return list(value) if value else []
|
||||
|
||||
estimated_fields = _json_list(data.get('estimated_fields'))
|
||||
source_ids = _json_list(data.get('source_ids'))
|
||||
|
||||
return cls(
|
||||
id=data.get('id'),
|
||||
name=data.get('name', ''),
|
||||
@@ -95,8 +190,11 @@ class Activity:
|
||||
variations=data.get('variations'),
|
||||
category=data.get('category', ''),
|
||||
subcategory=data.get('subcategory'),
|
||||
content_type=data.get('content_type'),
|
||||
source_file=data.get('source_file', ''),
|
||||
source_files=source_files,
|
||||
page_reference=data.get('page_reference'),
|
||||
source_excerpt=data.get('source_excerpt'),
|
||||
age_group_min=data.get('age_group_min'),
|
||||
age_group_max=data.get('age_group_max'),
|
||||
participants_min=data.get('participants_min'),
|
||||
@@ -110,6 +208,20 @@ class Activity:
|
||||
keywords=data.get('keywords'),
|
||||
tags=tags,
|
||||
popularity_score=data.get('popularity_score', 0),
|
||||
language=data.get('language'),
|
||||
normalized_name=data.get('normalized_name'),
|
||||
extraction_confidence=data.get('extraction_confidence'),
|
||||
needs_review=data.get('needs_review', 0) or 0,
|
||||
name_ro=data.get('name_ro'),
|
||||
description_ro=data.get('description_ro'),
|
||||
rules_ro=data.get('rules_ro'),
|
||||
variations_ro=data.get('variations_ro'),
|
||||
indoor_outdoor=data.get('indoor_outdoor'),
|
||||
space_needed=data.get('space_needed'),
|
||||
estimated_fields=estimated_fields,
|
||||
source_id=data.get('source_id'),
|
||||
source_ids=source_ids,
|
||||
chunk_key=data.get('chunk_key'),
|
||||
created_at=data.get('created_at'),
|
||||
updated_at=data.get('updated_at')
|
||||
)
|
||||
@@ -150,4 +262,44 @@ class Activity:
|
||||
return self.materials_category
|
||||
elif self.materials_list:
|
||||
return self.materials_list[:100] + "..." if len(self.materials_list) > 100 else self.materials_list
|
||||
return "nu specificate"
|
||||
return "nu specificate"
|
||||
|
||||
# --- Enrichment / bilingual display helpers ------------------------------
|
||||
def get_display_name(self) -> str:
|
||||
"""Romanian name when enriched, else the original."""
|
||||
return self.name_ro or self.name
|
||||
|
||||
def get_display_description(self) -> str:
|
||||
"""Romanian description when enriched, else the original."""
|
||||
return self.description_ro or self.description
|
||||
|
||||
def get_display_rules(self) -> Optional[str]:
|
||||
"""Romanian rules when enriched, else the original."""
|
||||
return self.rules_ro or self.rules
|
||||
|
||||
def get_display_variations(self) -> Optional[str]:
|
||||
"""Romanian variations when enriched, else the original."""
|
||||
return self.variations_ro or self.variations
|
||||
|
||||
def has_translation(self) -> bool:
|
||||
"""True if any Romanian enrichment text is present."""
|
||||
return bool(self.name_ro or self.description_ro
|
||||
or self.rules_ro or self.variations_ro)
|
||||
|
||||
def is_estimated(self, field_name: str) -> bool:
|
||||
"""True if `field_name` was inferred by enrichment (source was silent)."""
|
||||
return field_name in (self.estimated_fields or [])
|
||||
|
||||
def get_indoor_outdoor_display(self) -> Optional[str]:
|
||||
"""RO label for indoor_outdoor, or None when unset."""
|
||||
if not self.indoor_outdoor:
|
||||
return None
|
||||
from app.config_taxonomy import indoor_outdoor_display_name
|
||||
return indoor_outdoor_display_name(self.indoor_outdoor)
|
||||
|
||||
def get_space_needed_display(self) -> Optional[str]:
|
||||
"""RO label for space_needed, or None when unset."""
|
||||
if not self.space_needed:
|
||||
return None
|
||||
from app.config_taxonomy import space_needed_display_name
|
||||
return space_needed_display_name(self.space_needed)
|
||||
@@ -30,6 +30,8 @@ class DatabaseManager:
|
||||
"""Initialize database with v2.0 schema"""
|
||||
with self._get_connection() as conn:
|
||||
# Main activities table
|
||||
# NOTE: schema is rebuilt from scratch (plan §6) — no in-place
|
||||
# migration. The old DB is deleted and recreated by build_database.
|
||||
conn.execute("""
|
||||
CREATE TABLE IF NOT EXISTS activities (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
@@ -39,9 +41,12 @@ class DatabaseManager:
|
||||
variations TEXT,
|
||||
category TEXT NOT NULL,
|
||||
subcategory TEXT,
|
||||
content_type TEXT,
|
||||
source_file TEXT NOT NULL,
|
||||
source_files TEXT,
|
||||
page_reference TEXT,
|
||||
|
||||
source_excerpt TEXT,
|
||||
|
||||
-- Structured parameters
|
||||
age_group_min INTEGER,
|
||||
age_group_max INTEGER,
|
||||
@@ -49,26 +54,47 @@ class DatabaseManager:
|
||||
participants_max INTEGER,
|
||||
duration_min INTEGER,
|
||||
duration_max INTEGER,
|
||||
|
||||
|
||||
-- Categories for filtering
|
||||
materials_category TEXT,
|
||||
materials_list TEXT,
|
||||
skills_developed TEXT,
|
||||
difficulty_level TEXT,
|
||||
|
||||
|
||||
-- Metadata
|
||||
keywords TEXT,
|
||||
tags TEXT,
|
||||
popularity_score INTEGER DEFAULT 0,
|
||||
|
||||
-- Extraction / language metadata
|
||||
language TEXT,
|
||||
normalized_name TEXT,
|
||||
extraction_confidence TEXT,
|
||||
needs_review INTEGER DEFAULT 0,
|
||||
|
||||
-- Enrichment overlay (bilingual + inferred filters; Part B)
|
||||
name_ro TEXT,
|
||||
description_ro TEXT,
|
||||
rules_ro TEXT,
|
||||
variations_ro TEXT,
|
||||
indoor_outdoor TEXT,
|
||||
space_needed TEXT,
|
||||
estimated_fields TEXT,
|
||||
source_id TEXT,
|
||||
source_ids TEXT,
|
||||
chunk_key TEXT,
|
||||
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
)
|
||||
""")
|
||||
|
||||
|
||||
# FTS5 virtual table for search
|
||||
conn.execute("""
|
||||
CREATE VIRTUAL TABLE IF NOT EXISTS activities_fts USING fts5(
|
||||
name, description, rules, variations, keywords,
|
||||
materials_list, skills_developed,
|
||||
name_ro, description_ro, rules_ro, variations_ro,
|
||||
content='activities',
|
||||
content_rowid='id'
|
||||
)
|
||||
@@ -92,6 +118,9 @@ class DatabaseManager:
|
||||
"CREATE INDEX IF NOT EXISTS idx_activities_age ON activities(age_group_min, age_group_max)",
|
||||
"CREATE INDEX IF NOT EXISTS idx_activities_participants ON activities(participants_min, participants_max)",
|
||||
"CREATE INDEX IF NOT EXISTS idx_activities_duration ON activities(duration_min, duration_max)",
|
||||
"CREATE INDEX IF NOT EXISTS idx_activities_normalized_name ON activities(normalized_name)",
|
||||
"CREATE INDEX IF NOT EXISTS idx_activities_indoor_outdoor ON activities(indoor_outdoor)",
|
||||
"CREATE INDEX IF NOT EXISTS idx_activities_space_needed ON activities(space_needed)",
|
||||
"CREATE INDEX IF NOT EXISTS idx_categories_type ON categories(type)"
|
||||
]
|
||||
|
||||
@@ -102,24 +131,42 @@ class DatabaseManager:
|
||||
conn.execute("""
|
||||
CREATE TRIGGER IF NOT EXISTS activities_fts_insert AFTER INSERT ON activities
|
||||
BEGIN
|
||||
INSERT INTO activities_fts(rowid, name, description, rules, variations, keywords)
|
||||
VALUES (new.id, new.name, new.description, new.rules, new.variations, new.keywords);
|
||||
INSERT INTO activities_fts(rowid, name, description, rules, variations,
|
||||
keywords, materials_list, skills_developed,
|
||||
name_ro, description_ro, rules_ro, variations_ro)
|
||||
VALUES (new.id, new.name, new.description, new.rules, new.variations,
|
||||
new.keywords, new.materials_list, new.skills_developed,
|
||||
new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
|
||||
END
|
||||
""")
|
||||
|
||||
|
||||
conn.execute("""
|
||||
CREATE TRIGGER IF NOT EXISTS activities_fts_delete AFTER DELETE ON activities
|
||||
BEGIN
|
||||
DELETE FROM activities_fts WHERE rowid = old.id;
|
||||
INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
|
||||
variations, keywords, materials_list, skills_developed,
|
||||
name_ro, description_ro, rules_ro, variations_ro)
|
||||
VALUES ('delete', old.id, old.name, old.description, old.rules,
|
||||
old.variations, old.keywords, old.materials_list, old.skills_developed,
|
||||
old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
|
||||
END
|
||||
""")
|
||||
|
||||
|
||||
conn.execute("""
|
||||
CREATE TRIGGER IF NOT EXISTS activities_fts_update AFTER UPDATE ON activities
|
||||
BEGIN
|
||||
DELETE FROM activities_fts WHERE rowid = old.id;
|
||||
INSERT INTO activities_fts(rowid, name, description, rules, variations, keywords)
|
||||
VALUES (new.id, new.name, new.description, new.rules, new.variations, new.keywords);
|
||||
INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
|
||||
variations, keywords, materials_list, skills_developed,
|
||||
name_ro, description_ro, rules_ro, variations_ro)
|
||||
VALUES ('delete', old.id, old.name, old.description, old.rules,
|
||||
old.variations, old.keywords, old.materials_list, old.skills_developed,
|
||||
old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
|
||||
INSERT INTO activities_fts(rowid, name, description, rules, variations,
|
||||
keywords, materials_list, skills_developed,
|
||||
name_ro, description_ro, rules_ro, variations_ro)
|
||||
VALUES (new.id, new.name, new.description, new.rules, new.variations,
|
||||
new.keywords, new.materials_list, new.skills_developed,
|
||||
new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
|
||||
END
|
||||
""")
|
||||
|
||||
@@ -179,11 +226,17 @@ class DatabaseManager:
|
||||
"""Update category usage counts"""
|
||||
categories_to_update = [
|
||||
('category', activity.category),
|
||||
('content_type', activity.content_type),
|
||||
('language', activity.language),
|
||||
('age_group', activity.get_age_range_display()),
|
||||
('participants', activity.get_participants_display()),
|
||||
('duration', activity.get_duration_display()),
|
||||
('materials', activity.get_materials_display()),
|
||||
('difficulty', activity.difficulty_level),
|
||||
# Enrichment axes — slugs stored as value; UI maps to RO via
|
||||
# DISPLAY_NAMES. Without these the new dropdowns would be empty.
|
||||
('indoor_outdoor', activity.indoor_outdoor),
|
||||
('space_needed', activity.space_needed),
|
||||
]
|
||||
|
||||
for cat_type, cat_value in categories_to_update:
|
||||
@@ -210,6 +263,8 @@ class DatabaseManager:
|
||||
duration_max: Optional[int] = None,
|
||||
materials_category: Optional[str] = None,
|
||||
difficulty_level: Optional[str] = None,
|
||||
indoor_outdoor: Optional[str] = None,
|
||||
space_needed: Optional[str] = None,
|
||||
limit: int = 100) -> List[Dict[str, Any]]:
|
||||
"""Enhanced search with FTS5 and filters"""
|
||||
|
||||
@@ -267,7 +322,15 @@ class DatabaseManager:
|
||||
if difficulty_level:
|
||||
base_query += " AND difficulty_level = ?"
|
||||
params.append(difficulty_level)
|
||||
|
||||
|
||||
if indoor_outdoor:
|
||||
base_query += " AND indoor_outdoor = ?"
|
||||
params.append(indoor_outdoor)
|
||||
|
||||
if space_needed:
|
||||
base_query += " AND space_needed = ?"
|
||||
params.append(space_needed)
|
||||
|
||||
# Add ordering and limit
|
||||
query = f"{base_query} {order_clause} LIMIT ?"
|
||||
params.append(limit)
|
||||
@@ -332,8 +395,11 @@ class DatabaseManager:
|
||||
def clear_database(self):
|
||||
"""Clear all data from database"""
|
||||
with self._get_connection() as conn:
|
||||
# Deleting from activities fires the delete trigger, which removes
|
||||
# the matching FTS rows. The explicit 'delete-all' command then
|
||||
# guarantees the external-content FTS index is fully cleared.
|
||||
conn.execute("DELETE FROM activities")
|
||||
conn.execute("DELETE FROM activities_fts")
|
||||
conn.execute("INSERT INTO activities_fts(activities_fts) VALUES('delete-all')")
|
||||
conn.execute("DELETE FROM categories")
|
||||
conn.commit()
|
||||
|
||||
|
||||
@@ -2,8 +2,6 @@
|
||||
Services for INDEX-SISTEM-JOCURI v2.0
|
||||
"""
|
||||
|
||||
from .parser import IndexMasterParser
|
||||
from .indexer import ActivityIndexer
|
||||
from .search import SearchService
|
||||
|
||||
__all__ = ['IndexMasterParser', 'ActivityIndexer', 'SearchService']
|
||||
__all__ = ['SearchService']
|
||||
|
||||
@@ -1,248 +0,0 @@
|
||||
"""
|
||||
Activity indexer service for INDEX-SISTEM-JOCURI v2.0
|
||||
Coordinates parsing and database indexing
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any
|
||||
from pathlib import Path
|
||||
from app.models.database import DatabaseManager
|
||||
from app.models.activity import Activity
|
||||
from app.services.parser import IndexMasterParser
|
||||
import time
|
||||
|
||||
class ActivityIndexer:
|
||||
"""Service for indexing activities from INDEX_MASTER into database"""
|
||||
|
||||
def __init__(self, db_manager: DatabaseManager, index_master_path: str):
|
||||
"""Initialize indexer with database manager and INDEX_MASTER path"""
|
||||
self.db = db_manager
|
||||
self.parser = IndexMasterParser(index_master_path)
|
||||
self.indexing_stats = {}
|
||||
|
||||
def index_all_activities(self, clear_existing: bool = False) -> Dict[str, Any]:
|
||||
"""Index all activities from INDEX_MASTER into database"""
|
||||
|
||||
print("🚀 Starting activity indexing process...")
|
||||
start_time = time.time()
|
||||
|
||||
# Clear existing data if requested
|
||||
if clear_existing:
|
||||
print("🗑️ Clearing existing database...")
|
||||
self.db.clear_database()
|
||||
|
||||
# Parse activities from INDEX_MASTER
|
||||
print("📖 Parsing INDEX_MASTER file...")
|
||||
activities = self.parser.parse_all_categories()
|
||||
|
||||
if not activities:
|
||||
print("❌ No activities were parsed!")
|
||||
return {'success': False, 'error': 'No activities parsed'}
|
||||
|
||||
# Filter valid activities
|
||||
valid_activities = []
|
||||
for activity in activities:
|
||||
if self.parser.validate_activity_completeness(activity):
|
||||
valid_activities.append(activity)
|
||||
else:
|
||||
print(f"⚠️ Skipping incomplete activity: {activity.name[:50]}...")
|
||||
|
||||
print(f"✅ Validated {len(valid_activities)} activities out of {len(activities)} parsed")
|
||||
|
||||
if len(valid_activities) < 100:
|
||||
print(f"⚠️ Warning: Only {len(valid_activities)} valid activities found. Expected 500+")
|
||||
|
||||
# Bulk insert into database
|
||||
print("💾 Inserting activities into database...")
|
||||
try:
|
||||
inserted_count = self.db.bulk_insert_activities(valid_activities)
|
||||
|
||||
# Rebuild FTS index for optimal search performance
|
||||
print("🔍 Rebuilding search index...")
|
||||
self.db.rebuild_fts_index()
|
||||
|
||||
end_time = time.time()
|
||||
indexing_time = end_time - start_time
|
||||
|
||||
# Generate final statistics (with error handling)
|
||||
try:
|
||||
stats = self._generate_indexing_stats(valid_activities, indexing_time)
|
||||
stats['inserted_count'] = inserted_count
|
||||
stats['success'] = True
|
||||
except Exception as e:
|
||||
print(f"⚠️ Error generating statistics: {e}")
|
||||
stats = {
|
||||
'success': True,
|
||||
'inserted_count': inserted_count,
|
||||
'indexing_time_seconds': indexing_time,
|
||||
'error': f'Stats generation failed: {str(e)}'
|
||||
}
|
||||
|
||||
print(f"✅ Indexing complete! {inserted_count} activities indexed in {indexing_time:.2f}s")
|
||||
|
||||
# Verify database state (with error handling)
|
||||
try:
|
||||
db_stats = self.db.get_statistics()
|
||||
print(f"📊 Database now contains {db_stats['total_activities']} activities")
|
||||
except Exception as e:
|
||||
print(f"⚠️ Error getting database statistics: {e}")
|
||||
print(f"📊 Database insertion completed, statistics unavailable")
|
||||
|
||||
return stats
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error during database insertion: {e}")
|
||||
return {'success': False, 'error': str(e)}
|
||||
|
||||
def index_specific_category(self, category_code: str) -> Dict[str, Any]:
|
||||
"""Index activities from a specific category only"""
|
||||
|
||||
print(f"🎯 Indexing specific category: {category_code}")
|
||||
|
||||
# Load content and parse specific category
|
||||
if not self.parser.load_content():
|
||||
return {'success': False, 'error': 'Could not load INDEX_MASTER'}
|
||||
|
||||
category_name = self.parser.category_mapping.get(category_code)
|
||||
if not category_name:
|
||||
return {'success': False, 'error': f'Unknown category code: {category_code}'}
|
||||
|
||||
activities = self.parser.parse_category_section(category_code, category_name)
|
||||
|
||||
if not activities:
|
||||
return {'success': False, 'error': f'No activities found in category {category_code}'}
|
||||
|
||||
# Filter valid activities
|
||||
valid_activities = [a for a in activities if self.parser.validate_activity_completeness(a)]
|
||||
|
||||
try:
|
||||
inserted_count = self.db.bulk_insert_activities(valid_activities)
|
||||
return {
|
||||
'success': True,
|
||||
'category': category_name,
|
||||
'inserted_count': inserted_count,
|
||||
'total_parsed': len(activities),
|
||||
'valid_activities': len(valid_activities)
|
||||
}
|
||||
except Exception as e:
|
||||
return {'success': False, 'error': str(e)}
|
||||
|
||||
def _generate_indexing_stats(self, activities: List[Activity], indexing_time: float) -> Dict[str, Any]:
|
||||
"""Generate comprehensive indexing statistics"""
|
||||
|
||||
# Get parser statistics
|
||||
parser_stats = self.parser.get_parsing_statistics()
|
||||
|
||||
# Calculate additional metrics
|
||||
categories = {}
|
||||
age_ranges = {}
|
||||
durations = {}
|
||||
materials = {}
|
||||
|
||||
for activity in activities:
|
||||
# Category breakdown
|
||||
if activity.category in categories:
|
||||
categories[activity.category] += 1
|
||||
else:
|
||||
categories[activity.category] = 1
|
||||
|
||||
# Age range analysis (with safety check)
|
||||
try:
|
||||
age_key = activity.get_age_range_display() or "nespecificat"
|
||||
age_ranges[age_key] = age_ranges.get(age_key, 0) + 1
|
||||
except Exception as e:
|
||||
print(f"Warning: Error getting age range for activity {activity.name}: {e}")
|
||||
age_ranges["nespecificat"] = age_ranges.get("nespecificat", 0) + 1
|
||||
|
||||
# Duration analysis (with safety check)
|
||||
try:
|
||||
duration_key = activity.get_duration_display() or "nespecificat"
|
||||
durations[duration_key] = durations.get(duration_key, 0) + 1
|
||||
except Exception as e:
|
||||
print(f"Warning: Error getting duration for activity {activity.name}: {e}")
|
||||
durations["nespecificat"] = durations.get("nespecificat", 0) + 1
|
||||
|
||||
# Materials analysis (with safety check)
|
||||
try:
|
||||
materials_key = activity.get_materials_display() or "nespecificat"
|
||||
materials[materials_key] = materials.get(materials_key, 0) + 1
|
||||
except Exception as e:
|
||||
print(f"Warning: Error getting materials for activity {activity.name}: {e}")
|
||||
materials["nespecificat"] = materials.get("nespecificat", 0) + 1
|
||||
|
||||
return {
|
||||
'indexing_time_seconds': indexing_time,
|
||||
'parsing_stats': parser_stats,
|
||||
'distribution': {
|
||||
'categories': categories,
|
||||
'age_ranges': age_ranges,
|
||||
'durations': durations,
|
||||
'materials': materials
|
||||
},
|
||||
'quality_metrics': {
|
||||
'completion_rate': parser_stats.get('completion_rate', 0),
|
||||
'average_description_length': parser_stats.get('average_description_length', 0),
|
||||
'activities_with_metadata': sum(1 for a in activities if a.age_group_min or a.participants_min or a.duration_min)
|
||||
}
|
||||
}
|
||||
|
||||
def verify_indexing_quality(self) -> Dict[str, Any]:
|
||||
"""Verify the quality of indexed data"""
|
||||
|
||||
try:
|
||||
# Get database statistics
|
||||
db_stats = self.db.get_statistics()
|
||||
|
||||
# Check for minimum activity count
|
||||
total_activities = db_stats['total_activities']
|
||||
meets_minimum = total_activities >= 500
|
||||
|
||||
# Check category distribution
|
||||
categories = db_stats.get('categories', {})
|
||||
category_coverage = len(categories)
|
||||
|
||||
# Sample some activities to check quality
|
||||
sample_activities = self.db.search_activities(limit=10)
|
||||
|
||||
quality_issues = []
|
||||
for activity in sample_activities:
|
||||
if not activity.get('description') or len(activity['description']) < 10:
|
||||
quality_issues.append(f"Activity {activity.get('name', 'Unknown')} has insufficient description")
|
||||
|
||||
if not activity.get('category'):
|
||||
quality_issues.append(f"Activity {activity.get('name', 'Unknown')} missing category")
|
||||
|
||||
return {
|
||||
'total_activities': total_activities,
|
||||
'meets_minimum_requirement': meets_minimum,
|
||||
'minimum_target': 500,
|
||||
'category_coverage': category_coverage,
|
||||
'expected_categories': len(self.parser.category_mapping),
|
||||
'quality_issues': quality_issues,
|
||||
'quality_score': max(0, 100 - len(quality_issues) * 10),
|
||||
'database_stats': db_stats
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {'error': str(e), 'quality_score': 0}
|
||||
|
||||
def get_indexing_progress(self) -> Dict[str, Any]:
|
||||
"""Get current indexing progress and status"""
|
||||
try:
|
||||
db_stats = self.db.get_statistics()
|
||||
|
||||
# Calculate progress towards 500+ activities goal
|
||||
total_activities = db_stats['total_activities']
|
||||
target_activities = 500
|
||||
progress_percentage = min(100, (total_activities / target_activities) * 100)
|
||||
|
||||
return {
|
||||
'current_activities': total_activities,
|
||||
'target_activities': target_activities,
|
||||
'progress_percentage': progress_percentage,
|
||||
'status': 'completed' if total_activities >= target_activities else 'in_progress',
|
||||
'categories_indexed': list(db_stats.get('categories', {}).keys()),
|
||||
'database_size_mb': db_stats.get('database_size_bytes', 0) / (1024 * 1024)
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {'error': str(e), 'status': 'error'}
|
||||
@@ -1,340 +0,0 @@
|
||||
"""
|
||||
Advanced parser for INDEX_MASTER_JOCURI_ACTIVITATI.md
|
||||
Extracts 500+ individual activities with full details
|
||||
"""
|
||||
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Optional, Tuple
|
||||
from app.models.activity import Activity
|
||||
|
||||
class IndexMasterParser:
|
||||
"""Advanced parser for extracting real activities from INDEX_MASTER"""
|
||||
|
||||
def __init__(self, index_file_path: str):
|
||||
"""Initialize parser with INDEX_MASTER file path"""
|
||||
self.index_file_path = Path(index_file_path)
|
||||
self.content = ""
|
||||
self.activities = []
|
||||
|
||||
# Category mapping for main sections (exact match from file)
|
||||
self.category_mapping = {
|
||||
'[A]': 'JOCURI CERCETĂȘEȘTI ȘI SCOUT',
|
||||
'[B]': 'TEAM BUILDING ȘI COMUNICARE',
|
||||
'[C]': 'CAMPING ȘI ACTIVITĂȚI EXTERIOR',
|
||||
'[D]': 'ESCAPE ROOM ȘI PUZZLE-URI',
|
||||
'[E]': 'ORIENTARE ȘI BUSOLE',
|
||||
'[F]': 'PRIMUL AJUTOR ȘI SIGURANȚA',
|
||||
'[G]': 'ACTIVITĂȚI EDUCAȚIONALE',
|
||||
'[H]': 'RESURSE SPECIALE'
|
||||
}
|
||||
|
||||
def load_content(self) -> bool:
|
||||
"""Load and validate INDEX_MASTER content"""
|
||||
try:
|
||||
if not self.index_file_path.exists():
|
||||
print(f"❌ INDEX_MASTER file not found: {self.index_file_path}")
|
||||
return False
|
||||
|
||||
with open(self.index_file_path, 'r', encoding='utf-8') as f:
|
||||
self.content = f.read()
|
||||
|
||||
if len(self.content) < 1000: # Sanity check
|
||||
print(f"⚠️ INDEX_MASTER file seems too small: {len(self.content)} chars")
|
||||
return False
|
||||
|
||||
print(f"✅ Loaded INDEX_MASTER: {len(self.content)} characters")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error loading INDEX_MASTER: {e}")
|
||||
return False
|
||||
|
||||
def parse_all_categories(self) -> List[Activity]:
|
||||
"""Parse all categories and extract individual activities"""
|
||||
if not self.load_content():
|
||||
return []
|
||||
|
||||
print("🔍 Starting comprehensive parsing of INDEX_MASTER...")
|
||||
|
||||
# Parse each main category
|
||||
for category_code, category_name in self.category_mapping.items():
|
||||
print(f"\n📂 Processing category {category_code}: {category_name}")
|
||||
category_activities = self.parse_category_section(category_code, category_name)
|
||||
self.activities.extend(category_activities)
|
||||
print(f" ✅ Extracted {len(category_activities)} activities")
|
||||
|
||||
print(f"\n🎯 Total activities extracted: {len(self.activities)}")
|
||||
return self.activities
|
||||
|
||||
def parse_category_section(self, category_code: str, category_name: str) -> List[Activity]:
|
||||
"""Parse a specific category section"""
|
||||
activities = []
|
||||
|
||||
# Find the category section - exact pattern match
|
||||
# Look for the actual section, not the table of contents
|
||||
pattern = rf"^## {re.escape(category_code)} {re.escape(category_name)}\s*$"
|
||||
matches = list(re.finditer(pattern, self.content, re.MULTILINE | re.IGNORECASE))
|
||||
|
||||
if not matches:
|
||||
print(f" ⚠️ Category section not found: {category_code}")
|
||||
return activities
|
||||
|
||||
# Take the last match (should be the actual section, not TOC)
|
||||
match = matches[-1]
|
||||
print(f" 📍 Found section at position {match.start()}")
|
||||
|
||||
# Extract content until next main category or end
|
||||
start_pos = match.end()
|
||||
|
||||
# Find next main category (look for complete header)
|
||||
next_category_pattern = r"^## \[[A-H]\] [A-ZĂÂÎȘȚ]"
|
||||
next_match = re.search(next_category_pattern, self.content[start_pos:], re.MULTILINE)
|
||||
|
||||
if next_match:
|
||||
end_pos = start_pos + next_match.start()
|
||||
section_content = self.content[start_pos:end_pos]
|
||||
else:
|
||||
section_content = self.content[start_pos:]
|
||||
|
||||
# Parse subsections within the category
|
||||
activities.extend(self._parse_subsections(section_content, category_name))
|
||||
|
||||
return activities
|
||||
|
||||
def _parse_subsections(self, section_content: str, category_name: str) -> List[Activity]:
|
||||
"""Parse subsections within a category"""
|
||||
activities = []
|
||||
|
||||
# Find all subsections (### markers)
|
||||
subsection_pattern = r"^### (.+?)$"
|
||||
subsections = re.finditer(subsection_pattern, section_content, re.MULTILINE)
|
||||
|
||||
subsection_list = list(subsections)
|
||||
|
||||
for i, subsection in enumerate(subsection_list):
|
||||
subsection_title = subsection.group(1).strip()
|
||||
subsection_start = subsection.end()
|
||||
|
||||
# Find end of subsection
|
||||
if i + 1 < len(subsection_list):
|
||||
subsection_end = subsection_list[i + 1].start()
|
||||
else:
|
||||
subsection_end = len(section_content)
|
||||
|
||||
subsection_text = section_content[subsection_start:subsection_end]
|
||||
|
||||
# Parse individual games in this subsection
|
||||
subsection_activities = self._parse_games_in_subsection(
|
||||
subsection_text, category_name, subsection_title
|
||||
)
|
||||
activities.extend(subsection_activities)
|
||||
|
||||
return activities
|
||||
|
||||
def _parse_games_in_subsection(self, subsection_text: str, category_name: str, subsection_title: str) -> List[Activity]:
|
||||
"""Parse individual games within a subsection"""
|
||||
activities = []
|
||||
|
||||
# Look for "Exemple de jocuri:" sections
|
||||
examples_pattern = r"\*\*Exemple de jocuri:\*\*\s*\n(.*?)(?=\n\*\*|$)"
|
||||
examples_matches = re.finditer(examples_pattern, subsection_text, re.DOTALL)
|
||||
|
||||
for examples_match in examples_matches:
|
||||
examples_text = examples_match.group(1)
|
||||
|
||||
# Extract individual games (numbered list)
|
||||
game_pattern = r"^(\d+)\.\s*\*\*(.+?)\*\*\s*-\s*(.+?)$"
|
||||
games = re.finditer(game_pattern, examples_text, re.MULTILINE)
|
||||
|
||||
for game_match in games:
|
||||
game_number = game_match.group(1)
|
||||
game_name = game_match.group(2).strip()
|
||||
game_description = game_match.group(3).strip()
|
||||
|
||||
# Extract metadata from subsection
|
||||
metadata = self._extract_subsection_metadata(subsection_text)
|
||||
|
||||
# Create activity
|
||||
activity = Activity(
|
||||
name=game_name,
|
||||
description=game_description,
|
||||
category=category_name,
|
||||
subcategory=subsection_title,
|
||||
source_file=f"INDEX_MASTER_JOCURI_ACTIVITATI.md",
|
||||
page_reference=f"{category_name} > {subsection_title} > #{game_number}",
|
||||
**metadata
|
||||
)
|
||||
|
||||
activities.append(activity)
|
||||
|
||||
# Also extract from direct activity descriptions without "Exemple de jocuri"
|
||||
activities.extend(self._parse_direct_activities(subsection_text, category_name, subsection_title))
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_subsection_metadata(self, subsection_text: str) -> Dict:
|
||||
"""Extract metadata from subsection text"""
|
||||
metadata = {}
|
||||
|
||||
# Extract participants info
|
||||
participants_pattern = r"\*\*Participanți:\*\*\s*(.+?)(?:\n|\*\*)"
|
||||
participants_match = re.search(participants_pattern, subsection_text)
|
||||
if participants_match:
|
||||
participants_text = participants_match.group(1).strip()
|
||||
participants = self._parse_participants(participants_text)
|
||||
metadata.update(participants)
|
||||
|
||||
# Extract duration
|
||||
duration_pattern = r"\*\*Durata:\*\*\s*(.+?)(?:\n|\*\*)"
|
||||
duration_match = re.search(duration_pattern, subsection_text)
|
||||
if duration_match:
|
||||
duration_text = duration_match.group(1).strip()
|
||||
duration = self._parse_duration(duration_text)
|
||||
metadata.update(duration)
|
||||
|
||||
# Extract materials
|
||||
materials_pattern = r"\*\*Materiale:\*\*\s*(.+?)(?:\n|\*\*)"
|
||||
materials_match = re.search(materials_pattern, subsection_text)
|
||||
if materials_match:
|
||||
materials_text = materials_match.group(1).strip()
|
||||
metadata['materials_list'] = materials_text
|
||||
metadata['materials_category'] = self._categorize_materials(materials_text)
|
||||
|
||||
# Extract keywords
|
||||
keywords_pattern = r"\*\*Cuvinte cheie:\*\*\s*(.+?)(?:\n|\*\*)"
|
||||
keywords_match = re.search(keywords_pattern, subsection_text)
|
||||
if keywords_match:
|
||||
metadata['keywords'] = keywords_match.group(1).strip()
|
||||
|
||||
return metadata
|
||||
|
||||
def _parse_participants(self, participants_text: str) -> Dict:
|
||||
"""Parse participants information"""
|
||||
result = {}
|
||||
|
||||
# Look for number ranges like "8-30 copii" or "5-15 persoane"
|
||||
range_pattern = r"(\d+)-(\d+)"
|
||||
range_match = re.search(range_pattern, participants_text)
|
||||
|
||||
if range_match:
|
||||
result['participants_min'] = int(range_match.group(1))
|
||||
result['participants_max'] = int(range_match.group(2))
|
||||
else:
|
||||
# Look for single numbers
|
||||
number_pattern = r"(\d+)\+"
|
||||
number_match = re.search(number_pattern, participants_text)
|
||||
if number_match:
|
||||
result['participants_min'] = int(number_match.group(1))
|
||||
|
||||
# Extract age information
|
||||
age_pattern = r"(\d+)-(\d+)\s*ani"
|
||||
age_match = re.search(age_pattern, participants_text)
|
||||
if age_match:
|
||||
result['age_group_min'] = int(age_match.group(1))
|
||||
result['age_group_max'] = int(age_match.group(2))
|
||||
|
||||
return result
|
||||
|
||||
def _parse_duration(self, duration_text: str) -> Dict:
|
||||
"""Parse duration information"""
|
||||
result = {}
|
||||
|
||||
# Look for time ranges like "5-20 minute" or "15-30min"
|
||||
range_pattern = r"(\d+)-(\d+)\s*(?:minute|min)"
|
||||
range_match = re.search(range_pattern, duration_text)
|
||||
|
||||
if range_match:
|
||||
result['duration_min'] = int(range_match.group(1))
|
||||
result['duration_max'] = int(range_match.group(2))
|
||||
else:
|
||||
# Look for single duration
|
||||
single_pattern = r"(\d+)\+?\s*(?:minute|min)"
|
||||
single_match = re.search(single_pattern, duration_text)
|
||||
if single_match:
|
||||
result['duration_min'] = int(single_match.group(1))
|
||||
|
||||
return result
|
||||
|
||||
def _categorize_materials(self, materials_text: str) -> str:
|
||||
"""Categorize materials into simple categories"""
|
||||
materials_lower = materials_text.lower()
|
||||
|
||||
if any(word in materials_lower for word in ['fără', 'nu necesare', 'nimic', 'minime']):
|
||||
return 'Fără materiale'
|
||||
elif any(word in materials_lower for word in ['hârtie', 'creion', 'marker', 'simple']):
|
||||
return 'Materiale simple'
|
||||
elif any(word in materials_lower for word in ['computer', 'proiector', 'echipament', 'complexe']):
|
||||
return 'Materiale complexe'
|
||||
else:
|
||||
return 'Materiale variate'
|
||||
|
||||
def _parse_direct_activities(self, subsection_text: str, category_name: str, subsection_title: str) -> List[Activity]:
|
||||
"""Parse activities that are described directly without 'Exemple de jocuri' section"""
|
||||
activities = []
|
||||
|
||||
# Look for activity descriptions in sections that don't have "Exemple de jocuri"
|
||||
if "**Exemple de jocuri:**" not in subsection_text:
|
||||
# Try to extract from file descriptions
|
||||
file_pattern = r"\*\*Fișier:\*\*\s*`([^`]+)`.*?\*\*(.+?)\*\*"
|
||||
file_matches = re.finditer(file_pattern, subsection_text, re.DOTALL)
|
||||
|
||||
for file_match in file_matches:
|
||||
file_name = file_match.group(1)
|
||||
description_part = file_match.group(2)
|
||||
|
||||
# Create a general activity for this file
|
||||
activity = Activity(
|
||||
name=f"Activități din {file_name}",
|
||||
description=f"Colecție de activități din fișierul {file_name}. {description_part[:200]}...",
|
||||
category=category_name,
|
||||
subcategory=subsection_title,
|
||||
source_file=file_name,
|
||||
page_reference=f"{category_name} > {subsection_title}",
|
||||
**self._extract_subsection_metadata(subsection_text)
|
||||
)
|
||||
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def validate_activity_completeness(self, activity: Activity) -> bool:
|
||||
"""Validate that an activity has all necessary fields"""
|
||||
required_fields = ['name', 'description', 'category', 'source_file']
|
||||
|
||||
for field in required_fields:
|
||||
if not getattr(activity, field) or not getattr(activity, field).strip():
|
||||
return False
|
||||
|
||||
# Check minimum description length
|
||||
if len(activity.description) < 10:
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
def get_parsing_statistics(self) -> Dict:
|
||||
"""Get statistics about the parsing process"""
|
||||
if not self.activities:
|
||||
return {'total_activities': 0}
|
||||
|
||||
category_counts = {}
|
||||
valid_activities = 0
|
||||
|
||||
for activity in self.activities:
|
||||
# Count by category
|
||||
if activity.category in category_counts:
|
||||
category_counts[activity.category] += 1
|
||||
else:
|
||||
category_counts[activity.category] = 1
|
||||
|
||||
# Count valid activities
|
||||
if self.validate_activity_completeness(activity):
|
||||
valid_activities += 1
|
||||
|
||||
return {
|
||||
'total_activities': len(self.activities),
|
||||
'valid_activities': valid_activities,
|
||||
'completion_rate': (valid_activities / len(self.activities)) * 100 if self.activities else 0,
|
||||
'category_breakdown': category_counts,
|
||||
'average_description_length': sum(len(a.description) for a in self.activities) / len(self.activities) if self.activities else 0
|
||||
}
|
||||
@@ -5,8 +5,19 @@ Enhanced search with FTS5 and intelligent filtering
|
||||
|
||||
from typing import List, Dict, Any, Optional
|
||||
from app.models.database import DatabaseManager
|
||||
from app.config_taxonomy import NON_GAME_CONTENT_TYPES
|
||||
import re
|
||||
|
||||
# Category slugs that are themselves "non-game" — selecting one of these as a
|
||||
# category filter also lifts the default non-game content_type exclusion.
|
||||
NON_GAME_CATEGORIES = {"retete", "cantece-ceremonii"}
|
||||
|
||||
# When a Python-side post-filter is active the DB LIMIT is applied *before*
|
||||
# filtering, so we over-fetch to still satisfy the caller's `limit`.
|
||||
_OVERSCAN_FACTOR = 5
|
||||
_OVERSCAN_CAP = 2000
|
||||
|
||||
|
||||
class SearchService:
|
||||
"""Enhanced search service with intelligent query processing"""
|
||||
|
||||
@@ -24,22 +35,72 @@ class SearchService:
|
||||
|
||||
if filters is None:
|
||||
filters = {}
|
||||
|
||||
|
||||
# Process and normalize search text
|
||||
processed_search = self._process_search_text(search_text)
|
||||
|
||||
|
||||
# Map web filters to database fields
|
||||
db_filters = self._map_filters_to_db_fields(filters)
|
||||
|
||||
|
||||
# content_type and language are filtered in Python: the DB layer does
|
||||
# not expose them as query parameters. The DEFAULT search excludes the
|
||||
# non-game content types (rețete / cântece / ceremonii) — they surface
|
||||
# only when the user explicitly filters that content_type, or picks a
|
||||
# non-game category. See plan §6.
|
||||
content_type, exclude_non_game = self._resolve_content_type_filter(filters)
|
||||
language = (filters.get('language') or '').strip().lower() or None
|
||||
post_filtering = bool(content_type or exclude_non_game or language)
|
||||
|
||||
# Over-fetch when post-filtering so the final list can still reach `limit`.
|
||||
fetch_limit = min(limit * _OVERSCAN_FACTOR, _OVERSCAN_CAP) if post_filtering else limit
|
||||
|
||||
# Perform database search
|
||||
results = self.db.search_activities(
|
||||
search_text=processed_search,
|
||||
**db_filters,
|
||||
limit=limit
|
||||
limit=fetch_limit
|
||||
)
|
||||
|
||||
# Post-process results for relevance and ranking
|
||||
return self._post_process_results(results, processed_search, filters)
|
||||
|
||||
# Apply content_type / language post-filters
|
||||
results = self._apply_content_type_filter(results, content_type, exclude_non_game)
|
||||
if language:
|
||||
results = [r for r in results
|
||||
if (r.get('language') or '').strip().lower() == language]
|
||||
|
||||
# Post-process results for relevance and ranking, then honour `limit`
|
||||
results = self._post_process_results(results, processed_search, filters)
|
||||
return results[:limit]
|
||||
|
||||
def _resolve_content_type_filter(self, filters: Dict[str, str]):
|
||||
"""Determine the content_type post-filter.
|
||||
|
||||
Returns (explicit_content_type | None, exclude_non_game: bool):
|
||||
- an explicit `content_type` filter → that value, no exclusion;
|
||||
- a `category` filter on a non-game category → no exclusion;
|
||||
- otherwise → default search, exclude non-game content types.
|
||||
"""
|
||||
content_type = (filters.get('content_type') or '').strip()
|
||||
if content_type:
|
||||
return content_type, False
|
||||
category = (filters.get('category') or '').strip()
|
||||
if category in NON_GAME_CATEGORIES:
|
||||
return None, False
|
||||
return None, True
|
||||
|
||||
def _apply_content_type_filter(self,
|
||||
results: List[Dict[str, Any]],
|
||||
content_type: Optional[str],
|
||||
exclude_non_game: bool) -> List[Dict[str, Any]]:
|
||||
"""Filter results by content_type (explicit include vs default exclude)."""
|
||||
if content_type:
|
||||
return [r for r in results
|
||||
if (r.get('content_type') or '') == content_type]
|
||||
if exclude_non_game:
|
||||
# Rows with NULL/unknown content_type are kept — only the known
|
||||
# non-game types are dropped from the default search.
|
||||
return [r for r in results
|
||||
if (r.get('content_type') or '') not in NON_GAME_CONTENT_TYPES]
|
||||
return results
|
||||
|
||||
def _process_search_text(self, search_text: Optional[str]) -> Optional[str]:
|
||||
"""Process and enhance search text for better FTS5 results"""
|
||||
@@ -83,10 +144,16 @@ class SearchService:
|
||||
if not filter_value or not filter_value.strip():
|
||||
continue
|
||||
|
||||
# content_type / language are NOT database query params — they are
|
||||
# applied as Python post-filters in search_activities(). Skip them
|
||||
# here so they never reach DatabaseManager.search_activities().
|
||||
if filter_key in ('content_type', 'language'):
|
||||
continue
|
||||
|
||||
# Map filter types to database fields
|
||||
if filter_key == 'category':
|
||||
db_filters['category'] = filter_value
|
||||
|
||||
|
||||
elif filter_key == 'age_group':
|
||||
# Parse age range (e.g., "5-8 ani", "12+ ani")
|
||||
age_match = re.search(r'(\d+)(?:-(\d+))?\s*ani?', filter_value)
|
||||
@@ -133,7 +200,14 @@ class SearchService:
|
||||
|
||||
elif filter_key == 'difficulty':
|
||||
db_filters['difficulty_level'] = filter_value
|
||||
|
||||
|
||||
elif filter_key == 'indoor_outdoor':
|
||||
# Equality filter on the slug column (mirror difficulty).
|
||||
db_filters['indoor_outdoor'] = filter_value
|
||||
|
||||
elif filter_key == 'space_needed':
|
||||
db_filters['space_needed'] = filter_value
|
||||
|
||||
# Handle any other custom filters
|
||||
else:
|
||||
# Generic filter handling - try to match against keywords or tags
|
||||
@@ -177,21 +251,22 @@ class SearchService:
|
||||
boost_score = 0
|
||||
|
||||
# Check name matches (highest priority)
|
||||
name_lower = result.get('name', '').lower()
|
||||
# NB: use `or ''` — nullable columns come back as None, not ''.
|
||||
name_lower = (result.get('name') or '').lower()
|
||||
for term in search_terms:
|
||||
if term in name_lower:
|
||||
boost_score += 10
|
||||
if name_lower.startswith(term):
|
||||
boost_score += 5 # Extra boost for name starts with term
|
||||
|
||||
|
||||
# Check description matches
|
||||
desc_lower = result.get('description', '').lower()
|
||||
desc_lower = (result.get('description') or '').lower()
|
||||
for term in search_terms:
|
||||
if term in desc_lower:
|
||||
boost_score += 3
|
||||
|
||||
|
||||
# Check keywords matches
|
||||
keywords_lower = result.get('keywords', '').lower()
|
||||
keywords_lower = (result.get('keywords') or '').lower()
|
||||
for term in search_terms:
|
||||
if term in keywords_lower:
|
||||
boost_score += 5
|
||||
@@ -280,11 +355,14 @@ class SearchService:
|
||||
return []
|
||||
|
||||
try:
|
||||
# Search for activities that match the partial query
|
||||
# Search for activities that match the partial query.
|
||||
# Over-fetch then drop non-game content types so autocomplete
|
||||
# mirrors the default search (no rețete / cântece / ceremonii).
|
||||
results = self.db.search_activities(
|
||||
search_text=f'"{partial_query}"',
|
||||
limit=limit * 2
|
||||
limit=limit * 6
|
||||
)
|
||||
results = self._apply_content_type_filter(results, None, True)
|
||||
|
||||
suggestions = []
|
||||
seen = set()
|
||||
|
||||
@@ -705,4 +705,30 @@ body {
|
||||
box-shadow: none;
|
||||
border: 1px solid #ddd;
|
||||
}
|
||||
}
|
||||
|
||||
/* Enrichment markers (plan Part A7) */
|
||||
.estimated {
|
||||
color: #8a6d3b;
|
||||
font-style: italic;
|
||||
font-size: 0.85em;
|
||||
font-weight: normal;
|
||||
}
|
||||
|
||||
.original-text > summary {
|
||||
cursor: pointer;
|
||||
color: #555;
|
||||
user-select: none;
|
||||
}
|
||||
|
||||
.original-text .original-content {
|
||||
margin-top: 0.75rem;
|
||||
padding-left: 1rem;
|
||||
border-left: 3px solid #e0e0e0;
|
||||
color: #555;
|
||||
}
|
||||
|
||||
.download-hint {
|
||||
color: #888;
|
||||
font-size: 0.85em;
|
||||
}
|
||||
@@ -8,14 +8,20 @@
|
||||
<nav class="breadcrumb">
|
||||
<a href="{{ url_for('main.index') }}">Căutare</a>
|
||||
<span class="breadcrumb-separator">»</span>
|
||||
<span class="breadcrumb-current">{{ activity.name }}</span>
|
||||
<span class="breadcrumb-current">{{ activity.get_display_name() }}</span>
|
||||
</nav>
|
||||
|
||||
<!-- Activity header -->
|
||||
<header class="activity-detail-header">
|
||||
<div class="activity-title-section">
|
||||
<h1 class="activity-detail-title">{{ activity.name }}</h1>
|
||||
<span class="activity-category-badge">{{ activity.category }}</span>
|
||||
<h1 class="activity-detail-title">{{ activity.get_display_name() }}</h1>
|
||||
<span class="activity-category-badge">{{ display_names.get(activity.category, activity.category) }}</span>
|
||||
{% if activity.content_type %}
|
||||
<span class="activity-content-type-badge">{{ display_names.get(activity.content_type, activity.content_type) }}</span>
|
||||
{% endif %}
|
||||
{% if activity.needs_review %}
|
||||
<span class="activity-badge needs-review" title="Această activitate necesită verificare">⚠ De verificat</span>
|
||||
{% endif %}
|
||||
</div>
|
||||
|
||||
{% if activity.subcategory %}
|
||||
@@ -25,27 +31,46 @@
|
||||
|
||||
<!-- Activity content -->
|
||||
<div class="activity-detail-content">
|
||||
<!-- Main description -->
|
||||
<!-- Main description (Romanian-primary, falls back to original) -->
|
||||
<section class="activity-section">
|
||||
<h2 class="section-title">Descriere</h2>
|
||||
<div class="activity-description">{{ activity.description }}</div>
|
||||
<div class="activity-description">{{ activity.get_display_description() }}</div>
|
||||
</section>
|
||||
|
||||
<!-- Rules and variations -->
|
||||
{% if activity.rules %}
|
||||
{% if activity.get_display_rules() %}
|
||||
<section class="activity-section">
|
||||
<h2 class="section-title">Reguli</h2>
|
||||
<div class="activity-rules">{{ activity.rules }}</div>
|
||||
<div class="activity-rules">{{ activity.get_display_rules() }}</div>
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.variations %}
|
||||
{% if activity.get_display_variations() %}
|
||||
<section class="activity-section">
|
||||
<h2 class="section-title">Variații</h2>
|
||||
<div class="activity-variations">{{ activity.variations }}</div>
|
||||
<div class="activity-variations">{{ activity.get_display_variations() }}</div>
|
||||
</section>
|
||||
{% endif %}
|
||||
|
||||
<!-- Original (pre-translation) text, collapsed by default -->
|
||||
{% if activity.has_translation() %}
|
||||
<details class="activity-section original-text">
|
||||
<summary class="section-title">Text original ({{ display_names.get(activity.language, activity.language or 'sursă') }})</summary>
|
||||
<div class="original-content">
|
||||
<h3 class="metadata-title">{{ activity.name }}</h3>
|
||||
<div class="activity-description">{{ activity.description }}</div>
|
||||
{% if activity.rules %}
|
||||
<h4 class="metadata-title">Reguli</h4>
|
||||
<div class="activity-rules">{{ activity.rules }}</div>
|
||||
{% endif %}
|
||||
{% if activity.variations %}
|
||||
<h4 class="metadata-title">Variații</h4>
|
||||
<div class="activity-variations">{{ activity.variations }}</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
</details>
|
||||
{% endif %}
|
||||
|
||||
<!-- Metadata grid -->
|
||||
<section class="activity-section">
|
||||
<h2 class="section-title">Detalii activitate</h2>
|
||||
@@ -53,21 +78,35 @@
|
||||
{% if activity.get_age_range_display() != "toate vârstele" %}
|
||||
<div class="metadata-card">
|
||||
<h3 class="metadata-title">Grupa de vârstă</h3>
|
||||
<p class="metadata-value">{{ activity.get_age_range_display() }}</p>
|
||||
<p class="metadata-value">{{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.get_participants_display() != "orice număr" %}
|
||||
<div class="metadata-card">
|
||||
<h3 class="metadata-title">Participanți</h3>
|
||||
<p class="metadata-value">{{ activity.get_participants_display() }}</p>
|
||||
<p class="metadata-value">{{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.get_duration_display() != "durată variabilă" %}
|
||||
<div class="metadata-card">
|
||||
<h3 class="metadata-title">Durata</h3>
|
||||
<p class="metadata-value">{{ activity.get_duration_display() }}</p>
|
||||
<p class="metadata-value">{{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.get_indoor_outdoor_display() %}
|
||||
<div class="metadata-card">
|
||||
<h3 class="metadata-title">Interior / exterior</h3>
|
||||
<p class="metadata-value">{{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.get_space_needed_display() %}
|
||||
<div class="metadata-card">
|
||||
<h3 class="metadata-title">Spațiu necesar</h3>
|
||||
<p class="metadata-value">{{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
@@ -119,9 +158,15 @@
|
||||
<h2 class="section-title">Informații sursă</h2>
|
||||
<div class="source-info">
|
||||
{% if activity.source_file %}
|
||||
{% if config.SOURCE_DOWNLOAD_ENABLED %}
|
||||
<p><strong>Fișier sursă:</strong>
|
||||
<a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a>
|
||||
<span class="download-hint">(descarcă)</span></p>
|
||||
{% else %}
|
||||
<p><strong>Fișier sursă:</strong> {{ activity.source_file }}</p>
|
||||
{% endif %}
|
||||
|
||||
{% endif %}
|
||||
|
||||
{% if activity.page_reference %}
|
||||
<p><strong>Referință:</strong> {{ activity.page_reference }}</p>
|
||||
{% endif %}
|
||||
|
||||
@@ -36,7 +36,31 @@
|
||||
<select name="category" id="category" class="filter-select">
|
||||
<option value="">Toate categoriile</option>
|
||||
{% for category in filters.category %}
|
||||
<option value="{{ category }}">{{ category }}</option>
|
||||
<option value="{{ category }}">{{ display_names.get(category, category) }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if filters.content_type %}
|
||||
<div class="filter-group">
|
||||
<label for="content_type" class="filter-label">Tip conținut</label>
|
||||
<select name="content_type" id="content_type" class="filter-select">
|
||||
<option value="">Doar jocuri și activități</option>
|
||||
{% for content_type in filters.content_type %}
|
||||
<option value="{{ content_type }}">{{ display_names.get(content_type, content_type) }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if filters.language %}
|
||||
<div class="filter-group">
|
||||
<label for="language" class="filter-label">Limbă</label>
|
||||
<select name="language" id="language" class="filter-select">
|
||||
<option value="">Toate limbile</option>
|
||||
{% for language in filters.language %}
|
||||
<option value="{{ language }}">{{ display_names.get(language, language) }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</div>
|
||||
@@ -101,6 +125,30 @@
|
||||
</select>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if filters.indoor_outdoor %}
|
||||
<div class="filter-group">
|
||||
<label for="indoor_outdoor" class="filter-label">Interior / exterior</label>
|
||||
<select name="indoor_outdoor" id="indoor_outdoor" class="filter-select">
|
||||
<option value="">Oriunde</option>
|
||||
{% for io in filters.indoor_outdoor %}
|
||||
<option value="{{ io }}">{{ display_names.get(io, io) }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if filters.space_needed %}
|
||||
<div class="filter-group">
|
||||
<label for="space_needed" class="filter-label">Spațiu necesar</label>
|
||||
<select name="space_needed" id="space_needed" class="filter-select">
|
||||
<option value="">Orice spațiu</option>
|
||||
{% for sp in filters.space_needed %}
|
||||
<option value="{{ sp }}">{{ display_names.get(sp, sp) }}</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
</div>
|
||||
{% endif %}
|
||||
{% endif %}
|
||||
</div>
|
||||
|
||||
|
||||
@@ -24,7 +24,29 @@
|
||||
<option value="">Toate categoriile</option>
|
||||
{% for category in filters.category %}
|
||||
<option value="{{ category }}" {% if applied_filters.category == category %}selected{% endif %}>
|
||||
{{ category }}
|
||||
{{ display_names.get(category, category) }}
|
||||
</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
{% endif %}
|
||||
|
||||
{% if filters.content_type %}
|
||||
<select name="content_type" class="filter-select compact">
|
||||
<option value="">Doar jocuri și activități</option>
|
||||
{% for content_type in filters.content_type %}
|
||||
<option value="{{ content_type }}" {% if applied_filters.content_type == content_type %}selected{% endif %}>
|
||||
{{ display_names.get(content_type, content_type) }}
|
||||
</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
{% endif %}
|
||||
|
||||
{% if filters.language %}
|
||||
<select name="language" class="filter-select compact">
|
||||
<option value="">Toate limbile</option>
|
||||
{% for language in filters.language %}
|
||||
<option value="{{ language }}" {% if applied_filters.language == language %}selected{% endif %}>
|
||||
{{ display_names.get(language, language) }}
|
||||
</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
@@ -63,6 +85,28 @@
|
||||
</select>
|
||||
{% endif %}
|
||||
|
||||
{% if filters.indoor_outdoor %}
|
||||
<select name="indoor_outdoor" class="filter-select compact">
|
||||
<option value="">Oriunde</option>
|
||||
{% for io in filters.indoor_outdoor %}
|
||||
<option value="{{ io }}" {% if applied_filters.indoor_outdoor == io %}selected{% endif %}>
|
||||
{{ display_names.get(io, io) }}
|
||||
</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
{% endif %}
|
||||
|
||||
{% if filters.space_needed %}
|
||||
<select name="space_needed" class="filter-select compact">
|
||||
<option value="">Orice spațiu</option>
|
||||
{% for sp in filters.space_needed %}
|
||||
<option value="{{ sp }}" {% if applied_filters.space_needed == sp %}selected{% endif %}>
|
||||
{{ display_names.get(sp, sp) }}
|
||||
</option>
|
||||
{% endfor %}
|
||||
</select>
|
||||
{% endif %}
|
||||
|
||||
<button type="button" class="btn btn-secondary btn-sm" onclick="clearFilters()">
|
||||
Resetează
|
||||
</button>
|
||||
@@ -106,31 +150,46 @@
|
||||
<header class="activity-header">
|
||||
<h3 class="activity-title">
|
||||
<a href="{{ url_for('main.activity_detail', activity_id=activity.id) }}">
|
||||
{{ activity.name }}
|
||||
{{ activity.get_display_name() }}
|
||||
</a>
|
||||
</h3>
|
||||
<span class="activity-category">{{ activity.category }}</span>
|
||||
<span class="activity-category">{{ display_names.get(activity.category, activity.category) }}</span>
|
||||
{% if activity.needs_review %}
|
||||
<span class="activity-badge needs-review" title="Această activitate necesită verificare">⚠ De verificat</span>
|
||||
{% endif %}
|
||||
</header>
|
||||
|
||||
<div class="activity-content">
|
||||
<p class="activity-description">{{ activity.description }}</p>
|
||||
|
||||
<p class="activity-description">{{ activity.get_display_description() }}</p>
|
||||
|
||||
<div class="activity-metadata">
|
||||
{% if activity.get_age_range_display() != "toate vârstele" %}
|
||||
<span class="metadata-item">
|
||||
<strong>Vârsta:</strong> {{ activity.get_age_range_display() }}
|
||||
<strong>Vârsta:</strong> {{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}
|
||||
</span>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.get_participants_display() != "orice număr" %}
|
||||
<span class="metadata-item">
|
||||
<strong>Participanți:</strong> {{ activity.get_participants_display() }}
|
||||
<strong>Participanți:</strong> {{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}
|
||||
</span>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.get_duration_display() != "durată variabilă" %}
|
||||
<span class="metadata-item">
|
||||
<strong>Durata:</strong> {{ activity.get_duration_display() }}
|
||||
<strong>Durata:</strong> {{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}
|
||||
</span>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.get_indoor_outdoor_display() %}
|
||||
<span class="metadata-item">
|
||||
<strong>Loc:</strong> {{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}
|
||||
</span>
|
||||
{% endif %}
|
||||
|
||||
{% if activity.get_space_needed_display() %}
|
||||
<span class="metadata-item">
|
||||
<strong>Spațiu:</strong> {{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}
|
||||
</span>
|
||||
{% endif %}
|
||||
|
||||
@@ -143,7 +202,11 @@
|
||||
|
||||
{% if activity.source_file %}
|
||||
<div class="activity-source">
|
||||
{% if config.SOURCE_DOWNLOAD_ENABLED %}
|
||||
<small>Sursă: <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a></small>
|
||||
{% else %}
|
||||
<small>Sursă: {{ activity.source_file }}</small>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
|
||||
@@ -3,15 +3,28 @@ Flask routes for INDEX-SISTEM-JOCURI v2.0
|
||||
Clean, minimalist web interface with dynamic filters
|
||||
"""
|
||||
|
||||
from flask import Blueprint, request, render_template, jsonify, current_app
|
||||
from flask import (
|
||||
Blueprint, request, render_template, jsonify, current_app,
|
||||
send_from_directory,
|
||||
)
|
||||
from app.models.database import DatabaseManager
|
||||
from app.models.activity import Activity
|
||||
from app.services.search import SearchService
|
||||
import os
|
||||
from pathlib import Path
|
||||
from app.config_taxonomy import (
|
||||
CATEGORIES, CONTENT_TYPES, INDOOR_OUTDOOR, SPACE_NEEDED,
|
||||
)
|
||||
|
||||
bp = Blueprint('main', __name__)
|
||||
|
||||
# Slug -> Romanian display name. Category, content_type, indoor_outdoor and
|
||||
# space_needed slugs never collide, so a single flat map is enough for the UI
|
||||
# filter labels.
|
||||
LANGUAGE_NAMES = {'ro': 'Română', 'en': 'Engleză'}
|
||||
DISPLAY_NAMES = {
|
||||
**CATEGORIES, **CONTENT_TYPES, **INDOOR_OUTDOOR, **SPACE_NEEDED,
|
||||
**LANGUAGE_NAMES,
|
||||
}
|
||||
|
||||
# Initialize database manager (will be configured in application factory)
|
||||
def get_db_manager():
|
||||
"""Get database manager instance"""
|
||||
@@ -36,15 +49,17 @@ def index():
|
||||
# Get database statistics for the interface
|
||||
stats = db.get_statistics()
|
||||
|
||||
return render_template('index.html',
|
||||
return render_template('index.html',
|
||||
filters=filter_options,
|
||||
display_names=DISPLAY_NAMES,
|
||||
stats=stats)
|
||||
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error loading main page: {e}")
|
||||
# Fallback with empty filters
|
||||
return render_template('index.html',
|
||||
return render_template('index.html',
|
||||
filters={},
|
||||
display_names=DISPLAY_NAMES,
|
||||
stats={'total_activities': 0})
|
||||
|
||||
@bp.route('/search', methods=['GET', 'POST'])
|
||||
@@ -82,8 +97,9 @@ def search():
|
||||
search_query=search_query,
|
||||
applied_filters=filters,
|
||||
filters=filter_options,
|
||||
display_names=DISPLAY_NAMES,
|
||||
results_count=len(activities))
|
||||
|
||||
|
||||
except Exception as e:
|
||||
print(f"Search error: {e}")
|
||||
return render_template('results.html',
|
||||
@@ -91,6 +107,7 @@ def search():
|
||||
search_query='',
|
||||
applied_filters={},
|
||||
filters={},
|
||||
display_names=DISPLAY_NAMES,
|
||||
results_count=0,
|
||||
error=str(e))
|
||||
|
||||
@@ -121,12 +138,51 @@ def activity_detail(activity_id):
|
||||
|
||||
return render_template('activity.html',
|
||||
activity=activity,
|
||||
display_names=DISPLAY_NAMES,
|
||||
similar_activities=similar_activities)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error loading activity {activity_id}: {e}")
|
||||
return render_template('404.html'), 404
|
||||
|
||||
@bp.route('/source/<int:activity_id>')
|
||||
def source_download(activity_id):
|
||||
"""Download the original source file for an activity (plan A6).
|
||||
|
||||
Shipped DARK: returns 404 unless SOURCE_DOWNLOAD_ENABLED is set (copyright
|
||||
exposure — the user opts in). Resolves the activity's `source_file` under
|
||||
CORPUS_DIR. send_from_directory does the safe-join and blocks traversal;
|
||||
web-mirror / extension-less sources that are not real files 404 gracefully.
|
||||
"""
|
||||
if not current_app.config.get('SOURCE_DOWNLOAD_ENABLED', False):
|
||||
return render_template('404.html'), 404
|
||||
try:
|
||||
db = get_db_manager()
|
||||
activity_data = db.get_activity_by_id(activity_id)
|
||||
if not activity_data:
|
||||
return render_template('404.html'), 404
|
||||
|
||||
source_file = (activity_data.get('source_file') or '').strip()
|
||||
if not source_file:
|
||||
return render_template('404.html'), 404
|
||||
|
||||
corpus_dir = current_app.config.get('CORPUS_DIR')
|
||||
if not corpus_dir:
|
||||
return render_template('404.html'), 404
|
||||
try:
|
||||
# send_from_directory rejects path traversal and missing files with
|
||||
# a 404 (NotFound) — no manual safe_join needed.
|
||||
return send_from_directory(
|
||||
corpus_dir, source_file, as_attachment=True
|
||||
)
|
||||
except Exception:
|
||||
# Missing file / web-mirror source with no on-disk original.
|
||||
return render_template('404.html'), 404
|
||||
except Exception as e:
|
||||
print(f"Source download error for {activity_id}: {e}")
|
||||
return render_template('404.html'), 404
|
||||
|
||||
|
||||
@bp.route('/health')
|
||||
def health_check():
|
||||
"""Health check endpoint for Docker"""
|
||||
|
||||
Binary file not shown.
379
data/review_decisions.json
Normal file
379
data/review_decisions.json
Normal file
@@ -0,0 +1,379 @@
|
||||
{
|
||||
"96560c4dee400911e01ec5bf6f2460f2b6008bdb": "merge",
|
||||
"7b93e6fbec6dab9c43862f4104d2f0e83784ccca": "merge",
|
||||
"2765d81e6572977b411f9521871a7a3e557cab6b": "merge",
|
||||
"bdd723ba8575ec94e6ed9e036ea62512ef643b40": "merge",
|
||||
"78f0c433f6da30bb922ef1f8637a747410ac3528": "merge",
|
||||
"0d3a5f2519d69e04558431d5990de8b1fb59a9bc": "merge",
|
||||
"5abca033e2bd9f49a7b65cad2e357c7ea7e2236d": "merge",
|
||||
"379999f9a603dfe4cb2316687a2266e17001ea8a": "merge",
|
||||
"90a16119eadd2daa1ecb8b897c20875e54c5934a": "merge",
|
||||
"d75b4131ece700309957aa17dae01f1feb9b3c35": "merge",
|
||||
"0d0ccd256e8a02e45128b9dcc196b49ce6b5ec0d": "merge",
|
||||
"db8afcd1ae6e0525dce1403caccf417de1423c68": "merge",
|
||||
"ec157c89b9c3ff0691fc92b9ca0aa586f93c4880": "merge",
|
||||
"67ce4ec87bcc33aab599a1c93c930e93c7762ae4": "merge",
|
||||
"952cbbbd28453b93bae97de9edc473bb9525bac9": "merge",
|
||||
"7a1acb62bb01bcd568fd67699da2523b293ce8a6": "merge",
|
||||
"043c6dcc5a4d80f43ad9308e454e4ec287321a26": "merge",
|
||||
"d10374d745aeb6abe1c0cc827baf98b311933196": "merge",
|
||||
"bdce7856cee7f98e187b35b3ad806782e738f007": "merge",
|
||||
"58103e94a6020d75e2d79fc1877b08b15e2a3d67": "merge",
|
||||
"dec54f2d3015f73eea7558fb52d2107fc43b7937": "merge",
|
||||
"d4b1e9dca9f3bdd4302f311ee96e6b39e96d5b1f": "merge",
|
||||
"93b16e984bc1b98de7ec9403df61522e53ee26e6": "merge",
|
||||
"6202ef279e70c0850369575f8dd9a6a338006a82": "merge",
|
||||
"02b83749ab358c1c591f134d4d252a72843fb443": "merge",
|
||||
"903af17ce988c40ab64c855a8b1f8e989d36883d": "merge",
|
||||
"e0d6a73c57fcd151bbc4d7b7727aa8d6e1a3301c": "merge",
|
||||
"63583f4d78dfe919ed195910a917d8cfe52ebe09": "merge",
|
||||
"37306f964eaa9a123e40aaa7b642520ce3be0349": "merge",
|
||||
"e355dd178be6d4589c227a047f2083e55806c4ae": "merge",
|
||||
"3d99eecf4ad2e034ad9cac724cfb9f5d57bfd45f": "merge",
|
||||
"e7c893a786fba566ea0cd5c8ee2a27e280560fd1": "merge",
|
||||
"cf7a7b6db84a7070e9495f6ff2970d8860621885": "merge",
|
||||
"a7157217540041a575cd3909a8bd7fe9345932ef": "merge",
|
||||
"15f50b3eb8baeac6cf58df3938ddcda2ae0c3224": "merge",
|
||||
"3c977df90998d069d7bba6a11d2bc7a891f5b45d": "merge",
|
||||
"7b4361f2e3d9f017d78513bebfb0f2aa7243f987": "merge",
|
||||
"d386aa70c890c3538b5f6b201e135f7ea9613ae9": "merge",
|
||||
"1da5be5acb36f222a67999de864cbcbcebe8d876": "merge",
|
||||
"7bf9319e4c0322747fd2e35971d7f7932e835eef": "merge",
|
||||
"29d7060cf880d7db900d6fb64cf2cb3874794b88": "merge",
|
||||
"9832cc0a6625f24354d6dfee3527086d042dd52b": "merge",
|
||||
"12ea5a54fb5f123441d8fae1451f6cdd02f572bb": "merge",
|
||||
"0638d8c1c81fe5c52ee29823be9f0e162f9aed12": "merge",
|
||||
"a2d1962c3c3f1f471f33a093291119d67890d26e": "merge",
|
||||
"ca7437967bbe393da5f5f463e359711929d6ba00": "merge",
|
||||
"6637ee7fd0ae519f87e4374ac6d8d3df12a0c0e2": "merge",
|
||||
"b7a8104fa06e2e2491f5699eefa5dbc8ffb26209": "merge",
|
||||
"21bf2790f28f654d0ccd38d99e8bfdd9876f2d10": "merge",
|
||||
"1e6a13a18d750a8b20457f71bfe64950a4c50f22": "merge",
|
||||
"299a31f4187c38a9ff84773d4ed1a3eccfe6904c": "merge",
|
||||
"a013365c84b79fac919ea70a2c2820881bf05e82": "merge",
|
||||
"dfdc05810d2e84b93a2edd64cf8138aeb9a6ccf3": "merge",
|
||||
"cf88b32fe2c2c09907d1813a1fefe9b17754d9fb": "merge",
|
||||
"210ba3d41c7424428365e7ed33cf86c9819f1467": "merge",
|
||||
"a458a6f96eccddd5b36392f38ef5c7dd16a238cd": "merge",
|
||||
"e4198708e6d69d69ce8c7cc14d5c1e4757944808": "merge",
|
||||
"6e3c6b282675e787c8726c0d4cc536357f6b1438": "merge",
|
||||
"6370503a78ecaacf6e8738fe3419d2e4a4b280ec": "merge",
|
||||
"5e6bbf4ea08efd5187954c5bb40d67a0f4ec223c": "merge",
|
||||
"d49a48f1dbf890d9c802c0c7e0e7e38a8b28645f": "merge",
|
||||
"668ca80cde843b5bcfc1e3c3599b114893481c6b": "merge",
|
||||
"2d85fa7ad793b0fec8066b1b45ae51da7f09fae1": "merge",
|
||||
"1416073acad77bef155224b1f8005f778554aada": "merge",
|
||||
"ddb18e2eeb058b1dbbcc1afd444256a82fccc346": "merge",
|
||||
"f1f5fb5b2f73f4035098d5af23da8ca2f20f54ca": "merge",
|
||||
"d3f4d626827e6798675b55a4785c2829ef93ef69": "merge",
|
||||
"8e83a64a6aa030d6d02d3d736c3fffe8249033dd": "merge",
|
||||
"d93a71716835cd1f8856abe093580f1e6aeab023": "merge",
|
||||
"98cc404cdd51987d3a6d15e88269c2978286e80a": "merge",
|
||||
"4d31104a3cf321dcd831e354a6d2556cffeafbd0": "merge",
|
||||
"b0e6d245954401f9a766addafb2dde59b5bf5fe9": "merge",
|
||||
"722f70854b1aff75d059410ebfdcf99fa15fe828": "merge",
|
||||
"3c58885138b0e9da95fcb0ed7906339cd104e168": "merge",
|
||||
"daa05e337e1fc8f02c4679a1da21dd74deafda0c": "merge",
|
||||
"f5ab25d06273404abfef1a55a1288f518d668638": "merge",
|
||||
"7484f446f8c8e59e6f248a465ea74b60f87102a0": "merge",
|
||||
"d072ac1c088c2fccf40536706d6433ecac0e865a": "merge",
|
||||
"9c8c02b5aa8e2ca3fcd6a91b46883beeadfa77a1": "merge",
|
||||
"81fed4a423c79dee6a8c7a3b82b4873e0003597c": "merge",
|
||||
"de6d50103958f692181a2c8a7d8aff51a9e3c6a3": "merge",
|
||||
"d1efa3c2cf23ee5cf845716a94af6e0c9ee2e3e0": "merge",
|
||||
"d762147d9a4bd258a874ac9d899e712fa08e3865": "merge",
|
||||
"1b28f43ef78eb3031c172f8c1a7c62834b902945": "merge",
|
||||
"28dc9dde42db1dbf1fe826219b0cc2bb5b203304": "merge",
|
||||
"3dc8037f2ed01e025c39dae2900ad40b5ccbd546": "merge",
|
||||
"80d95a9a86341093da7a2266225f1e8b016bfdc1": "merge",
|
||||
"7ec81e486f1da7c383c20837f06d162241ccf720": "merge",
|
||||
"f653b01b650d420eeb1a581e3d2ce3823b7000f2": "merge",
|
||||
"45b0184b1c666135ef903680935ccca6a638ee8a": "merge",
|
||||
"9562ba2777d2ed8b3c590beac7697d2ea731a2f9": "merge",
|
||||
"37d803fce5601c7f4e924368b07a74cd03df7317": "merge",
|
||||
"9a92c8f1ef4120ad82993b095dad3b3dbd1c2e5f": "merge",
|
||||
"4265795c137dd7229c4a593beab00d8401225afd": "merge",
|
||||
"d308fbe7842da27ee2e3204f700afd9364dde327": "merge",
|
||||
"3aaa23c16b88060510826b497c6433f010db7bb0": "merge",
|
||||
"d827b842f11c0752ffe8cddb6c372b8deab01924": "merge",
|
||||
"0232a13fd5589c4e1cbfe23c13a8b7b863b665eb": "merge",
|
||||
"1b7d6c88a61344bb2033d5de64d913acf2d527f9": "merge",
|
||||
"d9d8104bd70c4c03b6841ccbc07d078b9bc22c81": "merge",
|
||||
"ad2251d37f8911b33e2b5227d2e7acb0533fb10a": "merge",
|
||||
"7b1ec96916a6b5cac294e05ffc46c42a821076dc": "merge",
|
||||
"7c29838be6491258e5e3d169d931412adc8d2f01": "merge",
|
||||
"c050c15f2d3e0a2cf8d1145d3a9050d29d2f540f": "merge",
|
||||
"3d6731ad69adec25662c431c5d42c7d4a1934c89": "merge",
|
||||
"d07cb8c2e9d0c4a192b3966698e6671d21ea982b": "merge",
|
||||
"218e30c5d34937563d5d328e51d983a33611a1f5": "merge",
|
||||
"4547df7094cd3f8791adc07996d3c3264d1b8ff4": "merge",
|
||||
"7de994f358426f4c3e48b54196e4c0c6c71d60c4": "merge",
|
||||
"60473b807e6a9c940dacc726d3270efe8b6a0c5b": "merge",
|
||||
"7db9d39d8d2baf52c6e586b95472699f42c9c98f": "merge",
|
||||
"2409751dda5ae660a28b230694e0bc2bd2207133": "merge",
|
||||
"b525c8c36de926f6b891d613df9835b06bc6be59": "merge",
|
||||
"533ba4716e2ed1154e3ce6323aa899ea148abdf5": "merge",
|
||||
"b815d67ecd6ac3a64a6f792ffddc63a462754100": "merge",
|
||||
"6d0c3759ee029bb6d3fcd693dca9753f5186ecf3": "merge",
|
||||
"50ae80ee932281e9f4424c71f4d2607ce842737b": "merge",
|
||||
"7507208a9ca4f83a3634904810bdcd6a1633058f": "merge",
|
||||
"8614f97398b7b50fb50ae8405f8695f2960489d9": "merge",
|
||||
"9d8779922c44bc4a805447283b8b037569131b18": "merge",
|
||||
"965d619a9809c4f3831e308cf6214ddcebb9f835": "merge",
|
||||
"7f99acd1fef4a139c96911773d188f5be94ddbb5": "merge",
|
||||
"c5478f32f290c116b8a801736d2832a4ae1fafa5": "merge",
|
||||
"d589902ef6b25634da0fd89c00e3bbb44e8d868b": "merge",
|
||||
"9b3887e6d292ba2d3642689538231fb1a4d1dc08": "merge",
|
||||
"a2571504a7e62563aadb34799f40882afcd56ec8": "merge",
|
||||
"29f4b1f18f23495691ca5d6bc33150f36612a16f": "merge",
|
||||
"be8529c39e659ba00ba30ea4ebbb05510ceaa684": "merge",
|
||||
"3113a3f8ab0795d3dfcbf4b0dc64f5887bfcfeb6": "merge",
|
||||
"53e4ecab7d2fc23efa1f75a93b7da1fbe46fb38a": "merge",
|
||||
"aec818b52ddc3de410cb56692e95493a0661ac40": "merge",
|
||||
"19cd0d05e21206203498eda94b4ba8b777d360f2": "merge",
|
||||
"b596ae2fa7fd7dd059095b989f8477fc0023ef2a": "merge",
|
||||
"68e69967ef4f7437f21373b52f77a3701f61054d": "merge",
|
||||
"d72598ee9abdfd3d5c07b1e6f3e9fa11424b569f": "merge",
|
||||
"d420186d189d783fde3d29908ade3221eac109df": "merge",
|
||||
"7a3c7465e4221511994a2b2ebf6aa5c0b3353b0c": "merge",
|
||||
"c19d94c259c4bb55cda1198b96c8e723ca48904b": "merge",
|
||||
"f7222a7fa8b9930b0aef8e377e820423bd891b7a": "merge",
|
||||
"aeda2c5a5aa5c0ae8b33f72f81a76e0617b45474": "merge",
|
||||
"3c75d378d76756f56f2c5827d0adeaa6c0c88e04": "merge",
|
||||
"9bd53fc3c7b669d84cfe1f9778791ca32c1df621": "merge",
|
||||
"e065420daad8abd3e6908381775c9ab5aeb804c0": "merge",
|
||||
"107bd809c7ca5d24b1ea2d0e59843b6386be43ad": "merge",
|
||||
"32457099f22bfff1d9ab4f24c87768c893e295c6": "merge",
|
||||
"965038421cb0d65fc937cbf4f99b487732807fe0": "merge",
|
||||
"7b0e5e6a8e49a9f3594cdf17559cfff52c1a329a": "merge",
|
||||
"fa88682c3eab4a5fed4bb64f5a9b5733018d12f8": "merge",
|
||||
"4cddf1c9b6424b9d66cf6adf2e631dda8a92b88f": "merge",
|
||||
"0733103d3ec92e1bc7a05f7a6be53b9ea19ea1c9": "merge",
|
||||
"30d28ba19a6ece36d32b54ca30c0a71fa4b04dcf": "merge",
|
||||
"91ca8e066bfd41514ed775e2cbe485d4d74ee85c": "merge",
|
||||
"8b952bbf8e45e128e8a5b224a3c3da7d166327b1": "merge",
|
||||
"5ae4f3199d76f8805e38174425f70fd7bb99296c": "merge",
|
||||
"3ac614a52102ecd7448d4ebd47641264b3feba99": "merge",
|
||||
"e25061e58c6d66b88e89635d670d5360ebf2bc6d": "merge",
|
||||
"277434fe9d2179b91d2cfbd660a16f95ba590031": "merge",
|
||||
"8ddcb8e0885cd61847206cad017e1d65b002e798": "merge",
|
||||
"bcda860319366bab0690195d2d9665f72492e19e": "merge",
|
||||
"d9ed88849aa59a5ba31798854a25843815cc2d3a": "merge",
|
||||
"eae46f64b96fff8db16751cf271698a54336d7d3": "merge",
|
||||
"cfd5e3814f129e59c011acce3ef97d3a95b21465": "merge",
|
||||
"c7929c37f1367d52bd529e671e2c951dbb80f618": "merge",
|
||||
"b5137e14a8914e9007fc78b48b9c8ac5060d3885": "merge",
|
||||
"38373f03cc19db9363c9be6dd1df60f8c0344a3a": "merge",
|
||||
"8e2784d1d0abf50c83d280dc3c50d51bcc8bb947": "merge",
|
||||
"9322e84a111ea8d6aacae5998d742b1ef7689870": "merge",
|
||||
"2ab3a81c5d889f7ed58d72a4c42e54728955a3b1": "merge",
|
||||
"3c87b2d1c3df5d3cfe920e992805d55f840b023f": "merge",
|
||||
"3f14169b7cce95214f874f5dc7acf214a1576e09": "merge",
|
||||
"bfbd7c57a96b5a66970d28f23ec65e71c3d0af63": "merge",
|
||||
"56841bf251e2631d31accd7b44ad1223ef55116e": "merge",
|
||||
"e4a907fb74246d5af8201df32ab1db0ffcf10582": "merge",
|
||||
"af373139d280bc4b36b31a3e917a442d441f76c4": "merge",
|
||||
"6d69ce5b4cb5dae80f7fc6fdd74092624d7e629b": "merge",
|
||||
"5d93d92c1ed5ec1b9eea859a8d10dd5542f6bf0a": "merge",
|
||||
"0d3655929d5a20efade4fdebdbe1e51a857aa4a8": "merge",
|
||||
"c6f5a99b9970977b87527e68a79856957266ec85": "merge",
|
||||
"ac8798c7b1bda482f86770a1152b33e1eb09936f": "merge",
|
||||
"da588a2ebac1e9fce8a00635be058aa08a534006": "merge",
|
||||
"b1b5f3e3195a835e110c12826b0f68aa6c5dfda8": "merge",
|
||||
"2cbfda9b32acb9e3353046193185ae459b6baade": "merge",
|
||||
"4530e7f505b85a4bc418f8e204fd2dce3e4bcd51": "merge",
|
||||
"d0f306635d8df199e336616980ff8367f23a86c6": "merge",
|
||||
"043651de409eaa56f0f5f02ccb36acb4f9379c7c": "merge",
|
||||
"78febd51e41d71f721454190ffe4defd3be64cf2": "merge",
|
||||
"878234eaaa6a82da2a07cbddf7f7a9312b7aee40": "merge",
|
||||
"d2e283086b32df1b74da9ff6bac014867f3e02b9": "merge",
|
||||
"8494ded48880d67c8857860f485a428c85ddb17c": "merge",
|
||||
"d99eab2896ee0c15d96ce9008387f1caae4992d5": "merge",
|
||||
"996d708cc4d0a7532f24889ed8fa33565f1252db": "merge",
|
||||
"bea98344fcc400338f45fa7aea05c1c716ff2909": "merge",
|
||||
"7ed550a014b20ddb6d406ae470fc564501fcdd2c": "merge",
|
||||
"783f22275c9c238507aa5903ec8b0d99e3a4a348": "merge",
|
||||
"d32c5eea6b7e682bc998ae968b835436784c8501": "merge",
|
||||
"e1aae6922f87193a5d0c8efdeabfd55e80ba13c1": "merge",
|
||||
"9aa988ee82d5a2ce587b2cb4ad304c1449985322": "merge",
|
||||
"e11c9441f3fb67f4c4fed049c015fb6eb8b73ca7": "merge",
|
||||
"fdc0642b90b7afb0520294d90a73060ed5211cde": "merge",
|
||||
"29c466523090f5f957dd5feb7a10a8ee839d6a68": "merge",
|
||||
"987e658e3f40c874da5e237a635ea3eba63520af": "merge",
|
||||
"ef677e0935c55d3da20872e27328eddbae34b5b9": "merge",
|
||||
"b34cbc1ac68e6e0cc61f273d030520cb0f6639fb": "merge",
|
||||
"7769d0a01f93ed6437128827ba88987807aa7154": "merge",
|
||||
"5abd16d445171cdc30752149066755886ac627b7": "merge",
|
||||
"b3da0c637ac87ff1032213ce007d365af512ce33": "merge",
|
||||
"ea4065e3f76043bad07c6e6aff17c342abf59668": "merge",
|
||||
"c3e40ebe130c4603359654c228ee8f3518ea6b46": "merge",
|
||||
"0a98a18116f3384808d9472d370e9c65984f10f7": "merge",
|
||||
"057f5aab9399ceab6ac1d8b59da831e952e53dbd": "merge",
|
||||
"2d41a71e7b7db81fbc2757b814d9189c67cda4cf": "merge",
|
||||
"c27d096ed3d5d94e1f9c1574eff52635c78ae282": "merge",
|
||||
"b4eed2c25fd8260393df142bef5d4a964f3bd10f": "merge",
|
||||
"f6bd9d75156541527e66fab5e474c1422ad9dbf9": "merge",
|
||||
"a449eb060de8beba56dd0dbafde9fee4f65288c9": "merge",
|
||||
"cde603b1d5c0ea111641e21abe2e75a49d157de4": "merge",
|
||||
"c3b0799baf3248e581be8a9cecf35351a7c4a363": "merge",
|
||||
"2c08a477dd2717c07da9f2bbaf1940403d0b8f86": "merge",
|
||||
"a55f668e2623aa50abded937ff4400382e80d90a": "merge",
|
||||
"abb24637bfd92e7212e9dafc9a7e5ad14d7cc8a4": "merge",
|
||||
"b862dd55a70d645da81f6c0cdd921c8728674633": "merge",
|
||||
"6197261b63f675835c3290be0f10ced10400e326": "merge",
|
||||
"74c82141e805ebaab913cea281416883849cc771": "merge",
|
||||
"1ce689262c53a212ad8f94a778ea4bf80bc62910": "merge",
|
||||
"05d9af704c387479856a2390b81a27fed4413285": "merge",
|
||||
"492c25ec0c4667bbe01177f689f3c51b8f31190d": "merge",
|
||||
"ec79b6e520a87ca24ae5265aff1b39a87dcadf82": "merge",
|
||||
"eb5d32991120f20c2ff1c16e92b31c78aa3dcd22": "merge",
|
||||
"e2dc20814e229ddef4569ca4d405c0b553313d1a": "merge",
|
||||
"9e37a3e0e6336d306c263bd4739e9231cb8221dd": "merge",
|
||||
"2c541c5bf5b6a93301fc8ee089228a09b8581959": "merge",
|
||||
"f5b717b5da236e42a840093e4562fe2896f44a20": "merge",
|
||||
"04a3b2dcba0a00ee833e2e9c4a5f032f2cf80742": "merge",
|
||||
"f8452066c036fa7c09b4acd03de4f0c661a6bec6": "merge",
|
||||
"b595ffe402b14a1e927c2f0fe9554f581354999d": "merge",
|
||||
"786ab65d5e3c51246f1c457b77eff154860bf959": "merge",
|
||||
"8b6d467df7b0161820cf698a33c665217c597d04": "merge",
|
||||
"264d317f0a0d92e82c2943ca9365367382902c27": "merge",
|
||||
"6d95efd3025e3175aa75c1aa7ed7ec7d1ac3d1a1": "merge",
|
||||
"165c5c31e2fb155e00a2664fcca334c25356af69": "merge",
|
||||
"998d2db54bde3578f8e7dbaa5e4612f20836e5d0": "merge",
|
||||
"c764f15e260fc5f2ac445e8216e27130fa9af9e7": "merge",
|
||||
"0120182e18a57cd9d936bff9bf74675bb9233cdb": "merge",
|
||||
"09c39c6b32699cf61aa71fa6689dd676b5e803a9": "merge",
|
||||
"1549346e5b321eaf1d767f673e642cac11d22700": "merge",
|
||||
"cd7a074efdf2dd6e9272e05cb88a8f8da08a8a8b": "merge",
|
||||
"f418edf54719e46b2707f0c6ec9a47da38ab099a": "merge",
|
||||
"9e20edd1cc156c387a6fd4b47f0495a5d1d5d969": "merge",
|
||||
"dac94f8716a6f192fcfb660453b3074e55b60c3f": "merge",
|
||||
"c62767464497b98bbab2fbf724ce069b749c0d95": "merge",
|
||||
"37bc745042745d9984d9323f83de2bb6f2af0d49": "merge",
|
||||
"3df2f823828b92a35075fb207844cf788b8bfc09": "merge",
|
||||
"153cbe257b938edc3bda1aeea3729c53abc81d57": "merge",
|
||||
"0da73ba76f0e83a24cb7ec86ad426fc79f0be00e": "merge",
|
||||
"517597920bde6cfea71154e6a13fa4a9bcb18db2": "merge",
|
||||
"283bd504ae23412dfcf671c7d6febdf42232383c": "merge",
|
||||
"4d889b00c2811ade234b312f72d725ffab3971e7": "merge",
|
||||
"c5e723a923eb6c0d07f34e05c30465a862b3fd5c": "merge",
|
||||
"9e9afe3963dcbfcb1191110145772373c49739dd": "merge",
|
||||
"4a238771093ed7936603426b9a19355fa48a086a": "merge",
|
||||
"641291a504ecc051c99fc385151f68f6f2c7711a": "merge",
|
||||
"d6e47126fe3754fde729dcb7777cc25165e57dba": "merge",
|
||||
"812cdf28743b2f01cceedca18cf7e237f6f38c1b": "merge",
|
||||
"de2ce5f515b0d9647320bd71d8d013c68fb48bc4": "merge",
|
||||
"14ffc9df5acc2537d3cca6322d097363d7106d52": "merge",
|
||||
"5941736673d484453b49edc863a9dfa669f4cdb2": "merge",
|
||||
"d68039d71f1bc980a8cca9176df5fd672575db46": "merge",
|
||||
"5e8667cd2a0862a7b528779267ea074d88501494": "merge",
|
||||
"7b6c4999ce6d158f72aba749543f5c8a1cbfb08b": "merge",
|
||||
"ace457ccf21922424aa38a0a29d5aa7dbd4e7fc1": "merge",
|
||||
"5707d00d0d3b646d83541c9205405f5e41ad07b0": "merge",
|
||||
"261e67f706670c9f81e7b2c6d0ec1fc934c4109a": "merge",
|
||||
"eec31103367b39bc1b4db3d0bce21650ad8f4382": "merge",
|
||||
"39880f589acdfeb25866adf3246a76df08798f19": "merge",
|
||||
"65ced2d78530d25626bcf18c4005a0aae5ed62c2": "merge",
|
||||
"48cf968689bc72fa3ae82ce65a7b0943489f0154": "merge",
|
||||
"4f2e11bbb3805459f2eacf219e5ec03f1d28cb35": "merge",
|
||||
"0ea6dbabece8f3e5ea404f61c30c377f60a472e5": "merge",
|
||||
"7ede6afb1141d462955d620a82608841220798b7": "merge",
|
||||
"9ff9d8bf1215f2d98d3707653c09bcacf4e296de": "merge",
|
||||
"b2e944340a3cccf5c1c399aa814e2fabe1ac864d": "merge",
|
||||
"ce4d859cbca0368652db3168769a4e6ccac8f91a": "merge",
|
||||
"6f657a344ac4233f86f0cf9ff11aa0aed4be94e0": "merge",
|
||||
"b7101934d892ce071b407567acf6804e41c9b89c": "merge",
|
||||
"2b5d28295e73e25c51359577e55470b59a4c283a": "merge",
|
||||
"f0c9de980532b717b361439c4ae7d58e4b6a49ea": "merge",
|
||||
"13bd864c337a32bc7207d44c4f1af0ce58f15072": "merge",
|
||||
"710c6b384f10cc08c9a2116614e7736d035e65ae": "merge",
|
||||
"7109848fc1fc60cba60a51e3cb0a514d3c90eebc": "merge",
|
||||
"801e5080975fda8bab6c060efb2e7f3ac85d8182": "merge",
|
||||
"a4910a93356eea624e5a08fe05634b94c0dd0285": "merge",
|
||||
"7ff8ff8d36da93bbeade44b70f6b944ea32135d0": "merge",
|
||||
"02402c6f4cde79fc0d4db36d033d4ea504d97c90": "merge",
|
||||
"09c196193a16b47338b3e5446882ffa8ce3f8b16": "merge",
|
||||
"2d427628ac15e8d04daa564e4c0f1a3aa7f93126": "merge",
|
||||
"d65181509712e219fbe71407f0cd24845f306cd1": "merge",
|
||||
"a82a00d0170e01857e493ed08f20d9b2133a8791": "merge",
|
||||
"a1c17e0481ed6b2a3bc68f8bbf135173f636c2f6": "merge",
|
||||
"1cefb01dff5305eba60b9f19e1acd5087bd8b60a": "merge",
|
||||
"2f766fa9b5c81f85fa225f5e9573785fd0ddf5eb": "merge",
|
||||
"12e822ee75c1f3a0d59cbb8e76370a7b7dd0e4dc": "merge",
|
||||
"6e285cebbe113587482274cb3d51eb4ad8139643": "merge",
|
||||
"f79b4edf703c01e6a5b7a75c43b691a2de7c6ca4": "merge",
|
||||
"cc3901d8e10db1561222e2b39c62391298a14d82": "merge",
|
||||
"21ee92f1adb0d480c328db4ded13728e890c0ea0": "merge",
|
||||
"7b95c6dbd4990cbf4400e7ae89e2d530e31baa4e": "merge",
|
||||
"978078ce3a4aaa7c425cc31e3dfece248a4f3a51": "merge",
|
||||
"569a61c811956afb56325ede193d1d24a2d28013": "merge",
|
||||
"77c9f803050c33ef23e6e428fbc98d5bde14ab2b": "merge",
|
||||
"02dc4ec08801ab168b2838f3322a3a477ee179be": "merge",
|
||||
"f804ddaa38c0d16757bd82f4aa58292da39a427c": "merge",
|
||||
"62cc44865952495586686578bfb905b0026a762f": "merge",
|
||||
"ae10bcefb1c772dd8898f56a6590a4342a880539": "merge",
|
||||
"59427406b24625474bcefc245fd1f27b30bc2c81": "merge",
|
||||
"3c7b681ea2f999c7581bf38d4c8e2a5486f8e00f": "merge",
|
||||
"2c17be0a20d9399ab3b7c491080aee118301d4e1": "merge",
|
||||
"072d8163161c207aafa8fff050ad93b7cafa815d": "merge",
|
||||
"881b8d6623cf104d25bef36a17f89af27d0a6e8e": "merge",
|
||||
"41875d183c0ef7d3a71477ae0d0f4bea0763812c": "merge",
|
||||
"ec1cfbe8410b6569a010a708f81ab9768e882cf1": "merge",
|
||||
"bd6faef7a688ea78b5017c19442e302f0337a3f1": "merge",
|
||||
"225823b9e7e36d1d9e2e8ff0dbce20f3de6dc040": "merge",
|
||||
"35d2a6b775147a9b60212d43810338231a239865": "merge",
|
||||
"a6484f51b8be1c8b1898464e2a60b7b4496afd23": "merge",
|
||||
"058172d888c1d39d7af7dc95fb6808bcde2a9a33": "merge",
|
||||
"8d0a231bab26e73c7ed7e908e049fb48f4ee333f": "merge",
|
||||
"96230c861c8a18100f40daa6a1d6e2ea91778707": "merge",
|
||||
"a3f4effb653fdc612188a260ee04e6fd58213d41": "merge",
|
||||
"3adb67d39baa74d72ca6fd926ab000b0e5b50007": "merge",
|
||||
"e21b4aaf288b2223783f39db5383c39fdd32773f": "merge",
|
||||
"885d004c25647a9561922fc2d5f2c2a7fe4f91f4": "merge",
|
||||
"53c8b120a142ebcafbec64ca3c9040afa39f4d02": "merge",
|
||||
"3e5be3f2f9b83d0506d261b45913d2cd5bd1e36a": "merge",
|
||||
"ce65391f84fecb5d44d7de447d87bb1ef64a60bf": "merge",
|
||||
"a39abb5b691049421ebb8720c0d5305a95bb244e": "merge",
|
||||
"a6065db7c298ecb714b86f930bec5988d4466603": "merge",
|
||||
"607aae23587abc641177812447a45099973c41a9": "merge",
|
||||
"c94ec26898c0d2bc0b76a42f690bb3722c01fc43": "merge",
|
||||
"d31c729d0d9dbef75416af90502bf75194e6e28a": "merge",
|
||||
"28442303003251487bbf9376acda1e6f35c1ea59": "merge",
|
||||
"1b263555acc9753b4de67191c636c9d403a69c98": "merge",
|
||||
"88aaa450db0dd1121d47ff9f605e727efbdcaab7": "merge",
|
||||
"d9e19f0a3a5dea8a04d5b2ad241f2ef9c1f06f59": "merge",
|
||||
"16e379debd6d37512791ee9333079c669df15575": "merge",
|
||||
"f51a06926de6189cc24abd4e67a325aaf1ad464a": "merge",
|
||||
"9b8e2100c116ded02740c4cb777a5f157faccdb0": "merge",
|
||||
"f34a747406f51b6531f04c0d14242226503051de": "merge",
|
||||
"db2dedf1eed6cba52eb672be03855792e85da808": "merge",
|
||||
"b6259a36610f0fd926ee6778b779fe538b9a510d": "merge",
|
||||
"717e057c3fccf73ff4c574c7fe925b16058d8809": "merge",
|
||||
"9b6e897d7f92a031ced4241f81f8b1a16f1dcebe": "merge",
|
||||
"7c9434809e1fb66093267cb386ede23690217c5d": "merge",
|
||||
"8d37e5a1fe1ead57a6fd6358f3dc734a661d69c3": "merge",
|
||||
"b8336b4760340e35f4e3160b8e9c6bbe1da9cae1": "merge",
|
||||
"1c3f0ea4fc7e4c995d8a4926081335a8c5f68dce": "merge",
|
||||
"837abf0a5cce27a14c20cd5a21af867f8174be43": "merge",
|
||||
"a1c63a7e1777c43cbf07834d8ad5afc05321445b": "merge",
|
||||
"22c80d3819f0537048ef7d5d5ec9097713fc8dac": "merge",
|
||||
"3eebe3845a49ee6374433d49c75d203021533a88": "merge",
|
||||
"deba84d9ce4b353cd6c332fa1ce130541f9c3ceb": "merge",
|
||||
"2a6f99c6ea2519768459228412712bd5e53ade15": "merge",
|
||||
"469559bb434980822263f6b4986c50b6acba5cf0": "merge",
|
||||
"97caa33b8724da030a27366e497db1848a38b202": "merge",
|
||||
"68fdeb90581c9b5f0ebad6bb83210afd427deee7": "merge",
|
||||
"cbcf7732216210ec544f18a676118c8ad7183695": "merge",
|
||||
"5854cac43108e0ec536ae892a31d53c961386b6c": "merge",
|
||||
"acb7e01a2f4e6516bcdd9d9f41f7993d81ae944c": "merge",
|
||||
"97bac2c4905f6a4d8b21feb0bd7639222ff14b03": "merge",
|
||||
"c29cea5412dc33461b3a57339e324d85cc9d8d2c": "merge",
|
||||
"e8301271a18a1914cea7bf4ed31f38db7aba2dc7": "merge",
|
||||
"a1192f7fa21c7525312de0620d64411918ca6bc7": "merge",
|
||||
"8d9b61b5c93dc9506e9beeb545d28a6f0605b0ce": "merge",
|
||||
"d2d22d4f31fbfa79cd7bc21d804a26cf2733615e": "merge",
|
||||
"a4c98e31776765ce53ab27a4992e7e3797f0aa53": "merge",
|
||||
"f8cafe342cb100e754258b7b8ea1d6f830b1939b": "merge",
|
||||
"5872e950aa89f5211554bb41a4fea2983547fabc": "merge",
|
||||
"26fa80ae816ebdf833bb53b111214dd7d6fcef14": "merge"
|
||||
}
|
||||
107
scripts/ENRICHMENT_PROMPT.md
Normal file
107
scripts/ENRICHMENT_PROMPT.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# SUBAGENT — Activity enrichment
|
||||
|
||||
You are a subagent in the game-library enrichment pipeline. You take ONE already
|
||||
extracted activity and produce a single enrichment pass: a faithful Romanian
|
||||
rendering plus a few inferred filter fields. You do **one** activity per prompt.
|
||||
|
||||
This is **not** re-extraction. The activity text already exists and is trusted.
|
||||
Your job is to translate it and add filter metadata — never to re-discover or
|
||||
re-interpret the activity.
|
||||
|
||||
## Your task
|
||||
|
||||
The prompt gives you two blocks:
|
||||
|
||||
1. **Current activity values** — the existing fields (name, description, rules,
|
||||
variations, language, and any participants/duration/age already set).
|
||||
2. **Source chunk text** — the original passage the activity came from. This is
|
||||
your ground truth for any expansion. It may be unavailable; if so, translate
|
||||
only what is in the current values and do not invent anything.
|
||||
|
||||
Produce one JSON object and write it to the path named in the prompt
|
||||
(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
|
||||
`content_key` string from the prompt.
|
||||
|
||||
## Rules
|
||||
|
||||
### Translation (always)
|
||||
- Translate `name`, `description`, `rules`, `variations` into natural, fluent
|
||||
Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
|
||||
- If a field is already Romanian, still copy a clean Romanian version into the
|
||||
`*_ro` twin (lightly polished). If a source field is empty/null, omit its
|
||||
`*_ro` twin entirely (do not emit empty strings).
|
||||
- Translate faithfully. Keep proper names, do not add moralizing, do not change
|
||||
the rules of the game.
|
||||
|
||||
### Description expansion (constrained)
|
||||
- You MAY make `description_ro` richer than a literal translation — but ONLY
|
||||
using detail that is actually present in the **source chunk text**. Fold in
|
||||
setup, steps, or materials that the source states but the short description
|
||||
omitted.
|
||||
- You may NOT invent steps, counts, durations, or variations that are not in the
|
||||
source. If the source is thin, the translation stays thin. Hallucinated
|
||||
expansion is the one unacceptable failure here.
|
||||
|
||||
### Inferred filter fields (mark when inferred)
|
||||
Fill these when you can, using the source text first, then reasonable inference:
|
||||
|
||||
- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
|
||||
- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
|
||||
- `participants_min`, `participants_max`: integers (people).
|
||||
- `duration_min`, `duration_max`: integers (minutes).
|
||||
- `age_group_min`, `age_group_max`: integers (years).
|
||||
|
||||
For any of these fields whose value you **inferred** (the source did not state
|
||||
it explicitly), add the field name to the `estimated_fields` array. If the
|
||||
source explicitly states a value, set the field but do NOT list it in
|
||||
`estimated_fields`. Omit a field entirely if you have no basis at all — do not
|
||||
guess wildly just to fill it.
|
||||
|
||||
Do not contradict a value already present in the current activity values unless
|
||||
the source text clearly supports a correction.
|
||||
|
||||
## Enum vocabulary (fixed — use these exact slugs)
|
||||
|
||||
- `indoor_outdoor`: `indoor` | `outdoor` | `either`
|
||||
- `space_needed`: `mic` | `mediu` | `mare`
|
||||
|
||||
## Output format
|
||||
|
||||
Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"content_key": "<the exact key from the prompt>",
|
||||
"name_ro": "…",
|
||||
"description_ro": "…",
|
||||
"rules_ro": "…",
|
||||
"variations_ro": "…",
|
||||
"indoor_outdoor": "outdoor",
|
||||
"space_needed": "mediu",
|
||||
"participants_min": 6,
|
||||
"participants_max": 20,
|
||||
"duration_min": 15,
|
||||
"duration_max": 30,
|
||||
"age_group_min": 8,
|
||||
"age_group_max": 14,
|
||||
"estimated_fields": ["space_needed", "duration_min", "duration_max"]
|
||||
}
|
||||
```
|
||||
|
||||
Include only the fields you actually fill. Always include `content_key` and
|
||||
`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
|
||||
no commentary, no markdown fences in the file itself.
|
||||
|
||||
### CRITICAL — escape quotes inside string values
|
||||
|
||||
Any ASCII double-quote (`"`, U+0022) inside a string value MUST be escaped as
|
||||
`\"`. Your Romanian text is full of `„cuvânt"` — written raw, the closing ASCII
|
||||
`"` terminates the JSON string early and the whole file fails to parse (and your
|
||||
enrichment for this activity is silently lost). Either keep the typographic
|
||||
marks (`„ "`) or escape every literal ASCII `"`. After writing, re-read the file
|
||||
and confirm it parses as valid JSON.
|
||||
|
||||
## Report
|
||||
|
||||
After writing the file, report in under 30 words: the activity name and which
|
||||
fields you estimated.
|
||||
113
scripts/SUBAGENT_PROMPT.md
Normal file
113
scripts/SUBAGENT_PROMPT.md
Normal file
@@ -0,0 +1,113 @@
|
||||
# SUBAGENT — Activity extraction
|
||||
|
||||
You are a subagent in the game-library extraction pipeline. You extract
|
||||
educational activities (games, team-building, scouting, recipes, songs,
|
||||
ceremonies) from one chunk of a source document into structured JSON.
|
||||
|
||||
## Your task
|
||||
|
||||
1. **Read ONLY the chunk you were assigned.** Do not read other chunks, other
|
||||
files, or the original document. The chunk is a `.txt` file with
|
||||
`--- PAGE N ---` markers.
|
||||
2. Identify **every distinct activity** in the chunk.
|
||||
3. For each activity, fill the schema in `scripts/activity_schema.json`.
|
||||
4. Write the result to `data/extracted/<chunk_key>.json`.
|
||||
|
||||
## What counts as "a distinct activity"
|
||||
|
||||
A distinct activity is a self-contained game/activity/recipe/song/ceremony with
|
||||
its own name and a real description of how to do it. It is NOT:
|
||||
|
||||
- a bare mention or a cross-reference with no description — **skip it**;
|
||||
- a sub-variant of an activity already extracted — fold it into `variations`;
|
||||
- a heading, a table of contents entry, or running page chrome.
|
||||
|
||||
If the same activity is split across a page boundary inside your chunk, treat it
|
||||
as **one** activity and combine the text.
|
||||
|
||||
## Output format
|
||||
|
||||
The file is one JSON object: a `header` plus an `activities` array.
|
||||
|
||||
```json
|
||||
{
|
||||
"header": {
|
||||
"source_id": "<set from the prompt>",
|
||||
"chunk_key": "<set from the prompt>",
|
||||
"source_hash": "<set from the prompt>",
|
||||
"schema_version": "1.0",
|
||||
"prompt_version": "1.0",
|
||||
"chunk_range": "pages 1-20"
|
||||
},
|
||||
"activities": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
## Rules for each activity
|
||||
|
||||
- **`name`** — the activity's real name (≥3 characters).
|
||||
- **`description`** — real prose describing the activity. No hard length limit,
|
||||
but it must actually describe what happens.
|
||||
- **`rules`** — how it is played / carried out, if the source gives rules.
|
||||
- **`category`** — exactly one taxonomy slug (see the `enum` in the schema):
|
||||
`jocuri-cercetasesti`, `team-building`, `icebreakers`, `camp-outdoor`,
|
||||
`wide-games`, `orientare`, `prim-ajutor`, `escape-room-puzzle`,
|
||||
`creative-stem`, `sports-active`, `cantece-ceremonii`, `retete`,
|
||||
`supravietuire`, `integrare-incluziune`, `conflict-empatie`, `altele`.
|
||||
When unsure, use `altele`.
|
||||
- **`content_type`** — the FORM of the content, independent of category:
|
||||
`joc`, `activitate`, `reteta`, `cantec`, or `ceremonie`.
|
||||
- **`language`** — `ro` or `en` (the language the activity is written in).
|
||||
- **`source_excerpt`** — **MANDATORY.** A short quote (one or two sentences)
|
||||
copied **verbatim** from the chunk. This is the anti-hallucination anchor: it
|
||||
is checked as a fuzzy substring of the chunk, and invented quotes are
|
||||
rejected.
|
||||
- **`page_reference`** — **MANDATORY.** The `--- PAGE N ---` marker(s) the
|
||||
activity came from, e.g. `"page 14"` or `"pages 14-15"`.
|
||||
- **`extraction_confidence`** — `high`, `med`, or `low`. Use `low` when the
|
||||
source text for the activity is thin or ambiguous.
|
||||
|
||||
## Never invent data
|
||||
|
||||
- Do **not** invent ages, participant counts, or durations. If the source does
|
||||
not state them, leave those fields `null`.
|
||||
- Do **not** paraphrase the `source_excerpt` — copy it character for character.
|
||||
- Better to extract fewer activities accurately than to pad the output.
|
||||
|
||||
## Escaping quotes inside JSON strings (CRITICAL)
|
||||
|
||||
Any ASCII double-quote (`"`, U+0022) that appears **inside a string value** must
|
||||
be written escaped as `\"`. This is the single most common way these extractions
|
||||
break: Romanian source text uses typographic quotes like `„cuvânt"` where the
|
||||
closing mark is a plain ASCII `"`. Written raw, it terminates the JSON string
|
||||
early and corrupts the whole file. So:
|
||||
|
||||
- `"description": "grupul cântă „Unu\" în cor"` ← correct (inner `"` escaped)
|
||||
- `"description": "grupul cântă „Unu" în cor"` ← BROKEN (unescaped `"`)
|
||||
|
||||
Prefer keeping the source's typographic quotes (`„ "`), but whenever a literal
|
||||
ASCII `"` lands inside a value, escape it. After writing, re-read the file and
|
||||
confirm it parses as valid JSON.
|
||||
|
||||
## Writing large outputs in batches (IMPORTANT)
|
||||
|
||||
A single Write tool call has a hard ~32K output-token limit. Dense chunks
|
||||
(50+ activities) will exceed this. If you estimate >30 activities, write the
|
||||
file **incrementally**:
|
||||
|
||||
1. First Write: emit the file with `header` + the first batch (≤25 activities)
|
||||
and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
|
||||
2. For each subsequent batch (≤25 activities at a time), use an Edit call
|
||||
that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
|
||||
`,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
|
||||
closing brace plus the last activity's tail) so the Edit is unambiguous.
|
||||
3. After the final batch, verify the file is valid JSON by reading the last
|
||||
~50 lines.
|
||||
|
||||
This keeps each tool call under the output-token cap.
|
||||
|
||||
## Before you finish
|
||||
|
||||
- Every activity has a non-empty `source_excerpt` and `page_reference`.
|
||||
- The file validates against `scripts/activity_schema.json`.
|
||||
- You only used text from your assigned chunk.
|
||||
110
scripts/activity_schema.json
Normal file
110
scripts/activity_schema.json
Normal file
@@ -0,0 +1,110 @@
|
||||
{
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"title": "Game-library extraction output",
|
||||
"description": "One subagent output file: a header carrying provenance/version metadata plus the list of activities extracted from a single chunk.",
|
||||
"type": "object",
|
||||
"required": ["header", "activities"],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"header": {
|
||||
"type": "object",
|
||||
"required": ["source_hash", "schema_version", "prompt_version", "chunk_range"],
|
||||
"additionalProperties": true,
|
||||
"properties": {
|
||||
"source_hash": {"type": "string", "minLength": 8},
|
||||
"schema_version": {"type": "string"},
|
||||
"prompt_version": {"type": "string"},
|
||||
"chunk_range": {"type": "string"},
|
||||
"source_id": {"type": ["string", "null"]},
|
||||
"chunk_key": {"type": ["string", "null"]}
|
||||
}
|
||||
},
|
||||
"activities": {
|
||||
"type": "array",
|
||||
"items": {"$ref": "#/definitions/activity"}
|
||||
}
|
||||
},
|
||||
"definitions": {
|
||||
"activity": {
|
||||
"type": "object",
|
||||
"required": [
|
||||
"name",
|
||||
"description",
|
||||
"category",
|
||||
"content_type",
|
||||
"language",
|
||||
"extraction_confidence",
|
||||
"source_excerpt",
|
||||
"page_reference"
|
||||
],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"name": {"type": "string", "minLength": 3},
|
||||
"description": {"type": "string", "minLength": 1},
|
||||
"rules": {"type": ["string", "null"]},
|
||||
"variations": {"type": ["string", "null"]},
|
||||
"category": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"jocuri-cercetasesti",
|
||||
"team-building",
|
||||
"icebreakers",
|
||||
"camp-outdoor",
|
||||
"wide-games",
|
||||
"orientare",
|
||||
"prim-ajutor",
|
||||
"escape-room-puzzle",
|
||||
"creative-stem",
|
||||
"sports-active",
|
||||
"cantece-ceremonii",
|
||||
"retete",
|
||||
"supravietuire",
|
||||
"integrare-incluziune",
|
||||
"conflict-empatie",
|
||||
"altele"
|
||||
]
|
||||
},
|
||||
"subcategory": {"type": ["string", "null"]},
|
||||
"content_type": {
|
||||
"type": "string",
|
||||
"enum": ["joc", "activitate", "reteta", "cantec", "ceremonie"]
|
||||
},
|
||||
"language": {"type": "string", "enum": ["ro", "en"]},
|
||||
"extraction_confidence": {
|
||||
"type": "string",
|
||||
"enum": ["high", "med", "low"]
|
||||
},
|
||||
"source_excerpt": {"type": "string", "minLength": 1},
|
||||
"page_reference": {"type": "string", "minLength": 1},
|
||||
"source_file": {"type": ["string", "null"]},
|
||||
"age_group_min": {"type": ["integer", "null"], "minimum": 0},
|
||||
"age_group_max": {"type": ["integer", "null"], "minimum": 0},
|
||||
"participants_min": {"type": ["integer", "null"], "minimum": 0},
|
||||
"participants_max": {"type": ["integer", "null"], "minimum": 0},
|
||||
"duration_min": {"type": ["integer", "null"], "minimum": 0},
|
||||
"duration_max": {"type": ["integer", "null"], "minimum": 0},
|
||||
"materials_category": {"type": ["string", "null"]},
|
||||
"materials_list": {
|
||||
"type": ["array", "null"],
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"skills_developed": {
|
||||
"type": ["array", "null"],
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"difficulty_level": {
|
||||
"type": ["string", "null"],
|
||||
"enum": ["usor", "mediu", "dificil", null]
|
||||
},
|
||||
"keywords": {
|
||||
"type": ["array", "null"],
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"tags": {
|
||||
"type": ["array", "null"],
|
||||
"items": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
780
scripts/build_database.py
Normal file
780
scripts/build_database.py
Normal file
@@ -0,0 +1,780 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
build_database.py — build data/activities.db from the subagent extraction JSON.
|
||||
|
||||
Replaces the old import_claude_activities.py. Pipeline (plan §4):
|
||||
|
||||
1. `--rebuild` builds into data/activities.db.tmp; on success the live DB is
|
||||
backed up to data/activities.db.bak and the tmp file is swapped in with an
|
||||
atomic os.replace. A mid-build crash leaves the live DB untouched.
|
||||
2. Every data/extracted/*.json is validated against scripts/activity_schema.json;
|
||||
invalid files are moved to data/extracted/_rejected/ with an error log.
|
||||
2b. Each source_excerpt must appear as a fuzzy substring (rapidfuzz
|
||||
partial_ratio >= 90) of its source chunk — non-matches are hallucinations
|
||||
and the activity is dropped (logged to _rejected/).
|
||||
3. `category` is normalized to a valid taxonomy slug (fallback `altele`).
|
||||
4. Dedup (D5): group by exact normalized_name, never across languages; within a
|
||||
group rapidfuzz on descriptions — >=85 auto-merge, 60-85 borderline (keep
|
||||
both, needs_review), <60 separate variants.
|
||||
5. data/review_decisions.json is applied before insert.
|
||||
6. Bulk insert into the tmp DB, populate the categories table, rebuild FTS.
|
||||
7. A QA report is printed.
|
||||
|
||||
Usage:
|
||||
python scripts/build_database.py --rebuild
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
from app.config_taxonomy import ( # noqa: E402
|
||||
category_display_name,
|
||||
normalize_category,
|
||||
normalize_content_type,
|
||||
)
|
||||
from app.models.activity import Activity # noqa: E402
|
||||
from app.models.database import DatabaseManager # noqa: E402
|
||||
from import_common import ( # noqa: E402
|
||||
DEFAULT_SCHEMA_PATH,
|
||||
content_key,
|
||||
excerpt_matches,
|
||||
find_chunk_text,
|
||||
iter_extraction_files,
|
||||
load_schema,
|
||||
normalize_name,
|
||||
source_path_for,
|
||||
)
|
||||
|
||||
# dedup thresholds (rapidfuzz token_sort_ratio, 0..100 scale)
|
||||
AUTO_MERGE_THRESHOLD = 85.0
|
||||
BORDERLINE_THRESHOLD = 60.0
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# extraction dict -> Activity
|
||||
# --------------------------------------------------------------------------
|
||||
def _csv(value: Any) -> Optional[str]:
|
||||
"""Schema arrays -> comma string for the (TEXT) DB columns."""
|
||||
if value is None:
|
||||
return None
|
||||
if isinstance(value, str):
|
||||
return value.strip() or None
|
||||
if isinstance(value, (list, tuple)):
|
||||
parts = [str(v).strip() for v in value if str(v).strip()]
|
||||
return ", ".join(parts) or None
|
||||
return str(value)
|
||||
|
||||
|
||||
def _split_csv(value: Optional[str]) -> list[str]:
|
||||
if not value:
|
||||
return []
|
||||
return [p.strip() for p in str(value).split(",") if p.strip()]
|
||||
|
||||
|
||||
def dict_to_activity(
|
||||
adict: dict,
|
||||
source_file: str,
|
||||
source_id: Optional[str] = None,
|
||||
chunk_key: Optional[str] = None,
|
||||
) -> Activity:
|
||||
"""Build an Activity from one extraction-JSON activity object."""
|
||||
tags = adict.get("tags") or []
|
||||
if isinstance(tags, str):
|
||||
tags = _split_csv(tags)
|
||||
|
||||
source_files = adict.get("source_files") or []
|
||||
if isinstance(source_files, str):
|
||||
source_files = _split_csv(source_files)
|
||||
if source_file and source_file not in source_files:
|
||||
source_files = [source_file, *source_files]
|
||||
|
||||
return Activity(
|
||||
source_id=source_id,
|
||||
source_ids=[source_id] if source_id else [],
|
||||
chunk_key=chunk_key,
|
||||
name=(adict.get("name") or "").strip(),
|
||||
description=(adict.get("description") or "").strip(),
|
||||
rules=adict.get("rules"),
|
||||
variations=adict.get("variations"),
|
||||
category=normalize_category(adict.get("category", "")),
|
||||
subcategory=adict.get("subcategory"),
|
||||
content_type=normalize_content_type(adict.get("content_type", "")),
|
||||
source_file=source_file,
|
||||
source_files=list(source_files),
|
||||
page_reference=adict.get("page_reference"),
|
||||
source_excerpt=adict.get("source_excerpt"),
|
||||
age_group_min=adict.get("age_group_min"),
|
||||
age_group_max=adict.get("age_group_max"),
|
||||
participants_min=adict.get("participants_min"),
|
||||
participants_max=adict.get("participants_max"),
|
||||
duration_min=adict.get("duration_min"),
|
||||
duration_max=adict.get("duration_max"),
|
||||
materials_category=adict.get("materials_category"),
|
||||
materials_list=_csv(adict.get("materials_list")),
|
||||
skills_developed=_csv(adict.get("skills_developed")),
|
||||
difficulty_level=adict.get("difficulty_level"),
|
||||
keywords=_csv(adict.get("keywords")),
|
||||
tags=list(tags),
|
||||
language=adict.get("language"),
|
||||
extraction_confidence=adict.get("extraction_confidence"),
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 3 — category normalization is done in dict_to_activity; a non-taxonomy
|
||||
# value silently falls back to `altele`. This logs the substitutions.
|
||||
# --------------------------------------------------------------------------
|
||||
def log_category_fallbacks(raw_pairs: list[tuple[str, str]]) -> list[str]:
|
||||
"""raw_pairs = (original, slug); return human-readable fallback messages."""
|
||||
msgs = []
|
||||
for original, slug in raw_pairs:
|
||||
if slug == "altele" and normalize_name(original or "") not in ("", "altele"):
|
||||
msgs.append(f"category '{original}' -> altele (not in taxonomy)")
|
||||
return msgs
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 4 — dedup
|
||||
# --------------------------------------------------------------------------
|
||||
def _longest(*values: Optional[str]) -> Optional[str]:
|
||||
best: Optional[str] = None
|
||||
for v in values:
|
||||
if v and (best is None or len(v) > len(best)):
|
||||
best = v
|
||||
return best
|
||||
|
||||
|
||||
def _union_csv(values: list[Optional[str]]) -> Optional[str]:
|
||||
seen: list[str] = []
|
||||
for value in values:
|
||||
for item in _split_csv(value):
|
||||
if item not in seen:
|
||||
seen.append(item)
|
||||
return ", ".join(seen) or None
|
||||
|
||||
|
||||
def merge_cluster(cluster: list[Activity]) -> Activity:
|
||||
"""Collapse a cluster of duplicate activities into one merged Activity."""
|
||||
if len(cluster) == 1:
|
||||
return cluster[0]
|
||||
|
||||
# representative = the one with the longest description
|
||||
rep = max(cluster, key=lambda a: len(a.description or ""))
|
||||
merged = Activity(
|
||||
name=rep.name,
|
||||
description=_longest(*(a.description for a in cluster)) or rep.description,
|
||||
rules=_longest(*(a.rules for a in cluster)),
|
||||
variations=_longest(*(a.variations for a in cluster)),
|
||||
category=rep.category,
|
||||
subcategory=rep.subcategory,
|
||||
content_type=rep.content_type,
|
||||
source_file=rep.source_file,
|
||||
page_reference=rep.page_reference,
|
||||
source_excerpt=rep.source_excerpt,
|
||||
age_group_min=rep.age_group_min,
|
||||
age_group_max=rep.age_group_max,
|
||||
participants_min=rep.participants_min,
|
||||
participants_max=rep.participants_max,
|
||||
duration_min=rep.duration_min,
|
||||
duration_max=rep.duration_max,
|
||||
materials_category=rep.materials_category,
|
||||
materials_list=_union_csv([a.materials_list for a in cluster]),
|
||||
skills_developed=_union_csv([a.skills_developed for a in cluster]),
|
||||
difficulty_level=rep.difficulty_level,
|
||||
keywords=_union_csv([a.keywords for a in cluster]),
|
||||
language=rep.language,
|
||||
extraction_confidence=rep.extraction_confidence,
|
||||
)
|
||||
# union of tags
|
||||
tags: list[str] = []
|
||||
for a in cluster:
|
||||
for t in a.tags or []:
|
||||
if t not in tags:
|
||||
tags.append(t)
|
||||
merged.tags = tags
|
||||
# accumulate every source the activity was seen in
|
||||
sources: list[str] = []
|
||||
for a in cluster:
|
||||
for s in [a.source_file, *(a.source_files or [])]:
|
||||
if s and s not in sources:
|
||||
sources.append(s)
|
||||
merged.source_files = sources
|
||||
# source provenance: keep rep's chunk_key/source_id as primary, union the
|
||||
# source_ids for the download route. Enrichment fields (name_ro,
|
||||
# description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
|
||||
# enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
|
||||
# row's content_key, so merging must not pre-populate them.
|
||||
merged.source_id = rep.source_id
|
||||
merged.chunk_key = rep.chunk_key
|
||||
source_ids: list[str] = []
|
||||
for a in cluster:
|
||||
for sid in [a.source_id, *(a.source_ids or [])]:
|
||||
if sid and sid not in source_ids:
|
||||
source_ids.append(sid)
|
||||
merged.source_ids = source_ids
|
||||
# popularity_score++ per merged duplicate (plan §4)
|
||||
merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
|
||||
return merged
|
||||
|
||||
|
||||
def dedup_activities(activities: list[Activity]) -> tuple[list[Activity], dict]:
|
||||
"""
|
||||
Dedup per plan D5.
|
||||
|
||||
Groups by (normalized_name, language) — different languages are NEVER
|
||||
merged. Within a group, descriptions are clustered with rapidfuzz:
|
||||
>= 85 -> same cluster (auto-merge)
|
||||
60-85 -> borderline: kept as separate clusters, both flagged needs_review
|
||||
< 60 -> separate variants
|
||||
"""
|
||||
from rapidfuzz import fuzz
|
||||
|
||||
groups: dict[tuple, list[Activity]] = defaultdict(list)
|
||||
for act in activities:
|
||||
key = (act.normalized_name or normalize_name(act.name), act.language)
|
||||
groups[key].append(act)
|
||||
|
||||
result: list[Activity] = []
|
||||
stats = {"input": len(activities), "auto_merged": 0, "borderline": 0, "output": 0}
|
||||
|
||||
for members in groups.values():
|
||||
clusters: list[list[Activity]] = []
|
||||
borderline_idx: set[int] = set()
|
||||
|
||||
for act in members:
|
||||
best_idx, best_score = -1, -1.0
|
||||
borderline_here: list[int] = []
|
||||
for idx, cluster in enumerate(clusters):
|
||||
score = fuzz.token_sort_ratio(
|
||||
act.description or "", cluster[0].description or ""
|
||||
)
|
||||
if score >= AUTO_MERGE_THRESHOLD:
|
||||
if score > best_score:
|
||||
best_idx, best_score = idx, score
|
||||
elif score >= BORDERLINE_THRESHOLD:
|
||||
borderline_here.append(idx)
|
||||
if best_idx >= 0:
|
||||
clusters[best_idx].append(act)
|
||||
else:
|
||||
clusters.append([act])
|
||||
new_idx = len(clusters) - 1
|
||||
for bidx in borderline_here:
|
||||
borderline_idx.add(bidx)
|
||||
borderline_idx.add(new_idx)
|
||||
|
||||
for idx, cluster in enumerate(clusters):
|
||||
merged = merge_cluster(cluster)
|
||||
if len(cluster) > 1:
|
||||
stats["auto_merged"] += len(cluster) - 1
|
||||
if idx in borderline_idx:
|
||||
merged.needs_review = 1
|
||||
stats["borderline"] += 1
|
||||
result.append(merged)
|
||||
|
||||
stats["output"] = len(result)
|
||||
return result, stats
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 5 — review decisions
|
||||
# --------------------------------------------------------------------------
|
||||
def load_review_decisions(path: Path) -> dict:
|
||||
if path and path.is_file():
|
||||
try:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
if isinstance(data, dict):
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def apply_review_decisions(
|
||||
activities: list[Activity], decisions: dict
|
||||
) -> tuple[list[Activity], dict]:
|
||||
"""
|
||||
Apply data/review_decisions.json (plan §5c).
|
||||
|
||||
Keyed by the stable content_key. A decision of `drop` removes the row;
|
||||
`keep-separate` / `merge` clear needs_review (the user has resolved it).
|
||||
Rows with no decision keep needs_review and resurface in the queue.
|
||||
"""
|
||||
kept: list[Activity] = []
|
||||
stats = {"dropped": 0, "resolved": 0}
|
||||
for act in activities:
|
||||
key = content_key(
|
||||
act.normalized_name or normalize_name(act.name),
|
||||
act.language,
|
||||
act.description or "",
|
||||
)
|
||||
entry = decisions.get(key)
|
||||
decision = entry.get("decision") if isinstance(entry, dict) else entry
|
||||
if decision == "drop":
|
||||
stats["dropped"] += 1
|
||||
continue
|
||||
if decision in ("keep-separate", "merge"):
|
||||
act.needs_review = 0
|
||||
stats["resolved"] += 1
|
||||
kept.append(act)
|
||||
return kept, stats
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 5b — enrichment overlay (plan Part B)
|
||||
# --------------------------------------------------------------------------
|
||||
# Translation / inferred-filter fields written by run_enrichment.py. Applied
|
||||
# AFTER dedup + review decisions, keyed on the same stable content_key, so the
|
||||
# overlay survives rebuilds as long as extraction text is frozen.
|
||||
_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
|
||||
_ENRICHMENT_INT_FIELDS = (
|
||||
"participants_min", "participants_max",
|
||||
"duration_min", "duration_max",
|
||||
"age_group_min", "age_group_max",
|
||||
)
|
||||
|
||||
|
||||
def load_enrichment(path: Path) -> dict:
|
||||
"""Load data/enrichment.json (flat map content_key -> field dict)."""
|
||||
if path and path.is_file():
|
||||
try:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
if isinstance(data, dict):
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
|
||||
"""
|
||||
Overlay enrichment fields onto the post-dedup activity list (plan B2).
|
||||
|
||||
Keyed by content_key. Only fields PRESENT in an entry are written; absent
|
||||
fields leave the underlying DB value untouched. indoor_outdoor /
|
||||
space_needed are normalized to slugs (None on unrecognised). Inferred
|
||||
fields are recorded in `estimated_fields`. Translated / expanded text is
|
||||
NOT re-validated against the source here — expansion fidelity is the
|
||||
enrichment prompt's responsibility (plan B2 comment).
|
||||
|
||||
Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
|
||||
"""
|
||||
from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
|
||||
|
||||
matched_keys: set[str] = set()
|
||||
fields_stated: dict[str, int] = defaultdict(int)
|
||||
fields_estimated: dict[str, int] = defaultdict(int)
|
||||
|
||||
for act in activities:
|
||||
key = content_key(
|
||||
act.normalized_name or normalize_name(act.name),
|
||||
act.language,
|
||||
act.description or "",
|
||||
)
|
||||
entry = enrichment.get(key)
|
||||
if not isinstance(entry, dict):
|
||||
continue
|
||||
matched_keys.add(key)
|
||||
|
||||
estimated = set(entry.get("estimated_fields") or [])
|
||||
|
||||
# bilingual text twins
|
||||
for fld in _ENRICHMENT_TEXT_FIELDS:
|
||||
val = entry.get(fld)
|
||||
if isinstance(val, str) and val.strip():
|
||||
setattr(act, fld, val.strip())
|
||||
|
||||
# inferred / clarified structured numeric fields
|
||||
for fld in _ENRICHMENT_INT_FIELDS:
|
||||
if entry.get(fld) is not None:
|
||||
try:
|
||||
setattr(act, fld, int(entry[fld]))
|
||||
except (TypeError, ValueError):
|
||||
pass
|
||||
|
||||
# enum filters — normalized to slug, dropped if unrecognised
|
||||
if entry.get("indoor_outdoor") is not None:
|
||||
slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
|
||||
if slug:
|
||||
act.indoor_outdoor = slug
|
||||
if entry.get("space_needed") is not None:
|
||||
slug = normalize_space_needed(entry["space_needed"])
|
||||
if slug:
|
||||
act.space_needed = slug
|
||||
|
||||
act.estimated_fields = sorted(estimated)
|
||||
|
||||
# QA tally: stated vs estimated population, per field
|
||||
for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
|
||||
if entry.get(fld) is None:
|
||||
continue
|
||||
if fld in estimated:
|
||||
fields_estimated[fld] += 1
|
||||
else:
|
||||
fields_stated[fld] += 1
|
||||
|
||||
return {
|
||||
"entries": len(enrichment),
|
||||
"matched": len(matched_keys),
|
||||
"orphaned": len(enrichment) - len(matched_keys),
|
||||
"fields_stated": dict(fields_stated),
|
||||
"fields_estimated": dict(fields_estimated),
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# golden-set recall (plan §7)
|
||||
# --------------------------------------------------------------------------
|
||||
def _golden_names(data: Any) -> list[str]:
|
||||
items = data.get("activities", data) if isinstance(data, dict) else data
|
||||
names: list[str] = []
|
||||
for item in items or []:
|
||||
if isinstance(item, str):
|
||||
names.append(item)
|
||||
elif isinstance(item, dict) and item.get("name"):
|
||||
names.append(item["name"])
|
||||
return names
|
||||
|
||||
|
||||
def golden_recall(golden_dir: Path, activities: list[Activity]) -> Optional[dict]:
|
||||
if not golden_dir or not golden_dir.is_dir():
|
||||
return None
|
||||
found = {normalize_name(a.name) for a in activities}
|
||||
expected, hits = 0, 0
|
||||
for gf in sorted(golden_dir.glob("*.json")):
|
||||
try:
|
||||
data = json.loads(gf.read_text(encoding="utf-8"))
|
||||
except (json.JSONDecodeError, OSError):
|
||||
continue
|
||||
for name in _golden_names(data):
|
||||
expected += 1
|
||||
if normalize_name(name) in found:
|
||||
hits += 1
|
||||
if expected == 0:
|
||||
return None
|
||||
return {"expected": expected, "found": hits, "recall": round(hits / expected, 3)}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# load + validate + excerpt-check the extraction files
|
||||
# --------------------------------------------------------------------------
|
||||
def collect_activities(
|
||||
extracted_dir: Path,
|
||||
chunks_dir: Path,
|
||||
sources_dir: Path,
|
||||
schema: dict,
|
||||
) -> dict:
|
||||
"""Validate, excerpt-check and convert every extraction file."""
|
||||
rejected_dir = extracted_dir / "_rejected"
|
||||
activities: list[Activity] = []
|
||||
report = {
|
||||
"files_total": 0,
|
||||
"files_valid": 0,
|
||||
"files_rejected_schema": 0,
|
||||
"activities_raw": 0,
|
||||
"activities_hallucinated": 0,
|
||||
"category_fallbacks": [],
|
||||
}
|
||||
raw_categories: list[tuple[str, str]] = []
|
||||
|
||||
from import_common import chunk_key_for # local import to avoid clutter
|
||||
|
||||
for json_path in iter_extraction_files(extracted_dir):
|
||||
report["files_total"] += 1
|
||||
try:
|
||||
data = json.loads(json_path.read_text(encoding="utf-8"))
|
||||
except json.JSONDecodeError as exc:
|
||||
_reject_file(json_path, rejected_dir, [f"invalid JSON: {exc}"])
|
||||
report["files_rejected_schema"] += 1
|
||||
continue
|
||||
|
||||
from import_common import validate_extraction
|
||||
|
||||
errors = validate_extraction(data, schema)
|
||||
if errors:
|
||||
_reject_file(json_path, rejected_dir, errors)
|
||||
report["files_rejected_schema"] += 1
|
||||
continue
|
||||
report["files_valid"] += 1
|
||||
|
||||
header = data.get("header", {})
|
||||
chunk_text = find_chunk_text(json_path, header, chunks_dir)
|
||||
chunk_key = chunk_key_for(json_path, header)
|
||||
source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
|
||||
fallback_source = (
|
||||
source_path_for(source_id, sources_dir) or source_id or json_path.stem
|
||||
)
|
||||
|
||||
hallucinated: list[dict] = []
|
||||
for adict in data.get("activities", []):
|
||||
report["activities_raw"] += 1
|
||||
excerpt = adict.get("source_excerpt") or ""
|
||||
# if the chunk text is unavailable we cannot verify — keep but the
|
||||
# QA report still counts it under activities_raw.
|
||||
if chunk_text is not None and not excerpt_matches(excerpt, chunk_text):
|
||||
hallucinated.append(adict)
|
||||
report["activities_hallucinated"] += 1
|
||||
continue
|
||||
src = adict.get("source_file") or fallback_source
|
||||
raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
|
||||
activities.append(dict_to_activity(adict, src, source_id, chunk_key))
|
||||
|
||||
if hallucinated:
|
||||
_log_hallucinations(json_path, rejected_dir, hallucinated)
|
||||
|
||||
report["category_fallbacks"] = log_category_fallbacks(raw_categories)
|
||||
report["activities"] = activities
|
||||
return report
|
||||
|
||||
|
||||
def _reject_file(json_path: Path, rejected_dir: Path, errors: list[str]) -> None:
|
||||
rejected_dir.mkdir(parents=True, exist_ok=True)
|
||||
dest = rejected_dir / json_path.name
|
||||
shutil.move(str(json_path), str(dest))
|
||||
log = rejected_dir / f"{json_path.stem}.errors.txt"
|
||||
log.write_text(
|
||||
f"REJECTED (schema validation): {json_path.name}\n\n"
|
||||
+ "\n".join(f" - {e}" for e in errors)
|
||||
+ "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
def _log_hallucinations(
|
||||
json_path: Path, rejected_dir: Path, hallucinated: list[dict]
|
||||
) -> None:
|
||||
rejected_dir.mkdir(parents=True, exist_ok=True)
|
||||
log = rejected_dir / f"{json_path.stem}.hallucinations.txt"
|
||||
lines = [f"DROPPED activities (source_excerpt not found in chunk): {json_path.name}", ""]
|
||||
for a in hallucinated:
|
||||
lines.append(f" - {a.get('name')!r}")
|
||||
lines.append(f" excerpt: {a.get('source_excerpt')!r}")
|
||||
log.write_text("\n".join(lines) + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# DB write + atomic swap
|
||||
# --------------------------------------------------------------------------
|
||||
def _enrich_category_display_names(db_path: Path) -> None:
|
||||
"""Give the categories table proper Romanian display names for slugs."""
|
||||
import sqlite3
|
||||
|
||||
conn = sqlite3.connect(db_path)
|
||||
try:
|
||||
rows = conn.execute(
|
||||
"SELECT value FROM categories WHERE type = 'category'"
|
||||
).fetchall()
|
||||
for (slug,) in rows:
|
||||
conn.execute(
|
||||
"UPDATE categories SET display_name = ? WHERE type='category' AND value = ?",
|
||||
(category_display_name(slug), slug),
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def write_database(db_tmp_path: Path, activities: list[Activity]) -> None:
|
||||
"""Create a fresh tmp DB, bulk insert, populate categories, rebuild FTS."""
|
||||
if db_tmp_path.exists():
|
||||
db_tmp_path.unlink()
|
||||
db = DatabaseManager(str(db_tmp_path))
|
||||
db.bulk_insert_activities(activities)
|
||||
_enrich_category_display_names(db_tmp_path)
|
||||
db.rebuild_fts_index()
|
||||
|
||||
|
||||
def atomic_swap(db_tmp_path: Path, db_path: Path) -> Optional[Path]:
|
||||
"""Back up the live DB then atomically swap the tmp file in."""
|
||||
backup: Optional[Path] = None
|
||||
if db_path.exists():
|
||||
backup = db_path.with_suffix(db_path.suffix + ".bak")
|
||||
shutil.copy2(db_path, backup)
|
||||
os.replace(db_tmp_path, db_path)
|
||||
return backup
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# orchestration
|
||||
# --------------------------------------------------------------------------
|
||||
def rebuild(
|
||||
*,
|
||||
extracted_dir: Path,
|
||||
chunks_dir: Path,
|
||||
sources_dir: Path,
|
||||
db_path: Path,
|
||||
decisions_path: Optional[Path] = None,
|
||||
enrichment_path: Optional[Path] = None,
|
||||
schema_path: Path = DEFAULT_SCHEMA_PATH,
|
||||
golden_dir: Optional[Path] = None,
|
||||
do_swap: bool = True,
|
||||
) -> dict:
|
||||
"""
|
||||
Full rebuild. Everything is built into <db_path>.tmp; the live DB is only
|
||||
touched by the final atomic swap, so a crash anywhere above leaves it intact.
|
||||
"""
|
||||
extracted_dir = Path(extracted_dir)
|
||||
db_path = Path(db_path)
|
||||
db_tmp_path = db_path.with_suffix(db_path.suffix + ".tmp")
|
||||
|
||||
schema = load_schema(schema_path)
|
||||
collected = collect_activities(extracted_dir, Path(chunks_dir), Path(sources_dir), schema)
|
||||
activities: list[Activity] = collected.pop("activities")
|
||||
|
||||
deduped, dedup_stats = dedup_activities(activities)
|
||||
|
||||
decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
|
||||
final, decision_stats = apply_review_decisions(deduped, decisions)
|
||||
|
||||
# Enrichment overlay — applied immediately after review decisions, on the
|
||||
# post-dedup list, keyed on the same stable content_key (plan B2).
|
||||
enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
|
||||
enrichment_stats = apply_enrichment(final, enrichment)
|
||||
|
||||
try:
|
||||
write_database(db_tmp_path, final)
|
||||
backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
|
||||
except Exception:
|
||||
if db_tmp_path.exists():
|
||||
db_tmp_path.unlink()
|
||||
raise
|
||||
|
||||
report = {
|
||||
**collected,
|
||||
"dedup": dedup_stats,
|
||||
"decisions": decision_stats,
|
||||
"enrichment": enrichment_stats,
|
||||
"final_count": len(final),
|
||||
"backup": str(backup) if backup else None,
|
||||
"swapped": do_swap,
|
||||
"qa": _qa_report(final, collected, golden_dir),
|
||||
}
|
||||
return report
|
||||
|
||||
|
||||
def _qa_report(
|
||||
activities: list[Activity], collected: dict, golden_dir: Optional[Path]
|
||||
) -> dict:
|
||||
per_category: dict[str, int] = defaultdict(int)
|
||||
per_content_type: dict[str, int] = defaultdict(int)
|
||||
confidence: dict[str, int] = defaultdict(int)
|
||||
with_rules = 0
|
||||
for a in activities:
|
||||
per_category[a.category] += 1
|
||||
per_content_type[a.content_type or "?"] += 1
|
||||
confidence[a.extraction_confidence or "?"] += 1
|
||||
if a.rules and a.rules.strip():
|
||||
with_rules += 1
|
||||
raw = collected.get("activities_raw", 0)
|
||||
hallucinated = collected.get("activities_hallucinated", 0)
|
||||
return {
|
||||
"total": len(activities),
|
||||
"per_category": dict(per_category),
|
||||
"per_content_type": dict(per_content_type),
|
||||
"extraction_confidence": dict(confidence),
|
||||
"pct_with_rules": round(100 * with_rules / len(activities), 1) if activities else 0.0,
|
||||
"needs_review": sum(1 for a in activities if a.needs_review),
|
||||
"hallucination_rate": round(100 * hallucinated / raw, 2) if raw else 0.0,
|
||||
"golden_recall": golden_recall(Path(golden_dir), activities) if golden_dir else None,
|
||||
}
|
||||
|
||||
|
||||
def print_report(report: dict) -> None:
|
||||
qa = report["qa"]
|
||||
print("=" * 60)
|
||||
print("BUILD DATABASE — QA REPORT")
|
||||
print("=" * 60)
|
||||
print(f"extraction files : {report['files_total']} "
|
||||
f"(valid {report['files_valid']}, schema-rejected {report['files_rejected_schema']})")
|
||||
print(f"activities raw : {report['activities_raw']}")
|
||||
print(f" hallucinated drop : {report['activities_hallucinated']} "
|
||||
f"({qa['hallucination_rate']}%)")
|
||||
d = report["dedup"]
|
||||
print(f"dedup : {d['input']} -> {d['output']} "
|
||||
f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
|
||||
print(f"review decisions : dropped {report['decisions']['dropped']}, "
|
||||
f"resolved {report['decisions']['resolved']}")
|
||||
enr = report.get("enrichment")
|
||||
if enr and enr.get("entries"):
|
||||
print(f"enrichment : {enr['entries']} entries "
|
||||
f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
|
||||
stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
|
||||
all_fields = sorted(set(stated) | set(estimated))
|
||||
if all_fields:
|
||||
print(" field population : (stated / estimated)")
|
||||
for fld in all_fields:
|
||||
print(f" {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
|
||||
print(f"final inserted : {report['final_count']}")
|
||||
print(f"% with rules : {qa['pct_with_rules']}")
|
||||
print(f"needs_review rows : {qa['needs_review']}")
|
||||
print("per category :")
|
||||
for slug, n in sorted(qa["per_category"].items(), key=lambda kv: -kv[1]):
|
||||
print(f" {slug:<24}: {n}")
|
||||
print("per content_type :")
|
||||
for ct, n in sorted(qa["per_content_type"].items(), key=lambda kv: -kv[1]):
|
||||
print(f" {ct:<24}: {n}")
|
||||
print("extraction_confidence:")
|
||||
for c, n in sorted(qa["extraction_confidence"].items()):
|
||||
print(f" {c:<24}: {n}")
|
||||
if qa["golden_recall"]:
|
||||
g = qa["golden_recall"]
|
||||
print(f"golden recall : {g['found']}/{g['expected']} = {g['recall']}")
|
||||
if report["category_fallbacks"]:
|
||||
print("category fallbacks :")
|
||||
for msg in report["category_fallbacks"]:
|
||||
print(f" {msg}")
|
||||
if report["backup"]:
|
||||
print(f"live DB backed up to : {report['backup']}")
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Build activities.db from extraction JSON.")
|
||||
parser.add_argument("--rebuild", action="store_true",
|
||||
help="rebuild the database from scratch (only mode supported)")
|
||||
parser.add_argument("--extracted", default="data/extracted")
|
||||
parser.add_argument("--chunks", default="data/chunks")
|
||||
parser.add_argument("--sources", default="data/sources")
|
||||
parser.add_argument("--db", default="data/activities.db")
|
||||
parser.add_argument("--decisions", default="data/review_decisions.json")
|
||||
parser.add_argument("--enrichment", default="data/enrichment.json")
|
||||
parser.add_argument("--golden", default="data/golden")
|
||||
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if not args.rebuild:
|
||||
parser.error("only --rebuild is supported (full rebuild, no incremental merge)")
|
||||
|
||||
report = rebuild(
|
||||
extracted_dir=Path(args.extracted),
|
||||
chunks_dir=Path(args.chunks),
|
||||
sources_dir=Path(args.sources),
|
||||
db_path=Path(args.db),
|
||||
decisions_path=Path(args.decisions),
|
||||
enrichment_path=Path(args.enrichment),
|
||||
schema_path=Path(args.schema),
|
||||
golden_dir=Path(args.golden),
|
||||
)
|
||||
print_report(report)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
251
scripts/chunk_sources.py
Normal file
251
scripts/chunk_sources.py
Normal file
@@ -0,0 +1,251 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
chunk_sources.py — split normalized data/sources/*.txt into ~20-page chunks
|
||||
for subagent extraction, and maintain data/chunks/manifest.json.
|
||||
|
||||
Paginated text → ~20-page chunks, ~4-page overlap (plan D8).
|
||||
Unpaginated text → ~10000-word windows, ~2000-word overlap.
|
||||
|
||||
The manifest is a cache derived from the filesystem + per-chunk state. Re-running
|
||||
this script is idempotent: existing chunk states (pending/assigned/done/rejected)
|
||||
survive as long as the source content hash is unchanged.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
if str(SCRIPT_DIR) not in sys.path:
|
||||
sys.path.insert(0, str(SCRIPT_DIR))
|
||||
|
||||
from extract_common import content_hash, split_pages # noqa: E402
|
||||
|
||||
SCHEMA_VERSION = "1.0"
|
||||
PAGES_PER_CHUNK = 20
|
||||
PAGE_OVERLAP = 4
|
||||
WORD_WINDOW = 10_000
|
||||
WORD_OVERLAP = 2_000
|
||||
|
||||
VALID_STATES = {"pending", "assigned", "done", "rejected"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# header parsing
|
||||
# --------------------------------------------------------------------------
|
||||
def parse_source(text: str) -> tuple[dict, str]:
|
||||
"""Split a normalized source file into (header_dict, body)."""
|
||||
lines = text.splitlines()
|
||||
header: dict = {}
|
||||
body_start = 0
|
||||
in_header = True
|
||||
for i, line in enumerate(lines):
|
||||
if line.startswith("--- PAGE "):
|
||||
body_start = i
|
||||
break
|
||||
if not in_header:
|
||||
continue
|
||||
if set(line.strip()) == {"="} and line.strip():
|
||||
body_start = i + 1
|
||||
in_header = False # header ends at the rule line
|
||||
continue
|
||||
if ":" in line:
|
||||
key, _, val = line.partition(":")
|
||||
header[key.strip()] = val.strip()
|
||||
body = "\n".join(lines[body_start:])
|
||||
return header, body
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# chunking — pure functions
|
||||
# --------------------------------------------------------------------------
|
||||
def chunk_pages(
|
||||
pages: list[tuple[int, str]],
|
||||
pages_per_chunk: int = PAGES_PER_CHUNK,
|
||||
overlap: int = PAGE_OVERLAP,
|
||||
) -> list[dict]:
|
||||
"""
|
||||
Split an ordered list of (page_no, text) into overlapping chunks.
|
||||
|
||||
stride = pages_per_chunk - overlap. Because stride < pages_per_chunk - 1, any
|
||||
activity straddling a page boundary appears whole in at least one chunk.
|
||||
"""
|
||||
if not pages:
|
||||
return []
|
||||
stride = max(1, pages_per_chunk - overlap)
|
||||
chunks: list[dict] = []
|
||||
i = 0
|
||||
n = len(pages)
|
||||
while i < n:
|
||||
window = pages[i : i + pages_per_chunk]
|
||||
first, last = window[0][0], window[-1][0]
|
||||
text = "".join(
|
||||
f"\n--- PAGE {num} ---\n{txt}\n" for num, txt in window
|
||||
)
|
||||
chunks.append(
|
||||
{"page_start": first, "page_end": last,
|
||||
"chunk_range": f"pages {first}-{last}", "text": text}
|
||||
)
|
||||
if i + pages_per_chunk >= n:
|
||||
break
|
||||
i += stride
|
||||
return chunks
|
||||
|
||||
|
||||
def chunk_words(
|
||||
text: str, window: int = WORD_WINDOW, overlap: int = WORD_OVERLAP
|
||||
) -> list[dict]:
|
||||
"""Split unpaginated text into overlapping word windows."""
|
||||
words = text.split()
|
||||
if not words:
|
||||
return []
|
||||
stride = max(1, window - overlap)
|
||||
chunks: list[dict] = []
|
||||
i = 0
|
||||
n = len(words)
|
||||
while i < n:
|
||||
seg = words[i : i + window]
|
||||
chunks.append(
|
||||
{"word_start": i, "word_end": i + len(seg),
|
||||
"chunk_range": f"words {i}-{i + len(seg)}", "text": " ".join(seg)}
|
||||
)
|
||||
if i + window >= n:
|
||||
break
|
||||
i += stride
|
||||
return chunks
|
||||
|
||||
|
||||
def make_chunks(source_text: str) -> list[dict]:
|
||||
"""Chunk one normalized source file. Picks page- or word-windowing."""
|
||||
_, body = parse_source(source_text)
|
||||
pages = split_pages(body)
|
||||
if pages:
|
||||
return chunk_pages(pages)
|
||||
return chunk_words(body)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# manifest
|
||||
# --------------------------------------------------------------------------
|
||||
def _empty_manifest() -> dict:
|
||||
return {"schema_version": SCHEMA_VERSION, "chunks": {}}
|
||||
|
||||
|
||||
def load_manifest(manifest_path: Path) -> dict:
|
||||
if manifest_path.exists():
|
||||
try:
|
||||
data = json.loads(manifest_path.read_text(encoding="utf-8"))
|
||||
data.setdefault("schema_version", SCHEMA_VERSION)
|
||||
data.setdefault("chunks", {})
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return _empty_manifest()
|
||||
|
||||
|
||||
def save_manifest(manifest: dict, manifest_path: Path) -> None:
|
||||
manifest_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
manifest_path.write_text(
|
||||
json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
|
||||
)
|
||||
|
||||
|
||||
def chunk_source_file(
|
||||
source_path: Path, chunks_dir: Path, manifest: dict
|
||||
) -> list[str]:
|
||||
"""
|
||||
Chunk one data/sources/<id>.txt → data/chunks/<id>/<id>.partNN.txt and
|
||||
register every chunk in `manifest`. Preserves prior state when the source
|
||||
content hash is unchanged. Returns the list of chunk keys written.
|
||||
"""
|
||||
source_id = source_path.stem
|
||||
text = source_path.read_text(encoding="utf-8", errors="replace")
|
||||
src_hash = content_hash(text)
|
||||
chunks = make_chunks(text)
|
||||
|
||||
out_dir = chunks_dir / source_id
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
written: list[str] = []
|
||||
for idx, chunk in enumerate(chunks, 1):
|
||||
key = f"{source_id}.part{idx:02d}"
|
||||
chunk_file = out_dir / f"{key}.txt"
|
||||
chunk_file.write_text(chunk["text"], encoding="utf-8")
|
||||
|
||||
prior = manifest["chunks"].get(key)
|
||||
# preserve state only if the source content is unchanged
|
||||
if prior and prior.get("source_hash") == src_hash and \
|
||||
prior.get("state") in VALID_STATES:
|
||||
state = prior["state"]
|
||||
else:
|
||||
state = "pending"
|
||||
|
||||
manifest["chunks"][key] = {
|
||||
"source_id": source_id,
|
||||
"source_hash": src_hash,
|
||||
"part": idx,
|
||||
"chunk_range": chunk["chunk_range"],
|
||||
"chunk_file": str(chunk_file.relative_to(chunks_dir.parent)),
|
||||
"expected_json": f"{key}.json",
|
||||
"state": state,
|
||||
}
|
||||
written.append(key)
|
||||
return written
|
||||
|
||||
|
||||
def prune_stale(manifest: dict, live_keys: set[str]) -> list[str]:
|
||||
"""Drop manifest entries whose chunk no longer exists on disk."""
|
||||
stale = [k for k in manifest["chunks"] if k not in live_keys]
|
||||
for k in stale:
|
||||
del manifest["chunks"][k]
|
||||
return stale
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def run(sources_dir: Path, chunks_dir: Path) -> dict:
|
||||
"""Chunk every *.txt in sources_dir. Returns a summary dict."""
|
||||
manifest_path = chunks_dir / "manifest.json"
|
||||
manifest = load_manifest(manifest_path)
|
||||
|
||||
live_keys: set[str] = set()
|
||||
source_files = sorted(sources_dir.glob("*.txt"))
|
||||
for src in source_files:
|
||||
live_keys.update(chunk_source_file(src, chunks_dir, manifest))
|
||||
|
||||
stale = prune_stale(manifest, live_keys)
|
||||
save_manifest(manifest, manifest_path)
|
||||
|
||||
states: dict[str, int] = {}
|
||||
for meta in manifest["chunks"].values():
|
||||
states[meta["state"]] = states.get(meta["state"], 0) + 1
|
||||
return {
|
||||
"sources": len(source_files),
|
||||
"chunks": len(live_keys),
|
||||
"pruned": len(stale),
|
||||
"states": states,
|
||||
}
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Chunk normalized sources.")
|
||||
parser.add_argument("--sources", default="data/sources", help="sources dir")
|
||||
parser.add_argument("--chunks", default="data/chunks", help="chunks output dir")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
summary = run(Path(args.sources), Path(args.chunks))
|
||||
print(f"sources processed : {summary['sources']}")
|
||||
print(f"chunks written : {summary['chunks']}")
|
||||
print(f"stale pruned : {summary['pruned']}")
|
||||
for state, count in sorted(summary["states"].items()):
|
||||
print(f" {state:<10}: {count}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,54 +0,0 @@
|
||||
# TEMPLATE PENTRU EXTRACȚIE ACTIVITĂȚI CU CLAUDE
|
||||
|
||||
## Instrucțiuni pentru Claude Code:
|
||||
|
||||
Pentru fiecare PDF/DOC, folosește următorul format de extracție:
|
||||
|
||||
### 1. Citește fișierul:
|
||||
```
|
||||
Claude, te rog citește fișierul: [CALE_FISIER]
|
||||
```
|
||||
|
||||
### 2. Extrage activitățile folosind acest template JSON:
|
||||
```json
|
||||
{
|
||||
"source_file": "[NUME_FISIER]",
|
||||
"activities": [
|
||||
{
|
||||
"name": "Numele activității",
|
||||
"description": "Descrierea completă a activității",
|
||||
"rules": "Regulile jocului/activității",
|
||||
"variations": "Variante sau adaptări",
|
||||
"category": "[A-H] bazat pe tip",
|
||||
"age_group_min": 6,
|
||||
"age_group_max": 14,
|
||||
"participants_min": 4,
|
||||
"participants_max": 20,
|
||||
"duration_min": 10,
|
||||
"duration_max": 30,
|
||||
"materials_list": "Lista materialelor necesare",
|
||||
"skills_developed": "Competențe dezvoltate",
|
||||
"difficulty_level": "Ușor/Mediu/Dificil",
|
||||
"keywords": "cuvinte cheie separate prin virgulă",
|
||||
"tags": "taguri relevante"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Salvează în fișier:
|
||||
După extracție, salvează JSON-ul în: `/scripts/extracted_activities/[NUME_FISIER].json`
|
||||
|
||||
### 4. Priorități de procesare:
|
||||
|
||||
**TOP PRIORITY (procesează primele):**
|
||||
1. 1000 Fantastic Scout Games.pdf
|
||||
2. Cartea Mare a jocurilor.pdf
|
||||
3. 160-de-activitati-dinamice-jocuri-pentru-team-building.pdf
|
||||
4. 101 Ways to Create an Unforgettable Camp Experience.pdf
|
||||
5. 151 Awesome Summer Camp Nature Activities.pdf
|
||||
|
||||
**Categorii de focus:**
|
||||
- [A] Jocuri Cercetășești
|
||||
- [C] Camping & Activități Exterior
|
||||
- [G] Activități Educaționale
|
||||
@@ -1,164 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
DATABASE SETUP SCRIPT - INDEX-SISTEM-JOCURI
|
||||
|
||||
Script pentru recrearea bazelor de date din .gitignore
|
||||
Folosește clasele DatabaseManager pentru consistență
|
||||
|
||||
Usage:
|
||||
python scripts/create_databases.py
|
||||
python scripts/create_databases.py --clear-existing
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
# Add src to path so we can import our modules
|
||||
sys.path.append(str(Path(__file__).parent.parent / 'src'))
|
||||
|
||||
from database import DatabaseManager
|
||||
from game_library_manager import GameLibraryManager
|
||||
|
||||
def create_main_database(db_path: str = "data/activities.db", clear: bool = False):
|
||||
"""Create the main activities database"""
|
||||
db_file = Path(db_path)
|
||||
|
||||
if clear and db_file.exists():
|
||||
print(f"🗑️ Removing existing database: {db_path}")
|
||||
db_file.unlink()
|
||||
|
||||
print(f"📊 Creating main database: {db_path}")
|
||||
db = DatabaseManager(db_path)
|
||||
|
||||
# Test the database
|
||||
try:
|
||||
stats = db.get_statistics()
|
||||
print(f"✅ Database created successfully: {stats['total_activities']} activities")
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Error creating database: {e}")
|
||||
return False
|
||||
|
||||
def create_game_library_database(db_path: str = "data/game_library.db", clear: bool = False):
|
||||
"""Create the legacy game library database"""
|
||||
db_file = Path(db_path)
|
||||
|
||||
if clear and db_file.exists():
|
||||
print(f"🗑️ Removing existing database: {db_path}")
|
||||
db_file.unlink()
|
||||
|
||||
print(f"📊 Creating game library database: {db_path}")
|
||||
manager = GameLibraryManager(db_path)
|
||||
|
||||
print(f"✅ Game library database created successfully")
|
||||
return True
|
||||
|
||||
def create_test_database(db_path: str = "data/test_activities.db", clear: bool = False):
|
||||
"""Create the test database"""
|
||||
db_file = Path(db_path)
|
||||
|
||||
if clear and db_file.exists():
|
||||
print(f"🗑️ Removing existing database: {db_path}")
|
||||
db_file.unlink()
|
||||
|
||||
print(f"📊 Creating test database: {db_path}")
|
||||
db = DatabaseManager(db_path)
|
||||
|
||||
# Add some test data
|
||||
test_activity = {
|
||||
'title': 'Test Activity - Setup Script',
|
||||
'description': 'This is a test activity created by the setup script',
|
||||
'file_path': 'test/sample.txt',
|
||||
'file_type': 'TXT',
|
||||
'category': 'test',
|
||||
'age_group': '8-12 ani',
|
||||
'participants': '5-10 persoane',
|
||||
'duration': '15-30min',
|
||||
'materials': 'Fără materiale',
|
||||
'tags': '["test", "setup"]',
|
||||
'source_text': 'Sample test content for verification'
|
||||
}
|
||||
|
||||
try:
|
||||
db.insert_activity(test_activity)
|
||||
stats = db.get_statistics()
|
||||
print(f"✅ Test database created with sample data: {stats['total_activities']} activities")
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Error creating test database: {e}")
|
||||
return False
|
||||
|
||||
def ensure_data_directory():
|
||||
"""Ensure the data directory exists"""
|
||||
data_dir = Path("data")
|
||||
if not data_dir.exists():
|
||||
print(f"📁 Creating data directory: {data_dir}")
|
||||
data_dir.mkdir(parents=True)
|
||||
else:
|
||||
print(f"📁 Data directory exists: {data_dir}")
|
||||
|
||||
def main():
|
||||
"""Main setup function"""
|
||||
parser = argparse.ArgumentParser(description='Create databases for INDEX-SISTEM-JOCURI')
|
||||
parser.add_argument('--clear-existing', '-c', action='store_true',
|
||||
help='Remove existing databases before creating new ones')
|
||||
parser.add_argument('--main-only', action='store_true',
|
||||
help='Create only the main activities database')
|
||||
parser.add_argument('--test-only', action='store_true',
|
||||
help='Create only the test database')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("🚀 DATABASE SETUP - INDEX-SISTEM-JOCURI")
|
||||
print("=" * 50)
|
||||
|
||||
# Ensure data directory exists
|
||||
ensure_data_directory()
|
||||
|
||||
success_count = 0
|
||||
total_count = 0
|
||||
|
||||
if args.test_only:
|
||||
total_count = 1
|
||||
if create_test_database(clear=args.clear_existing):
|
||||
success_count += 1
|
||||
elif args.main_only:
|
||||
total_count = 1
|
||||
if create_main_database(clear=args.clear_existing):
|
||||
success_count += 1
|
||||
else:
|
||||
# Create all databases
|
||||
databases = [
|
||||
("Main activities", lambda: create_main_database(clear=args.clear_existing)),
|
||||
("Game library", lambda: create_game_library_database(clear=args.clear_existing)),
|
||||
("Test activities", lambda: create_test_database(clear=args.clear_existing))
|
||||
]
|
||||
|
||||
total_count = len(databases)
|
||||
|
||||
for name, create_func in databases:
|
||||
print(f"\n📂 Creating {name} database...")
|
||||
try:
|
||||
if create_func():
|
||||
success_count += 1
|
||||
except Exception as e:
|
||||
print(f"❌ Failed to create {name} database: {e}")
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print(f"🎯 SUMMARY: {success_count}/{total_count} databases created successfully")
|
||||
|
||||
if success_count == total_count:
|
||||
print("✅ All databases ready!")
|
||||
print("\nNext steps:")
|
||||
print("1. Run indexer: cd src && python indexer.py --clear-db")
|
||||
print("2. Start web app: cd src && python app.py")
|
||||
else:
|
||||
print("⚠️ Some databases failed to create. Check errors above.")
|
||||
return 1
|
||||
|
||||
return 0
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
102
scripts/enrich_wave.sh
Executable file
102
scripts/enrich_wave.sh
Executable file
@@ -0,0 +1,102 @@
|
||||
#!/bin/bash
|
||||
# ============================================================================
|
||||
# enrich_wave.sh — ONE throttled enrichment wave, fully headless (no Claude
|
||||
# session). Designed to be run by the LXC's OS cron at night.
|
||||
#
|
||||
# - Prepares a bounded wave (first N missing keys) via enrichment_wave.py.
|
||||
# - Runs ONE `claude -p` per batch file, PAR batches concurrently (OS-level
|
||||
# parallelism — no Workflow tool, no 2-per-workflow cap, no session needed).
|
||||
# - When the backlog is empty, runs --collect + --rebuild and stops.
|
||||
#
|
||||
# Throttle = --keys (default 700 ≈ 75% of a 5h usage window ≈ 950 keys).
|
||||
# A single flock guarantees waves never overlap.
|
||||
#
|
||||
# Usage: scripts/enrich_wave.sh [KEYS] [PAR]
|
||||
# KEYS = max keys this wave (default 700)
|
||||
# PAR = concurrent claude -p (default 6)
|
||||
# ============================================================================
|
||||
set -uo pipefail
|
||||
|
||||
REPO="/workspace/game-library"
|
||||
LOG_DIR="/workspace/.claude-logs"
|
||||
LOCK="/tmp/enrich_wave.lock"
|
||||
KEYS="${1:-700}"
|
||||
PAR="${2:-6}"
|
||||
MAX_TURNS=100
|
||||
|
||||
# --- environment (cron has a minimal env) ---------------------------------- #
|
||||
export HOME="${HOME:-/home/claude}"
|
||||
[ -f "$HOME/.nvm/nvm.sh" ] && . "$HOME/.nvm/nvm.sh" >/dev/null 2>&1
|
||||
export PATH="$HOME/.nvm/versions/node/v20.19.6/bin:/usr/local/bin:/usr/bin:/bin:$PATH"
|
||||
|
||||
mkdir -p "$LOG_DIR"
|
||||
TS="$(date +%Y%m%d_%H%M%S)"
|
||||
LOG="$LOG_DIR/enrich_${TS}.log"
|
||||
|
||||
log() { echo "[$(date '+%H:%M:%S')] $*" | tee -a "$LOG"; }
|
||||
|
||||
# --- single-instance lock: skip if a wave is still running ----------------- #
|
||||
exec 9>"$LOCK"
|
||||
if ! flock -n 9; then
|
||||
log "another wave holds the lock; exiting."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
cd "$REPO" || { log "cannot cd $REPO"; exit 1; }
|
||||
command -v claude >/dev/null 2>&1 || { log "claude CLI not on PATH"; exit 1; }
|
||||
|
||||
log "=== enrichment wave start (keys=$KEYS par=$PAR) ==="
|
||||
|
||||
# --- 1) prepare bounded wave (batch files only) ---------------------------- #
|
||||
PREP="$(python3 scripts/enrichment_wave.py --prepare --keys "$KEYS" --no-shards 2>&1)"
|
||||
echo "$PREP" | tee -a "$LOG"
|
||||
|
||||
if echo "$PREP" | grep -q "WAVE: COMPLETE"; then
|
||||
log "backlog empty -> collect + rebuild"
|
||||
python3 scripts/run_enrichment.py --collect >>"$LOG" 2>&1
|
||||
python3 scripts/build_database.py --rebuild >>"$LOG" 2>&1
|
||||
grep -E "enrichment .*matched" "$LOG" | tail -1 | tee -a "$LOG"
|
||||
log "=== ENRICHMENT COMPLETE ==="
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# --- 2) per-batch headless enrichment, PAR-way parallel -------------------- #
|
||||
read -r -d '' BATCH_PROMPT <<'EOP'
|
||||
You are an enrichment subagent in the game-library pipeline. Working dir: /workspace/game-library.
|
||||
|
||||
Read `scripts/ENRICHMENT_PROMPT.md` FIRST — it defines the rules and output format EXACTLY (translate faithfully to Romanian; expand description_ro ONLY from the source chunk text; mark inferred filter fields in estimated_fields; fixed enum vocab).
|
||||
|
||||
Your batch file is __BATCHFILE__ — it lists content_keys, one per line. For EACH key:
|
||||
1. IDEMPOTENT SKIP: if `data/enrichment_parts/<key>.json` already exists AND parses as valid JSON, SKIP it (do not rewrite).
|
||||
2. Otherwise read its prompt `data/enrichment_prompts/<key>.prompt.md`, produce the enrichment JSON per ENRICHMENT_PROMPT.md, and write it to `data/enrichment_parts/<key>.json` (MUST include the exact "content_key": "<key>").
|
||||
3. Validate it parses: python3 -c "import json;json.load(open('data/enrichment_parts/<key>.json'))".
|
||||
|
||||
CRITICAL — JSON quote escaping: any literal ASCII double-quote inside a string value MUST be escaped as \". Romanian text uses „cuvant" where the closing mark is a plain ASCII " — written raw it breaks the JSON. Either keep the typographic „ " marks or escape every ASCII ". Re-read and re-validate each file; fix any that fail.
|
||||
|
||||
Work through EVERY key in the batch file. If a key's prompt is missing, skip it and continue. When done, reply with one line: the count written and skipped.
|
||||
EOP
|
||||
|
||||
export REPO LOG MAX_TURNS BATCH_PROMPT
|
||||
run_one() {
|
||||
local bf="$1"
|
||||
local name; name="$(basename "$bf")"
|
||||
local prompt="${BATCH_PROMPT/__BATCHFILE__/$bf}"
|
||||
cd "$REPO" || return 1
|
||||
timeout 1200 claude -p "$prompt" \
|
||||
--allowedTools "Bash(python3:*),Read,Write,Bash(cat:*),Bash(ls:*)" \
|
||||
--max-turns "$MAX_TURNS" </dev/null >>"$LOG.$name.out" 2>&1
|
||||
echo "[$(date '+%H:%M:%S')] done $name (exit $?)" >>"$LOG"
|
||||
}
|
||||
export -f run_one
|
||||
|
||||
BATCHES=(data/enrichment_batches/batch_*.txt)
|
||||
log "launching ${#BATCHES[@]} batches, $PAR concurrent..."
|
||||
printf '%s\n' "${BATCHES[@]}" | xargs -P "$PAR" -I{} bash -c 'run_one "$@"' _ {}
|
||||
|
||||
# --- 3) summary ------------------------------------------------------------ #
|
||||
if grep -qi "session limit\|usage limit" "$LOG".batch_*.out 2>/dev/null; then
|
||||
log "WINDOW EXHAUSTED (usage limit hit mid-wave) — unfinished keys retry next fire."
|
||||
fi
|
||||
STATUS="$(python3 scripts/enrichment_wave.py --status 2>&1 | grep -E 'good|missing|done')"
|
||||
echo "$STATUS" | tee -a "$LOG"
|
||||
log "=== wave done ==="
|
||||
294
scripts/enrichment_wave.py
Normal file
294
scripts/enrichment_wave.py
Normal file
@@ -0,0 +1,294 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
enrichment_wave.py — throttled, window-paced wave preparation for the corpus
|
||||
enrichment pipeline.
|
||||
|
||||
The enrichment backlog (~9541 keys) does NOT fit in one 5-hour Anthropic usage
|
||||
window. Launching all remaining batches at once always runs the window to
|
||||
EXHAUSTION (the "subagent completed without calling StructuredOutput" signature),
|
||||
consuming 100% and blocking other work. There is no readable real-time window
|
||||
meter, so pacing must be BLIND: cap each wave to a fixed KEY COUNT (sized to
|
||||
~75% of empirical window capacity, ~950 keys), and let an external scheduler
|
||||
(cron, every 6h) space waves across windows.
|
||||
|
||||
This script encapsulates the reconcile + bounded-wave preparation that used to
|
||||
live as ad-hoc inline Python. It does NOT call the LLM and does NOT launch
|
||||
workflows — it only prepares files on disk and prints what to launch.
|
||||
|
||||
Modes:
|
||||
--status read-only: print done / missing / pct
|
||||
--prepare --keys N --shards K drop corrupt parts; take the FIRST N missing
|
||||
keys (sorted, deterministic); write batch
|
||||
files for ONLY those; regenerate K shard JS
|
||||
files covering exactly those batches; print
|
||||
machine-greppable WAVE:/SHARD: lines.
|
||||
|
||||
Idempotency: a key is "done" iff data/enrichment_parts/<key>.json exists AND
|
||||
parses. Re-running --prepare with the same args is deterministic (same sorted
|
||||
first-N keys), so a re-fire never reshuffles work. Parts on disk are the durable
|
||||
checkpoint.
|
||||
|
||||
Output contract (parsed by the cron wave-runner):
|
||||
WAVE: COMPLETE -> backlog empty; run collect+rebuild
|
||||
WAVE: PREPARED keys=.. batches=.. shards=.. remaining_after=..
|
||||
SHARD: data/enrichment_wf/shard_0.js -> one line per workflow to launch
|
||||
...
|
||||
|
||||
Usage:
|
||||
python3 scripts/enrichment_wave.py --status
|
||||
python3 scripts/enrichment_wave.py --prepare --keys 700 --shards 8
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
|
||||
PROMPT_SUFFIX = ".prompt.md"
|
||||
PART_SUFFIX = ".json"
|
||||
BATCH_SIZE_DEFAULT = 12
|
||||
KEYS_DEFAULT = 700
|
||||
SHARDS_DEFAULT = 8
|
||||
|
||||
# Resolved relative to REPO_ROOT so the script works from any cwd.
|
||||
DEF_PROMPTS = "data/enrichment_prompts"
|
||||
DEF_PARTS = "data/enrichment_parts"
|
||||
DEF_BATCHES = "data/enrichment_batches"
|
||||
DEF_WF = "data/enrichment_wf"
|
||||
TEMPLATE_NAME = "shard.js.tmpl"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Helpers
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _abs(p: str) -> Path:
|
||||
q = Path(p)
|
||||
return q if q.is_absolute() else (REPO_ROOT / q)
|
||||
|
||||
|
||||
def part_ok(path: Path) -> bool:
|
||||
"""A part counts as done iff it parses as a JSON object."""
|
||||
try:
|
||||
return isinstance(json.load(open(path, encoding="utf-8")), dict)
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def corrupt_parts(parts_dir: Path) -> list[Path]:
|
||||
return [p for p in parts_dir.glob("*" + PART_SUFFIX) if not part_ok(p)]
|
||||
|
||||
|
||||
def compute_missing(prompts_dir: Path, parts_dir: Path) -> list[str]:
|
||||
"""Keys whose prompt exists but whose part is absent. Sorted = deterministic."""
|
||||
missing = []
|
||||
for pr in prompts_dir.glob("*" + PROMPT_SUFFIX):
|
||||
key = pr.name[: -len(PROMPT_SUFFIX)]
|
||||
if not (parts_dir / (key + PART_SUFFIX)).exists():
|
||||
missing.append(key)
|
||||
return sorted(missing)
|
||||
|
||||
|
||||
def count_done(prompts_dir: Path, parts_dir: Path) -> tuple[int, int]:
|
||||
"""(good_parts_with_prompt, total_prompts)."""
|
||||
total = 0
|
||||
good = 0
|
||||
for pr in prompts_dir.glob("*" + PROMPT_SUFFIX):
|
||||
total += 1
|
||||
key = pr.name[: -len(PROMPT_SUFFIX)]
|
||||
part = parts_dir / (key + PART_SUFFIX)
|
||||
if part.exists() and part_ok(part):
|
||||
good += 1
|
||||
return good, total
|
||||
|
||||
|
||||
def write_batches(keys: list[str], batches_dir: Path, size: int) -> int:
|
||||
"""Replace all batch_*.txt with fresh files of <= size keys. Returns NB."""
|
||||
batches_dir.mkdir(parents=True, exist_ok=True)
|
||||
for old in batches_dir.glob("batch_*.txt"):
|
||||
old.unlink()
|
||||
nb = 0
|
||||
for i in range(0, len(keys), size):
|
||||
chunk = keys[i : i + size]
|
||||
(batches_dir / f"batch_{nb:04d}.txt").write_text(
|
||||
"\n".join(chunk) + "\n", encoding="utf-8"
|
||||
)
|
||||
nb += 1
|
||||
return nb
|
||||
|
||||
|
||||
def shard_ranges(nb: int, k: int) -> list[tuple[int, int]]:
|
||||
"""Split [0,nb) into k contiguous, disjoint, total-covering ranges.
|
||||
|
||||
Even distribution: the first (nb % k) shards carry one extra batch. When
|
||||
nb < k the trailing ranges are empty [x,x) and are dropped by the caller.
|
||||
"""
|
||||
if nb <= 0 or k <= 0:
|
||||
return []
|
||||
base, extra = divmod(nb, k)
|
||||
ranges = []
|
||||
start = 0
|
||||
for i in range(k):
|
||||
length = base + (1 if i < extra else 0)
|
||||
ranges.append((start, start + length))
|
||||
start += length
|
||||
return ranges
|
||||
|
||||
|
||||
def render_shard(template: str, shard: int, start: int, end: int, nshards: int) -> str:
|
||||
return (
|
||||
template.replace("__SHARD__", str(shard))
|
||||
.replace("__START__", str(start))
|
||||
.replace("__END__", str(end))
|
||||
.replace("__NSHARDS__", str(nshards))
|
||||
)
|
||||
|
||||
|
||||
def write_shards(ranges: list[tuple[int, int]], template: str, wf_dir: Path) -> list[Path]:
|
||||
"""Delete stale shard_*.js, then write one per NON-EMPTY range. Returns paths."""
|
||||
wf_dir.mkdir(parents=True, exist_ok=True)
|
||||
for old in wf_dir.glob("shard_*.js"):
|
||||
old.unlink()
|
||||
non_empty = [(i, s, e) for i, (s, e) in enumerate(ranges) if e > s]
|
||||
nshards = len(non_empty)
|
||||
paths = []
|
||||
# Re-index shards 0..nshards-1 so labels/meta stay contiguous even if some
|
||||
# trailing ranges were empty (tiny final wave with fewer batches than K).
|
||||
for new_idx, (_, s, e) in enumerate(non_empty):
|
||||
path = wf_dir / f"shard_{new_idx}.js"
|
||||
path.write_text(
|
||||
render_shard(template, new_idx, s, e, nshards), encoding="utf-8"
|
||||
)
|
||||
paths.append(path)
|
||||
return paths
|
||||
|
||||
|
||||
def rel(path: Path) -> str:
|
||||
try:
|
||||
return str(path.relative_to(REPO_ROOT))
|
||||
except ValueError:
|
||||
return str(path)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Modes
|
||||
# --------------------------------------------------------------------------- #
|
||||
def cmd_status(prompts_dir: Path, parts_dir: Path) -> int:
|
||||
good, total = count_done(prompts_dir, parts_dir)
|
||||
parts_on_disk = len(list(parts_dir.glob("*" + PART_SUFFIX)))
|
||||
bad = len(corrupt_parts(parts_dir))
|
||||
missing = total - good
|
||||
pct = (100.0 * good / total) if total else 0.0
|
||||
print("=== enrichment status ===")
|
||||
print(f"prompts (universe) : {total}")
|
||||
print(f"parts on disk : {parts_on_disk}")
|
||||
print(f"good (done) : {good}")
|
||||
print(f"corrupt parts : {bad} (reported only; --prepare drops them)")
|
||||
print(f"missing : {missing}")
|
||||
print(f"done : {pct:.1f}%")
|
||||
if total:
|
||||
print(f"WAVE: {'COMPLETE' if missing == 0 else 'PENDING'} missing={missing}")
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_prepare(
|
||||
prompts_dir: Path,
|
||||
parts_dir: Path,
|
||||
batches_dir: Path,
|
||||
wf_dir: Path,
|
||||
keys: int,
|
||||
shards: int,
|
||||
batch_size: int,
|
||||
make_shards: bool = True,
|
||||
) -> int:
|
||||
template = ""
|
||||
if make_shards:
|
||||
template_path = wf_dir / TEMPLATE_NAME
|
||||
if not template_path.is_file():
|
||||
print(f"ERROR: missing shard template {rel(template_path)}", file=sys.stderr)
|
||||
return 2
|
||||
template = template_path.read_text(encoding="utf-8")
|
||||
|
||||
# 1) drop corrupt parts (only mutation to parts/)
|
||||
dropped = 0
|
||||
for p in corrupt_parts(parts_dir):
|
||||
p.unlink()
|
||||
dropped += 1
|
||||
|
||||
# 2) compute missing (deterministic)
|
||||
missing = compute_missing(prompts_dir, parts_dir)
|
||||
|
||||
# 3) empty -> COMPLETE sentinel, no files written
|
||||
if not missing:
|
||||
print(f"dropped_corrupt={dropped}")
|
||||
print("WAVE: COMPLETE")
|
||||
return 0
|
||||
|
||||
# 4) clamp to first N
|
||||
take = missing[:keys]
|
||||
|
||||
# 5) batches for ONLY those keys
|
||||
nb = write_batches(take, batches_dir, batch_size)
|
||||
|
||||
# 6) shard scripts covering exactly those batches (skipped on the bash path)
|
||||
paths = []
|
||||
if make_shards:
|
||||
ranges = shard_ranges(nb, shards)
|
||||
paths = write_shards(ranges, template, wf_dir)
|
||||
|
||||
remaining_after = len(missing) - len(take)
|
||||
print(f"dropped_corrupt={dropped}")
|
||||
print(
|
||||
f"WAVE: PREPARED keys={len(take)} batches={nb} "
|
||||
f"shards={len(paths)} remaining_after={remaining_after}"
|
||||
)
|
||||
for p in paths:
|
||||
print(f"SHARD: {rel(p)}")
|
||||
return 0
|
||||
|
||||
|
||||
def main(argv=None) -> int:
|
||||
ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
ap.add_argument("--status", action="store_true", help="read-only progress report")
|
||||
ap.add_argument("--prepare", action="store_true", help="prepare one bounded wave")
|
||||
ap.add_argument("--keys", type=int, default=KEYS_DEFAULT, help=f"max keys this wave (default {KEYS_DEFAULT})")
|
||||
ap.add_argument("--shards", type=int, default=SHARDS_DEFAULT, help=f"workflow shards (default {SHARDS_DEFAULT})")
|
||||
ap.add_argument("--batch-size", type=int, default=BATCH_SIZE_DEFAULT, help=f"keys per batch (default {BATCH_SIZE_DEFAULT})")
|
||||
ap.add_argument("--no-shards", action="store_true", help="prepare batch files only; skip shard JS generation (bash/headless path)")
|
||||
ap.add_argument("--prompts", default=DEF_PROMPTS)
|
||||
ap.add_argument("--parts", default=DEF_PARTS)
|
||||
ap.add_argument("--batches", default=DEF_BATCHES)
|
||||
ap.add_argument("--wf-dir", default=DEF_WF)
|
||||
args = ap.parse_args(argv)
|
||||
|
||||
prompts_dir = _abs(args.prompts)
|
||||
parts_dir = _abs(args.parts)
|
||||
batches_dir = _abs(args.batches)
|
||||
wf_dir = _abs(args.wf_dir)
|
||||
|
||||
if not prompts_dir.is_dir():
|
||||
print(f"ERROR: prompts dir not found: {rel(prompts_dir)}", file=sys.stderr)
|
||||
return 2
|
||||
parts_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if args.keys < 1 or args.shards < 1 or args.batch_size < 1:
|
||||
print("ERROR: --keys/--shards/--batch-size must be >= 1", file=sys.stderr)
|
||||
return 2
|
||||
|
||||
if args.prepare:
|
||||
return cmd_prepare(
|
||||
prompts_dir, parts_dir, batches_dir, wf_dir,
|
||||
args.keys, args.shards, args.batch_size,
|
||||
make_shards=not args.no_shards,
|
||||
)
|
||||
# default to status
|
||||
return cmd_status(prompts_dir, parts_dir)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
361
scripts/extract_common.py
Normal file
361
scripts/extract_common.py
Normal file
@@ -0,0 +1,361 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
extract_common.py — single home for per-format text extraction.
|
||||
|
||||
Every extractor returns a plain text *body* with synthetic page markers
|
||||
(`--- PAGE N ---`). The file-level header (`SOURCE:` / `CONVERTED:`) is added
|
||||
by normalize_sources.py, not here.
|
||||
|
||||
Critical fix vs. the old pdf_to_text_converter.py: there is NO `max_pages` cap.
|
||||
Large books are extracted in full.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import importlib
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import subprocess
|
||||
import tempfile
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from typing import Callable
|
||||
|
||||
PAGE_MARKER_RE = re.compile(r"^--- PAGE (\d+) ---\s*$", re.MULTILINE)
|
||||
|
||||
# paragraphs per synthetic page for paginated-by-flow formats (docx)
|
||||
DOCX_PARAS_PER_PAGE = 40
|
||||
|
||||
# formats we deliberately ignore (epub duplicates existing PDFs — plan §1)
|
||||
IGNORED_EXTENSIONS = {".epub"}
|
||||
|
||||
# obvious junk filenames skipped during a walk
|
||||
JUNK_NAMES = {"desktop.ini", "linkuri-jocuri.txt"}
|
||||
JUNK_SUFFIXES = {".bak", ".tmp", ".ini"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# page assembly helpers
|
||||
# --------------------------------------------------------------------------
|
||||
def join_pages(pages: list[str], start: int = 1) -> str:
|
||||
"""Join a list of page texts into a body string with `--- PAGE N ---`."""
|
||||
out: list[str] = []
|
||||
for i, text in enumerate(pages, start):
|
||||
out.append(f"\n--- PAGE {i} ---\n{(text or '').strip()}\n")
|
||||
return "".join(out)
|
||||
|
||||
|
||||
def split_pages(body: str) -> list[tuple[int, str]]:
|
||||
"""Inverse of join_pages: parse a body into [(page_number, text), ...]."""
|
||||
matches = list(PAGE_MARKER_RE.finditer(body))
|
||||
if not matches:
|
||||
return []
|
||||
pages: list[tuple[int, str]] = []
|
||||
for idx, m in enumerate(matches):
|
||||
num = int(m.group(1))
|
||||
seg_start = m.end()
|
||||
seg_end = matches[idx + 1].start() if idx + 1 < len(matches) else len(body)
|
||||
pages.append((num, body[seg_start:seg_end].strip()))
|
||||
return pages
|
||||
|
||||
|
||||
def count_page_markers(body: str) -> int:
|
||||
return len(PAGE_MARKER_RE.findall(body))
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# format detection
|
||||
# --------------------------------------------------------------------------
|
||||
FORMAT_BY_EXT = {
|
||||
".pdf": "pdf",
|
||||
".docx": "docx",
|
||||
".doc": "doc",
|
||||
".pptx": "pptx",
|
||||
".ppt": "pptx",
|
||||
".htm": "html",
|
||||
".html": "html",
|
||||
".zip": "zip",
|
||||
".epub": "epub",
|
||||
".txt": "txt",
|
||||
}
|
||||
|
||||
|
||||
def detect_format(path: str | os.PathLike) -> str:
|
||||
"""Return a format key for a path based on its extension."""
|
||||
ext = Path(path).suffix.lower()
|
||||
return FORMAT_BY_EXT.get(ext, "unknown")
|
||||
|
||||
|
||||
def is_junk(path: str | os.PathLike) -> bool:
|
||||
p = Path(path)
|
||||
name = p.name.lower()
|
||||
if name in JUNK_NAMES:
|
||||
return True
|
||||
if name.startswith("readme") and p.suffix.lower() == ".md":
|
||||
return True
|
||||
if p.suffix.lower() in JUNK_SUFFIXES:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# content hashing + near-duplicate elimination
|
||||
# --------------------------------------------------------------------------
|
||||
def _normalize_for_hash(text: str) -> str:
|
||||
return re.sub(r"\s+", " ", (text or "")).strip().lower()
|
||||
|
||||
|
||||
def content_hash(text: str) -> str:
|
||||
"""Stable SHA1 of whitespace-normalized text — used for exact-dup detection."""
|
||||
return hashlib.sha1(_normalize_for_hash(text).encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def near_duplicate_ratio(a: str, b: str) -> float:
|
||||
"""Similarity score in [0, 100] between two texts (rapidfuzz token ratio)."""
|
||||
from rapidfuzz import fuzz
|
||||
|
||||
return fuzz.token_sort_ratio(_normalize_for_hash(a), _normalize_for_hash(b))
|
||||
|
||||
|
||||
def dedupe_texts(
|
||||
items: list[tuple[str, str]], threshold: float = 95.0
|
||||
) -> list[tuple[str, str]]:
|
||||
"""
|
||||
Drop exact and near-duplicate texts from a list of (key, text) pairs.
|
||||
|
||||
Used for HTML mirror pages (print copies, repeated index/footer pages).
|
||||
Keeps the first occurrence; O(n) on exact hash, O(n*k) fuzzy only against
|
||||
already-kept items.
|
||||
"""
|
||||
kept: list[tuple[str, str]] = []
|
||||
seen_hashes: set[str] = set()
|
||||
for key, text in items:
|
||||
h = content_hash(text)
|
||||
if h in seen_hashes:
|
||||
continue
|
||||
if any(near_duplicate_ratio(text, kt) >= threshold for _, kt in kept):
|
||||
continue
|
||||
seen_hashes.add(h)
|
||||
kept.append((key, text))
|
||||
return kept
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# preflight dependency check
|
||||
# --------------------------------------------------------------------------
|
||||
REQUIRED_PYTHON_MODULES = {
|
||||
"pdfplumber": "pdfplumber",
|
||||
"PyPDF2": "pypdf2",
|
||||
"docx": "python-docx",
|
||||
"pptx": "python-pptx",
|
||||
"bs4": "beautifulsoup4",
|
||||
"lxml": "lxml",
|
||||
"jsonschema": "jsonschema",
|
||||
"rapidfuzz": "rapidfuzz",
|
||||
"chardet": "chardet",
|
||||
}
|
||||
|
||||
|
||||
def preflight(check_ocr: bool = False) -> dict:
|
||||
"""
|
||||
Check system + Python dependencies before a long normalization run.
|
||||
|
||||
Returns {'ok': bool, 'missing_python': [...], 'missing_system': [...],
|
||||
'warnings': [...]}. libreoffice is a *warning* (only .doc needs it),
|
||||
tesseract only checked when check_ocr=True.
|
||||
"""
|
||||
missing_python: list[str] = []
|
||||
for module, pip_name in REQUIRED_PYTHON_MODULES.items():
|
||||
try:
|
||||
importlib.import_module(module)
|
||||
except ImportError:
|
||||
missing_python.append(pip_name)
|
||||
|
||||
warnings: list[str] = []
|
||||
missing_system: list[str] = []
|
||||
|
||||
if not (shutil.which("libreoffice") or shutil.which("soffice")):
|
||||
warnings.append("libreoffice not found — legacy .doc files cannot be converted")
|
||||
|
||||
if check_ocr and not shutil.which("tesseract"):
|
||||
missing_system.append("tesseract (OCR requested but not installed)")
|
||||
|
||||
return {
|
||||
"ok": not missing_python and not missing_system,
|
||||
"missing_python": missing_python,
|
||||
"missing_system": missing_system,
|
||||
"warnings": warnings,
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# per-format extractors
|
||||
# --------------------------------------------------------------------------
|
||||
def extract_pdf(path: str | os.PathLike) -> str:
|
||||
"""PDF → body. pdfplumber primary, PyPDF2 fallback. No page cap."""
|
||||
path = str(path)
|
||||
try:
|
||||
return _extract_pdf_pdfplumber(path)
|
||||
except Exception:
|
||||
return _extract_pdf_pypdf2(path)
|
||||
|
||||
|
||||
def _extract_pdf_pdfplumber(path: str) -> str:
|
||||
import pdfplumber
|
||||
|
||||
pages: list[str] = []
|
||||
with pdfplumber.open(path) as pdf:
|
||||
for page in pdf.pages: # ALL pages — no max_pages
|
||||
try:
|
||||
pages.append(page.extract_text() or "")
|
||||
except Exception:
|
||||
pages.append("")
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
def _extract_pdf_pypdf2(path: str) -> str:
|
||||
import PyPDF2
|
||||
|
||||
pages: list[str] = []
|
||||
with open(path, "rb") as fh:
|
||||
reader = PyPDF2.PdfReader(fh)
|
||||
for page in reader.pages: # ALL pages — no max_pages
|
||||
try:
|
||||
pages.append(page.extract_text() or "")
|
||||
except Exception:
|
||||
pages.append("")
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
def extract_docx(path: str | os.PathLike) -> str:
|
||||
"""docx → body. Synthetic page marker every DOCX_PARAS_PER_PAGE paragraphs."""
|
||||
import docx
|
||||
|
||||
document = docx.Document(str(path))
|
||||
paragraphs = [p.text for p in document.paragraphs]
|
||||
pages: list[str] = []
|
||||
for i in range(0, max(len(paragraphs), 1), DOCX_PARAS_PER_PAGE):
|
||||
chunk = paragraphs[i : i + DOCX_PARAS_PER_PAGE]
|
||||
pages.append("\n".join(chunk))
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
def extract_doc(path: str | os.PathLike) -> str:
|
||||
"""
|
||||
Legacy .doc → body via `libreoffice --headless --convert-to docx`.
|
||||
|
||||
Raises RuntimeError if libreoffice is unavailable — the caller marks the
|
||||
resulting source `needs_review` regardless (conversion is imperfect).
|
||||
"""
|
||||
soffice = shutil.which("libreoffice") or shutil.which("soffice")
|
||||
if not soffice:
|
||||
raise RuntimeError("libreoffice/soffice not available — cannot convert .doc")
|
||||
|
||||
src = Path(path).resolve()
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
subprocess.run(
|
||||
[soffice, "--headless", "--convert-to", "docx", "--outdir", tmp, str(src)],
|
||||
check=True,
|
||||
capture_output=True,
|
||||
timeout=300,
|
||||
)
|
||||
converted = Path(tmp) / (src.stem + ".docx")
|
||||
if not converted.exists():
|
||||
raise RuntimeError(f"libreoffice produced no output for {src.name}")
|
||||
return extract_docx(converted)
|
||||
|
||||
|
||||
def extract_pptx(path: str | os.PathLike) -> str:
|
||||
"""pptx → body. One page per slide: title + body text + speaker notes."""
|
||||
from pptx import Presentation
|
||||
|
||||
presentation = Presentation(str(path))
|
||||
pages: list[str] = []
|
||||
for slide in presentation.slides:
|
||||
parts: list[str] = []
|
||||
for shape in slide.shapes:
|
||||
if shape.has_text_frame and shape.text_frame.text.strip():
|
||||
parts.append(shape.text_frame.text.strip())
|
||||
if slide.has_notes_slide:
|
||||
notes = slide.notes_slide.notes_text_frame.text.strip()
|
||||
if notes:
|
||||
parts.append(f"[NOTES] {notes}")
|
||||
pages.append("\n".join(parts))
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
def extract_html(path: str | os.PathLike) -> str:
|
||||
"""HTML mirror page → body. Strips nav/script/style/footer/header/aside."""
|
||||
import chardet
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
raw = Path(path).read_bytes()
|
||||
enc = chardet.detect(raw).get("encoding") or "utf-8"
|
||||
soup = BeautifulSoup(raw.decode(enc, errors="replace"), "lxml")
|
||||
|
||||
for tag in soup(["script", "style", "nav", "footer", "header", "aside", "noscript"]):
|
||||
tag.decompose()
|
||||
# also drop common chrome by role/class
|
||||
for tag in soup.find_all(attrs={"role": ["navigation", "banner", "contentinfo"]}):
|
||||
tag.decompose()
|
||||
|
||||
text = soup.get_text(separator="\n")
|
||||
lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
|
||||
return join_pages(["\n".join(lines)])
|
||||
|
||||
|
||||
def extract_zip(path: str | os.PathLike) -> str:
|
||||
"""
|
||||
zip → body. Unzips into a temp dir and recurses on every extractable inner
|
||||
file. Inner files are page-renumbered into one continuous body.
|
||||
"""
|
||||
path = str(path)
|
||||
pages: list[str] = []
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
try:
|
||||
with zipfile.ZipFile(path) as zf:
|
||||
zf.extractall(tmp)
|
||||
except zipfile.BadZipFile:
|
||||
return ""
|
||||
for inner in sorted(Path(tmp).rglob("*")):
|
||||
if not inner.is_file() or is_junk(inner):
|
||||
continue
|
||||
fmt = detect_format(inner)
|
||||
if fmt in ("unknown", "epub", "zip"):
|
||||
# nested zips handled by recursion below
|
||||
if fmt == "zip":
|
||||
body = extract_zip(inner)
|
||||
pages.extend(t for _, t in split_pages(body))
|
||||
continue
|
||||
try:
|
||||
body = extract_file(inner)
|
||||
except Exception:
|
||||
continue
|
||||
pages.extend(t for _, t in split_pages(body))
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
EXTRACTORS: dict[str, Callable[[str | os.PathLike], str]] = {
|
||||
"pdf": extract_pdf,
|
||||
"docx": extract_docx,
|
||||
"doc": extract_doc,
|
||||
"pptx": extract_pptx,
|
||||
"html": extract_html,
|
||||
"zip": extract_zip,
|
||||
}
|
||||
|
||||
|
||||
def extract_file(path: str | os.PathLike) -> str:
|
||||
"""Dispatch a single file to the right extractor. Returns a page-marked body."""
|
||||
fmt = detect_format(path)
|
||||
if fmt == "txt":
|
||||
body = Path(path).read_text(encoding="utf-8", errors="replace")
|
||||
# already paginated? pass through; else wrap as one page
|
||||
return body if count_page_markers(body) else join_pages([body])
|
||||
extractor = EXTRACTORS.get(fmt)
|
||||
if extractor is None:
|
||||
raise ValueError(f"No extractor for format '{fmt}': {path}")
|
||||
return extractor(path)
|
||||
@@ -1,424 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
HTML Activity Extractor - Proceseaz 1876 fiiere HTML
|
||||
Extrage automat activiti folosind pattern recognition
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import json
|
||||
from pathlib import Path
|
||||
from bs4 import BeautifulSoup
|
||||
import chardet
|
||||
from typing import List, Dict, Optional
|
||||
import sqlite3
|
||||
from datetime import datetime
|
||||
|
||||
class HTMLActivityExtractor:
|
||||
def __init__(self, db_path='data/activities.db'):
|
||||
self.db_path = db_path
|
||||
# Pattern-uri pentru detectare activiti <20>n rom<6F>n
|
||||
self.activity_patterns = {
|
||||
'title_patterns': [
|
||||
r'(?i)(joc|activitate|exerci[t]iu|team[\s-]?building|energizer|ice[\s-]?breaker)[\s:]+([^\.]{5,100})',
|
||||
r'(?i)<h[1-6][^>]*>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</h[1-6]>',
|
||||
r'(?i)<strong>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</strong>',
|
||||
r'(?i)^[\d]+\.?\s*([A-Z][^\.]{10,100}(?:joc|activitate|exerci[t]iu)[^\.]{0,50})$',
|
||||
],
|
||||
'description_markers': [
|
||||
'descriere', 'reguli', 'cum se joac[a]', 'instructiuni',
|
||||
'obiectiv', 'desfasurare', 'explicatie', 'mod de joc'
|
||||
],
|
||||
'materials_markers': [
|
||||
'materiale', 'necesare', 'echipament', 'ce avem nevoie',
|
||||
'se folosesc', 'trebuie sa avem', 'dotari'
|
||||
],
|
||||
'age_patterns': [
|
||||
r'(?i)v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
|
||||
r'(?i)(\d+)[\s-]+(\d+)\s*ani',
|
||||
r'(?i)pentru\s+(\d+)[\s-]+(\d+)\s*ani',
|
||||
r'(?i)categoria?\s*(?:de\s*)?v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
|
||||
],
|
||||
'participants_patterns': [
|
||||
r'(?i)(\d+)[\s-]+(\d+)\s*(?:participan[t]i|juc[a]tori|persoane|copii)',
|
||||
r'(?i)num[a]r\s*(?:de\s*)?(?:participan[t]i|juc[a]tori)[\s:]+(\d+)[\s-]+(\d+)',
|
||||
r'(?i)grup\s*de\s*(\d+)[\s-]+(\d+)',
|
||||
],
|
||||
'duration_patterns': [
|
||||
r'(?i)durat[a][\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
|
||||
r'(?i)timp[\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
|
||||
r'(?i)(\d+)[\s-]+(\d+)\s*minute',
|
||||
]
|
||||
}
|
||||
|
||||
# Categorii predefinite bazate pe sistemul existent
|
||||
self.categories = {
|
||||
'[A]': ['joc', 'joaca', 'distractie', 'amuzament'],
|
||||
'[B]': ['aventura', 'explorare', 'descoperire'],
|
||||
'[C]': ['camping', 'tabara', 'excursie', 'drumetie'],
|
||||
'[D]': ['foc', 'flacara', 'lumina'],
|
||||
'[E]': ['noduri', 'fr<EFBFBD>nghii', 'sfori', 'legare'],
|
||||
'[F]': ['bushcraft', 'supravietuire', 'survival'],
|
||||
'[G]': ['educatie', 'educativ', 'invatare', 'scoala'],
|
||||
'[H]': ['orientare', 'busola', 'harta', 'navigare']
|
||||
}
|
||||
|
||||
def detect_encoding(self, file_path):
|
||||
"""Detecteaz encoding-ul fiierului"""
|
||||
with open(file_path, 'rb') as f:
|
||||
result = chardet.detect(f.read())
|
||||
return result['encoding'] or 'utf-8'
|
||||
|
||||
def extract_from_html(self, html_path: str) -> List[Dict]:
|
||||
"""Extrage activiti dintr-un singur fiier HTML"""
|
||||
activities = []
|
||||
|
||||
try:
|
||||
# Detectare encoding i citire
|
||||
encoding = self.detect_encoding(html_path)
|
||||
with open(html_path, 'r', encoding=encoding, errors='ignore') as f:
|
||||
content = f.read()
|
||||
|
||||
soup = BeautifulSoup(content, 'lxml')
|
||||
|
||||
# Metod 1: Caut liste de activiti
|
||||
activities.extend(self._extract_from_lists(soup, html_path))
|
||||
|
||||
# Metod 2: Caut activiti <20>n headings
|
||||
activities.extend(self._extract_from_headings(soup, html_path))
|
||||
|
||||
# Metod 3: Caut pattern-uri <20>n text
|
||||
activities.extend(self._extract_from_patterns(soup, html_path))
|
||||
|
||||
# Metod 4: Caut <20>n tabele
|
||||
activities.extend(self._extract_from_tables(soup, html_path))
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing {html_path}: {e}")
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_lists(self, soup, source_file):
|
||||
"""Extrage activiti din liste HTML (ul, ol)"""
|
||||
activities = []
|
||||
|
||||
for list_elem in soup.find_all(['ul', 'ol']):
|
||||
# Verific dac lista pare s conin activiti
|
||||
list_text = list_elem.get_text().lower()
|
||||
if any(marker in list_text for marker in ['joc', 'activitate', 'exercitiu']):
|
||||
for li in list_elem.find_all('li'):
|
||||
text = li.get_text(strip=True)
|
||||
if len(text) > 20: # Minim 20 caractere pentru o activitate valid
|
||||
activity = self._create_activity_from_text(text, source_file)
|
||||
if activity:
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_headings(self, soup, source_file):
|
||||
"""Extrage activiti bazate pe headings"""
|
||||
activities = []
|
||||
|
||||
for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
|
||||
heading_text = heading.get_text(strip=True)
|
||||
|
||||
# Verific dac heading-ul conine cuvinte cheie
|
||||
if any(keyword in heading_text.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
|
||||
# Caut descrierea <20>n elementele urmtoare
|
||||
description = ""
|
||||
next_elem = heading.find_next_sibling()
|
||||
|
||||
while next_elem and next_elem.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
|
||||
if next_elem.name in ['p', 'div', 'ul']:
|
||||
description += next_elem.get_text(strip=True) + " "
|
||||
if len(description) > 500: # Limit descriere
|
||||
break
|
||||
next_elem = next_elem.find_next_sibling()
|
||||
|
||||
if description:
|
||||
activity = {
|
||||
'name': heading_text[:200],
|
||||
'description': description[:1000],
|
||||
'source_file': str(source_file),
|
||||
'category': self._detect_category(heading_text + " " + description)
|
||||
}
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_patterns(self, soup, source_file):
|
||||
"""Extrage activiti folosind pattern matching"""
|
||||
activities = []
|
||||
text = soup.get_text()
|
||||
|
||||
# Caut pattern-uri de activiti
|
||||
for pattern in self.activity_patterns['title_patterns']:
|
||||
matches = re.finditer(pattern, text, re.MULTILINE)
|
||||
for match in matches:
|
||||
title = match.group(0) if match.lastindex == 0 else match.group(match.lastindex)
|
||||
if len(title) > 10:
|
||||
# Extrage context <20>n jurul match-ului
|
||||
start = max(0, match.start() - 200)
|
||||
end = min(len(text), match.end() + 500)
|
||||
context = text[start:end]
|
||||
|
||||
activity = self._create_activity_from_text(context, source_file, title)
|
||||
if activity:
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_tables(self, soup, source_file):
|
||||
"""Extrage activiti din tabele"""
|
||||
activities = []
|
||||
|
||||
for table in soup.find_all('table'):
|
||||
rows = table.find_all('tr')
|
||||
if len(rows) > 1: # Cel puin header i o linie de date
|
||||
# Detecteaz coloanele relevante
|
||||
headers = [th.get_text(strip=True).lower() for th in rows[0].find_all(['th', 'td'])]
|
||||
|
||||
for row in rows[1:]:
|
||||
cells = row.find_all(['td'])
|
||||
if cells:
|
||||
activity_data = {}
|
||||
for i, cell in enumerate(cells):
|
||||
if i < len(headers):
|
||||
activity_data[headers[i]] = cell.get_text(strip=True)
|
||||
|
||||
# Creeaz activitate din date tabel
|
||||
if any(key in activity_data for key in ['joc', 'activitate', 'nume', 'titlu']):
|
||||
activity = self._create_activity_from_table_data(activity_data, source_file)
|
||||
if activity:
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _create_activity_from_text(self, text, source_file, title=None):
|
||||
"""Creeaz un dicionar de activitate din text"""
|
||||
if not text or len(text) < 30:
|
||||
return None
|
||||
|
||||
activity = {
|
||||
'name': title or text[:100].split('.')[0].strip(),
|
||||
'description': text[:1000],
|
||||
'source_file': str(source_file),
|
||||
'category': self._detect_category(text),
|
||||
'keywords': self._extract_keywords(text),
|
||||
'created_at': datetime.now().isoformat()
|
||||
}
|
||||
|
||||
# Extrage metadata suplimentar
|
||||
activity.update(self._extract_metadata(text))
|
||||
|
||||
return activity
|
||||
|
||||
def _create_activity_from_table_data(self, data, source_file):
|
||||
"""Creeaz activitate din date de tabel"""
|
||||
activity = {
|
||||
'source_file': str(source_file),
|
||||
'created_at': datetime.now().isoformat()
|
||||
}
|
||||
|
||||
# Mapare c<>mpuri tabel la c<>mpuri DB
|
||||
field_mapping = {
|
||||
'nume': 'name', 'titlu': 'name', 'joc': 'name', 'activitate': 'name',
|
||||
'descriere': 'description', 'detalii': 'description', 'explicatie': 'description',
|
||||
'materiale': 'materials_list', 'echipament': 'materials_list',
|
||||
'varsta': 'age_group_min', 'categoria': 'category',
|
||||
'participanti': 'participants_min', 'numar': 'participants_min',
|
||||
'durata': 'duration_min', 'timp': 'duration_min'
|
||||
}
|
||||
|
||||
for table_field, db_field in field_mapping.items():
|
||||
if table_field in data:
|
||||
activity[db_field] = data[table_field]
|
||||
|
||||
# Validare minim
|
||||
if 'name' in activity and len(activity.get('name', '')) > 5:
|
||||
return activity
|
||||
|
||||
return None
|
||||
|
||||
def _extract_metadata(self, text):
|
||||
"""Extrage metadata din text folosind pattern-uri"""
|
||||
metadata = {}
|
||||
|
||||
# Extrage v<>rsta
|
||||
for pattern in self.activity_patterns['age_patterns']:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
metadata['age_group_min'] = int(match.group(1))
|
||||
metadata['age_group_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
||||
break
|
||||
|
||||
# Extrage numr participani
|
||||
for pattern in self.activity_patterns['participants_patterns']:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
metadata['participants_min'] = int(match.group(1))
|
||||
metadata['participants_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
||||
break
|
||||
|
||||
# Extrage durata
|
||||
for pattern in self.activity_patterns['duration_patterns']:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
metadata['duration_min'] = int(match.group(1))
|
||||
metadata['duration_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
||||
break
|
||||
|
||||
# Extrage materiale
|
||||
materials = []
|
||||
text_lower = text.lower()
|
||||
for marker in self.activity_patterns['materials_markers']:
|
||||
idx = text_lower.find(marker)
|
||||
if idx != -1:
|
||||
# Extrage urmtoarele 200 caractere dup marker
|
||||
materials_text = text[idx:idx+200]
|
||||
# Extrage items din list
|
||||
items = re.findall(r'[-"]\s*([^\n-"]+)', materials_text)
|
||||
if items:
|
||||
materials.extend(items)
|
||||
|
||||
if materials:
|
||||
metadata['materials_list'] = ', '.join(materials[:10]) # Maxim 10 materiale
|
||||
|
||||
return metadata
|
||||
|
||||
def _detect_category(self, text):
|
||||
"""Detecteaz categoria activitii bazat pe cuvinte cheie"""
|
||||
text_lower = text.lower()
|
||||
|
||||
for category, keywords in self.categories.items():
|
||||
if any(keyword in text_lower for keyword in keywords):
|
||||
return category
|
||||
|
||||
return '[A]' # Default categoria jocuri
|
||||
|
||||
def _extract_keywords(self, text):
|
||||
"""Extrage cuvinte cheie din text"""
|
||||
keywords = []
|
||||
text_lower = text.lower()
|
||||
|
||||
# Lista de cuvinte cheie relevante
|
||||
keyword_list = [
|
||||
'cooperare', 'competitie', 'echipa', 'creativitate', 'miscare',
|
||||
'strategie', 'comunicare', 'incredere', 'coordonare', 'atentie',
|
||||
'reflexe', 'logica', 'imaginatie', 'muzica', 'dans', 'sport',
|
||||
'natura', 'mediu', 'stiinta', 'matematica', 'limba', 'cultura'
|
||||
]
|
||||
|
||||
for keyword in keyword_list:
|
||||
if keyword in text_lower:
|
||||
keywords.append(keyword)
|
||||
|
||||
return ', '.join(keywords[:5]) # Maxim 5 keywords
|
||||
|
||||
def save_to_database(self, activities):
|
||||
"""Salveaz activitile <20>n baza de date"""
|
||||
conn = sqlite3.connect(self.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
saved_count = 0
|
||||
duplicate_count = 0
|
||||
|
||||
for activity in activities:
|
||||
try:
|
||||
# Verific duplicate
|
||||
cursor.execute(
|
||||
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
||||
(activity.get('name'), activity.get('source_file'))
|
||||
)
|
||||
|
||||
if cursor.fetchone():
|
||||
duplicate_count += 1
|
||||
continue
|
||||
|
||||
# Pregtete valorile pentru insert
|
||||
columns = []
|
||||
values = []
|
||||
placeholders = []
|
||||
|
||||
for key, value in activity.items():
|
||||
if key != 'created_at': # Skip created_at, it has default
|
||||
columns.append(key)
|
||||
values.append(value)
|
||||
placeholders.append('?')
|
||||
|
||||
# Insert <20>n DB
|
||||
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
||||
cursor.execute(query, values)
|
||||
saved_count += 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error saving activity: {e}")
|
||||
continue
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
return saved_count, duplicate_count
|
||||
|
||||
def process_all_html_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
||||
"""Proceseaz toate fiierele HTML din directorul specificat"""
|
||||
base_path = Path(base_path)
|
||||
html_files = list(base_path.rglob("*.html"))
|
||||
html_files.extend(list(base_path.rglob("*.htm")))
|
||||
|
||||
print(f"Found {len(html_files)} HTML files to process")
|
||||
|
||||
all_activities = []
|
||||
processed = 0
|
||||
errors = 0
|
||||
|
||||
for i, html_file in enumerate(html_files):
|
||||
try:
|
||||
activities = self.extract_from_html(str(html_file))
|
||||
all_activities.extend(activities)
|
||||
processed += 1
|
||||
|
||||
# Progress update
|
||||
if (i + 1) % 100 == 0:
|
||||
print(f"Progress: {i+1}/{len(html_files)} files processed, {len(all_activities)} activities found")
|
||||
# Save batch to DB
|
||||
if all_activities:
|
||||
saved, dupes = self.save_to_database(all_activities)
|
||||
print(f"Batch saved: {saved} new activities, {dupes} duplicates skipped")
|
||||
all_activities = [] # Clear buffer
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing {html_file}: {e}")
|
||||
errors += 1
|
||||
|
||||
# Save remaining activities
|
||||
if all_activities:
|
||||
saved, dupes = self.save_to_database(all_activities)
|
||||
print(f"Final batch saved: {saved} new activities, {dupes} duplicates skipped")
|
||||
|
||||
print(f"\nProcessing complete!")
|
||||
print(f"Files processed: {processed}")
|
||||
print(f"Errors: {errors}")
|
||||
|
||||
return processed, errors
|
||||
|
||||
# Funcie main pentru test
|
||||
if __name__ == "__main__":
|
||||
extractor = HTMLActivityExtractor()
|
||||
|
||||
# Test pe un fiier sample mai <20>nt<6E>i
|
||||
print("Testing on sample file first...")
|
||||
# Gsete un fiier HTML pentru test
|
||||
test_files = list(Path('/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri').rglob("*.html"))[:3]
|
||||
|
||||
for test_file in test_files:
|
||||
print(f"\nTesting: {test_file}")
|
||||
activities = extractor.extract_from_html(str(test_file))
|
||||
print(f"Found {len(activities)} activities")
|
||||
if activities:
|
||||
print(f"Sample activity: {activities[0]['name'][:50]}...")
|
||||
|
||||
# <20>ntreab dac s continue cu procesarea complet
|
||||
response = input("\nContinue with full processing? (y/n): ")
|
||||
if response.lower() == 'y':
|
||||
extractor.process_all_html_files()
|
||||
@@ -1,78 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Import activities extracted by Claude from JSON files
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
class ClaudeActivityImporter:
|
||||
def __init__(self, db_path='data/activities.db'):
|
||||
self.db_path = db_path
|
||||
self.json_dir = Path('scripts/extracted_activities')
|
||||
self.json_dir.mkdir(exist_ok=True)
|
||||
|
||||
def import_json_file(self, json_path):
|
||||
"""Import activities from a single JSON file"""
|
||||
with open(json_path, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
source_file = data.get('source_file', str(json_path))
|
||||
activities = data.get('activities', [])
|
||||
|
||||
conn = sqlite3.connect(self.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
imported = 0
|
||||
for activity in activities:
|
||||
try:
|
||||
# Add source file and timestamp
|
||||
activity['source_file'] = source_file
|
||||
activity['created_at'] = datetime.now().isoformat()
|
||||
|
||||
# Prepare insert
|
||||
columns = list(activity.keys())
|
||||
values = list(activity.values())
|
||||
placeholders = ['?' for _ in values]
|
||||
|
||||
# Check for duplicate
|
||||
cursor.execute(
|
||||
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
||||
(activity.get('name'), source_file)
|
||||
)
|
||||
|
||||
if not cursor.fetchone():
|
||||
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
||||
cursor.execute(query, values)
|
||||
imported += 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error importing activity: {e}")
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
print(f"Imported {imported} activities from {json_path.name}")
|
||||
return imported
|
||||
|
||||
def import_all_json_files(self):
|
||||
"""Import all JSON files from the extracted_activities directory"""
|
||||
json_files = list(self.json_dir.glob("*.json"))
|
||||
|
||||
if not json_files:
|
||||
print("No JSON files found in extracted_activities directory")
|
||||
return 0
|
||||
|
||||
total_imported = 0
|
||||
for json_file in json_files:
|
||||
imported = self.import_json_file(json_file)
|
||||
total_imported += imported
|
||||
|
||||
print(f"\nTotal imported: {total_imported} activities from {len(json_files)} files")
|
||||
return total_imported
|
||||
|
||||
if __name__ == "__main__":
|
||||
importer = ClaudeActivityImporter()
|
||||
importer.import_all_json_files()
|
||||
179
scripts/import_common.py
Normal file
179
scripts/import_common.py
Normal file
@@ -0,0 +1,179 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
import_common.py — shared helpers for the import / validation side of the
|
||||
extraction pipeline (Lane C).
|
||||
|
||||
Used by build_database.py and validate_extractions.py:
|
||||
* JSON-schema validation of subagent extraction files,
|
||||
* the anti-hallucination source_excerpt substring check (E5),
|
||||
* locating the source chunk that an extraction file came from,
|
||||
* the stable content key used by the needs_review queue.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import re
|
||||
import unicodedata
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
|
||||
DEFAULT_SCHEMA_PATH = SCRIPT_DIR / "activity_schema.json"
|
||||
|
||||
# rapidfuzz.partial_ratio is on a 0..100 scale; an excerpt counts as a real
|
||||
# quote from the source when it scores at least this against the chunk text.
|
||||
EXCERPT_MATCH_THRESHOLD = 90.0
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# schema validation
|
||||
# --------------------------------------------------------------------------
|
||||
def load_schema(schema_path: str | Path = DEFAULT_SCHEMA_PATH) -> dict:
|
||||
"""Load the activity JSON schema produced by Lane A."""
|
||||
return json.loads(Path(schema_path).read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def validate_extraction(data: Any, schema: dict) -> list[str]:
|
||||
"""
|
||||
Validate one parsed extraction file against `schema`.
|
||||
|
||||
Returns a list of human-readable error strings; empty list == valid.
|
||||
"""
|
||||
import jsonschema
|
||||
|
||||
validator = jsonschema.Draft7Validator(schema)
|
||||
errors: list[str] = []
|
||||
for err in sorted(validator.iter_errors(data), key=lambda e: list(e.path)):
|
||||
location = "/".join(str(p) for p in err.path) or "<root>"
|
||||
errors.append(f"{location}: {err.message}")
|
||||
return errors
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# excerpt verification (E5 — anti-hallucination)
|
||||
# --------------------------------------------------------------------------
|
||||
def _normalize_text(text: str) -> str:
|
||||
return re.sub(r"\s+", " ", (text or "")).strip().lower()
|
||||
|
||||
|
||||
def excerpt_score(excerpt: str, chunk_text: str) -> float:
|
||||
"""Best fuzzy-substring score (0..100) of `excerpt` inside `chunk_text`."""
|
||||
from rapidfuzz import fuzz
|
||||
|
||||
if not excerpt or not chunk_text:
|
||||
return 0.0
|
||||
return float(fuzz.partial_ratio(_normalize_text(excerpt), _normalize_text(chunk_text)))
|
||||
|
||||
|
||||
def excerpt_matches(
|
||||
excerpt: str, chunk_text: str, threshold: float = EXCERPT_MATCH_THRESHOLD
|
||||
) -> bool:
|
||||
"""True when `excerpt` appears (fuzzily) as a substring of `chunk_text`."""
|
||||
return excerpt_score(excerpt, chunk_text) >= threshold
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# locating the source chunk an extraction file came from
|
||||
# --------------------------------------------------------------------------
|
||||
def chunk_key_for(json_path: Path, header: Optional[dict]) -> str:
|
||||
"""
|
||||
Resolve the chunk key for an extraction file.
|
||||
|
||||
Prefers the explicit `chunk_key` in the header, otherwise falls back to the
|
||||
JSON file stem (extraction files are named `<chunk_key>.json`).
|
||||
"""
|
||||
if header and header.get("chunk_key"):
|
||||
return str(header["chunk_key"])
|
||||
return json_path.stem
|
||||
|
||||
|
||||
def source_id_for(chunk_key: str, header: Optional[dict]) -> str:
|
||||
"""Resolve the source id; `<source_id>.partNN` → `<source_id>`."""
|
||||
if header and header.get("source_id"):
|
||||
return str(header["source_id"])
|
||||
# chunk keys look like "<source_id>.partNN"
|
||||
return chunk_key.rsplit(".part", 1)[0]
|
||||
|
||||
|
||||
def find_chunk_text(
|
||||
json_path: Path, header: Optional[dict], chunks_dir: Path
|
||||
) -> Optional[str]:
|
||||
"""
|
||||
Return the text of the source chunk for an extraction file, or None.
|
||||
|
||||
Looks for data/chunks/<source_id>/<chunk_key>.txt, then falls back to a
|
||||
recursive glob on the chunk key.
|
||||
"""
|
||||
chunk_key = chunk_key_for(json_path, header)
|
||||
source_id = source_id_for(chunk_key, header)
|
||||
|
||||
candidate = chunks_dir / source_id / f"{chunk_key}.txt"
|
||||
if candidate.is_file():
|
||||
return candidate.read_text(encoding="utf-8", errors="replace")
|
||||
|
||||
matches = list(chunks_dir.rglob(f"{chunk_key}.txt"))
|
||||
if matches:
|
||||
return matches[0].read_text(encoding="utf-8", errors="replace")
|
||||
return None
|
||||
|
||||
|
||||
def source_path_for(source_id: str, sources_dir: Path) -> Optional[str]:
|
||||
"""
|
||||
Read the original `SOURCE:` path from a normalized source header.
|
||||
|
||||
data/sources/<source_id>.txt starts with a `SOURCE: <relative path>` line.
|
||||
"""
|
||||
src_file = sources_dir / f"{source_id}.txt"
|
||||
if not src_file.is_file():
|
||||
return None
|
||||
try:
|
||||
with src_file.open(encoding="utf-8", errors="replace") as fh:
|
||||
for line in fh:
|
||||
if line.startswith("SOURCE:"):
|
||||
return line.split(":", 1)[1].strip()
|
||||
if line.startswith("=") or line.startswith("--- PAGE "):
|
||||
break
|
||||
except OSError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# stable content key for the needs_review queue (plan §5c)
|
||||
# --------------------------------------------------------------------------
|
||||
def normalize_name(name: str) -> str:
|
||||
"""Diacritic-free, lowercased, whitespace-collapsed name (dedup key)."""
|
||||
if not name:
|
||||
return ""
|
||||
decomposed = unicodedata.normalize("NFKD", name)
|
||||
ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
|
||||
return re.sub(r"\s+", " ", ascii_str.lower().strip())
|
||||
|
||||
|
||||
def content_key(normalized_name: str, language: Optional[str], description: str) -> str:
|
||||
"""
|
||||
Stable hash identifying a row for the review queue.
|
||||
|
||||
Only borderline-kept-separate rows and legacy `.doc` rows ever carry
|
||||
needs_review, and neither is auto-merged — so their (normalized_name,
|
||||
language, description) triple is stable across rebuilds.
|
||||
"""
|
||||
payload = f"{normalized_name}\x1f{language or ''}\x1f{_normalize_text(description)}"
|
||||
return hashlib.sha1(payload.encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# iteration
|
||||
# --------------------------------------------------------------------------
|
||||
def iter_extraction_files(extracted_dir: Path):
|
||||
"""Yield every *.json directly under `extracted_dir` (skips _rejected/)."""
|
||||
if not extracted_dir.is_dir():
|
||||
return
|
||||
for path in sorted(extracted_dir.glob("*.json")):
|
||||
if path.is_file():
|
||||
yield path
|
||||
255
scripts/normalize_sources.py
Normal file
255
scripts/normalize_sources.py
Normal file
@@ -0,0 +1,255 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
normalize_sources.py — walk data/carti-camp-jocuri/ and write data/sources/<id>.txt.
|
||||
|
||||
Output files keep the existing header format:
|
||||
|
||||
SOURCE: <original relative path>
|
||||
CONVERTED: <iso date>
|
||||
FORMAT: <pdf|docx|doc|pptx|html-mirror|zip>
|
||||
NEEDS_REVIEW: <reason> (optional — legacy .doc conversions)
|
||||
==================================================
|
||||
|
||||
--- PAGE 1 ---
|
||||
...
|
||||
|
||||
Each source gets a stable id = <8-hex hash of relative path>_<sanitized stem>,
|
||||
so two files with the same name in different folders never collide.
|
||||
|
||||
The pipeline is script-only: this normalizes formats, it does NOT run extraction.
|
||||
Run `--check-deps` before a long job.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import datetime as _dt
|
||||
import hashlib
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
if str(SCRIPT_DIR) not in sys.path:
|
||||
sys.path.insert(0, str(SCRIPT_DIR))
|
||||
|
||||
from extract_common import ( # noqa: E402
|
||||
count_page_markers,
|
||||
dedupe_texts,
|
||||
detect_format,
|
||||
extract_file,
|
||||
extract_html,
|
||||
is_junk,
|
||||
join_pages,
|
||||
preflight,
|
||||
split_pages,
|
||||
)
|
||||
|
||||
HEADER_RULE = "=" * 50
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# stable source id
|
||||
# --------------------------------------------------------------------------
|
||||
def sanitize_stem(stem: str) -> str:
|
||||
s = re.sub(r"[^\w]+", "_", stem, flags=re.UNICODE).strip("_").lower()
|
||||
return s[:60] or "source"
|
||||
|
||||
|
||||
def stable_id(relative_path: str | Path) -> str:
|
||||
"""Collision-proof id derived from the path relative to the corpus root."""
|
||||
rel = str(relative_path).replace("\\", "/")
|
||||
digest = hashlib.sha1(rel.encode("utf-8")).hexdigest()[:8]
|
||||
stem = sanitize_stem(Path(rel).stem)
|
||||
return f"{digest}_{stem}"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# header
|
||||
# --------------------------------------------------------------------------
|
||||
def build_header(
|
||||
source_rel: str, fmt: str, needs_review: str | None = None
|
||||
) -> str:
|
||||
today = _dt.date.today().isoformat()
|
||||
lines = [
|
||||
f"SOURCE: {source_rel}",
|
||||
f"CONVERTED: {today}",
|
||||
f"FORMAT: {fmt}",
|
||||
]
|
||||
if needs_review:
|
||||
lines.append(f"NEEDS_REVIEW: {needs_review}")
|
||||
lines.append(HEADER_RULE)
|
||||
return "\n".join(lines) + "\n\n"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# mirror-site directories
|
||||
# --------------------------------------------------------------------------
|
||||
MIRROR_PAGE_EXTS = {".html", ".htm"}
|
||||
|
||||
|
||||
def is_mirror_dir(path: Path) -> bool:
|
||||
"""A directory counts as a site mirror if it contains HTML pages."""
|
||||
if not path.is_dir():
|
||||
return False
|
||||
if path.name.endswith("_files"):
|
||||
return False
|
||||
return any(
|
||||
p.suffix.lower() in MIRROR_PAGE_EXTS
|
||||
for p in path.rglob("*")
|
||||
if p.is_file()
|
||||
)
|
||||
|
||||
|
||||
def normalize_mirror(mirror_dir: Path) -> str:
|
||||
"""Extract every HTML page in a mirror dir, dedupe near-duplicates, join."""
|
||||
pages: list[tuple[str, str]] = []
|
||||
for html in sorted(mirror_dir.rglob("*")):
|
||||
if not html.is_file() or html.suffix.lower() not in MIRROR_PAGE_EXTS:
|
||||
continue
|
||||
if "_files" in html.parts:
|
||||
continue
|
||||
try:
|
||||
body = extract_html(html)
|
||||
except Exception:
|
||||
continue
|
||||
text = "\n".join(t for _, t in split_pages(body))
|
||||
if text.strip():
|
||||
pages.append((str(html.relative_to(mirror_dir)), text))
|
||||
pages = dedupe_texts(pages)
|
||||
return join_pages([t for _, t in pages])
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# one source
|
||||
# --------------------------------------------------------------------------
|
||||
def normalize_one(
|
||||
path: Path, corpus_root: Path, out_dir: Path
|
||||
) -> dict | None:
|
||||
"""
|
||||
Normalize a single file or mirror directory → data/sources/<id>.txt.
|
||||
|
||||
Returns a result dict, or None if the entry was skipped (junk / ignored).
|
||||
"""
|
||||
rel = path.relative_to(corpus_root)
|
||||
sid = stable_id(rel)
|
||||
|
||||
if path.is_dir():
|
||||
if not is_mirror_dir(path):
|
||||
return None
|
||||
fmt, needs_review = "html-mirror", None
|
||||
body = normalize_mirror(path)
|
||||
else:
|
||||
if is_junk(path):
|
||||
return None
|
||||
fmt = detect_format(path)
|
||||
if fmt in ("unknown", "epub", "txt"):
|
||||
return None # epub duplicates PDFs; txt is not a source format here
|
||||
needs_review = "legacy .doc conversion is imperfect" if fmt == "doc" else None
|
||||
try:
|
||||
body = extract_file(path)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
return {"id": sid, "source": str(rel), "status": "error", "error": str(exc)}
|
||||
|
||||
if not body.strip():
|
||||
return {"id": sid, "source": str(rel), "status": "empty"}
|
||||
|
||||
out_path = out_dir / f"{sid}.txt"
|
||||
out_path.write_text(build_header(str(rel), fmt, needs_review) + body,
|
||||
encoding="utf-8")
|
||||
return {
|
||||
"id": sid,
|
||||
"source": str(rel),
|
||||
"status": "ok",
|
||||
"format": fmt,
|
||||
"pages": count_page_markers(body),
|
||||
"needs_review": bool(needs_review),
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# walk
|
||||
# --------------------------------------------------------------------------
|
||||
def iter_corpus_entries(corpus_root: Path):
|
||||
"""Yield top-level files and mirror directories under the corpus root."""
|
||||
for entry in sorted(corpus_root.iterdir()):
|
||||
if entry.name.startswith("."):
|
||||
continue
|
||||
if entry.is_dir():
|
||||
if is_mirror_dir(entry):
|
||||
yield entry
|
||||
else:
|
||||
yield entry
|
||||
|
||||
|
||||
def run(corpus_root: Path, out_dir: Path) -> dict:
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
results: list[dict] = []
|
||||
for entry in iter_corpus_entries(corpus_root):
|
||||
res = normalize_one(entry, corpus_root, out_dir)
|
||||
if res is not None:
|
||||
results.append(res)
|
||||
summary = {
|
||||
"total": len(results),
|
||||
"ok": sum(1 for r in results if r["status"] == "ok"),
|
||||
"errors": sum(1 for r in results if r["status"] == "error"),
|
||||
"empty": sum(1 for r in results if r["status"] == "empty"),
|
||||
"needs_review": sum(1 for r in results if r.get("needs_review")),
|
||||
"results": results,
|
||||
}
|
||||
return summary
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def print_preflight(report: dict) -> int:
|
||||
print("Dependency preflight")
|
||||
print("--------------------")
|
||||
if report["missing_python"]:
|
||||
print(" MISSING Python packages: " + ", ".join(report["missing_python"]))
|
||||
else:
|
||||
print(" Python packages: OK")
|
||||
if report["missing_system"]:
|
||||
print(" MISSING system tools : " + ", ".join(report["missing_system"]))
|
||||
for w in report["warnings"]:
|
||||
print(f" WARNING: {w}")
|
||||
print(" => " + ("READY" if report["ok"] else "NOT READY — install the above"))
|
||||
return 0 if report["ok"] else 1
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Normalize mixed sources to .txt")
|
||||
parser.add_argument("--corpus", default="data/carti-camp-jocuri",
|
||||
help="corpus root to walk")
|
||||
parser.add_argument("--out", default="data/sources", help="output directory")
|
||||
parser.add_argument("--check-deps", action="store_true",
|
||||
help="run dependency preflight and exit")
|
||||
parser.add_argument("--ocr", action="store_true",
|
||||
help="include OCR (tesseract) in the preflight check")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.check_deps:
|
||||
return print_preflight(preflight(check_ocr=args.ocr))
|
||||
|
||||
report = preflight(check_ocr=args.ocr)
|
||||
if report["missing_python"]:
|
||||
print_preflight(report)
|
||||
return 1
|
||||
for w in report["warnings"]:
|
||||
print(f"WARNING: {w}")
|
||||
|
||||
summary = run(Path(args.corpus), Path(args.out))
|
||||
print(f"normalized : {summary['ok']}/{summary['total']}")
|
||||
print(f"errors : {summary['errors']}")
|
||||
print(f"empty : {summary['empty']}")
|
||||
print(f"needs_review: {summary['needs_review']}")
|
||||
for r in summary["results"]:
|
||||
if r["status"] != "ok":
|
||||
print(f" [{r['status']}] {r['source']}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,143 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
PDF Mass Conversion to Text for Activity Extraction
|
||||
Handles all PDF sizes efficiently with multiple fallback methods
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
from pathlib import Path
|
||||
import PyPDF2
|
||||
import pdfplumber
|
||||
from typing import List, Dict
|
||||
import logging
|
||||
|
||||
class PDFConverter:
|
||||
def __init__(self, max_pages=50):
|
||||
self.max_pages = max_pages
|
||||
self.conversion_stats = {}
|
||||
|
||||
def convert_pdf_to_text(self, pdf_path: str) -> str:
|
||||
"""Convert PDF to text using multiple methods with fallbacks"""
|
||||
try:
|
||||
# Method 1: pdfplumber (best for tables and layout)
|
||||
return self._convert_with_pdfplumber(pdf_path)
|
||||
except Exception as e:
|
||||
print(f"pdfplumber failed for {pdf_path}: {e}")
|
||||
|
||||
try:
|
||||
# Method 2: PyPDF2 (fallback)
|
||||
return self._convert_with_pypdf2(pdf_path)
|
||||
except Exception as e2:
|
||||
print(f"PyPDF2 also failed for {pdf_path}: {e2}")
|
||||
return ""
|
||||
|
||||
def _convert_with_pdfplumber(self, pdf_path: str) -> str:
|
||||
"""Primary conversion method using pdfplumber"""
|
||||
text_content = ""
|
||||
|
||||
with pdfplumber.open(pdf_path) as pdf:
|
||||
total_pages = len(pdf.pages)
|
||||
pages_to_process = min(total_pages, self.max_pages)
|
||||
|
||||
print(f" Converting {pdf_path}: {pages_to_process}/{total_pages} pages")
|
||||
|
||||
for i, page in enumerate(pdf.pages[:pages_to_process]):
|
||||
try:
|
||||
page_text = page.extract_text()
|
||||
if page_text:
|
||||
text_content += f"\n--- PAGE {i+1} ---\n"
|
||||
text_content += page_text
|
||||
text_content += "\n"
|
||||
except Exception as e:
|
||||
print(f" Error on page {i+1}: {e}")
|
||||
continue
|
||||
|
||||
self.conversion_stats[pdf_path] = {
|
||||
'method': 'pdfplumber',
|
||||
'pages_processed': pages_to_process,
|
||||
'total_pages': total_pages,
|
||||
'success': True,
|
||||
'text_length': len(text_content)
|
||||
}
|
||||
|
||||
return text_content
|
||||
|
||||
def _convert_with_pypdf2(self, pdf_path: str) -> str:
|
||||
"""Fallback conversion method using PyPDF2"""
|
||||
text_content = ""
|
||||
|
||||
with open(pdf_path, 'rb') as file:
|
||||
reader = PyPDF2.PdfReader(file)
|
||||
total_pages = len(reader.pages)
|
||||
pages_to_process = min(total_pages, self.max_pages)
|
||||
|
||||
print(f" Converting {pdf_path} (fallback): {pages_to_process}/{total_pages} pages")
|
||||
|
||||
for i in range(pages_to_process):
|
||||
try:
|
||||
page = reader.pages[i]
|
||||
page_text = page.extract_text()
|
||||
if page_text:
|
||||
text_content += f"\n--- PAGE {i+1} ---\n"
|
||||
text_content += page_text
|
||||
text_content += "\n"
|
||||
except Exception as e:
|
||||
print(f" Error on page {i+1}: {e}")
|
||||
continue
|
||||
|
||||
self.conversion_stats[pdf_path] = {
|
||||
'method': 'PyPDF2',
|
||||
'pages_processed': pages_to_process,
|
||||
'total_pages': total_pages,
|
||||
'success': True,
|
||||
'text_length': len(text_content)
|
||||
}
|
||||
|
||||
return text_content
|
||||
|
||||
def convert_all_pdfs(self, pdf_directory: str, output_directory: str):
|
||||
"""Convert all PDFs in directory to text files"""
|
||||
pdf_files = list(Path(pdf_directory).glob("**/*.pdf"))
|
||||
|
||||
print(f"🔄 Converting {len(pdf_files)} PDF files to text...")
|
||||
|
||||
os.makedirs(output_directory, exist_ok=True)
|
||||
|
||||
for i, pdf_path in enumerate(pdf_files):
|
||||
print(f"\n[{i+1}/{len(pdf_files)}] Processing {pdf_path.name}...")
|
||||
|
||||
# Convert to text
|
||||
text_content = self.convert_pdf_to_text(str(pdf_path))
|
||||
|
||||
if text_content.strip():
|
||||
# Save as text file
|
||||
output_file = Path(output_directory) / f"{pdf_path.stem}.txt"
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
f.write(f"SOURCE: {pdf_path}\n")
|
||||
f.write(f"CONVERTED: 2025-01-11\n")
|
||||
f.write("="*50 + "\n\n")
|
||||
f.write(text_content)
|
||||
|
||||
print(f" ✅ Saved: {output_file}")
|
||||
else:
|
||||
print(f" ❌ No text extracted from {pdf_path.name}")
|
||||
|
||||
# Save conversion statistics
|
||||
stats_file = Path(output_directory) / "conversion_stats.json"
|
||||
with open(stats_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(self.conversion_stats, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"\n🎉 PDF conversion complete! Check {output_directory}")
|
||||
return len([f for f in self.conversion_stats.values() if f['success']])
|
||||
|
||||
# Usage
|
||||
if __name__ == "__main__":
|
||||
converter = PDFConverter(max_pages=50)
|
||||
|
||||
# Convert all PDFs
|
||||
pdf_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri"
|
||||
output_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri/INDEX-SISTEM-JOCURI/converted_pdfs"
|
||||
|
||||
converted_count = converter.convert_all_pdfs(pdf_dir, output_dir)
|
||||
print(f"Final result: {converted_count} PDFs successfully converted")
|
||||
244
scripts/repair_extractions.py
Normal file
244
scripts/repair_extractions.py
Normal file
@@ -0,0 +1,244 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
repair_extractions.py — one-shot repair of malformed extraction JSON.
|
||||
|
||||
Subagents systematically emit unescaped ASCII double-quotes inside string
|
||||
values (Romanian text like „Unu" uses a closing " that terminates the JSON
|
||||
string early). Re-extraction reproduces the bug, so we repair instead.
|
||||
|
||||
IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
|
||||
ending the string at the stray quote and reinterpreting the trailing text as a
|
||||
new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
|
||||
truncation is silent (the field is still non-empty) and slips past a naive
|
||||
presence check. So we use a faithful char-scanner that ESCAPES stray quotes
|
||||
(\\") instead of splitting on them, then validate the result against the real
|
||||
activity schema (additionalProperties:false also catches any residual split).
|
||||
|
||||
This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
|
||||
the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
|
||||
disk. We write clean JSON back to data/extracted/ and the build reads vanilla
|
||||
json.
|
||||
|
||||
Source selection (faithful recovery needs the ORIGINAL malformed text):
|
||||
* a chunk is a candidate when a MALFORMED original exists — either the
|
||||
top-level data/extracted/<key>.json is itself invalid, or a malformed
|
||||
original sits in data/extracted/_rejected/<key>.json.
|
||||
* the malformed original is preferred as the repair source.
|
||||
* chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
|
||||
output that lost the original) are NOT silently "repaired" — if such a chunk
|
||||
has no valid top-level file it is reported as needing RE-EXTRACTION.
|
||||
|
||||
Usage:
|
||||
python scripts/repair_extractions.py # report only (dry run)
|
||||
python scripts/repair_extractions.py --apply # write repaired JSON
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import glob
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
EXTRACTED = REPO_ROOT / "data" / "extracted"
|
||||
REJECTED = EXTRACTED / "_rejected"
|
||||
|
||||
if str(SCRIPT_DIR) not in __import__("sys").path:
|
||||
__import__("sys").path.insert(0, str(SCRIPT_DIR))
|
||||
from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction # noqa: E402
|
||||
|
||||
|
||||
def escape_stray_quotes(s: str) -> str:
|
||||
"""Escape ASCII double-quotes that occur INSIDE a JSON string value.
|
||||
|
||||
A `"` inside a string is treated as a real string-close only when the next
|
||||
non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
|
||||
content and is escaped to `\\"`. This preserves the full value instead of
|
||||
truncating it (the json_repair failure mode).
|
||||
"""
|
||||
out: list[str] = []
|
||||
in_str = False
|
||||
esc = False
|
||||
n = len(s)
|
||||
i = 0
|
||||
while i < n:
|
||||
c = s[i]
|
||||
if esc:
|
||||
out.append(c)
|
||||
esc = False
|
||||
i += 1
|
||||
continue
|
||||
if c == "\\":
|
||||
out.append(c)
|
||||
esc = True
|
||||
i += 1
|
||||
continue
|
||||
if c == '"':
|
||||
if not in_str:
|
||||
in_str = True
|
||||
out.append(c)
|
||||
else:
|
||||
j = i + 1
|
||||
while j < n and s[j] in " \t\r\n":
|
||||
j += 1
|
||||
nxt = s[j] if j < n else ""
|
||||
if nxt in ",}]:" or nxt == "":
|
||||
in_str = False
|
||||
out.append(c)
|
||||
else:
|
||||
out.append('\\"') # content quote → escape, keep value whole
|
||||
i += 1
|
||||
continue
|
||||
out.append(c)
|
||||
i += 1
|
||||
return "".join(out)
|
||||
|
||||
|
||||
def _is_valid_json(path: Path) -> bool:
|
||||
try:
|
||||
json.loads(path.read_text(encoding="utf-8"))
|
||||
return True
|
||||
except (json.JSONDecodeError, OSError):
|
||||
return False
|
||||
|
||||
|
||||
def _malformed_source(key: str) -> Optional[Path]:
|
||||
"""Return the malformed-original file for a chunk, preferring top-level."""
|
||||
live = EXTRACTED / f"{key}.json"
|
||||
if live.exists() and not _is_valid_json(live):
|
||||
return live
|
||||
rej = REJECTED / f"{key}.json"
|
||||
if rej.exists() and not _is_valid_json(rej):
|
||||
return rej
|
||||
return None
|
||||
|
||||
|
||||
def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
|
||||
"""
|
||||
(repair_candidates, needs_reextraction).
|
||||
|
||||
repair_candidates: key -> malformed source file (faithfully repairable).
|
||||
needs_reextraction: chunks with no malformed original AND no valid
|
||||
top-level file (their original was lost) — must be re-extracted.
|
||||
"""
|
||||
keys = set()
|
||||
for fn in glob.glob(str(EXTRACTED / "*.json")):
|
||||
keys.add(Path(fn).stem)
|
||||
for fn in glob.glob(str(REJECTED / "*.json")):
|
||||
keys.add(Path(fn).stem)
|
||||
|
||||
candidates: dict[str, Path] = {}
|
||||
needs_reextraction: list[str] = []
|
||||
for key in sorted(keys):
|
||||
# A malformed original anywhere is faithfully repairable, and is the
|
||||
# source of truth even if a (json_repair-produced, possibly truncated)
|
||||
# valid top-level file exists — escaping the original never truncates,
|
||||
# so re-repairing from it is always >= the json_repair output.
|
||||
src = _malformed_source(key)
|
||||
if src is not None:
|
||||
candidates[key] = src
|
||||
continue
|
||||
live = EXTRACTED / f"{key}.json"
|
||||
if live.exists() and _is_valid_json(live):
|
||||
continue # genuinely-valid extraction, nothing to do
|
||||
# no valid top-level and no malformed original to repair from
|
||||
needs_reextraction.append(key)
|
||||
return candidates, needs_reextraction
|
||||
|
||||
|
||||
def repair(apply: bool) -> int:
|
||||
schema = load_schema(DEFAULT_SCHEMA_PATH)
|
||||
candidates, needs_reextraction = _candidate_keys()
|
||||
|
||||
print("=" * 64)
|
||||
print(f"REPAIR EXTRACTIONS ({'APPLY' if apply else 'dry run'})")
|
||||
print("=" * 64)
|
||||
print(f"repair candidates: {len(candidates)}")
|
||||
|
||||
def _textlen(data: dict) -> int:
|
||||
total = 0
|
||||
for a in data.get("activities", []):
|
||||
if isinstance(a, dict):
|
||||
for v in a.values():
|
||||
if isinstance(v, str):
|
||||
total += len(v)
|
||||
return total
|
||||
|
||||
ok = 0
|
||||
kept_toplevel = 0
|
||||
still_bad: list[str] = []
|
||||
schema_fail: list[tuple[str, str]] = []
|
||||
|
||||
for key, src in candidates.items():
|
||||
live = EXTRACTED / f"{key}.json"
|
||||
live_valid = live.exists() and _is_valid_json(live)
|
||||
|
||||
raw = src.read_text(encoding="utf-8")
|
||||
fixed = escape_stray_quotes(raw)
|
||||
try:
|
||||
data = json.loads(fixed)
|
||||
except json.JSONDecodeError as exc:
|
||||
if live_valid:
|
||||
kept_toplevel += 1 # genuine top-level is fine; stale _rejected
|
||||
else:
|
||||
still_bad.append(f"{key}: still invalid after escape ({exc})")
|
||||
continue
|
||||
errors = validate_extraction(data, schema)
|
||||
if errors:
|
||||
if live_valid:
|
||||
kept_toplevel += 1
|
||||
else:
|
||||
schema_fail.append((key, errors[0]))
|
||||
print(f" {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
|
||||
continue
|
||||
|
||||
# Faithfulness guard: only replace a valid top-level when the escaped
|
||||
# repair carries STRICTLY more text (i.e. the top-level was a truncated
|
||||
# json_repair output). Genuine extractions are kept untouched.
|
||||
if live_valid:
|
||||
try:
|
||||
live_data = json.loads(live.read_text(encoding="utf-8"))
|
||||
except json.JSONDecodeError:
|
||||
live_data = {}
|
||||
if _textlen(data) <= _textlen(live_data):
|
||||
kept_toplevel += 1
|
||||
continue
|
||||
|
||||
n = len(data.get("activities", []))
|
||||
print(f" {key[:50]:<50} {n:>3} acts REPAIR")
|
||||
if apply:
|
||||
live.write_text(
|
||||
json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
|
||||
)
|
||||
ok += 1
|
||||
|
||||
print("-" * 64)
|
||||
print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
|
||||
f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
|
||||
f"needs re-extraction: {len(needs_reextraction)}")
|
||||
for key, err in schema_fail:
|
||||
print(f" ⚠ schema {key}: {err[:60]}")
|
||||
for msg in still_bad:
|
||||
print(f" ✘ {msg}")
|
||||
for key in needs_reextraction:
|
||||
print(f" ↻ re-extract: {key}")
|
||||
if not apply:
|
||||
print("\nDry run — re-run with --apply to write repaired JSON.")
|
||||
print("=" * 64)
|
||||
return 0
|
||||
|
||||
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
|
||||
parser.add_argument("--apply", action="store_true",
|
||||
help="write repaired JSON (default: dry run)")
|
||||
args = parser.parse_args(argv)
|
||||
return repair(args.apply)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
145
scripts/review_queue.py
Normal file
145
scripts/review_queue.py
Normal file
@@ -0,0 +1,145 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
review_queue.py — CLI for the needs_review lifecycle (plan §5c).
|
||||
|
||||
Rows land in the queue when dedup leaves a borderline pair separate, or when a
|
||||
legacy `.doc` source was converted imperfectly. Each row has a stable content
|
||||
key; a decision written here is stored in data/review_decisions.json (git
|
||||
tracked) and re-applied by build_database.py on every rebuild, so the queue
|
||||
never resurfaces a resolved row.
|
||||
|
||||
Commands:
|
||||
python scripts/review_queue.py list
|
||||
python scripts/review_queue.py resolve <id> <merge|keep-separate|drop>
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sqlite3
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
from import_common import content_key, normalize_name # noqa: E402
|
||||
|
||||
VALID_DECISIONS = ("merge", "keep-separate", "drop")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# review_decisions.json
|
||||
# --------------------------------------------------------------------------
|
||||
def load_decisions(path: Path) -> dict:
|
||||
if path.is_file():
|
||||
try:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
if isinstance(data, dict):
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def save_decisions(decisions: dict, path: Path) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(
|
||||
json.dumps(decisions, indent=2, ensure_ascii=False, sort_keys=True),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# queue
|
||||
# --------------------------------------------------------------------------
|
||||
def list_queue(db_path: Path) -> list[dict]:
|
||||
"""Return every needs_review row in the current DB, with its content key."""
|
||||
if not db_path.is_file():
|
||||
return []
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
try:
|
||||
rows = conn.execute(
|
||||
"SELECT name, normalized_name, language, description "
|
||||
"FROM activities WHERE needs_review = 1 ORDER BY normalized_name"
|
||||
).fetchall()
|
||||
except sqlite3.OperationalError:
|
||||
return []
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
out = []
|
||||
for row in rows:
|
||||
norm = row["normalized_name"] or normalize_name(row["name"])
|
||||
key = content_key(norm, row["language"], row["description"] or "")
|
||||
out.append({
|
||||
"id": key,
|
||||
"name": row["name"],
|
||||
"language": row["language"],
|
||||
"description": row["description"] or "",
|
||||
})
|
||||
return out
|
||||
|
||||
|
||||
def resolve(decisions_path: Path, content_id: str, decision: str) -> dict:
|
||||
"""Record a decision for a content key in review_decisions.json."""
|
||||
if decision not in VALID_DECISIONS:
|
||||
raise ValueError(
|
||||
f"invalid decision {decision!r}; expected one of {VALID_DECISIONS}"
|
||||
)
|
||||
decisions = load_decisions(decisions_path)
|
||||
decisions[content_id] = {"decision": decision}
|
||||
save_decisions(decisions, decisions_path)
|
||||
return decisions
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="needs_review queue CLI")
|
||||
parser.add_argument("--db", default="data/activities.db")
|
||||
parser.add_argument("--decisions", default="data/review_decisions.json")
|
||||
sub = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
sub.add_parser("list", help="list rows currently flagged needs_review")
|
||||
|
||||
p_resolve = sub.add_parser("resolve", help="record a decision for a row")
|
||||
p_resolve.add_argument("id", help="content id from `list`")
|
||||
p_resolve.add_argument("decision", choices=VALID_DECISIONS)
|
||||
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.command == "list":
|
||||
rows = list_queue(Path(args.db))
|
||||
if not rows:
|
||||
print("review queue is empty.")
|
||||
return 0
|
||||
print(f"{len(rows)} row(s) need review:\n")
|
||||
for r in rows:
|
||||
desc = r["description"][:80].replace("\n", " ")
|
||||
print(f" id : {r['id']}")
|
||||
print(f" name : {r['name']} [{r['language']}]")
|
||||
print(f" desc : {desc}")
|
||||
print(f" -> review_queue.py resolve {r['id']} <merge|keep-separate|drop>")
|
||||
print()
|
||||
return 0
|
||||
|
||||
if args.command == "resolve":
|
||||
resolve(Path(args.decisions), args.id, args.decision)
|
||||
print(f"recorded: {args.id} -> {args.decision}")
|
||||
print(f"written to {args.decisions} (applied on next build_database --rebuild)")
|
||||
return 0
|
||||
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
289
scripts/run_enrichment.py
Normal file
289
scripts/run_enrichment.py
Normal file
@@ -0,0 +1,289 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
run_enrichment.py — enrichment orchestrator (plan Part B3).
|
||||
|
||||
Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
|
||||
already-rebuilt data/activities.db, and for every activity emits one subagent
|
||||
prompt asking for a single bilingual + inferred-filter enrichment pass. Like
|
||||
extraction, this script does NOT call the LLM — the interactive Claude Code
|
||||
orchestrator launches waves of subagents on the emitted prompts.
|
||||
|
||||
Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
|
||||
import_common.content_key(normalized_name, language, _normalize_text(description))
|
||||
— the SAME function build_database uses to apply the overlay. The key is stable
|
||||
only while the extraction text is frozen, so enrichment runs AFTER the freezing
|
||||
rebuild.
|
||||
|
||||
Modes:
|
||||
(default) emit one prompt per activity that has no enrichment part yet
|
||||
(resumable: data/enrichment_parts/<key>.json present => skip)
|
||||
--collect merge data/enrichment_parts/*.json -> data/enrichment.json
|
||||
|
||||
Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
|
||||
the emitted prompts to a single source / category for the sign-off pilot.
|
||||
|
||||
Usage:
|
||||
python scripts/run_enrichment.py --source teambuilding_corbu # pilot
|
||||
python scripts/run_enrichment.py # all rows
|
||||
python scripts/run_enrichment.py --collect # merge parts
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sqlite3
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
from import_common import ( # noqa: E402
|
||||
content_key,
|
||||
find_chunk_text,
|
||||
normalize_name,
|
||||
)
|
||||
from repair_extractions import escape_stray_quotes # noqa: E402
|
||||
|
||||
ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
|
||||
|
||||
# Columns pulled from the DB into the prompt as the "current value" context.
|
||||
_DB_COLUMNS = (
|
||||
"id", "name", "description", "rules", "variations",
|
||||
"category", "content_type", "language", "normalized_name",
|
||||
"page_reference", "source_id", "chunk_key",
|
||||
"participants_min", "participants_max",
|
||||
"duration_min", "duration_max",
|
||||
"age_group_min", "age_group_max",
|
||||
)
|
||||
|
||||
# How much source-chunk text to inline. Chunks are page-sized; cap so a dense
|
||||
# chunk does not blow the prompt up, but keep enough to ground the expansion.
|
||||
_CHUNK_TEXT_CAP = 12000
|
||||
|
||||
|
||||
def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
try:
|
||||
cols = ", ".join(_DB_COLUMNS)
|
||||
sql = f"SELECT {cols} FROM activities"
|
||||
params: list = []
|
||||
if source_substr:
|
||||
sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
|
||||
params = [f"%{source_substr}%", f"%{source_substr}%"]
|
||||
sql += " ORDER BY source_id, id"
|
||||
return [dict(r) for r in conn.execute(sql, params).fetchall()]
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def _row_content_key(row: dict) -> str:
|
||||
return content_key(
|
||||
row.get("normalized_name") or normalize_name(row.get("name") or ""),
|
||||
row.get("language"),
|
||||
row.get("description") or "",
|
||||
)
|
||||
|
||||
|
||||
def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
|
||||
"""Locate the source-chunk text via the row's chunk_key / source_id."""
|
||||
header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
|
||||
if not header["chunk_key"]:
|
||||
return None
|
||||
# find_chunk_text resolves from the header when chunk_key is present;
|
||||
# the json_path arg is only a fallback, so a synthetic path is fine.
|
||||
text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
|
||||
if text and len(text) > _CHUNK_TEXT_CAP:
|
||||
text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
|
||||
return text
|
||||
|
||||
|
||||
def _current_fields_block(row: dict) -> str:
|
||||
"""The activity's current DB values, as a compact JSON block for context."""
|
||||
fields = {
|
||||
"name": row.get("name"),
|
||||
"description": row.get("description"),
|
||||
"rules": row.get("rules"),
|
||||
"variations": row.get("variations"),
|
||||
"category": row.get("category"),
|
||||
"content_type": row.get("content_type"),
|
||||
"language": row.get("language"),
|
||||
"participants_min": row.get("participants_min"),
|
||||
"participants_max": row.get("participants_max"),
|
||||
"duration_min": row.get("duration_min"),
|
||||
"duration_max": row.get("duration_max"),
|
||||
"age_group_min": row.get("age_group_min"),
|
||||
"age_group_max": row.get("age_group_max"),
|
||||
}
|
||||
return json.dumps(fields, ensure_ascii=False, indent=2)
|
||||
|
||||
|
||||
def emit_enrichment_prompt(
|
||||
row: dict, key: str, chunks_dir: Path, prompts_dir: Path
|
||||
) -> Path:
|
||||
"""Write the subagent enrichment prompt for one activity."""
|
||||
chunk_text = _chunk_text_for_row(row, chunks_dir)
|
||||
source_block = (
|
||||
chunk_text if chunk_text is not None
|
||||
else "[source chunk text unavailable — translate only what is given "
|
||||
"above; do NOT invent steps, and mark any inferred filter field "
|
||||
"as estimated]"
|
||||
)
|
||||
part_path = f"data/enrichment_parts/{key}.json"
|
||||
text = "\n".join([
|
||||
f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
|
||||
"",
|
||||
f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
|
||||
"Single pass. Translate faithfully to Romanian; expand description_ro "
|
||||
"ONLY from the source chunk text below; mark inferred filter fields in "
|
||||
"`estimated_fields`.",
|
||||
"",
|
||||
f"Write the result JSON to: `{part_path}`",
|
||||
f'It MUST include `"content_key": "{key}"`.',
|
||||
f'Page reference: {row.get("page_reference") or "?"}',
|
||||
"",
|
||||
"## Current activity values (the text to translate / enrich)",
|
||||
"```json",
|
||||
_current_fields_block(row),
|
||||
"```",
|
||||
"",
|
||||
"## Source chunk text (ground description_ro expansion in THIS only)",
|
||||
"```",
|
||||
source_block,
|
||||
"```",
|
||||
"",
|
||||
])
|
||||
prompts_dir.mkdir(parents=True, exist_ok=True)
|
||||
out = prompts_dir / f"{key}.prompt.md"
|
||||
out.write_text(text, encoding="utf-8")
|
||||
return out
|
||||
|
||||
|
||||
def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
|
||||
"""Merge data/enrichment_parts/*.json into one flat content_key map."""
|
||||
merged: dict = {}
|
||||
bad: list[str] = []
|
||||
repaired: list[str] = []
|
||||
if parts_dir.is_dir():
|
||||
for part in sorted(parts_dir.glob("*.json")):
|
||||
raw = part.read_text(encoding="utf-8")
|
||||
try:
|
||||
data = json.loads(raw)
|
||||
except json.JSONDecodeError:
|
||||
# Enrichment subagents hit the same unescaped-ASCII-quote bug as
|
||||
# extraction (description_ro is full of Romanian „…"). Repair by
|
||||
# escaping rather than dropping the activity's enrichment.
|
||||
try:
|
||||
data = json.loads(escape_stray_quotes(raw))
|
||||
repaired.append(part.name)
|
||||
except json.JSONDecodeError:
|
||||
bad.append(part.name)
|
||||
continue
|
||||
except OSError:
|
||||
bad.append(part.name)
|
||||
continue
|
||||
if not isinstance(data, dict):
|
||||
bad.append(part.name)
|
||||
continue
|
||||
key = data.get("content_key") or part.stem
|
||||
entry = {k: v for k, v in data.items() if k != "content_key"}
|
||||
merged[key] = entry
|
||||
out_path.write_text(
|
||||
json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
|
||||
)
|
||||
return {"entries": len(merged), "repaired": repaired,
|
||||
"bad_parts": bad, "out": str(out_path)}
|
||||
|
||||
|
||||
def run_emit(
|
||||
*,
|
||||
db_path: Path,
|
||||
chunks_dir: Path,
|
||||
parts_dir: Path,
|
||||
prompts_dir: Path,
|
||||
source_substr: Optional[str],
|
||||
limit: Optional[int],
|
||||
) -> dict:
|
||||
rows = _fetch_rows(db_path, source_substr)
|
||||
emitted, skipped = 0, 0
|
||||
for row in rows:
|
||||
key = _row_content_key(row)
|
||||
if (parts_dir / f"{key}.json").is_file():
|
||||
skipped += 1
|
||||
continue
|
||||
emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
|
||||
emitted += 1
|
||||
if limit and emitted >= limit:
|
||||
break
|
||||
return {
|
||||
"rows": len(rows),
|
||||
"emitted": emitted,
|
||||
"skipped_done": skipped,
|
||||
"prompts_dir": str(prompts_dir),
|
||||
}
|
||||
|
||||
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
|
||||
parser.add_argument("--db", default="data/activities.db")
|
||||
parser.add_argument("--chunks", default="data/chunks")
|
||||
parser.add_argument("--parts", default="data/enrichment_parts")
|
||||
parser.add_argument("--prompts", default="data/enrichment_prompts")
|
||||
parser.add_argument("--out", default="data/enrichment.json")
|
||||
parser.add_argument("--source", default=None,
|
||||
help="only rows whose source_id/chunk_key contains this (pilot)")
|
||||
parser.add_argument("--limit", type=int, default=None,
|
||||
help="cap emitted prompts (pilot)")
|
||||
parser.add_argument("--collect", action="store_true",
|
||||
help="merge enrichment parts into the overlay JSON")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
print("=" * 60)
|
||||
print("ENRICHMENT ORCHESTRATOR")
|
||||
print("=" * 60)
|
||||
|
||||
if args.collect:
|
||||
result = collect_enrichment(Path(args.parts), Path(args.out))
|
||||
print(f"collected : {result['entries']} entries -> {result['out']}")
|
||||
if result["repaired"]:
|
||||
print(f"repaired : {len(result['repaired'])} parts (unescaped-quote fix)")
|
||||
if result["bad_parts"]:
|
||||
print(f"bad parts : {len(result['bad_parts'])} (skipped)")
|
||||
for name in result["bad_parts"]:
|
||||
print(f" - {name}")
|
||||
print("Run build_database.py --rebuild to apply the overlay.")
|
||||
print("=" * 60)
|
||||
return 0
|
||||
|
||||
summary = run_emit(
|
||||
db_path=Path(args.db),
|
||||
chunks_dir=Path(args.chunks),
|
||||
parts_dir=Path(args.parts),
|
||||
prompts_dir=Path(args.prompts),
|
||||
source_substr=args.source,
|
||||
limit=args.limit,
|
||||
)
|
||||
print(f"rows in DB : {summary['rows']}"
|
||||
+ (f" (filtered by '{args.source}')" if args.source else ""))
|
||||
print(f"already enriched : {summary['skipped_done']}")
|
||||
print(f"prompts emitted : {summary['emitted']}")
|
||||
if summary["emitted"]:
|
||||
print(f"prompts dir : {summary['prompts_dir']}/")
|
||||
print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
|
||||
"writing data/enrichment_parts/<key>.json, then run "
|
||||
"run_enrichment.py --collect and build_database.py --rebuild.")
|
||||
else:
|
||||
print("Nothing to emit — run --collect then build_database.py --rebuild.")
|
||||
print("=" * 60)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,50 +1,140 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Main extraction orchestrator
|
||||
Ruleaza intregul proces de extractie
|
||||
run_extraction.py — extraction orchestrator (plan §3).
|
||||
|
||||
The pipeline is script-only up to the LLM step: this script normalizes the
|
||||
corpus, chunks the normalized sources, and emits one subagent prompt per
|
||||
`pending` chunk. It does NOT run the extraction itself — that step is the
|
||||
interactive Claude Code orchestrator launching waves of subagents.
|
||||
|
||||
Steps:
|
||||
1. normalize data/carti-camp-jocuri/ -> data/sources/*.txt
|
||||
2. chunk data/sources/*.txt -> data/chunks/<id>/*.txt + manifest.json
|
||||
3. emit one prompt per `pending` chunk -> data/chunks/_prompts/*.md
|
||||
4. report how many chunks remain `pending`
|
||||
|
||||
Usage:
|
||||
python scripts/run_extraction.py
|
||||
python scripts/run_extraction.py --skip-normalize # re-chunk only
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from unified_processor import UnifiedProcessor
|
||||
from import_claude_activities import ClaudeActivityImporter
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
import chunk_sources # noqa: E402
|
||||
import normalize_sources # noqa: E402
|
||||
|
||||
SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
|
||||
|
||||
|
||||
def emit_chunk_prompt(chunk_key: str, meta: dict, prompts_dir: Path) -> Path:
|
||||
"""Write the subagent prompt for one pending chunk."""
|
||||
chunk_file = meta.get("chunk_file", f"data/chunks/<id>/{chunk_key}.txt")
|
||||
expected_json = meta.get("expected_json", f"{chunk_key}.json")
|
||||
text = "\n".join([
|
||||
f"# EXTRACTION — chunk `{chunk_key}`",
|
||||
"",
|
||||
f"Read ONLY this chunk: `{chunk_file}`",
|
||||
f"Chunk range: {meta.get('chunk_range', '?')}",
|
||||
"",
|
||||
f"Follow the rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
|
||||
"Identify every distinct activity, fill the schema "
|
||||
"(`scripts/activity_schema.json`), and write the result to:",
|
||||
"",
|
||||
f" data/extracted/{expected_json}",
|
||||
"",
|
||||
"Header fields to set: "
|
||||
f'source_id="{meta.get("source_id", "")}", chunk_key="{chunk_key}", '
|
||||
f'source_hash="{meta.get("source_hash", "")}".',
|
||||
"",
|
||||
])
|
||||
prompts_dir.mkdir(parents=True, exist_ok=True)
|
||||
out = prompts_dir / f"{chunk_key}.prompt.md"
|
||||
out.write_text(text, encoding="utf-8")
|
||||
return out
|
||||
|
||||
|
||||
def run(
|
||||
*,
|
||||
corpus_root: Path,
|
||||
sources_dir: Path,
|
||||
chunks_dir: Path,
|
||||
skip_normalize: bool = False,
|
||||
) -> dict:
|
||||
summary: dict = {}
|
||||
|
||||
if not skip_normalize:
|
||||
norm = normalize_sources.run(corpus_root, sources_dir)
|
||||
summary["normalized"] = {"ok": norm["ok"], "total": norm["total"],
|
||||
"errors": norm["errors"]}
|
||||
|
||||
chunk_summary = chunk_sources.run(sources_dir, chunks_dir)
|
||||
summary["chunks"] = chunk_summary
|
||||
|
||||
manifest_path = chunks_dir / "manifest.json"
|
||||
manifest = chunk_sources.load_manifest(manifest_path)
|
||||
prompts_dir = chunks_dir / "_prompts"
|
||||
|
||||
pending = {k: m for k, m in manifest["chunks"].items()
|
||||
if m.get("state") == "pending"}
|
||||
for key, meta in sorted(pending.items()):
|
||||
emit_chunk_prompt(key, meta, prompts_dir)
|
||||
|
||||
states: dict[str, int] = {}
|
||||
for m in manifest["chunks"].values():
|
||||
states[m.get("state", "?")] = states.get(m.get("state", "?"), 0) + 1
|
||||
summary["states"] = states
|
||||
summary["pending"] = len(pending)
|
||||
summary["prompts_dir"] = str(prompts_dir)
|
||||
return summary
|
||||
|
||||
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Extraction orchestrator.")
|
||||
parser.add_argument("--corpus", default="data/carti-camp-jocuri")
|
||||
parser.add_argument("--sources", default="data/sources")
|
||||
parser.add_argument("--chunks", default="data/chunks")
|
||||
parser.add_argument("--skip-normalize", action="store_true",
|
||||
help="skip normalization, re-chunk existing sources only")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
summary = run(
|
||||
corpus_root=Path(args.corpus),
|
||||
sources_dir=Path(args.sources),
|
||||
chunks_dir=Path(args.chunks),
|
||||
skip_normalize=args.skip_normalize,
|
||||
)
|
||||
|
||||
print("=" * 60)
|
||||
print("EXTRACTION ORCHESTRATOR")
|
||||
print("=" * 60)
|
||||
if "normalized" in summary:
|
||||
n = summary["normalized"]
|
||||
print(f"normalized : {n['ok']}/{n['total']} (errors {n['errors']})")
|
||||
print(f"chunks : {summary['chunks']['chunks']}")
|
||||
for state, count in sorted(summary["states"].items()):
|
||||
print(f" {state:<10}: {count}")
|
||||
print(f"\npending chunks remaining : {summary['pending']}")
|
||||
if summary["pending"]:
|
||||
print(f"subagent prompts written : {summary['prompts_dir']}/")
|
||||
print("Launch waves of ~5-10 subagents on those prompts, then run "
|
||||
"validate_extractions.py and build_database.py --rebuild.")
|
||||
else:
|
||||
print("All chunks extracted — run build_database.py --rebuild.")
|
||||
print("=" * 60)
|
||||
return 0
|
||||
|
||||
def main():
|
||||
print("="*60)
|
||||
print("ACTIVITY EXTRACTION SYSTEM")
|
||||
print("Strategy S8: Hybrid Claude + Scripts")
|
||||
print("="*60)
|
||||
|
||||
# Step 1: Run automated extraction
|
||||
print("\nSTEP 1: Automated Extraction")
|
||||
print("-"*40)
|
||||
processor = UnifiedProcessor()
|
||||
processor.process_automated_formats()
|
||||
|
||||
# Step 2: Wait for Claude processing
|
||||
print("\n" + "="*60)
|
||||
print("STEP 2: Manual Claude Processing Required")
|
||||
print("-"*40)
|
||||
print("Please process PDF/DOC files with Claude using the template.")
|
||||
print("Files are listed in: pdf_doc_for_claude.txt")
|
||||
print("Save extracted activities as JSON in: scripts/extracted_activities/")
|
||||
print("="*60)
|
||||
|
||||
response = input("\nHave you completed Claude processing? (y/n): ")
|
||||
|
||||
if response.lower() == 'y':
|
||||
# Step 3: Import Claude-extracted activities
|
||||
print("\nSTEP 3: Importing Claude-extracted activities")
|
||||
print("-"*40)
|
||||
importer = ClaudeActivityImporter()
|
||||
importer.import_all_json_files()
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("EXTRACTION COMPLETE!")
|
||||
print("="*60)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
raise SystemExit(main())
|
||||
|
||||
@@ -1,197 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Text/Markdown Activity Extractor
|
||||
Proceseaza fisiere TXT si MD pentru extractie activitati
|
||||
"""
|
||||
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import List, Dict
|
||||
import sqlite3
|
||||
from datetime import datetime
|
||||
|
||||
class TextActivityExtractor:
|
||||
def __init__(self, db_path='data/activities.db'):
|
||||
self.db_path = db_path
|
||||
self.activity_patterns = {
|
||||
'section_headers': [
|
||||
r'^#{1,6}\s*(.+)$', # Markdown headers
|
||||
r'^([A-Z][^\.]{10,100})$', # Titluri simple
|
||||
r'^\d+\.\s*(.+)$', # Numbered lists
|
||||
r'^[•\-\*]\s*(.+)$', # Bullet points
|
||||
],
|
||||
'activity_markers': [
|
||||
'joc:', 'activitate:', 'exercitiu:', 'team building:',
|
||||
'nume:', 'titlu:', 'denumire:'
|
||||
]
|
||||
}
|
||||
|
||||
def extract_from_text(self, file_path: str) -> List[Dict]:
|
||||
"""Extrage activitati din fisier text/markdown"""
|
||||
activities = []
|
||||
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
|
||||
content = f.read()
|
||||
|
||||
# Metoda 1: Cauta sectiuni markdown
|
||||
if file_path.endswith('.md'):
|
||||
activities.extend(self._extract_from_markdown(content, file_path))
|
||||
|
||||
# Metoda 2: Cauta pattern-uri generale
|
||||
activities.extend(self._extract_from_patterns(content, file_path))
|
||||
|
||||
# Metoda 3: Cauta blocuri de text structurate
|
||||
activities.extend(self._extract_from_blocks(content, file_path))
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing {file_path}: {e}")
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_markdown(self, content, source_file):
|
||||
"""Extrage activitati din format markdown"""
|
||||
activities = []
|
||||
lines = content.split('\n')
|
||||
|
||||
current_activity = None
|
||||
current_content = []
|
||||
|
||||
for line in lines:
|
||||
# Verifica daca e header de activitate
|
||||
if re.match(r'^#{1,3}\s*(.+)', line):
|
||||
# Salveaza activitatea anterioara daca exista
|
||||
if current_activity and current_content:
|
||||
current_activity['description'] = '\n'.join(current_content[:20]) # Max 20 linii
|
||||
activities.append(current_activity)
|
||||
|
||||
# Verifica daca noul header e o activitate
|
||||
header_text = re.sub(r'^#{1,3}\s*', '', line)
|
||||
if any(marker in header_text.lower() for marker in ['joc', 'activitate', 'exercitiu']):
|
||||
current_activity = {
|
||||
'name': header_text[:200],
|
||||
'source_file': str(source_file),
|
||||
'category': '[A]'
|
||||
}
|
||||
current_content = []
|
||||
else:
|
||||
current_activity = None
|
||||
|
||||
elif current_activity:
|
||||
# Adauga continut la activitatea curenta
|
||||
if line.strip():
|
||||
current_content.append(line)
|
||||
|
||||
# Salveaza ultima activitate
|
||||
if current_activity and current_content:
|
||||
current_activity['description'] = '\n'.join(current_content[:20])
|
||||
activities.append(current_activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_patterns(self, content, source_file):
|
||||
"""Extrage folosind pattern matching"""
|
||||
activities = []
|
||||
|
||||
# Cauta markeri specifici de activitati
|
||||
for marker in self.activity_patterns['activity_markers']:
|
||||
pattern = re.compile(f'{re.escape(marker)}\\s*(.+?)(?=\\n\\n|{re.escape(marker)}|$)',
|
||||
re.IGNORECASE | re.DOTALL)
|
||||
matches = pattern.finditer(content)
|
||||
|
||||
for match in matches:
|
||||
activity_text = match.group(1)
|
||||
if len(activity_text) > 20:
|
||||
activity = {
|
||||
'name': activity_text.split('\n')[0][:200],
|
||||
'description': activity_text[:1000],
|
||||
'source_file': str(source_file),
|
||||
'category': '[A]'
|
||||
}
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_blocks(self, content, source_file):
|
||||
"""Extrage din blocuri de text separate"""
|
||||
activities = []
|
||||
|
||||
# Imparte in blocuri separate de linii goale
|
||||
blocks = re.split(r'\n\s*\n', content)
|
||||
|
||||
for block in blocks:
|
||||
if len(block) > 50: # Minim 50 caractere
|
||||
lines = block.strip().split('\n')
|
||||
first_line = lines[0].strip()
|
||||
|
||||
# Verifica daca blocul pare o activitate
|
||||
if any(keyword in first_line.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
|
||||
activity = {
|
||||
'name': first_line[:200],
|
||||
'description': block[:1000],
|
||||
'source_file': str(source_file),
|
||||
'category': '[A]'
|
||||
}
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def save_to_database(self, activities):
|
||||
"""Salveaza in baza de date"""
|
||||
conn = sqlite3.connect(self.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
saved_count = 0
|
||||
|
||||
for activity in activities:
|
||||
try:
|
||||
# Check for duplicates
|
||||
cursor.execute(
|
||||
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
||||
(activity.get('name'), activity.get('source_file'))
|
||||
)
|
||||
|
||||
if not cursor.fetchone():
|
||||
columns = list(activity.keys())
|
||||
values = list(activity.values())
|
||||
placeholders = ['?' for _ in values]
|
||||
|
||||
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
||||
cursor.execute(query, values)
|
||||
saved_count += 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error saving: {e}")
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
return saved_count
|
||||
|
||||
def process_all_text_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
||||
"""Proceseaza toate fisierele text si markdown"""
|
||||
base_path = Path(base_path)
|
||||
|
||||
text_files = list(base_path.rglob("*.txt"))
|
||||
md_files = list(base_path.rglob("*.md"))
|
||||
all_files = text_files + md_files
|
||||
|
||||
print(f"Found {len(all_files)} text/markdown files")
|
||||
|
||||
all_activities = []
|
||||
|
||||
for file_path in all_files:
|
||||
activities = self.extract_from_text(str(file_path))
|
||||
all_activities.extend(activities)
|
||||
print(f"Processed {file_path.name}: {len(activities)} activities")
|
||||
|
||||
# Save to database
|
||||
saved = self.save_to_database(all_activities)
|
||||
print(f"\nTotal saved: {saved} activities from {len(all_files)} files")
|
||||
|
||||
return len(all_files), saved
|
||||
|
||||
if __name__ == "__main__":
|
||||
extractor = TextActivityExtractor()
|
||||
extractor.process_all_text_files()
|
||||
@@ -1,151 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Unified Activity Processor
|
||||
Orchestreaz toate extractoarele pentru procesare complet
|
||||
"""
|
||||
|
||||
import time
|
||||
from pathlib import Path
|
||||
from html_extractor import HTMLActivityExtractor
|
||||
from text_extractor import TextActivityExtractor
|
||||
import sqlite3
|
||||
|
||||
class UnifiedProcessor:
|
||||
def __init__(self, db_path='data/activities.db'):
|
||||
self.db_path = db_path
|
||||
self.html_extractor = HTMLActivityExtractor(db_path)
|
||||
self.text_extractor = TextActivityExtractor(db_path)
|
||||
self.stats = {
|
||||
'html_processed': 0,
|
||||
'text_processed': 0,
|
||||
'pdf_to_process': 0,
|
||||
'doc_to_process': 0,
|
||||
'total_activities': 0,
|
||||
'start_time': None,
|
||||
'end_time': None
|
||||
}
|
||||
|
||||
def get_current_activity_count(self):
|
||||
"""Obine numrul curent de activiti din DB"""
|
||||
conn = sqlite3.connect(self.db_path)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT COUNT(*) FROM activities")
|
||||
count = cursor.fetchone()[0]
|
||||
conn.close()
|
||||
return count
|
||||
|
||||
def count_files_to_process(self, base_path):
|
||||
"""Numr fiierele care trebuie procesate"""
|
||||
base_path = Path(base_path)
|
||||
|
||||
counts = {
|
||||
'html': len(list(base_path.rglob("*.html"))) + len(list(base_path.rglob("*.htm"))),
|
||||
'txt': len(list(base_path.rglob("*.txt"))),
|
||||
'md': len(list(base_path.rglob("*.md"))),
|
||||
'pdf': len(list(base_path.rglob("*.pdf"))),
|
||||
'doc': len(list(base_path.rglob("*.doc"))),
|
||||
'docx': len(list(base_path.rglob("*.docx")))
|
||||
}
|
||||
|
||||
return counts
|
||||
|
||||
def process_automated_formats(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
||||
"""Proceseaz toate formatele care pot fi automatizate"""
|
||||
print("="*60)
|
||||
print("UNIFIED ACTIVITY PROCESSOR - AUTOMATED PHASE")
|
||||
print("="*60)
|
||||
|
||||
self.stats['start_time'] = time.time()
|
||||
initial_count = self.get_current_activity_count()
|
||||
|
||||
# Afieaz statistici iniiale
|
||||
file_counts = self.count_files_to_process(base_path)
|
||||
print(f"\nFiles to process:")
|
||||
for format, count in file_counts.items():
|
||||
print(f" {format.upper()}: {count} files")
|
||||
print(f"\nCurrent activities in database: {initial_count}")
|
||||
print("-"*60)
|
||||
|
||||
# FAZA 1: Procesare HTML (prioritate maxim - volum mare)
|
||||
print("\n[1/2] Processing HTML files...")
|
||||
print("-"*40)
|
||||
html_processed, html_errors = self.html_extractor.process_all_html_files(base_path)
|
||||
self.stats['html_processed'] = html_processed
|
||||
|
||||
# FAZA 2: Procesare Text/MD
|
||||
print("\n[2/2] Processing Text/Markdown files...")
|
||||
print("-"*40)
|
||||
text_processed, text_saved = self.text_extractor.process_all_text_files(base_path)
|
||||
self.stats['text_processed'] = text_processed
|
||||
|
||||
# Statistici finale
|
||||
self.stats['end_time'] = time.time()
|
||||
final_count = self.get_current_activity_count()
|
||||
self.stats['total_activities'] = final_count - initial_count
|
||||
|
||||
# Identific fiierele care necesit procesare manual
|
||||
self.stats['pdf_to_process'] = file_counts['pdf']
|
||||
self.stats['doc_to_process'] = file_counts['doc'] + file_counts['docx']
|
||||
|
||||
self.print_summary()
|
||||
self.save_pdf_doc_list(base_path)
|
||||
|
||||
def print_summary(self):
|
||||
"""Afieaz rezumatul procesrii"""
|
||||
print("\n" + "="*60)
|
||||
print("PROCESSING SUMMARY")
|
||||
print("="*60)
|
||||
|
||||
duration = self.stats['end_time'] - self.stats['start_time']
|
||||
|
||||
print(f"\nAutomated Processing Results:")
|
||||
print(f" HTML files processed: {self.stats['html_processed']}")
|
||||
print(f" Text/MD files processed: {self.stats['text_processed']}")
|
||||
print(f" New activities added: {self.stats['total_activities']}")
|
||||
print(f" Processing time: {duration:.1f} seconds")
|
||||
|
||||
print(f"\nFiles requiring Claude processing:")
|
||||
print(f" PDF files: {self.stats['pdf_to_process']}")
|
||||
print(f" DOC/DOCX files: {self.stats['doc_to_process']}")
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("NEXT STEPS:")
|
||||
print("1. Review the file 'pdf_doc_for_claude.txt' for manual processing")
|
||||
print("2. Use Claude to extract activities from PDF/DOC files")
|
||||
print("3. Focus on largest PDF files first (highest activity density)")
|
||||
print("="*60)
|
||||
|
||||
def save_pdf_doc_list(self, base_path):
|
||||
"""Salveaz lista de PDF/DOC pentru procesare cu Claude"""
|
||||
base_path = Path(base_path)
|
||||
|
||||
pdf_files = sorted(base_path.rglob("*.pdf"), key=lambda p: p.stat().st_size, reverse=True)
|
||||
doc_files = list(base_path.rglob("*.doc"))
|
||||
docx_files = list(base_path.rglob("*.docx"))
|
||||
|
||||
with open('pdf_doc_for_claude.txt', 'w', encoding='utf-8') as f:
|
||||
f.write("PDF/DOC FILES FOR CLAUDE PROCESSING\n")
|
||||
f.write("="*60 + "\n")
|
||||
f.write("Files sorted by size (largest first = likely more activities)\n\n")
|
||||
|
||||
f.write("TOP PRIORITY PDF FILES (process these first):\n")
|
||||
f.write("-"*40 + "\n")
|
||||
for i, pdf in enumerate(pdf_files[:20], 1):
|
||||
size_mb = pdf.stat().st_size / (1024*1024)
|
||||
f.write(f"{i}. {pdf.name} ({size_mb:.1f} MB)\n")
|
||||
f.write(f" Path: {pdf}\n\n")
|
||||
|
||||
if len(pdf_files) > 20:
|
||||
f.write(f"\n... and {len(pdf_files)-20} more PDF files\n\n")
|
||||
|
||||
f.write("\nDOC/DOCX FILES:\n")
|
||||
f.write("-"*40 + "\n")
|
||||
for doc in doc_files + docx_files:
|
||||
size_kb = doc.stat().st_size / 1024
|
||||
f.write(f"- {doc.name} ({size_kb:.1f} KB)\n")
|
||||
|
||||
print(f"\nPDF/DOC list saved to: pdf_doc_for_claude.txt")
|
||||
|
||||
if __name__ == "__main__":
|
||||
processor = UnifiedProcessor()
|
||||
processor.process_automated_formats()
|
||||
208
scripts/validate_extractions.py
Normal file
208
scripts/validate_extractions.py
Normal file
@@ -0,0 +1,208 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
validate_extractions.py — validate every data/extracted/*.json (plan §5b).
|
||||
|
||||
For each extraction file it runs two checks:
|
||||
1. JSON-schema validation against scripts/activity_schema.json,
|
||||
2. the source_excerpt anti-hallucination check (each excerpt must be a fuzzy
|
||||
substring of the chunk it came from).
|
||||
|
||||
For every failing chunk it:
|
||||
* writes the exact re-extraction prompt to data/extracted/_reextract/<chunk>.prompt.md,
|
||||
* marks the chunk `rejected` in data/chunks/manifest.json.
|
||||
|
||||
The orchestrator then re-launches subagents only on the `rejected` chunks; the
|
||||
loop repeats until nothing is rejected.
|
||||
|
||||
Usage:
|
||||
python scripts/validate_extractions.py
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
from import_common import ( # noqa: E402
|
||||
DEFAULT_SCHEMA_PATH,
|
||||
chunk_key_for,
|
||||
excerpt_matches,
|
||||
excerpt_score,
|
||||
find_chunk_text,
|
||||
iter_extraction_files,
|
||||
load_schema,
|
||||
validate_extraction,
|
||||
)
|
||||
|
||||
SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# re-extraction prompt
|
||||
# --------------------------------------------------------------------------
|
||||
def build_reextraction_prompt(
|
||||
chunk_key: str, chunk_file: Optional[str], errors: list[str]
|
||||
) -> str:
|
||||
"""The exact prompt to hand a subagent to re-extract a rejected chunk."""
|
||||
chunk_ref = chunk_file or f"data/chunks/<source_id>/{chunk_key}.txt"
|
||||
lines = [
|
||||
f"# RE-EXTRACTION — chunk `{chunk_key}`",
|
||||
"",
|
||||
"The previous extraction for this chunk was **REJECTED**. Reasons:",
|
||||
"",
|
||||
]
|
||||
lines += [f"- {e}" for e in errors]
|
||||
lines += [
|
||||
"",
|
||||
"## What to do",
|
||||
"",
|
||||
f"1. Read ONLY this chunk: `{chunk_ref}`",
|
||||
f"2. Follow the extraction rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
|
||||
"3. Fix every problem listed above. In particular:",
|
||||
" - every `source_excerpt` must be copied **verbatim** from the chunk",
|
||||
" (it is checked as a fuzzy substring — invented quotes are rejected);",
|
||||
" - `source_excerpt` and `page_reference` are mandatory on every activity;",
|
||||
" - the output must validate against `scripts/activity_schema.json`.",
|
||||
f"4. Overwrite the extraction file `data/extracted/{chunk_key}.json`.",
|
||||
"",
|
||||
]
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# manifest
|
||||
# --------------------------------------------------------------------------
|
||||
def load_manifest(manifest_path: Path) -> dict:
|
||||
if manifest_path.is_file():
|
||||
try:
|
||||
data = json.loads(manifest_path.read_text(encoding="utf-8"))
|
||||
data.setdefault("chunks", {})
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return {"chunks": {}}
|
||||
|
||||
|
||||
def save_manifest(manifest: dict, manifest_path: Path) -> None:
|
||||
manifest_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
manifest_path.write_text(
|
||||
json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
|
||||
)
|
||||
|
||||
|
||||
def mark_rejected(manifest: dict, chunk_key: str) -> None:
|
||||
"""Flip a chunk to `rejected` in the manifest (creating the entry if new)."""
|
||||
entry = manifest["chunks"].get(chunk_key, {})
|
||||
entry["state"] = "rejected"
|
||||
manifest["chunks"][chunk_key] = entry
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# validation
|
||||
# --------------------------------------------------------------------------
|
||||
def validate_file(json_path: Path, schema: dict, chunks_dir: Path) -> list[str]:
|
||||
"""Return the list of errors for one extraction file (empty == valid)."""
|
||||
try:
|
||||
data = json.loads(json_path.read_text(encoding="utf-8"))
|
||||
except json.JSONDecodeError as exc:
|
||||
return [f"invalid JSON: {exc}"]
|
||||
|
||||
errors = validate_extraction(data, schema)
|
||||
if errors:
|
||||
return errors
|
||||
|
||||
header = data.get("header", {})
|
||||
chunk_text = find_chunk_text(json_path, header, chunks_dir)
|
||||
if chunk_text is None:
|
||||
return [f"source chunk not found for {chunk_key_for(json_path, header)}"]
|
||||
|
||||
for adict in data.get("activities", []):
|
||||
excerpt = adict.get("source_excerpt") or ""
|
||||
if not excerpt_matches(excerpt, chunk_text):
|
||||
score = excerpt_score(excerpt, chunk_text)
|
||||
errors.append(
|
||||
f"activity {adict.get('name')!r}: source_excerpt not found in "
|
||||
f"chunk (best match {score:.0f}/100) — possible hallucination"
|
||||
)
|
||||
return errors
|
||||
|
||||
|
||||
def run(
|
||||
extracted_dir: Path,
|
||||
chunks_dir: Path,
|
||||
manifest_path: Path,
|
||||
schema_path: Path = DEFAULT_SCHEMA_PATH,
|
||||
) -> dict:
|
||||
schema = load_schema(schema_path)
|
||||
manifest = load_manifest(manifest_path)
|
||||
reextract_dir = extracted_dir / "_reextract"
|
||||
|
||||
report = {"total": 0, "valid": 0, "rejected": 0, "rejected_chunks": []}
|
||||
for json_path in iter_extraction_files(extracted_dir):
|
||||
report["total"] += 1
|
||||
errors = validate_file(json_path, schema, chunks_dir)
|
||||
if not errors:
|
||||
report["valid"] += 1
|
||||
continue
|
||||
|
||||
report["rejected"] += 1
|
||||
try:
|
||||
data = json.loads(json_path.read_text(encoding="utf-8"))
|
||||
header = data.get("header", {})
|
||||
except json.JSONDecodeError:
|
||||
header = {}
|
||||
chunk_key = chunk_key_for(json_path, header)
|
||||
chunk_file = None
|
||||
meta = manifest["chunks"].get(chunk_key)
|
||||
if meta:
|
||||
chunk_file = meta.get("chunk_file")
|
||||
|
||||
reextract_dir.mkdir(parents=True, exist_ok=True)
|
||||
prompt = build_reextraction_prompt(chunk_key, chunk_file, errors)
|
||||
(reextract_dir / f"{chunk_key}.prompt.md").write_text(prompt, encoding="utf-8")
|
||||
|
||||
mark_rejected(manifest, chunk_key)
|
||||
report["rejected_chunks"].append({"chunk": chunk_key, "errors": errors})
|
||||
|
||||
save_manifest(manifest, manifest_path)
|
||||
return report
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Validate extraction JSON files.")
|
||||
parser.add_argument("--extracted", default="data/extracted")
|
||||
parser.add_argument("--chunks", default="data/chunks")
|
||||
parser.add_argument("--manifest", default="data/chunks/manifest.json")
|
||||
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
report = run(
|
||||
Path(args.extracted), Path(args.chunks), Path(args.manifest), Path(args.schema)
|
||||
)
|
||||
print(f"extraction files : {report['total']}")
|
||||
print(f" valid : {report['valid']}")
|
||||
print(f" rejected : {report['rejected']}")
|
||||
for item in report["rejected_chunks"]:
|
||||
print(f" [rejected] {item['chunk']}")
|
||||
for err in item["errors"]:
|
||||
print(f" - {err}")
|
||||
if report["rejected"]:
|
||||
print(f"\nRe-extraction prompts written to {args.extracted}/_reextract/")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
114
tests/conftest.py
Normal file
114
tests/conftest.py
Normal file
@@ -0,0 +1,114 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Shared pytest fixtures for the extraction-pipeline tests.
|
||||
|
||||
scripts/ is not a package, so it is added to sys.path here. Synthetic fixtures
|
||||
(PDF, docx, zip, HTML) are generated at runtime — no binary blobs in the repo.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
SCRIPTS_DIR = REPO_ROOT / "scripts"
|
||||
if str(SCRIPTS_DIR) not in sys.path:
|
||||
sys.path.insert(0, str(SCRIPTS_DIR))
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# synthetic PDF — deliberately large to pin the "no max_pages" regression
|
||||
# --------------------------------------------------------------------------
|
||||
@pytest.fixture
|
||||
def big_pdf(tmp_path):
|
||||
"""A 60-page PDF; each page carries a unique 'PDFMARK-<n>' token."""
|
||||
from reportlab.pdfgen import canvas
|
||||
from reportlab.lib.pagesizes import letter
|
||||
|
||||
path = tmp_path / "big.pdf"
|
||||
c = canvas.Canvas(str(path), pagesize=letter)
|
||||
for n in range(1, 61):
|
||||
c.drawString(72, 720, f"PDFMARK-{n} synthetic activity page number {n}")
|
||||
c.drawString(72, 700, "Acest joc educativ se joaca in echipa.")
|
||||
c.showPage()
|
||||
c.save()
|
||||
return path
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# synthetic docx — 100 paragraphs => 3 synthetic pages at 40 paras/page
|
||||
# --------------------------------------------------------------------------
|
||||
@pytest.fixture
|
||||
def sample_docx(tmp_path):
|
||||
import docx
|
||||
|
||||
path = tmp_path / "sample.docx"
|
||||
document = docx.Document()
|
||||
for i in range(100):
|
||||
document.add_paragraph(f"Paragraf {i}: continut joc team-building.")
|
||||
document.save(str(path))
|
||||
return path
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# synthetic HTML mirror page — with nav/script/footer chrome to strip
|
||||
# --------------------------------------------------------------------------
|
||||
HTML_WITH_NAV = """<!doctype html>
|
||||
<html><head><title>Joc</title>
|
||||
<style>.x{color:red}</style>
|
||||
<script>var tracking = 1;</script>
|
||||
</head><body>
|
||||
<nav><a href="/">Home</a><a href="/games">Games</a></nav>
|
||||
<header>Site Banner Junk</header>
|
||||
<main>
|
||||
<h1>Vanatoarea de comori</h1>
|
||||
<p>Acesta este un joc real de orientare pentru cercetasi.</p>
|
||||
<p>Jucatorii cauta indicii ascunse in tabara.</p>
|
||||
</main>
|
||||
<footer>Copyright 2024 - toate drepturile rezervate</footer>
|
||||
</body></html>
|
||||
"""
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def html_with_nav(tmp_path):
|
||||
path = tmp_path / "page.html"
|
||||
path.write_text(HTML_WITH_NAV, encoding="utf-8")
|
||||
return path
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# synthetic zip — contains a docx and a stray junk file
|
||||
# --------------------------------------------------------------------------
|
||||
@pytest.fixture
|
||||
def sample_zip(tmp_path, sample_docx):
|
||||
path = tmp_path / "archive.zip"
|
||||
with zipfile.ZipFile(path, "w") as zf:
|
||||
zf.write(sample_docx, arcname="inner/sample.docx")
|
||||
zf.writestr("desktop.ini", "junk")
|
||||
return path
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# synthetic normalized source — paginated, with an activity straddling a
|
||||
# page boundary so the chunker overlap can be verified.
|
||||
# --------------------------------------------------------------------------
|
||||
@pytest.fixture
|
||||
def paginated_source(tmp_path):
|
||||
"""A 50-page normalized source. An activity spans the page 20/21 boundary."""
|
||||
lines = ["SOURCE: synthetic/test.pdf", "CONVERTED: 2026-05-19",
|
||||
"FORMAT: pdf", "=" * 50, ""]
|
||||
for n in range(1, 51):
|
||||
lines.append(f"--- PAGE {n} ---")
|
||||
if n == 20:
|
||||
lines.append("ACTIVITY-START jocul podului care traverseaza pagina")
|
||||
elif n == 21:
|
||||
lines.append("continuare a jocului podului ACTIVITY-END")
|
||||
else:
|
||||
lines.append(f"continut obisnuit pe pagina {n}")
|
||||
lines.append("")
|
||||
path = tmp_path / "src_paginated.txt"
|
||||
path.write_text("\n".join(lines), encoding="utf-8")
|
||||
return path
|
||||
3
tests/fixtures/.gitkeep
vendored
Normal file
3
tests/fixtures/.gitkeep
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
# Test fixtures (synthetic PDF/docx/zip/HTML) are generated at runtime by
|
||||
# tests/conftest.py — no binary blobs are committed. This file only preserves
|
||||
# the directory in git.
|
||||
334
tests/test_build_database.py
Normal file
334
tests/test_build_database.py
Normal file
@@ -0,0 +1,334 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Tests for scripts/build_database.py — the import / dedup / swap side.
|
||||
|
||||
Covers: category -> slug + `altele` fallback; dedup across all three threshold
|
||||
bands; EN != RO never merged; field combination on merge; atomic swap with a
|
||||
simulated mid-build crash; the source_excerpt substring check.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
SCRIPTS_DIR = REPO_ROOT / "scripts"
|
||||
for _p in (str(REPO_ROOT), str(SCRIPTS_DIR)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
import build_database as bd # noqa: E402
|
||||
from app.models.activity import Activity # noqa: E402
|
||||
from app.models.database import DatabaseManager # noqa: E402
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# helpers
|
||||
# --------------------------------------------------------------------------
|
||||
def _activity(**over):
|
||||
base = dict(
|
||||
name="Jocul testului",
|
||||
description="O activitate de echipa in aer liber.",
|
||||
category="team-building",
|
||||
content_type="joc",
|
||||
language="ro",
|
||||
extraction_confidence="high",
|
||||
)
|
||||
base.update(over)
|
||||
return Activity(**base)
|
||||
|
||||
|
||||
def _ext_activity(**over):
|
||||
"""A schema-valid extraction-JSON activity object."""
|
||||
base = dict(
|
||||
name="Jocul testului",
|
||||
description="O activitate de echipa in aer liber.",
|
||||
category="team-building",
|
||||
content_type="joc",
|
||||
language="ro",
|
||||
extraction_confidence="high",
|
||||
source_excerpt="ANCHOR-EXCERPT despre jocul testului",
|
||||
page_reference="page 1",
|
||||
)
|
||||
base.update(over)
|
||||
return base
|
||||
|
||||
|
||||
def _write_extraction(extracted_dir, chunk_key, activities, source_id="src01"):
|
||||
extracted_dir.mkdir(parents=True, exist_ok=True)
|
||||
payload = {
|
||||
"header": {
|
||||
"source_hash": "hash1234deadbeef",
|
||||
"schema_version": "1.0",
|
||||
"prompt_version": "1.0",
|
||||
"chunk_range": "pages 1-20",
|
||||
"source_id": source_id,
|
||||
"chunk_key": chunk_key,
|
||||
},
|
||||
"activities": activities,
|
||||
}
|
||||
(extracted_dir / f"{chunk_key}.json").write_text(
|
||||
json.dumps(payload, ensure_ascii=False), encoding="utf-8"
|
||||
)
|
||||
|
||||
|
||||
def _write_chunk(chunks_dir, source_id, chunk_key, text):
|
||||
d = chunks_dir / source_id
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
(d / f"{chunk_key}.txt").write_text(text, encoding="utf-8")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 3 — category normalization
|
||||
# --------------------------------------------------------------------------
|
||||
def test_category_alias_mapped_to_slug():
|
||||
act = bd.dict_to_activity(_ext_activity(category="teambuilding"), "s.txt")
|
||||
assert act.category == "team-building"
|
||||
|
||||
|
||||
def test_unknown_category_falls_back_to_altele():
|
||||
act = bd.dict_to_activity(_ext_activity(category="zzz-not-a-category"), "s.txt")
|
||||
assert act.category == "altele"
|
||||
|
||||
|
||||
def test_content_type_normalized():
|
||||
act = bd.dict_to_activity(_ext_activity(content_type="games"), "s.txt")
|
||||
assert act.content_type == "joc"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 4 — dedup, three bands
|
||||
# --------------------------------------------------------------------------
|
||||
def test_dedup_auto_merge_identical_descriptions():
|
||||
""">= 85 similar -> a single merged row."""
|
||||
a = _activity(description="copiii formeaza echipe si traverseaza terenul")
|
||||
b = _activity(description="copiii formeaza echipe si traverseaza terenul")
|
||||
out, stats = bd.dedup_activities([a, b])
|
||||
assert len(out) == 1
|
||||
assert stats["auto_merged"] == 1
|
||||
assert out[0].needs_review == 0
|
||||
|
||||
|
||||
def test_dedup_borderline_keeps_both_and_flags_needs_review():
|
||||
"""60-85 similar -> both kept, both flagged needs_review."""
|
||||
from rapidfuzz import fuzz
|
||||
|
||||
d1 = "alpha beta gamma delta epsilon"
|
||||
d2 = "alpha beta gamma delta epsilon zeta eta theta iota"
|
||||
score = fuzz.token_sort_ratio(d1, d2)
|
||||
assert 60.0 <= score < 85.0, f"precondition: score={score} not borderline"
|
||||
|
||||
a = _activity(description=d1)
|
||||
b = _activity(description=d2)
|
||||
out, stats = bd.dedup_activities([a, b])
|
||||
assert len(out) == 2
|
||||
assert stats["borderline"] == 2
|
||||
assert all(act.needs_review == 1 for act in out)
|
||||
|
||||
|
||||
def test_dedup_low_similarity_kept_as_separate_variants():
|
||||
"""< 60 similar -> separate variants, no needs_review."""
|
||||
from rapidfuzz import fuzz
|
||||
|
||||
d1 = "alpha beta gamma delta epsilon"
|
||||
d2 = "quebec romeo sierra tango uniform victor whiskey"
|
||||
assert fuzz.token_sort_ratio(d1, d2) < 60.0
|
||||
|
||||
a = _activity(description=d1)
|
||||
b = _activity(description=d2)
|
||||
out, stats = bd.dedup_activities([a, b])
|
||||
assert len(out) == 2
|
||||
assert stats["auto_merged"] == 0
|
||||
assert all(act.needs_review == 0 for act in out)
|
||||
|
||||
|
||||
def test_dedup_never_merges_across_languages():
|
||||
"""Same name + same description but EN vs RO -> two distinct rows."""
|
||||
desc = "children form teams and cross the field"
|
||||
ro = _activity(name="Cursa", description=desc, language="ro")
|
||||
en = _activity(name="Cursa", description=desc, language="en")
|
||||
out, stats = bd.dedup_activities([ro, en])
|
||||
assert len(out) == 2
|
||||
assert stats["auto_merged"] == 0
|
||||
langs = {a.language for a in out}
|
||||
assert langs == {"ro", "en"}
|
||||
|
||||
|
||||
def test_merge_combines_fields():
|
||||
"""On merge: longest description/rules, union materials, accumulated sources."""
|
||||
desc = "copiii formeaza echipe si traverseaza terenul cu obstacole"
|
||||
a = _activity(
|
||||
description=desc,
|
||||
rules="regula scurta",
|
||||
materials_list="franghie, esarfa",
|
||||
source_file="a.txt",
|
||||
keywords="echipa",
|
||||
)
|
||||
b = _activity(
|
||||
description=desc,
|
||||
rules="o regula mult mai lunga si mai detaliata pentru joc",
|
||||
materials_list="busola, esarfa",
|
||||
source_file="b.txt",
|
||||
keywords="cooperare",
|
||||
)
|
||||
out, _ = bd.dedup_activities([a, b])
|
||||
assert len(out) == 1
|
||||
merged = out[0]
|
||||
assert merged.rules == "o regula mult mai lunga si mai detaliata pentru joc"
|
||||
mats = set(m.strip() for m in merged.materials_list.split(","))
|
||||
assert mats == {"franghie", "esarfa", "busola"}
|
||||
assert set(merged.source_files) == {"a.txt", "b.txt"}
|
||||
assert merged.popularity_score == 1
|
||||
assert set(k.strip() for k in merged.keywords.split(",")) == {"echipa", "cooperare"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 5 — review decisions
|
||||
# --------------------------------------------------------------------------
|
||||
def test_review_decision_drop_removes_row():
|
||||
from import_common import content_key, normalize_name
|
||||
|
||||
a = _activity(description="o descriere de test")
|
||||
key = content_key(normalize_name(a.name), a.language, a.description)
|
||||
kept, stats = bd.apply_review_decisions([a], {key: {"decision": "drop"}})
|
||||
assert kept == []
|
||||
assert stats["dropped"] == 1
|
||||
|
||||
|
||||
def test_review_decision_keep_separate_clears_needs_review():
|
||||
from import_common import content_key, normalize_name
|
||||
|
||||
a = _activity(description="o descriere de test")
|
||||
a.needs_review = 1
|
||||
key = content_key(normalize_name(a.name), a.language, a.description)
|
||||
kept, stats = bd.apply_review_decisions([a], {key: {"decision": "keep-separate"}})
|
||||
assert len(kept) == 1 and kept[0].needs_review == 0
|
||||
assert stats["resolved"] == 1
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 2b — source_excerpt hallucination check
|
||||
# --------------------------------------------------------------------------
|
||||
def test_hallucinated_excerpt_activity_dropped(tmp_path):
|
||||
extracted = tmp_path / "extracted"
|
||||
chunks = tmp_path / "chunks"
|
||||
sources = tmp_path / "sources"
|
||||
|
||||
good = _ext_activity(
|
||||
name="Joc real", source_excerpt="textul real apare in bucata sursa"
|
||||
)
|
||||
bad = _ext_activity(
|
||||
name="Joc inventat",
|
||||
source_excerpt="acest citat nu exista nicaieri in sursa originala xyzzy",
|
||||
)
|
||||
_write_extraction(extracted, "src01.part01", [good, bad])
|
||||
_write_chunk(
|
||||
chunks, "src01", "src01.part01",
|
||||
"--- PAGE 1 ---\ntextul real apare in bucata sursa pentru jocul real.\n",
|
||||
)
|
||||
|
||||
from import_common import load_schema
|
||||
|
||||
schema = load_schema()
|
||||
res = bd.collect_activities(extracted, chunks, sources, schema)
|
||||
names = {a.name for a in res["activities"]}
|
||||
assert names == {"Joc real"}
|
||||
assert res["activities_hallucinated"] == 1
|
||||
assert (extracted / "_rejected").exists()
|
||||
|
||||
|
||||
def test_schema_invalid_file_moved_to_rejected(tmp_path):
|
||||
extracted = tmp_path / "extracted"
|
||||
chunks = tmp_path / "chunks"
|
||||
sources = tmp_path / "sources"
|
||||
extracted.mkdir(parents=True)
|
||||
|
||||
# missing required header keys + bad activity
|
||||
(extracted / "bad.json").write_text(
|
||||
json.dumps({"header": {}, "activities": [{"name": "x"}]}),
|
||||
encoding="utf-8",
|
||||
)
|
||||
from import_common import load_schema
|
||||
|
||||
res = bd.collect_activities(extracted, chunks, sources, load_schema())
|
||||
assert res["files_rejected_schema"] == 1
|
||||
assert not (extracted / "bad.json").exists()
|
||||
assert (extracted / "_rejected" / "bad.json").exists()
|
||||
assert (extracted / "_rejected" / "bad.errors.txt").exists()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# end-to-end rebuild + atomic swap
|
||||
# --------------------------------------------------------------------------
|
||||
def _setup_corpus(tmp_path):
|
||||
extracted = tmp_path / "extracted"
|
||||
chunks = tmp_path / "chunks"
|
||||
sources = tmp_path / "sources"
|
||||
excerpt = "jocul testului este o activitate de echipa"
|
||||
_write_extraction(
|
||||
extracted, "src01.part01",
|
||||
[_ext_activity(source_excerpt=excerpt)],
|
||||
)
|
||||
_write_chunk(chunks, "src01", "src01.part01",
|
||||
f"--- PAGE 1 ---\n{excerpt} in aer liber.\n")
|
||||
return extracted, chunks, sources
|
||||
|
||||
|
||||
def test_rebuild_creates_database(tmp_path):
|
||||
extracted, chunks, sources = _setup_corpus(tmp_path)
|
||||
db_path = tmp_path / "activities.db"
|
||||
|
||||
report = bd.rebuild(
|
||||
extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
|
||||
db_path=db_path,
|
||||
)
|
||||
assert db_path.exists()
|
||||
assert report["final_count"] == 1
|
||||
|
||||
db = DatabaseManager(str(db_path))
|
||||
rows = db.search_activities()
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["category"] == "team-building"
|
||||
|
||||
|
||||
def test_atomic_swap_keeps_live_db_intact_on_crash(tmp_path, monkeypatch):
|
||||
"""A mid-build crash must leave the live DB byte-identical."""
|
||||
extracted, chunks, sources = _setup_corpus(tmp_path)
|
||||
db_path = tmp_path / "activities.db"
|
||||
|
||||
# a pre-existing live DB with sentinel content
|
||||
live = DatabaseManager(str(db_path))
|
||||
live.insert_activity(_activity(name="Sentinel viu"))
|
||||
before = db_path.read_bytes()
|
||||
|
||||
def boom(self, *a, **k):
|
||||
raise RuntimeError("simulated mid-build crash")
|
||||
|
||||
monkeypatch.setattr(DatabaseManager, "bulk_insert_activities", boom)
|
||||
|
||||
with pytest.raises(RuntimeError, match="simulated mid-build crash"):
|
||||
bd.rebuild(
|
||||
extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
|
||||
db_path=db_path,
|
||||
)
|
||||
|
||||
# live DB untouched, tmp cleaned up
|
||||
assert db_path.read_bytes() == before
|
||||
assert not (tmp_path / "activities.db.tmp").exists()
|
||||
|
||||
|
||||
def test_rebuild_backs_up_live_db(tmp_path):
|
||||
extracted, chunks, sources = _setup_corpus(tmp_path)
|
||||
db_path = tmp_path / "activities.db"
|
||||
DatabaseManager(str(db_path)).insert_activity(_activity(name="Vechi"))
|
||||
|
||||
report = bd.rebuild(
|
||||
extracted_dir=extracted, chunks_dir=chunks, sources_dir=sources,
|
||||
db_path=db_path,
|
||||
)
|
||||
assert report["backup"] is not None
|
||||
assert Path(report["backup"]).exists()
|
||||
assert os.path.basename(report["backup"]) == "activities.db.bak"
|
||||
183
tests/test_chunk_sources.py
Normal file
183
tests/test_chunk_sources.py
Normal file
@@ -0,0 +1,183 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Tests for scripts/chunk_sources.py."""
|
||||
|
||||
import json
|
||||
|
||||
import chunk_sources as cs
|
||||
import normalize_sources as ns
|
||||
|
||||
|
||||
def _pages(n):
|
||||
return [(i, f"text-{i}") for i in range(1, n + 1)]
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# header parsing
|
||||
# --------------------------------------------------------------------------
|
||||
def test_parse_source_splits_header_and_body(paginated_source):
|
||||
text = paginated_source.read_text(encoding="utf-8")
|
||||
header, body = cs.parse_source(text)
|
||||
assert header["FORMAT"] == "pdf"
|
||||
assert body.lstrip().startswith("--- PAGE 1 ---")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# page chunking
|
||||
# --------------------------------------------------------------------------
|
||||
def test_chunk_pages_basic_split():
|
||||
chunks = cs.chunk_pages(_pages(50), pages_per_chunk=20, overlap=4)
|
||||
# stride 16: starts at pages 1, 17, 33, ...
|
||||
assert chunks[0]["page_start"] == 1 and chunks[0]["page_end"] == 20
|
||||
assert chunks[1]["page_start"] == 17
|
||||
assert chunks[-1]["page_end"] == 50
|
||||
|
||||
|
||||
def test_chunk_pages_have_overlap():
|
||||
chunks = cs.chunk_pages(_pages(50), pages_per_chunk=20, overlap=4)
|
||||
overlap = chunks[0]["page_end"] - chunks[1]["page_start"] + 1
|
||||
assert overlap == 4
|
||||
|
||||
|
||||
def test_chunk_pages_short_document_single_chunk():
|
||||
chunks = cs.chunk_pages(_pages(8), pages_per_chunk=20, overlap=4)
|
||||
assert len(chunks) == 1
|
||||
assert chunks[0]["page_start"] == 1 and chunks[0]["page_end"] == 8
|
||||
|
||||
|
||||
def test_chunk_pages_empty():
|
||||
assert cs.chunk_pages([]) == []
|
||||
|
||||
|
||||
def test_activity_at_page_boundary_intact_in_one_chunk(paginated_source):
|
||||
"""An activity straddling the page 20/21 boundary must appear whole in >=1 chunk."""
|
||||
text = paginated_source.read_text(encoding="utf-8")
|
||||
chunks = cs.make_chunks(text)
|
||||
full = [
|
||||
c for c in chunks
|
||||
if "ACTIVITY-START" in c["text"] and "ACTIVITY-END" in c["text"]
|
||||
]
|
||||
assert full, "activity spanning a page boundary was split across all chunks"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# word-window chunking for unpaginated text
|
||||
# --------------------------------------------------------------------------
|
||||
def test_chunk_words_window_and_overlap():
|
||||
text = " ".join(f"w{i}" for i in range(25_000))
|
||||
chunks = cs.chunk_words(text, window=10_000, overlap=2_000)
|
||||
assert len(chunks) == 3 # stride 8000 over 25000 words
|
||||
first = chunks[0]["text"].split()
|
||||
second = chunks[1]["text"].split()
|
||||
assert first[8_000:10_000] == second[0:2_000] # 2000-word overlap
|
||||
|
||||
|
||||
def test_make_chunks_unpaginated_uses_word_windows():
|
||||
body = "cuvant " * 15_000
|
||||
text = "SOURCE: x\nFORMAT: txt\n" + "=" * 50 + "\n\n" + body
|
||||
chunks = cs.make_chunks(text)
|
||||
assert len(chunks) >= 2
|
||||
assert chunks[0]["chunk_range"].startswith("words")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# stable source ids — anti-collision
|
||||
# --------------------------------------------------------------------------
|
||||
def test_stable_id_same_stem_different_path_no_collision():
|
||||
a = ns.stable_id("camp/games/scout.pdf")
|
||||
b = ns.stable_id("school/lessons/scout.pdf")
|
||||
assert a != b
|
||||
assert a.endswith("_scout") and b.endswith("_scout")
|
||||
|
||||
|
||||
def test_stable_id_deterministic():
|
||||
assert ns.stable_id("a/b/c.pdf") == ns.stable_id("a/b/c.pdf")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# manifest registry + idempotency
|
||||
# --------------------------------------------------------------------------
|
||||
def test_run_writes_chunks_and_manifest(paginated_source, tmp_path):
|
||||
sources_dir = tmp_path / "sources"
|
||||
sources_dir.mkdir()
|
||||
(sources_dir / paginated_source.name).write_text(
|
||||
paginated_source.read_text(encoding="utf-8"), encoding="utf-8"
|
||||
)
|
||||
chunks_dir = tmp_path / "chunks"
|
||||
|
||||
summary = cs.run(sources_dir, chunks_dir)
|
||||
assert summary["sources"] == 1
|
||||
assert summary["chunks"] >= 2
|
||||
|
||||
manifest = json.loads((chunks_dir / "manifest.json").read_text())
|
||||
assert manifest["chunks"]
|
||||
for key, meta in manifest["chunks"].items():
|
||||
assert meta["state"] == "pending"
|
||||
assert meta["expected_json"] == f"{key}.json"
|
||||
assert (chunks_dir.parent / meta["chunk_file"]).exists()
|
||||
|
||||
|
||||
def test_manifest_idempotent_preserves_state(paginated_source, tmp_path):
|
||||
sources_dir = tmp_path / "sources"
|
||||
sources_dir.mkdir()
|
||||
(sources_dir / paginated_source.name).write_text(
|
||||
paginated_source.read_text(encoding="utf-8"), encoding="utf-8"
|
||||
)
|
||||
chunks_dir = tmp_path / "chunks"
|
||||
manifest_path = chunks_dir / "manifest.json"
|
||||
|
||||
cs.run(sources_dir, chunks_dir)
|
||||
|
||||
# orchestrator marks one chunk done
|
||||
manifest = json.loads(manifest_path.read_text())
|
||||
first_key = next(iter(manifest["chunks"]))
|
||||
n_before = len(manifest["chunks"])
|
||||
manifest["chunks"][first_key]["state"] = "done"
|
||||
manifest_path.write_text(json.dumps(manifest), encoding="utf-8")
|
||||
|
||||
# re-run: 'done' must survive, no chunk added or lost
|
||||
cs.run(sources_dir, chunks_dir)
|
||||
manifest2 = json.loads(manifest_path.read_text())
|
||||
assert len(manifest2["chunks"]) == n_before
|
||||
assert manifest2["chunks"][first_key]["state"] == "done"
|
||||
assert all(
|
||||
m["state"] in ("pending", "done") for m in manifest2["chunks"].values()
|
||||
)
|
||||
|
||||
|
||||
def test_manifest_resets_state_when_source_changes(paginated_source, tmp_path):
|
||||
sources_dir = tmp_path / "sources"
|
||||
sources_dir.mkdir()
|
||||
src = sources_dir / paginated_source.name
|
||||
src.write_text(paginated_source.read_text(encoding="utf-8"), encoding="utf-8")
|
||||
chunks_dir = tmp_path / "chunks"
|
||||
manifest_path = chunks_dir / "manifest.json"
|
||||
|
||||
cs.run(sources_dir, chunks_dir)
|
||||
manifest = json.loads(manifest_path.read_text())
|
||||
first_key = next(iter(manifest["chunks"]))
|
||||
manifest["chunks"][first_key]["state"] = "done"
|
||||
manifest_path.write_text(json.dumps(manifest), encoding="utf-8")
|
||||
|
||||
# mutate the source content -> hash changes -> state resets
|
||||
src.write_text(src.read_text(encoding="utf-8") + "\n--- PAGE 51 ---\nextra\n",
|
||||
encoding="utf-8")
|
||||
cs.run(sources_dir, chunks_dir)
|
||||
manifest2 = json.loads(manifest_path.read_text())
|
||||
assert manifest2["chunks"][first_key]["state"] == "pending"
|
||||
|
||||
|
||||
def test_prune_stale_removes_orphan_entries(paginated_source, tmp_path):
|
||||
sources_dir = tmp_path / "sources"
|
||||
sources_dir.mkdir()
|
||||
src = sources_dir / paginated_source.name
|
||||
src.write_text(paginated_source.read_text(encoding="utf-8"), encoding="utf-8")
|
||||
chunks_dir = tmp_path / "chunks"
|
||||
|
||||
cs.run(sources_dir, chunks_dir)
|
||||
# delete the source -> its chunks become stale
|
||||
src.unlink()
|
||||
summary = cs.run(sources_dir, chunks_dir)
|
||||
assert summary["chunks"] == 0
|
||||
assert summary["pruned"] >= 1
|
||||
manifest = json.loads((chunks_dir / "manifest.json").read_text())
|
||||
assert manifest["chunks"] == {}
|
||||
231
tests/test_enrichment.py
Normal file
231
tests/test_enrichment.py
Normal file
@@ -0,0 +1,231 @@
|
||||
"""
|
||||
Tests for the enrichment overlay (plan Part B) and the new filter axes /
|
||||
bilingual display helpers (plan Part A).
|
||||
|
||||
Covers:
|
||||
* config_taxonomy.normalize_indoor_outdoor / normalize_space_needed
|
||||
* build_database.apply_enrichment keying, field application, estimated tally
|
||||
* DatabaseManager indoor_outdoor / space_needed equality filters
|
||||
* FTS5 indexing of the *_ro columns
|
||||
* Activity bilingual display helpers
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
import pytest
|
||||
|
||||
PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
|
||||
if PROJECT_ROOT not in sys.path:
|
||||
sys.path.insert(0, PROJECT_ROOT)
|
||||
SCRIPTS = os.path.join(PROJECT_ROOT, "scripts")
|
||||
if SCRIPTS not in sys.path:
|
||||
sys.path.insert(0, SCRIPTS)
|
||||
|
||||
from app.models.activity import Activity # noqa: E402
|
||||
from app.models.database import DatabaseManager # noqa: E402
|
||||
from app.config_taxonomy import ( # noqa: E402
|
||||
normalize_indoor_outdoor,
|
||||
normalize_space_needed,
|
||||
)
|
||||
from import_common import content_key, normalize_name # noqa: E402
|
||||
from build_database import apply_enrichment # noqa: E402
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# taxonomy normalizers
|
||||
# --------------------------------------------------------------------------
|
||||
@pytest.mark.parametrize("raw,expected", [
|
||||
("indoor", "indoor"),
|
||||
("Outdoor", "outdoor"),
|
||||
("either", "either"),
|
||||
("interior", "indoor"),
|
||||
("aer liber", "outdoor"),
|
||||
("both", "either"),
|
||||
("", None),
|
||||
("nonsense", None),
|
||||
(None, None),
|
||||
])
|
||||
def test_normalize_indoor_outdoor(raw, expected):
|
||||
assert normalize_indoor_outdoor(raw) == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize("raw,expected", [
|
||||
("mic", "mic"),
|
||||
("MEDIU", "mediu"),
|
||||
("mare", "mare"),
|
||||
("small", "mic"),
|
||||
("large", "mare"),
|
||||
("", None),
|
||||
("huge", None),
|
||||
(None, None),
|
||||
])
|
||||
def test_normalize_space_needed(raw, expected):
|
||||
assert normalize_space_needed(raw) == expected
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# apply_enrichment
|
||||
# --------------------------------------------------------------------------
|
||||
def _activity(name="Joc de test", description="O descriere de test.", language="ro"):
|
||||
return Activity(
|
||||
name=name, description=description, category="team-building",
|
||||
content_type="joc", source_file="t.txt", language=language,
|
||||
)
|
||||
|
||||
|
||||
def _key_for(act: Activity) -> str:
|
||||
return content_key(
|
||||
act.normalized_name or normalize_name(act.name),
|
||||
act.language,
|
||||
act.description or "",
|
||||
)
|
||||
|
||||
|
||||
def test_apply_enrichment_matches_and_applies_fields():
|
||||
act = _activity()
|
||||
key = _key_for(act)
|
||||
enrichment = {
|
||||
key: {
|
||||
"name_ro": "Joc de test (RO)",
|
||||
"description_ro": "Descriere îmbogățită în română.",
|
||||
"indoor_outdoor": "outdoor",
|
||||
"space_needed": "mediu",
|
||||
"participants_min": 4,
|
||||
"participants_max": 12,
|
||||
"estimated_fields": ["space_needed", "participants_min", "participants_max"],
|
||||
}
|
||||
}
|
||||
stats = apply_enrichment([act], enrichment)
|
||||
|
||||
assert act.name_ro == "Joc de test (RO)"
|
||||
assert act.description_ro == "Descriere îmbogățită în română."
|
||||
assert act.indoor_outdoor == "outdoor"
|
||||
assert act.space_needed == "mediu"
|
||||
assert act.participants_min == 4 and act.participants_max == 12
|
||||
assert set(act.estimated_fields) == {"space_needed", "participants_min", "participants_max"}
|
||||
|
||||
assert stats["entries"] == 1
|
||||
assert stats["matched"] == 1
|
||||
assert stats["orphaned"] == 0
|
||||
# indoor_outdoor stated, space_needed estimated
|
||||
assert stats["fields_stated"].get("indoor_outdoor") == 1
|
||||
assert stats["fields_estimated"].get("space_needed") == 1
|
||||
|
||||
|
||||
def test_apply_enrichment_orphan_entry_counted():
|
||||
act = _activity()
|
||||
enrichment = {"deadbeef" * 5: {"name_ro": "nu se potrivește"}}
|
||||
stats = apply_enrichment([act], enrichment)
|
||||
assert stats["matched"] == 0
|
||||
assert stats["orphaned"] == 1
|
||||
assert act.name_ro is None # untouched
|
||||
|
||||
|
||||
def test_apply_enrichment_absent_fields_leave_value_untouched():
|
||||
act = _activity()
|
||||
act.participants_min = 5
|
||||
key = _key_for(act)
|
||||
# entry only translates name; participants must be preserved
|
||||
apply_enrichment([act], {key: {"name_ro": "Tradus"}})
|
||||
assert act.participants_min == 5
|
||||
assert act.name_ro == "Tradus"
|
||||
|
||||
|
||||
def test_apply_enrichment_drops_unrecognised_enum():
|
||||
act = _activity()
|
||||
key = _key_for(act)
|
||||
apply_enrichment([act], {key: {"indoor_outdoor": "spaceship", "space_needed": "mic"}})
|
||||
assert act.indoor_outdoor is None # unrecognised → dropped
|
||||
assert act.space_needed == "mic"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# DB equality filters + FTS on *_ro
|
||||
# --------------------------------------------------------------------------
|
||||
@pytest.fixture
|
||||
def db(tmp_path):
|
||||
return DatabaseManager(str(tmp_path / "enrich.db"))
|
||||
|
||||
|
||||
def _insert(db, **overrides):
|
||||
base = dict(
|
||||
name="Activitate", description="desc", category="camp-outdoor",
|
||||
content_type="joc", source_file="t.txt", language="ro",
|
||||
)
|
||||
base.update(overrides)
|
||||
return db.insert_activity(Activity(**base))
|
||||
|
||||
|
||||
def test_indoor_outdoor_equality_filter(db):
|
||||
_insert(db, name="In casa", indoor_outdoor="indoor")
|
||||
_insert(db, name="Afara", indoor_outdoor="outdoor")
|
||||
res = db.search_activities(indoor_outdoor="outdoor")
|
||||
assert len(res) == 1
|
||||
assert res[0]["name"] == "Afara"
|
||||
|
||||
|
||||
def test_space_needed_equality_filter(db):
|
||||
_insert(db, name="Mic", space_needed="mic")
|
||||
_insert(db, name="Mare", space_needed="mare")
|
||||
res = db.search_activities(space_needed="mare")
|
||||
assert len(res) == 1
|
||||
assert res[0]["name"] == "Mare"
|
||||
|
||||
|
||||
def test_fts_indexes_name_ro(db):
|
||||
_insert(db, name="Treasure Hunt", name_ro="Vânătoarea de comori")
|
||||
# term only present in the Romanian twin
|
||||
res = db.search_activities(search_text="comori")
|
||||
assert len(res) == 1
|
||||
assert res[0]["name"] == "Treasure Hunt"
|
||||
|
||||
|
||||
def test_fts_indexes_description_ro(db):
|
||||
_insert(db, name="Game", description="english desc",
|
||||
description_ro="o activitate de cooperare")
|
||||
res = db.search_activities(search_text="cooperare")
|
||||
assert len(res) == 1
|
||||
|
||||
|
||||
def test_ro_columns_round_trip(db):
|
||||
aid = _insert(
|
||||
db, name="X", name_ro="X-ro", description_ro="d-ro",
|
||||
rules_ro="r-ro", variations_ro="v-ro",
|
||||
indoor_outdoor="either", space_needed="mediu",
|
||||
estimated_fields=["duration_min"], source_id="src1",
|
||||
source_ids=["src1", "src2"], chunk_key="src1.part01",
|
||||
)
|
||||
row = db.get_activity_by_id(aid)
|
||||
loaded = Activity.from_dict(row)
|
||||
assert loaded.name_ro == "X-ro"
|
||||
assert loaded.indoor_outdoor == "either"
|
||||
assert loaded.space_needed == "mediu"
|
||||
assert loaded.estimated_fields == ["duration_min"]
|
||||
assert loaded.source_ids == ["src1", "src2"]
|
||||
assert loaded.chunk_key == "src1.part01"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# display helpers
|
||||
# --------------------------------------------------------------------------
|
||||
def test_display_helpers_prefer_ro_with_fallback():
|
||||
act = _activity(name="Original", description="Original desc")
|
||||
assert act.get_display_name() == "Original" # no translation yet
|
||||
assert act.get_display_description() == "Original desc"
|
||||
act.name_ro = "Tradus"
|
||||
act.description_ro = "Descriere tradusă"
|
||||
assert act.get_display_name() == "Tradus"
|
||||
assert act.get_display_description() == "Descriere tradusă"
|
||||
assert act.has_translation() is True
|
||||
|
||||
|
||||
def test_is_estimated_and_axis_displays():
|
||||
act = _activity()
|
||||
act.indoor_outdoor = "outdoor"
|
||||
act.space_needed = "mare"
|
||||
act.estimated_fields = ["space_needed"]
|
||||
assert act.get_indoor_outdoor_display() == "Exterior"
|
||||
assert act.get_space_needed_display() == "Spațiu mare"
|
||||
assert act.is_estimated("space_needed") is True
|
||||
assert act.is_estimated("indoor_outdoor") is False
|
||||
177
tests/test_extract_common.py
Normal file
177
tests/test_extract_common.py
Normal file
@@ -0,0 +1,177 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""Tests for scripts/extract_common.py."""
|
||||
|
||||
import shutil
|
||||
import zipfile
|
||||
|
||||
import pytest
|
||||
|
||||
import extract_common as ec
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# format detection
|
||||
# --------------------------------------------------------------------------
|
||||
def test_detect_format():
|
||||
assert ec.detect_format("a/b/file.PDF") == "pdf"
|
||||
assert ec.detect_format("x.docx") == "docx"
|
||||
assert ec.detect_format("x.doc") == "doc"
|
||||
assert ec.detect_format("x.pptx") == "pptx"
|
||||
assert ec.detect_format("x.html") == "html"
|
||||
assert ec.detect_format("x.zip") == "zip"
|
||||
assert ec.detect_format("x.epub") == "epub"
|
||||
assert ec.detect_format("x.xyz") == "unknown"
|
||||
|
||||
|
||||
def test_is_junk():
|
||||
assert ec.is_junk("some/desktop.ini")
|
||||
assert ec.is_junk("notes.bak")
|
||||
assert ec.is_junk("README.md")
|
||||
assert not ec.is_junk("1000 Scout Games.pdf")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# PDF — the critical "no max_pages" regression
|
||||
# --------------------------------------------------------------------------
|
||||
def test_pdf_extracts_all_60_pages(big_pdf):
|
||||
body = ec.extract_pdf(big_pdf)
|
||||
# the old converter capped at 50 pages — page 60 must be present now
|
||||
assert "--- PAGE 60 ---" in body
|
||||
assert "PDFMARK-60" in body
|
||||
assert ec.count_page_markers(body) == 60
|
||||
|
||||
|
||||
def test_pdf_does_not_truncate_mid_document(big_pdf):
|
||||
body = ec.extract_pdf(big_pdf)
|
||||
pages = ec.split_pages(body)
|
||||
assert pages[-1][0] == 60 # last marker is the real last page
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# page join / split round-trip
|
||||
# --------------------------------------------------------------------------
|
||||
def test_join_split_round_trip():
|
||||
body = ec.join_pages(["alpha", "beta", "gamma"])
|
||||
pages = ec.split_pages(body)
|
||||
assert [n for n, _ in pages] == [1, 2, 3]
|
||||
assert [t for _, t in pages] == ["alpha", "beta", "gamma"]
|
||||
|
||||
|
||||
def test_split_pages_no_markers_returns_empty():
|
||||
assert ec.split_pages("plain text with no markers") == []
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# docx — synthetic page markers
|
||||
# --------------------------------------------------------------------------
|
||||
def test_docx_synthetic_page_markers(sample_docx):
|
||||
body = ec.extract_docx(sample_docx)
|
||||
# 100 paragraphs / 40 per page => 3 pages
|
||||
assert ec.count_page_markers(body) == 3
|
||||
assert "Paragraf 99" in body
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# HTML mirror — nav/script/footer stripped
|
||||
# --------------------------------------------------------------------------
|
||||
def test_html_strips_chrome(html_with_nav):
|
||||
body = ec.extract_html(html_with_nav)
|
||||
assert "Vanatoarea de comori" in body
|
||||
assert "joc real de orientare" in body
|
||||
# chrome must be gone
|
||||
assert "tracking" not in body
|
||||
assert "Site Banner Junk" not in body
|
||||
assert "toate drepturile rezervate" not in body
|
||||
assert "Games" not in body
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# content hash + near-duplicate elimination
|
||||
# --------------------------------------------------------------------------
|
||||
def test_content_hash_ignores_whitespace():
|
||||
assert ec.content_hash("hello world") == ec.content_hash("hello world\n")
|
||||
assert ec.content_hash("hello world") != ec.content_hash("goodbye world")
|
||||
|
||||
|
||||
def test_dedupe_exact_duplicates():
|
||||
items = [("a", "joc identic"), ("b", "joc identic"), ("c", "alt joc")]
|
||||
kept = ec.dedupe_texts(items)
|
||||
assert [k for k, _ in kept] == ["a", "c"]
|
||||
|
||||
|
||||
def test_dedupe_near_duplicates():
|
||||
base = "Vanatoarea de comori este un joc de orientare pentru cercetasi in tabara."
|
||||
near = base + " Pagina printata." # >95% similar
|
||||
items = [("orig", base), ("print", near), ("other", "Cu totul alt continut diferit aici.")]
|
||||
kept = ec.dedupe_texts(items, threshold=85.0)
|
||||
keys = [k for k, _ in kept]
|
||||
assert "orig" in keys
|
||||
assert "print" not in keys
|
||||
assert "other" in keys
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# zip recursion
|
||||
# --------------------------------------------------------------------------
|
||||
def test_zip_recurses_into_inner_files(sample_zip):
|
||||
body = ec.extract_zip(sample_zip)
|
||||
assert "Paragraf 0" in body
|
||||
assert ec.count_page_markers(body) > 0
|
||||
|
||||
|
||||
def test_zip_bad_archive_returns_empty(tmp_path):
|
||||
bad = tmp_path / "broken.zip"
|
||||
bad.write_text("not a zip", encoding="utf-8")
|
||||
assert ec.extract_zip(bad) == ""
|
||||
|
||||
|
||||
def test_nested_zip(tmp_path, sample_zip):
|
||||
outer = tmp_path / "outer.zip"
|
||||
with zipfile.ZipFile(outer, "w") as zf:
|
||||
zf.write(sample_zip, arcname="nested/archive.zip")
|
||||
body = ec.extract_zip(outer)
|
||||
assert "Paragraf 0" in body
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# preflight
|
||||
# --------------------------------------------------------------------------
|
||||
def test_preflight_python_packages_present():
|
||||
report = ec.preflight()
|
||||
# all required packages are installed in the test environment
|
||||
assert report["missing_python"] == []
|
||||
|
||||
|
||||
def test_preflight_reports_libreoffice_state():
|
||||
report = ec.preflight()
|
||||
has_lo = bool(shutil.which("libreoffice") or shutil.which("soffice"))
|
||||
if has_lo:
|
||||
assert all("libreoffice" not in w for w in report["warnings"])
|
||||
else:
|
||||
assert any("libreoffice" in w for w in report["warnings"])
|
||||
|
||||
|
||||
def test_preflight_ocr_flag():
|
||||
report = ec.preflight(check_ocr=True)
|
||||
if not shutil.which("tesseract"):
|
||||
assert any("tesseract" in m for m in report["missing_system"])
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# legacy .doc — skipped unless libreoffice is installed
|
||||
# --------------------------------------------------------------------------
|
||||
@pytest.mark.skipif(
|
||||
not (shutil.which("libreoffice") or shutil.which("soffice")),
|
||||
reason="libreoffice not installed",
|
||||
)
|
||||
def test_doc_conversion(tmp_path, sample_docx):
|
||||
doc_path = tmp_path / "legacy.doc"
|
||||
shutil.copy(sample_docx, doc_path) # smoke test of the docx path
|
||||
body = ec.extract_doc(doc_path)
|
||||
assert ec.count_page_markers(body) >= 1
|
||||
|
||||
|
||||
def test_doc_without_libreoffice_raises(tmp_path, monkeypatch):
|
||||
monkeypatch.setattr(ec.shutil, "which", lambda _: None)
|
||||
with pytest.raises(RuntimeError):
|
||||
ec.extract_doc(tmp_path / "whatever.doc")
|
||||
139
tests/test_fts.py
Normal file
139
tests/test_fts.py
Normal file
@@ -0,0 +1,139 @@
|
||||
"""
|
||||
Integration tests for the FTS5 search index.
|
||||
|
||||
Confirms that materials_list and skills_developed are indexed by FTS5 and kept
|
||||
in sync by the insert / update / delete triggers (plan §6, §7).
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
|
||||
import pytest
|
||||
|
||||
# Make the project root importable when pytest is run from anywhere.
|
||||
PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
|
||||
if PROJECT_ROOT not in sys.path:
|
||||
sys.path.insert(0, PROJECT_ROOT)
|
||||
|
||||
from app.models.activity import Activity # noqa: E402
|
||||
from app.models.database import DatabaseManager # noqa: E402
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def db(tmp_path):
|
||||
"""A fresh DatabaseManager backed by a temporary SQLite file."""
|
||||
return DatabaseManager(str(tmp_path / "test_activities.db"))
|
||||
|
||||
|
||||
def _make_activity(**overrides):
|
||||
base = dict(
|
||||
name="Vânătoarea de comori",
|
||||
description="O activitate de echipă în aer liber.",
|
||||
category="camp-outdoor",
|
||||
content_type="joc",
|
||||
source_file="test.txt",
|
||||
language="ro",
|
||||
)
|
||||
base.update(overrides)
|
||||
return Activity(**base)
|
||||
|
||||
|
||||
def test_search_by_materials_list(db):
|
||||
"""A term that only appears in materials_list returns the activity."""
|
||||
activity = _make_activity(materials_list="frânghie, eșarfă, busolă")
|
||||
db.insert_activity(activity)
|
||||
|
||||
results = db.search_activities(search_text="busolă")
|
||||
assert len(results) == 1
|
||||
assert results[0]["name"] == "Vânătoarea de comori"
|
||||
|
||||
|
||||
def test_search_by_skills_developed(db):
|
||||
"""A term that only appears in skills_developed returns the activity."""
|
||||
activity = _make_activity(skills_developed="comunicare, leadership, rabdare")
|
||||
db.insert_activity(activity)
|
||||
|
||||
results = db.search_activities(search_text="leadership")
|
||||
assert len(results) == 1
|
||||
assert results[0]["name"] == "Vânătoarea de comori"
|
||||
|
||||
|
||||
def test_term_absent_from_indexed_columns_no_hit(db):
|
||||
"""A term present in no indexed column yields no hit (control)."""
|
||||
db.insert_activity(_make_activity(materials_list="frânghie"))
|
||||
assert db.search_activities(search_text="zzzunlikelyterm") == []
|
||||
|
||||
|
||||
def test_delete_trigger_removes_from_fts(db):
|
||||
"""Deleting an activity removes it from the FTS index (delete trigger)."""
|
||||
activity = _make_activity(materials_list="catalige")
|
||||
activity_id = db.insert_activity(activity)
|
||||
assert len(db.search_activities(search_text="catalige")) == 1
|
||||
|
||||
with db._get_connection() as conn:
|
||||
conn.execute("DELETE FROM activities WHERE id = ?", (activity_id,))
|
||||
conn.commit()
|
||||
|
||||
assert db.search_activities(search_text="catalige") == []
|
||||
|
||||
|
||||
def test_update_trigger_resyncs_fts(db):
|
||||
"""Updating materials_list re-syncs the FTS index (update trigger)."""
|
||||
activity = _make_activity(materials_list="creioane")
|
||||
activity_id = db.insert_activity(activity)
|
||||
assert len(db.search_activities(search_text="creioane")) == 1
|
||||
|
||||
with db._get_connection() as conn:
|
||||
conn.execute(
|
||||
"UPDATE activities SET materials_list = ? WHERE id = ?",
|
||||
("acuarele", activity_id),
|
||||
)
|
||||
conn.commit()
|
||||
|
||||
# Old term gone, new term found.
|
||||
assert db.search_activities(search_text="creioane") == []
|
||||
assert len(db.search_activities(search_text="acuarele")) == 1
|
||||
|
||||
|
||||
def test_rebuild_fts_index(db):
|
||||
"""rebuild_fts_index keeps materials_list / skills_developed searchable."""
|
||||
db.insert_activity(_make_activity(skills_developed="orientare"))
|
||||
db.rebuild_fts_index()
|
||||
assert len(db.search_activities(search_text="orientare")) == 1
|
||||
|
||||
|
||||
def test_new_schema_columns_round_trip(db):
|
||||
"""New activity columns persist and load back via from_dict."""
|
||||
activity = _make_activity(
|
||||
source_files=["a.txt", "b.txt"],
|
||||
source_excerpt="Citat scurt din sursă.",
|
||||
extraction_confidence="high",
|
||||
needs_review=1,
|
||||
normalized_name="vanatoarea de comori",
|
||||
)
|
||||
activity_id = db.insert_activity(activity)
|
||||
|
||||
row = db.get_activity_by_id(activity_id)
|
||||
assert row["content_type"] == "joc"
|
||||
assert row["language"] == "ro"
|
||||
assert row["extraction_confidence"] == "high"
|
||||
assert row["needs_review"] == 1
|
||||
assert row["normalized_name"] == "vanatoarea de comori"
|
||||
assert json.loads(row["source_files"]) == ["a.txt", "b.txt"]
|
||||
assert row["source_excerpt"] == "Citat scurt din sursă."
|
||||
|
||||
loaded = Activity.from_dict(row)
|
||||
assert loaded.source_files == ["a.txt", "b.txt"]
|
||||
assert loaded.content_type == "joc"
|
||||
|
||||
|
||||
def test_normalized_name_auto_derived(db):
|
||||
"""normalized_name is auto-derived from name when not provided."""
|
||||
activity = Activity(
|
||||
name="Ștafetă cu Obstacole",
|
||||
description="desc",
|
||||
category="sports-active",
|
||||
source_file="t.txt",
|
||||
)
|
||||
assert activity.normalized_name == "stafeta cu obstacole"
|
||||
140
tests/test_search.py
Normal file
140
tests/test_search.py
Normal file
@@ -0,0 +1,140 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
CRITICAL REGRESSION TEST (plan §6, §7).
|
||||
|
||||
`search.py` changed the result sets of /search and /api/search: the default
|
||||
search now EXCLUDES the non-game content types (rețetă / cântec / ceremonie),
|
||||
which surface only when the user explicitly filters that content_type or picks
|
||||
a non-game category. This test guards that behaviour.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
|
||||
from app.models.activity import Activity
|
||||
from app.models.database import DatabaseManager
|
||||
from app.services.search import SearchService
|
||||
from app.config_taxonomy import NON_GAME_CONTENT_TYPES
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# fixtures
|
||||
# --------------------------------------------------------------------------
|
||||
def _activity(name, content_type, category="altele", language="ro"):
|
||||
return Activity(
|
||||
name=name,
|
||||
description=f"Descriere pentru {name}, un conținut de tip {content_type}.",
|
||||
category=category,
|
||||
content_type=content_type,
|
||||
language=language,
|
||||
source_file="test/fixture.txt",
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def search_service(tmp_path):
|
||||
"""A SearchService over a temp DB seeded with one row per content_type."""
|
||||
db = DatabaseManager(str(tmp_path / "activities.db"))
|
||||
db.clear_database()
|
||||
db.bulk_insert_activities([
|
||||
_activity("Vanatoarea de comori", "joc", category="wide-games"),
|
||||
_activity("Cercul de cunoastere", "activitate", category="icebreakers"),
|
||||
_activity("Reteta de paine la ceaun", "reteta", category="retete"),
|
||||
_activity("Cantecul de tabara", "cantec", category="cantece-ceremonii"),
|
||||
_activity("Ceremonia de inchidere", "ceremonie", category="cantece-ceremonii"),
|
||||
_activity("Game in English", "joc", category="wide-games", language="en"),
|
||||
])
|
||||
return SearchService(db)
|
||||
|
||||
|
||||
def _content_types(results):
|
||||
return {r.get("content_type") for r in results}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# the regression: default search excludes non-game content types
|
||||
# --------------------------------------------------------------------------
|
||||
def test_default_search_excludes_non_game_content(search_service):
|
||||
"""No filters → rețete / cântece / ceremonii must NOT appear."""
|
||||
results = search_service.search_activities()
|
||||
types = _content_types(results)
|
||||
|
||||
assert types, "default search returned nothing"
|
||||
for non_game in NON_GAME_CONTENT_TYPES:
|
||||
assert non_game not in types, (
|
||||
f"default search leaked non-game content_type '{non_game}'"
|
||||
)
|
||||
# game content is still present
|
||||
assert "joc" in types
|
||||
assert "activitate" in types
|
||||
|
||||
|
||||
def test_default_search_with_text_excludes_non_game(search_service):
|
||||
"""A text query still excludes non-game content by default."""
|
||||
results = search_service.search_activities(search_text="conținut")
|
||||
assert NON_GAME_CONTENT_TYPES[0] not in _content_types(results)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# explicit content_type filter INCLUDES the non-game rows
|
||||
# --------------------------------------------------------------------------
|
||||
def test_explicit_content_type_filter_includes_non_game(search_service):
|
||||
"""Filtering content_type=reteta returns exactly the rețete."""
|
||||
results = search_service.search_activities(filters={"content_type": "reteta"})
|
||||
types = _content_types(results)
|
||||
|
||||
assert types == {"reteta"}, f"expected only rețete, got {types}"
|
||||
assert len(results) == 1
|
||||
|
||||
|
||||
def test_explicit_content_type_filter_for_cantec(search_service):
|
||||
results = search_service.search_activities(filters={"content_type": "cantec"})
|
||||
assert _content_types(results) == {"cantec"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# a non-game CATEGORY filter also lifts the exclusion
|
||||
# --------------------------------------------------------------------------
|
||||
def test_non_game_category_filter_includes_non_game(search_service):
|
||||
"""Picking category=cantece-ceremonii surfaces cântece + ceremonii."""
|
||||
results = search_service.search_activities(
|
||||
filters={"category": "cantece-ceremonii"})
|
||||
types = _content_types(results)
|
||||
|
||||
assert "cantec" in types
|
||||
assert "ceremonie" in types
|
||||
|
||||
|
||||
def test_game_category_filter_still_excludes_non_game(search_service):
|
||||
"""A normal (game) category filter keeps the non-game exclusion."""
|
||||
results = search_service.search_activities(filters={"category": "wide-games"})
|
||||
types = _content_types(results)
|
||||
for non_game in NON_GAME_CONTENT_TYPES:
|
||||
assert non_game not in types
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# language filter
|
||||
# --------------------------------------------------------------------------
|
||||
def test_language_filter_ro(search_service):
|
||||
results = search_service.search_activities(filters={"language": "ro"})
|
||||
assert results
|
||||
assert all(r.get("language") == "ro" for r in results)
|
||||
|
||||
|
||||
def test_language_filter_en(search_service):
|
||||
results = search_service.search_activities(filters={"language": "en"})
|
||||
assert results
|
||||
assert all(r.get("language") == "en" for r in results)
|
||||
assert {r.get("name") for r in results} == {"Game in English"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# get_filter_options surfaces the new axes
|
||||
# --------------------------------------------------------------------------
|
||||
def test_filter_options_include_content_type_and_language(search_service):
|
||||
"""The dynamic-filter mechanism now exposes content_type + language."""
|
||||
options = search_service.db.get_filter_options()
|
||||
assert "content_type" in options
|
||||
assert "language" in options
|
||||
assert "joc" in options["content_type"]
|
||||
assert set(options["language"]) == {"ro", "en"}
|
||||
156
tests/test_validate_extractions.py
Normal file
156
tests/test_validate_extractions.py
Normal file
@@ -0,0 +1,156 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Tests for scripts/validate_extractions.py.
|
||||
|
||||
Covers: schema rejection, the source_excerpt hallucination check, the content
|
||||
of the generated re-extraction prompt, and the manifest `rejected` marking.
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
SCRIPTS_DIR = REPO_ROOT / "scripts"
|
||||
for _p in (str(REPO_ROOT), str(SCRIPTS_DIR)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
import validate_extractions as ve # noqa: E402
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# helpers
|
||||
# --------------------------------------------------------------------------
|
||||
def _ext_activity(**over):
|
||||
base = dict(
|
||||
name="Jocul testului",
|
||||
description="O activitate de echipa in aer liber.",
|
||||
category="team-building",
|
||||
content_type="joc",
|
||||
language="ro",
|
||||
extraction_confidence="high",
|
||||
source_excerpt="ancora din bucata sursa",
|
||||
page_reference="page 1",
|
||||
)
|
||||
base.update(over)
|
||||
return base
|
||||
|
||||
|
||||
def _write_extraction(extracted_dir, chunk_key, activities, header_extra=None):
|
||||
extracted_dir.mkdir(parents=True, exist_ok=True)
|
||||
header = {
|
||||
"source_hash": "hash1234deadbeef",
|
||||
"schema_version": "1.0",
|
||||
"prompt_version": "1.0",
|
||||
"chunk_range": "pages 1-20",
|
||||
"source_id": "src01",
|
||||
"chunk_key": chunk_key,
|
||||
}
|
||||
if header_extra:
|
||||
header.update(header_extra)
|
||||
payload = {"header": header, "activities": activities}
|
||||
(extracted_dir / f"{chunk_key}.json").write_text(
|
||||
json.dumps(payload, ensure_ascii=False), encoding="utf-8"
|
||||
)
|
||||
|
||||
|
||||
def _write_chunk(chunks_dir, source_id, chunk_key, text):
|
||||
d = chunks_dir / source_id
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
(d / f"{chunk_key}.txt").write_text(text, encoding="utf-8")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# tests
|
||||
# --------------------------------------------------------------------------
|
||||
def test_valid_file_passes(tmp_path):
|
||||
extracted = tmp_path / "extracted"
|
||||
chunks = tmp_path / "chunks"
|
||||
excerpt = "ancora din bucata sursa apare aici"
|
||||
_write_extraction(extracted, "src01.part01", [_ext_activity(source_excerpt=excerpt)])
|
||||
_write_chunk(chunks, "src01", "src01.part01", f"--- PAGE 1 ---\n{excerpt}\n")
|
||||
|
||||
report = ve.run(extracted, chunks, tmp_path / "manifest.json")
|
||||
assert report["valid"] == 1
|
||||
assert report["rejected"] == 0
|
||||
|
||||
|
||||
def test_schema_invalid_file_rejected(tmp_path):
|
||||
extracted = tmp_path / "extracted"
|
||||
chunks = tmp_path / "chunks"
|
||||
extracted.mkdir(parents=True)
|
||||
(extracted / "src01.part01.json").write_text(
|
||||
json.dumps({"header": {}, "activities": [{"name": "x"}]}), encoding="utf-8"
|
||||
)
|
||||
|
||||
report = ve.run(extracted, chunks, tmp_path / "manifest.json")
|
||||
assert report["rejected"] == 1
|
||||
prompt = extracted / "_reextract" / "src01.part01.prompt.md"
|
||||
assert prompt.exists()
|
||||
|
||||
|
||||
def test_hallucinated_excerpt_rejected(tmp_path):
|
||||
extracted = tmp_path / "extracted"
|
||||
chunks = tmp_path / "chunks"
|
||||
_write_extraction(
|
||||
extracted, "src01.part01",
|
||||
[_ext_activity(source_excerpt="citat complet inventat care nu exista qqqq")],
|
||||
)
|
||||
_write_chunk(chunks, "src01", "src01.part01",
|
||||
"--- PAGE 1 ---\ntext complet diferit despre altceva.\n")
|
||||
|
||||
report = ve.run(extracted, chunks, tmp_path / "manifest.json")
|
||||
assert report["rejected"] == 1
|
||||
errors = report["rejected_chunks"][0]["errors"]
|
||||
assert any("hallucination" in e for e in errors)
|
||||
|
||||
|
||||
def test_reextraction_prompt_content(tmp_path):
|
||||
extracted = tmp_path / "extracted"
|
||||
chunks = tmp_path / "chunks"
|
||||
_write_extraction(
|
||||
extracted, "src01.part01",
|
||||
[_ext_activity(source_excerpt="citat inventat care nu exista zzzz")],
|
||||
)
|
||||
_write_chunk(chunks, "src01", "src01.part01",
|
||||
"--- PAGE 1 ---\ntext despre cu totul altceva aici.\n")
|
||||
|
||||
ve.run(extracted, chunks, tmp_path / "manifest.json")
|
||||
prompt = (extracted / "_reextract" / "src01.part01.prompt.md").read_text(
|
||||
encoding="utf-8"
|
||||
)
|
||||
assert "src01.part01" in prompt
|
||||
assert "REJECTED" in prompt
|
||||
assert "verbatim" in prompt
|
||||
assert "data/extracted/src01.part01.json" in prompt
|
||||
|
||||
|
||||
def test_manifest_marks_chunk_rejected(tmp_path):
|
||||
extracted = tmp_path / "extracted"
|
||||
chunks = tmp_path / "chunks"
|
||||
manifest_path = tmp_path / "manifest.json"
|
||||
manifest_path.write_text(
|
||||
json.dumps({"chunks": {"src01.part01": {"state": "done",
|
||||
"chunk_file": "chunks/src01/src01.part01.txt"}}}),
|
||||
encoding="utf-8",
|
||||
)
|
||||
_write_extraction(
|
||||
extracted, "src01.part01",
|
||||
[_ext_activity(source_excerpt="citat fabricat absent vvvv")],
|
||||
)
|
||||
_write_chunk(chunks, "src01", "src01.part01",
|
||||
"--- PAGE 1 ---\nun continut neinrudit.\n")
|
||||
|
||||
ve.run(extracted, chunks, manifest_path)
|
||||
manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
|
||||
assert manifest["chunks"]["src01.part01"]["state"] == "rejected"
|
||||
|
||||
|
||||
def test_build_reextraction_prompt_lists_errors():
|
||||
prompt = ve.build_reextraction_prompt(
|
||||
"abc.part03", "data/chunks/abc/abc.part03.txt",
|
||||
["header: 'source_hash' is a required property"],
|
||||
)
|
||||
assert "abc.part03" in prompt
|
||||
assert "source_hash" in prompt
|
||||
Reference in New Issue
Block a user