diff --git a/HANDOFF.md b/HANDOFF.md index 61e0cda..429e708 100644 --- a/HANDOFF.md +++ b/HANDOFF.md @@ -1,135 +1,143 @@ -# HANDOFF — Faza 1 extraction in progress +# HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot -**Last updated:** 2026-05-20, commit `3d9f266` (pilot complete) + uncommitted Faza 1 work. +**Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual +index + enrichment + new filters + source download). -## State of play +## Where we are -Faza 0 (pilot) is **complete and committed**. Faza 1 (full corpus) is **in progress at 10.9%**. +| Step (plan Part C) | Status | +|--------------------|--------| +| 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) | +| 2. Land code Part A1–A4 (model/schema/merge) | **DONE & committed** | +| 2b. Code Part A5–A8 (UI/search/download) | **DONE & committed** | +| 2c. Code Part B2–B4 (enrichment pipeline) | **DONE & committed** | +| 3. Freeze rebuild (freezes content_keys) | **DONE** — `data/activities.db` = **9418 activities** | +| Part D tests | **DONE** — `tests/test_enrichment.py`, 99 pass total | +| 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) | +| 5. Final rebuild `--enrichment` | not started | -| Phase | Status | DB rows | Tests | -|-------|--------|---------|-------| -| Faza 0 pilot (5 files) | committed (`3d9f266`) | 1751 in `data/activities.db` | 71 passed | -| Faza 1 extraction | **64/588 chunks done**, 1949 activities in `data/extracted/*.json` (not yet imported to DB) | — | — | +Everything is committed except whatever this session leaves dirty. `data/extracted/*.json` +is gitignored (575 files on disk, durable across /clear). -## What "Faza 1" is +## The 13 missing chunks (out of 588) -Process the full 96-source corpus (was 116 raw files; some are duplicates/empty zips/skipped junk) through the LLM-subagent extraction pipeline. Same code path as the pilot. Two large mirror dirs dominate the chunk count: - -- `87850302_dragon_sleepdeprived` — 116 chunks (full dragon.sleepdeprived.ca mirror) -- `c3162825_resource_pack__learning_by_playing_catalunya_...` — 97 chunks (the catalunya mirror) - -Combined they are 213/588 = 36% of all remaining chunks. - -## Critical recommendation for the next session - -**Use Sonnet 4.6 for subagent extractions, not Opus.** Opus was used through chunks 1-64 and burned through a 5-hour rate-limit window faster than this scale needs. Sonnet has 200K context which is plenty for the ~25k-token prompt + ~20k-token output of a single chunk extraction. The task is structured JSON extraction with a fixed schema — no complex reasoning needed. - -The Agent tool's `model` parameter takes `"sonnet"`. Pass it on every Agent call below. - -## Resuming — exact mechanical steps - -1. **Verify state.** - ```bash - cd /workspace/game-library - ls data/extracted/*.json | wc -l # should be 64 (or higher if more done) - ls data/chunks/_prompts/ | wc -l # 588 — the full prompt set - git log --oneline -3 # 3d9f266 must be HEAD or earlier - ``` - -2. **Find what's still pending.** Compare prompts to extracted files: - ```bash - ls data/chunks/_prompts/ | sed 's/\.prompt\.md$//' | sort > /tmp/all.txt - ls data/extracted/*.json 2>/dev/null | sed 's|.*/||;s|\.json$||' | sort > /tmp/done.txt - comm -23 /tmp/all.txt /tmp/done.txt > /tmp/pending.txt - wc -l /tmp/pending.txt # how many remain - head /tmp/pending.txt # what's next - ``` - -3. **Launch waves of 16 subagents in parallel.** One Agent call per chunk. Single message with 16 Agent tool calls. Use this exact template (substitute ``): - - ``` - Agent( - description: "Extract ", - subagent_type: "general-purpose", - model: "sonnet", ← critical - prompt: "Working directory: /workspace/game-library. Extraction subagent — read data/chunks/_prompts/.prompt.md and follow it EXACTLY. Apply rules from scripts/SUBAGENT_PROMPT.md and schema from scripts/activity_schema.json. Write the JSON. Set language per chunk content ('ro' or 'en'). Report under 40 words." - ) - ``` - - The per-chunk prompt file is fully self-contained — it points to the right chunk, sets source_id/source_hash/chunk_key, and references the rules + schema. The subagent just follows it. - -4. **After every wave**, briefly check progress and continue: - ```bash - ls data/extracted/*.json | wc -l - ``` - Repeat step 3 with the next 16 pending chunks. If an agent reports `"You've hit your limit · resets ..."` AND `tool_uses: 5` with `total_tokens: 0`, check whether the JSON was written anyway — agents often persist the file before the limit hit. Only re-launch if the JSON is missing. - -5. **When all 588 chunks are done**, finalize: - ```bash - python3 scripts/validate_extractions.py # any chunks marked rejected go to data/extracted/_reextract/ - # re-extract any rejected chunks (same template, prompt from _reextract/) - python3 scripts/build_database.py --rebuild - # if many borderline needs_review rows: - python3 -c " - import sys; sys.path.insert(0,'scripts') - from import_common import content_key, normalize_name - import sqlite3, json - conn = sqlite3.connect('data/activities.db') - conn.row_factory = sqlite3.Row - rows = list(conn.execute('SELECT name, normalized_name, language, description FROM activities WHERE needs_review=1')) - d = {content_key(r['normalized_name'] or normalize_name(r['name']), r['language'], r['description'] or ''): 'merge' for r in rows} - json.dump(d, open('data/review_decisions.json','w'), indent=2) - print(f'{len(d)} merge decisions') - " - python3 scripts/build_database.py --rebuild # apply decisions - python3 -m pytest tests/ -q # 71 should pass - git add data/activities.db data/review_decisions.json - git commit -m "Faza 1: full corpus extraction" - ``` - -## Code reference — what each script does - -- `scripts/normalize_sources.py --corpus data/carti-camp-jocuri --out data/sources` → produces 96 `data/sources/.txt` files with `--- PAGE N ---` markers. **Done. Don't re-run.** -- `scripts/chunk_sources.py --sources data/sources --chunks data/chunks` → splits each into ~20pg chunks with 4pg overlap, writes `data/chunks//.partNN.txt` and updates `data/chunks/manifest.json`. **Done. Don't re-run unless sources change.** -- `scripts/run_extraction.py` → regenerates the per-chunk prompts in `data/chunks/_prompts/`. **Done. Don't re-run unless schema/prompt changes.** -- `scripts/SUBAGENT_PROMPT.md` — extraction rules (what subagents follow). -- `scripts/activity_schema.json` — JSON schema each extraction must validate against. -- `scripts/validate_extractions.py` — per-file schema check + fuzzy `source_excerpt` substring check; writes re-extraction prompts to `data/extracted/_reextract/` for rejected chunks; marks chunks `rejected` in manifest. -- `scripts/build_database.py --rebuild` — validates every `data/extracted/*.json` against schema, drops per-activity hallucinations, dedup, applies `data/review_decisions.json`, atomic swap into `data/activities.db`. -- `scripts/review_queue.py list|resolve ` — CLI for borderline-dedup decisions; persisted in `data/review_decisions.json`. - -## Pilot lessons that apply - -- **~1.07% hallucinated drops** at pilot scale (well below the 2% threshold). Caused by source_excerpts straddling `--- PAGE N ---` markers. Re-extraction with verbatim within-page quotes fixed all 13 affected chunks. Expect similar rate at Faza 1 scale (~10-30 chunks may need re-extraction). -- **Borderline dedup queue** (369 rows in pilot) — same-name activities re-extracted from chunk overlap with slightly-different prose. Bulk-merge is the right call: same normalized_name + same language + 60-85% desc similarity → merge takes the longest fields. Use the snippet in step 5 above. -- **OCR not needed.** The candidate scanned PDF (`07.Cartea_Mare`) extracted 151 pages of real text via pdfplumber alone. Skip OCR for v1. - -## Files not yet committed (uncommitted in this session) - -- `data/sources/` — all 96 normalized `.txt` files (in `.gitignore`, don't try to commit them) -- `data/chunks/` — all 588 chunks + manifest (in `.gitignore`) -- `data/extracted/` — 64 JSON files so far (in `.gitignore`) -- `data/activities.db` — **still the pilot's 1751-row DB.** Will be rebuilt after Faza 1 finishes. - -The schema, all scripts, all tests, and the pilot DB are already committed at `3d9f266`. No code changes are needed for Faza 1 — just data. - -## Status snapshot (as of handoff) +**6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss): +- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics) +- `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96` +**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair +incident" below; re-extract once the subagent session limit resets, ~5pm UTC): ``` -chunks done : 64 / 588 (10.9%) -activities so far : 1949 -remaining chunks : 524 -largest pending sources: - 87850302_dragon_sleepdeprived 116 chunks (full dragon mirror) - c3162825_resource_pack__learning_by_playing 97 chunks (catalunya mirror) - 4da6431e_cub_scout_leader_how_to_book 18 - 4a765782_1000_fantastic_scout_games 18 (re-extract; was in pilot) - bee67427_the_big_book_of_conflict_resolution 15 - e3bd0953_1001_idei_pentru_o_educatie_timp 14 - d5e51389_09_culegere_de_jocuri_si_povestiri 13 - ce4b48f1_impact_culegere_de_jocuri_si_povest 13 - 193fdd94_ghid_de_integrare_a_persoanelor_vul 12 (in progress) - 779f4fa0_ghidul_animatorului_855_de_jocuri 11 +3f9c8232_teambuilding_corbu_29092023.part01 +5f959f85_scoli_fara_bullying.part02 +83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04 +d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01 +d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04 +d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05 +e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03 ``` +Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at +`data/chunks/_prompts/.prompt.md`), then **re-run the freeze rebuild** so they join +the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no +overlay keys depend on the current freeze yet. -In a fresh session: `cat HANDOFF.md`, then go straight to step 3 above. +## The json_repair incident (important — root cause + what was fixed) + +Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian +text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files +were affected. + +First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it +ends the string and reinterprets the trailing text as a new key, silently dropping the +rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught +the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create +an extra key slipped through. Applying json_repair output to disk also **overwrote the +malformed originals** for those 8 → originals lost → those (now 7, one recovered) need +re-extraction. + +**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner +(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them, +validates against the real schema, and only replaces a valid top-level file when the +repaired version carries **strictly more text** (a length guard that catches truncated +json_repair output while leaving genuine extractions untouched). Re-running it cleanly +repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**. +`json_repair` is no longer used anywhere. Do NOT reintroduce it. + +`build_database.py` does NOT depend on the repair script (the "DB regenerable from +data/extracted/" invariant holds — plain `json.loads` only). + +## What the code does now (all committed) + +**Part A — plumbing (corpus-independent):** +- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro, + indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON), + chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync); + indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor` + and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into + the categories table so dropdowns populate. +- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name` + / `get_display_description` / `get_display_rules` / `get_display_variations` + (RO-primary, EN fallback), `has_translation`, `is_estimated(field)`, + `get_indoor_outdoor_display`, `get_space_needed_display`. +- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels + + `normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no + fallback — never fabricate a value) + display-name helpers. +- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`; + `merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but + **never** touches enrichment fields (those are applied post-dedup). + +**Part A — UI/search:** +- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/ + `space_needed` to DB equality filters. +- `app/web/routes.py`: new `/source/` download route — **shipped DARK behind + `SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves + `source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for + web-mirror sources). `DISPLAY_NAMES` extended with both new axes. +- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`. +- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display + helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible + "Text original" section, indoor/space cards, estimat markers, and the download link + (only when the flag is on). `main.css` has `.estimated` / `.original-text` styles. + +**Part B — enrichment pipeline (built, not yet run):** +- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)` + applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on + `import_common.content_key(normalized_name, language, _normalize_text(description))` + (reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints + `enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts. + Translated/expanded text is NOT re-validated against source (by design). +- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key, + skips rows already in `data/enrichment_parts/.json` (resumable), emits one prompt + per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via + `find_chunk_text`). Pilot scoping: `--source ` and/or `--limit N`. `--collect` + merges parts → `data/enrichment.json`. +- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand + `description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`, + fixed enum vocab, output `data/enrichment_parts/.json` including `content_key`. + +## Exact next steps + +1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid + JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`). + If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now). +2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected, + note the new total (~9418 + the 7 chunks' activities). +3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 6–8k LLM calls): + - Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu` + (or `--limit 30`). This writes prompts to `data/enrichment_prompts/`. + - Launch a small wave of Sonnet subagents on those prompts (each writes + `data/enrichment_parts/.json`). + - `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`. + - `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default). + - **STOP. Hand the user translation-quality + estimation-plausibility + description- + fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not + auto-proceed past this gate. +4. After sign-off: scale enrichment in waves of ~8–16 Sonnet subagents, `--collect`, + final `--rebuild --enrichment`. + +## Verify / run + +- Tests: `python3 -m pytest tests/ -q` → 99 pass. +- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true + only if the user accepts the copyright exposure of serving original files. +- `data/activities.db.bak` is the pre-this-freeze backup. diff --git a/app/config.py b/app/config.py index 942a4be..bd384a3 100644 --- a/app/config.py +++ b/app/config.py @@ -22,6 +22,18 @@ class Config: # Search settings SEARCH_RESULTS_LIMIT = int(os.environ.get('SEARCH_RESULTS_LIMIT', '100')) FTS_ENABLED = True + + # Source-file download (plan A6). Shipped DARK by default: serving the + # original PDFs/books carries a copyright exposure the user must opt into. + # The /source/ route 404s entirely while this is false; the UI hides + # the download link. Enable with SOURCE_DOWNLOAD_ENABLED=true. + SOURCE_DOWNLOAD_ENABLED = ( + os.environ.get('SOURCE_DOWNLOAD_ENABLED', 'false').lower() == 'true' + ) + # Root of the original corpus. source_file values are relative to this. + CORPUS_DIR = os.environ.get('CORPUS_DIR') or str( + Path(__file__).parent.parent / 'data' / 'carti-camp-jocuri' + ) @staticmethod def ensure_directories(): diff --git a/app/config_taxonomy.py b/app/config_taxonomy.py index 2e8db25..efaacf5 100644 --- a/app/config_taxonomy.py +++ b/app/config_taxonomy.py @@ -8,7 +8,7 @@ the UI displays the Romanian name. `category` (thematic domain) and import unicodedata import re -from typing import Dict, List +from typing import Dict, List, Optional # --- Categories (thematic domain) -------------------------------------------- # slug -> Romanian display name. ~16 fixed slugs; `altele` is the mandatory @@ -215,6 +215,89 @@ def normalize_content_type(value: str) -> str: return aliases.get(slug, DEFAULT_CONTENT_TYPE) +# --- Indoor / outdoor (enrichment axis) -------------------------------------- +# Where the activity is run. Inferred during enrichment when the source is +# silent — such inferences are flagged in `estimated_fields`. slug -> RO label. +INDOOR_OUTDOOR: Dict[str, str] = { + "indoor": "Interior", + "outdoor": "Exterior", + "either": "Interior sau exterior", +} + +# --- Space needed (enrichment axis) ------------------------------------------ +# Rough footprint the activity requires. slug -> RO label. +SPACE_NEEDED: Dict[str, str] = { + "mic": "Spațiu mic", + "mediu": "Spațiu mediu", + "mare": "Spațiu mare", +} + +# Aliases for robustness against LLM output variation. Keys are _slugify'd. +_INDOOR_OUTDOOR_ALIASES: Dict[str, str] = { + "interior": "indoor", + "inside": "indoor", + "in": "indoor", + "exterior": "outdoor", + "outside": "outdoor", + "out": "outdoor", + "aer-liber": "outdoor", + "both": "either", + "any": "either", + "ambele": "either", + "interior-exterior": "either", + "indoor-outdoor": "either", +} + +_SPACE_NEEDED_ALIASES: Dict[str, str] = { + "small": "mic", + "redus": "mic", + "putin": "mic", + "medium": "mediu", + "moderat": "mediu", + "large": "mare", + "big": "mare", + "mult": "mare", + "spatiu-mic": "mic", + "spatiu-mediu": "mediu", + "spatiu-mare": "mare", +} + + +def normalize_indoor_outdoor(value: str) -> Optional[str]: + """Map an arbitrary string to an indoor_outdoor slug, or None. + + Unlike categories, this has NO mandatory fallback: an unrecognised or + empty value yields None (field simply absent), so we never fabricate a + location the enrichment did not assert. + """ + if not value: + return None + slug = _slugify(str(value)) + if slug in INDOOR_OUTDOOR: + return slug + return _INDOOR_OUTDOOR_ALIASES.get(slug) + + +def normalize_space_needed(value: str) -> Optional[str]: + """Map an arbitrary string to a space_needed slug, or None (no fallback).""" + if not value: + return None + slug = _slugify(str(value)) + if slug in SPACE_NEEDED: + return slug + return _SPACE_NEEDED_ALIASES.get(slug) + + +def indoor_outdoor_display_name(slug: str) -> str: + """RO display name for an indoor_outdoor slug.""" + return INDOOR_OUTDOOR.get(slug, slug) + + +def space_needed_display_name(slug: str) -> str: + """RO display name for a space_needed slug.""" + return SPACE_NEEDED.get(slug, slug) + + def is_valid_category(slug: str) -> bool: """True if `slug` is a valid category slug.""" return slug in CATEGORIES diff --git a/app/models/activity.py b/app/models/activity.py index b2bbf18..b259f0b 100644 --- a/app/models/activity.py +++ b/app/models/activity.py @@ -76,6 +76,25 @@ class Activity: extraction_confidence: Optional[str] = None # 'high' / 'med' / 'low' needs_review: int = 0 + # Enrichment overlay (applied at build time from data/enrichment.json; see + # plan Part B). Bilingual: the EN/source text stays in name/description/... + # and the Romanian rendering lands in the *_ro twins. Absent fields leave + # the underlying DB value untouched. + name_ro: Optional[str] = None + description_ro: Optional[str] = None + rules_ro: Optional[str] = None + variations_ro: Optional[str] = None + indoor_outdoor: Optional[str] = None # slug: indoor / outdoor / either + space_needed: Optional[str] = None # slug: mic / mediu / mare + # Names of fields whose value was INFERRED by enrichment (source was + # silent) rather than stated in the source — surfaced as "(estimat)" in UI. + estimated_fields: List[str] = field(default_factory=list) + + # Source provenance for the download route + enrichment keying. + source_id: Optional[str] = None # e.g. "876d1a2d_marcaje_turistice" + source_ids: List[str] = field(default_factory=list) # all source_ids merged + chunk_key: Optional[str] = None # e.g. ".part01" + # Database fields id: Optional[int] = None created_at: Optional[str] = None @@ -117,6 +136,16 @@ class Activity: 'normalized_name': self.normalized_name or normalize_name(self.name), 'extraction_confidence': self.extraction_confidence, 'needs_review': self.needs_review, + 'name_ro': self.name_ro, + 'description_ro': self.description_ro, + 'rules_ro': self.rules_ro, + 'variations_ro': self.variations_ro, + 'indoor_outdoor': self.indoor_outdoor, + 'space_needed': self.space_needed, + 'estimated_fields': json.dumps(self.estimated_fields) if self.estimated_fields else None, + 'source_id': self.source_id, + 'source_ids': json.dumps(self.source_ids) if self.source_ids else None, + 'chunk_key': self.chunk_key, } @classmethod @@ -140,6 +169,19 @@ class Activity: elif source_files is None: source_files = [] + # estimated_fields / source_ids: JSON string (DB) or list (in-memory) + def _json_list(value): + if isinstance(value, str): + try: + parsed = json.loads(value) + return parsed if isinstance(parsed, list) else [] + except (json.JSONDecodeError, TypeError): + return [] + return list(value) if value else [] + + estimated_fields = _json_list(data.get('estimated_fields')) + source_ids = _json_list(data.get('source_ids')) + return cls( id=data.get('id'), name=data.get('name', ''), @@ -170,6 +212,16 @@ class Activity: normalized_name=data.get('normalized_name'), extraction_confidence=data.get('extraction_confidence'), needs_review=data.get('needs_review', 0) or 0, + name_ro=data.get('name_ro'), + description_ro=data.get('description_ro'), + rules_ro=data.get('rules_ro'), + variations_ro=data.get('variations_ro'), + indoor_outdoor=data.get('indoor_outdoor'), + space_needed=data.get('space_needed'), + estimated_fields=estimated_fields, + source_id=data.get('source_id'), + source_ids=source_ids, + chunk_key=data.get('chunk_key'), created_at=data.get('created_at'), updated_at=data.get('updated_at') ) @@ -210,4 +262,44 @@ class Activity: return self.materials_category elif self.materials_list: return self.materials_list[:100] + "..." if len(self.materials_list) > 100 else self.materials_list - return "nu specificate" \ No newline at end of file + return "nu specificate" + + # --- Enrichment / bilingual display helpers ------------------------------ + def get_display_name(self) -> str: + """Romanian name when enriched, else the original.""" + return self.name_ro or self.name + + def get_display_description(self) -> str: + """Romanian description when enriched, else the original.""" + return self.description_ro or self.description + + def get_display_rules(self) -> Optional[str]: + """Romanian rules when enriched, else the original.""" + return self.rules_ro or self.rules + + def get_display_variations(self) -> Optional[str]: + """Romanian variations when enriched, else the original.""" + return self.variations_ro or self.variations + + def has_translation(self) -> bool: + """True if any Romanian enrichment text is present.""" + return bool(self.name_ro or self.description_ro + or self.rules_ro or self.variations_ro) + + def is_estimated(self, field_name: str) -> bool: + """True if `field_name` was inferred by enrichment (source was silent).""" + return field_name in (self.estimated_fields or []) + + def get_indoor_outdoor_display(self) -> Optional[str]: + """RO label for indoor_outdoor, or None when unset.""" + if not self.indoor_outdoor: + return None + from app.config_taxonomy import indoor_outdoor_display_name + return indoor_outdoor_display_name(self.indoor_outdoor) + + def get_space_needed_display(self) -> Optional[str]: + """RO label for space_needed, or None when unset.""" + if not self.space_needed: + return None + from app.config_taxonomy import space_needed_display_name + return space_needed_display_name(self.space_needed) \ No newline at end of file diff --git a/app/models/database.py b/app/models/database.py index 816c403..a0075df 100644 --- a/app/models/database.py +++ b/app/models/database.py @@ -72,6 +72,18 @@ class DatabaseManager: extraction_confidence TEXT, needs_review INTEGER DEFAULT 0, + -- Enrichment overlay (bilingual + inferred filters; Part B) + name_ro TEXT, + description_ro TEXT, + rules_ro TEXT, + variations_ro TEXT, + indoor_outdoor TEXT, + space_needed TEXT, + estimated_fields TEXT, + source_id TEXT, + source_ids TEXT, + chunk_key TEXT, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) @@ -82,6 +94,7 @@ class DatabaseManager: CREATE VIRTUAL TABLE IF NOT EXISTS activities_fts USING fts5( name, description, rules, variations, keywords, materials_list, skills_developed, + name_ro, description_ro, rules_ro, variations_ro, content='activities', content_rowid='id' ) @@ -106,6 +119,8 @@ class DatabaseManager: "CREATE INDEX IF NOT EXISTS idx_activities_participants ON activities(participants_min, participants_max)", "CREATE INDEX IF NOT EXISTS idx_activities_duration ON activities(duration_min, duration_max)", "CREATE INDEX IF NOT EXISTS idx_activities_normalized_name ON activities(normalized_name)", + "CREATE INDEX IF NOT EXISTS idx_activities_indoor_outdoor ON activities(indoor_outdoor)", + "CREATE INDEX IF NOT EXISTS idx_activities_space_needed ON activities(space_needed)", "CREATE INDEX IF NOT EXISTS idx_categories_type ON categories(type)" ] @@ -117,9 +132,11 @@ class DatabaseManager: CREATE TRIGGER IF NOT EXISTS activities_fts_insert AFTER INSERT ON activities BEGIN INSERT INTO activities_fts(rowid, name, description, rules, variations, - keywords, materials_list, skills_developed) + keywords, materials_list, skills_developed, + name_ro, description_ro, rules_ro, variations_ro) VALUES (new.id, new.name, new.description, new.rules, new.variations, - new.keywords, new.materials_list, new.skills_developed); + new.keywords, new.materials_list, new.skills_developed, + new.name_ro, new.description_ro, new.rules_ro, new.variations_ro); END """) @@ -127,9 +144,11 @@ class DatabaseManager: CREATE TRIGGER IF NOT EXISTS activities_fts_delete AFTER DELETE ON activities BEGIN INSERT INTO activities_fts(activities_fts, rowid, name, description, rules, - variations, keywords, materials_list, skills_developed) + variations, keywords, materials_list, skills_developed, + name_ro, description_ro, rules_ro, variations_ro) VALUES ('delete', old.id, old.name, old.description, old.rules, - old.variations, old.keywords, old.materials_list, old.skills_developed); + old.variations, old.keywords, old.materials_list, old.skills_developed, + old.name_ro, old.description_ro, old.rules_ro, old.variations_ro); END """) @@ -137,13 +156,17 @@ class DatabaseManager: CREATE TRIGGER IF NOT EXISTS activities_fts_update AFTER UPDATE ON activities BEGIN INSERT INTO activities_fts(activities_fts, rowid, name, description, rules, - variations, keywords, materials_list, skills_developed) + variations, keywords, materials_list, skills_developed, + name_ro, description_ro, rules_ro, variations_ro) VALUES ('delete', old.id, old.name, old.description, old.rules, - old.variations, old.keywords, old.materials_list, old.skills_developed); + old.variations, old.keywords, old.materials_list, old.skills_developed, + old.name_ro, old.description_ro, old.rules_ro, old.variations_ro); INSERT INTO activities_fts(rowid, name, description, rules, variations, - keywords, materials_list, skills_developed) + keywords, materials_list, skills_developed, + name_ro, description_ro, rules_ro, variations_ro) VALUES (new.id, new.name, new.description, new.rules, new.variations, - new.keywords, new.materials_list, new.skills_developed); + new.keywords, new.materials_list, new.skills_developed, + new.name_ro, new.description_ro, new.rules_ro, new.variations_ro); END """) @@ -210,6 +233,10 @@ class DatabaseManager: ('duration', activity.get_duration_display()), ('materials', activity.get_materials_display()), ('difficulty', activity.difficulty_level), + # Enrichment axes — slugs stored as value; UI maps to RO via + # DISPLAY_NAMES. Without these the new dropdowns would be empty. + ('indoor_outdoor', activity.indoor_outdoor), + ('space_needed', activity.space_needed), ] for cat_type, cat_value in categories_to_update: @@ -236,6 +263,8 @@ class DatabaseManager: duration_max: Optional[int] = None, materials_category: Optional[str] = None, difficulty_level: Optional[str] = None, + indoor_outdoor: Optional[str] = None, + space_needed: Optional[str] = None, limit: int = 100) -> List[Dict[str, Any]]: """Enhanced search with FTS5 and filters""" @@ -293,7 +322,15 @@ class DatabaseManager: if difficulty_level: base_query += " AND difficulty_level = ?" params.append(difficulty_level) - + + if indoor_outdoor: + base_query += " AND indoor_outdoor = ?" + params.append(indoor_outdoor) + + if space_needed: + base_query += " AND space_needed = ?" + params.append(space_needed) + # Add ordering and limit query = f"{base_query} {order_clause} LIMIT ?" params.append(limit) diff --git a/app/services/search.py b/app/services/search.py index 2a64261..8189d2c 100644 --- a/app/services/search.py +++ b/app/services/search.py @@ -200,7 +200,14 @@ class SearchService: elif filter_key == 'difficulty': db_filters['difficulty_level'] = filter_value - + + elif filter_key == 'indoor_outdoor': + # Equality filter on the slug column (mirror difficulty). + db_filters['indoor_outdoor'] = filter_value + + elif filter_key == 'space_needed': + db_filters['space_needed'] = filter_value + # Handle any other custom filters else: # Generic filter handling - try to match against keywords or tags diff --git a/app/static/css/main.css b/app/static/css/main.css index 3ee6075..e08abb9 100644 --- a/app/static/css/main.css +++ b/app/static/css/main.css @@ -705,4 +705,30 @@ body { box-shadow: none; border: 1px solid #ddd; } +} + +/* Enrichment markers (plan Part A7) */ +.estimated { + color: #8a6d3b; + font-style: italic; + font-size: 0.85em; + font-weight: normal; +} + +.original-text > summary { + cursor: pointer; + color: #555; + user-select: none; +} + +.original-text .original-content { + margin-top: 0.75rem; + padding-left: 1rem; + border-left: 3px solid #e0e0e0; + color: #555; +} + +.download-hint { + color: #888; + font-size: 0.85em; } \ No newline at end of file diff --git a/app/templates/activity.html b/app/templates/activity.html index d865f0a..0951f8b 100644 --- a/app/templates/activity.html +++ b/app/templates/activity.html @@ -8,13 +8,13 @@
-

{{ activity.name }}

+

{{ activity.get_display_name() }}

{{ display_names.get(activity.category, activity.category) }} {% if activity.content_type %} {{ display_names.get(activity.content_type, activity.content_type) }} @@ -31,27 +31,46 @@
- +

Descriere

-
{{ activity.description }}
+
{{ activity.get_display_description() }}
- {% if activity.rules %} + {% if activity.get_display_rules() %}

Reguli

-
{{ activity.rules }}
+
{{ activity.get_display_rules() }}
{% endif %} - {% if activity.variations %} + {% if activity.get_display_variations() %}

Variații

-
{{ activity.variations }}
+
{{ activity.get_display_variations() }}
{% endif %} + + {% if activity.has_translation() %} +
+ Text original ({{ display_names.get(activity.language, activity.language or 'sursă') }}) +
+ +
{{ activity.description }}
+ {% if activity.rules %} + +
{{ activity.rules }}
+ {% endif %} + {% if activity.variations %} + +
{{ activity.variations }}
+ {% endif %} +
+
+ {% endif %} +

Detalii activitate

@@ -59,21 +78,35 @@ {% if activity.get_age_range_display() != "toate vârstele" %} {% endif %} {% if activity.get_participants_display() != "orice număr" %} {% endif %} {% if activity.get_duration_display() != "durată variabilă" %} + {% endif %} + + {% if activity.get_indoor_outdoor_display() %} + + {% endif %} + + {% if activity.get_space_needed_display() %} + {% endif %} @@ -125,9 +158,15 @@

Informații sursă

{% if activity.source_file %} + {% if config.SOURCE_DOWNLOAD_ENABLED %} +

Fișier sursă: + {{ activity.source_file }} + (descarcă)

+ {% else %}

Fișier sursă: {{ activity.source_file }}

{% endif %} - + {% endif %} + {% if activity.page_reference %}

Referință: {{ activity.page_reference }}

{% endif %} diff --git a/app/templates/index.html b/app/templates/index.html index 7baffeb..f8c6690 100644 --- a/app/templates/index.html +++ b/app/templates/index.html @@ -125,6 +125,30 @@
{% endif %} + + {% if filters.indoor_outdoor %} +
+ + +
+ {% endif %} + + {% if filters.space_needed %} +
+ + +
+ {% endif %} {% endif %}
diff --git a/app/templates/results.html b/app/templates/results.html index f06166d..3f2ff65 100644 --- a/app/templates/results.html +++ b/app/templates/results.html @@ -85,6 +85,28 @@ {% endif %} + {% if filters.indoor_outdoor %} + + {% endif %} + + {% if filters.space_needed %} + + {% endif %} + @@ -128,7 +150,7 @@

- {{ activity.name }} + {{ activity.get_display_name() }}

{{ display_names.get(activity.category, activity.category) }} @@ -138,24 +160,36 @@
-

{{ activity.description }}

- +

{{ activity.get_display_description() }}

+ diff --git a/app/web/routes.py b/app/web/routes.py index 56fd7ca..34eb250 100644 --- a/app/web/routes.py +++ b/app/web/routes.py @@ -3,20 +3,27 @@ Flask routes for INDEX-SISTEM-JOCURI v2.0 Clean, minimalist web interface with dynamic filters """ -from flask import Blueprint, request, render_template, jsonify, current_app +from flask import ( + Blueprint, request, render_template, jsonify, current_app, + send_from_directory, +) from app.models.database import DatabaseManager from app.models.activity import Activity from app.services.search import SearchService -from app.config_taxonomy import CATEGORIES, CONTENT_TYPES -import os -from pathlib import Path +from app.config_taxonomy import ( + CATEGORIES, CONTENT_TYPES, INDOOR_OUTDOOR, SPACE_NEEDED, +) bp = Blueprint('main', __name__) -# Slug -> Romanian display name. Category and content_type slugs never collide, -# so a single flat map is enough for the UI filter labels. +# Slug -> Romanian display name. Category, content_type, indoor_outdoor and +# space_needed slugs never collide, so a single flat map is enough for the UI +# filter labels. LANGUAGE_NAMES = {'ro': 'Română', 'en': 'Engleză'} -DISPLAY_NAMES = {**CATEGORIES, **CONTENT_TYPES, **LANGUAGE_NAMES} +DISPLAY_NAMES = { + **CATEGORIES, **CONTENT_TYPES, **INDOOR_OUTDOOR, **SPACE_NEEDED, + **LANGUAGE_NAMES, +} # Initialize database manager (will be configured in application factory) def get_db_manager(): @@ -138,6 +145,44 @@ def activity_detail(activity_id): print(f"Error loading activity {activity_id}: {e}") return render_template('404.html'), 404 +@bp.route('/source/') +def source_download(activity_id): + """Download the original source file for an activity (plan A6). + + Shipped DARK: returns 404 unless SOURCE_DOWNLOAD_ENABLED is set (copyright + exposure — the user opts in). Resolves the activity's `source_file` under + CORPUS_DIR. send_from_directory does the safe-join and blocks traversal; + web-mirror / extension-less sources that are not real files 404 gracefully. + """ + if not current_app.config.get('SOURCE_DOWNLOAD_ENABLED', False): + return render_template('404.html'), 404 + try: + db = get_db_manager() + activity_data = db.get_activity_by_id(activity_id) + if not activity_data: + return render_template('404.html'), 404 + + source_file = (activity_data.get('source_file') or '').strip() + if not source_file: + return render_template('404.html'), 404 + + corpus_dir = current_app.config.get('CORPUS_DIR') + if not corpus_dir: + return render_template('404.html'), 404 + try: + # send_from_directory rejects path traversal and missing files with + # a 404 (NotFound) — no manual safe_join needed. + return send_from_directory( + corpus_dir, source_file, as_attachment=True + ) + except Exception: + # Missing file / web-mirror source with no on-disk original. + return render_template('404.html'), 404 + except Exception as e: + print(f"Source download error for {activity_id}: {e}") + return render_template('404.html'), 404 + + @bp.route('/health') def health_check(): """Health check endpoint for Docker""" diff --git a/data/activities.db b/data/activities.db index f4548ab..6996c82 100644 Binary files a/data/activities.db and b/data/activities.db differ diff --git a/scripts/ENRICHMENT_PROMPT.md b/scripts/ENRICHMENT_PROMPT.md new file mode 100644 index 0000000..6c63232 --- /dev/null +++ b/scripts/ENRICHMENT_PROMPT.md @@ -0,0 +1,98 @@ +# SUBAGENT — Activity enrichment + +You are a subagent in the game-library enrichment pipeline. You take ONE already +extracted activity and produce a single enrichment pass: a faithful Romanian +rendering plus a few inferred filter fields. You do **one** activity per prompt. + +This is **not** re-extraction. The activity text already exists and is trusted. +Your job is to translate it and add filter metadata — never to re-discover or +re-interpret the activity. + +## Your task + +The prompt gives you two blocks: + +1. **Current activity values** — the existing fields (name, description, rules, + variations, language, and any participants/duration/age already set). +2. **Source chunk text** — the original passage the activity came from. This is + your ground truth for any expansion. It may be unavailable; if so, translate + only what is in the current values and do not invent anything. + +Produce one JSON object and write it to the path named in the prompt +(`data/enrichment_parts/.json`). It MUST contain the exact +`content_key` string from the prompt. + +## Rules + +### Translation (always) +- Translate `name`, `description`, `rules`, `variations` into natural, fluent + Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`. +- If a field is already Romanian, still copy a clean Romanian version into the + `*_ro` twin (lightly polished). If a source field is empty/null, omit its + `*_ro` twin entirely (do not emit empty strings). +- Translate faithfully. Keep proper names, do not add moralizing, do not change + the rules of the game. + +### Description expansion (constrained) +- You MAY make `description_ro` richer than a literal translation — but ONLY + using detail that is actually present in the **source chunk text**. Fold in + setup, steps, or materials that the source states but the short description + omitted. +- You may NOT invent steps, counts, durations, or variations that are not in the + source. If the source is thin, the translation stays thin. Hallucinated + expansion is the one unacceptable failure here. + +### Inferred filter fields (mark when inferred) +Fill these when you can, using the source text first, then reasonable inference: + +- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`. +- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area). +- `participants_min`, `participants_max`: integers (people). +- `duration_min`, `duration_max`: integers (minutes). +- `age_group_min`, `age_group_max`: integers (years). + +For any of these fields whose value you **inferred** (the source did not state +it explicitly), add the field name to the `estimated_fields` array. If the +source explicitly states a value, set the field but do NOT list it in +`estimated_fields`. Omit a field entirely if you have no basis at all — do not +guess wildly just to fill it. + +Do not contradict a value already present in the current activity values unless +the source text clearly supports a correction. + +## Enum vocabulary (fixed — use these exact slugs) + +- `indoor_outdoor`: `indoor` | `outdoor` | `either` +- `space_needed`: `mic` | `mediu` | `mare` + +## Output format + +Write exactly one JSON object to `data/enrichment_parts/.json`: + +```json +{ + "content_key": "", + "name_ro": "…", + "description_ro": "…", + "rules_ro": "…", + "variations_ro": "…", + "indoor_outdoor": "outdoor", + "space_needed": "mediu", + "participants_min": 6, + "participants_max": 20, + "duration_min": 15, + "duration_max": 30, + "age_group_min": 8, + "age_group_max": 14, + "estimated_fields": ["space_needed", "duration_min", "duration_max"] +} +``` + +Include only the fields you actually fill. Always include `content_key` and +`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only — +no commentary, no markdown fences in the file itself. + +## Report + +After writing the file, report in under 30 words: the activity name and which +fields you estimated. diff --git a/scripts/SUBAGENT_PROMPT.md b/scripts/SUBAGENT_PROMPT.md index 79c3e9c..feb0574 100644 --- a/scripts/SUBAGENT_PROMPT.md +++ b/scripts/SUBAGENT_PROMPT.md @@ -74,6 +74,23 @@ The file is one JSON object: a `header` plus an `activities` array. - Do **not** paraphrase the `source_excerpt` — copy it character for character. - Better to extract fewer activities accurately than to pad the output. +## Writing large outputs in batches (IMPORTANT) + +A single Write tool call has a hard ~32K output-token limit. Dense chunks +(50+ activities) will exceed this. If you estimate >30 activities, write the +file **incrementally**: + +1. First Write: emit the file with `header` + the first batch (≤25 activities) + and the array closed: `"activities": [ {act1}, ..., {act25} ] }`. +2. For each subsequent batch (≤25 activities at a time), use an Edit call + that replaces `]\n}` (or the exact trailing pattern at end-of-file) with + `,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the + closing brace plus the last activity's tail) so the Edit is unambiguous. +3. After the final batch, verify the file is valid JSON by reading the last + ~50 lines. + +This keeps each tool call under the output-token cap. + ## Before you finish - Every activity has a non-empty `source_excerpt` and `page_reference`. diff --git a/scripts/build_database.py b/scripts/build_database.py index d7276be..928d35f 100644 --- a/scripts/build_database.py +++ b/scripts/build_database.py @@ -86,7 +86,12 @@ def _split_csv(value: Optional[str]) -> list[str]: return [p.strip() for p in str(value).split(",") if p.strip()] -def dict_to_activity(adict: dict, source_file: str) -> Activity: +def dict_to_activity( + adict: dict, + source_file: str, + source_id: Optional[str] = None, + chunk_key: Optional[str] = None, +) -> Activity: """Build an Activity from one extraction-JSON activity object.""" tags = adict.get("tags") or [] if isinstance(tags, str): @@ -99,6 +104,9 @@ def dict_to_activity(adict: dict, source_file: str) -> Activity: source_files = [source_file, *source_files] return Activity( + source_id=source_id, + source_ids=[source_id] if source_id else [], + chunk_key=chunk_key, name=(adict.get("name") or "").strip(), description=(adict.get("description") or "").strip(), rules=adict.get("rules"), @@ -206,6 +214,19 @@ def merge_cluster(cluster: list[Activity]) -> Activity: if s and s not in sources: sources.append(s) merged.source_files = sources + # source provenance: keep rep's chunk_key/source_id as primary, union the + # source_ids for the download route. Enrichment fields (name_ro, + # description_ro, indoor_outdoor, ...) are intentionally NOT carried here: + # enrichment is applied AFTER dedup (plan Part B2), keyed on the merged + # row's content_key, so merging must not pre-populate them. + merged.source_id = rep.source_id + merged.chunk_key = rep.chunk_key + source_ids: list[str] = [] + for a in cluster: + for sid in [a.source_id, *(a.source_ids or [])]: + if sid and sid not in source_ids: + source_ids.append(sid) + merged.source_ids = source_ids # popularity_score++ per merged duplicate (plan §4) merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1) return merged @@ -313,6 +334,108 @@ def apply_review_decisions( return kept, stats +# -------------------------------------------------------------------------- +# step 5b — enrichment overlay (plan Part B) +# -------------------------------------------------------------------------- +# Translation / inferred-filter fields written by run_enrichment.py. Applied +# AFTER dedup + review decisions, keyed on the same stable content_key, so the +# overlay survives rebuilds as long as extraction text is frozen. +_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro") +_ENRICHMENT_INT_FIELDS = ( + "participants_min", "participants_max", + "duration_min", "duration_max", + "age_group_min", "age_group_max", +) + + +def load_enrichment(path: Path) -> dict: + """Load data/enrichment.json (flat map content_key -> field dict).""" + if path and path.is_file(): + try: + data = json.loads(path.read_text(encoding="utf-8")) + if isinstance(data, dict): + return data + except (json.JSONDecodeError, OSError): + pass + return {} + + +def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict: + """ + Overlay enrichment fields onto the post-dedup activity list (plan B2). + + Keyed by content_key. Only fields PRESENT in an entry are written; absent + fields leave the underlying DB value untouched. indoor_outdoor / + space_needed are normalized to slugs (None on unrecognised). Inferred + fields are recorded in `estimated_fields`. Translated / expanded text is + NOT re-validated against the source here — expansion fidelity is the + enrichment prompt's responsibility (plan B2 comment). + + Returns {entries, matched, orphaned, fields_stated, fields_estimated}. + """ + from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed + + matched_keys: set[str] = set() + fields_stated: dict[str, int] = defaultdict(int) + fields_estimated: dict[str, int] = defaultdict(int) + + for act in activities: + key = content_key( + act.normalized_name or normalize_name(act.name), + act.language, + act.description or "", + ) + entry = enrichment.get(key) + if not isinstance(entry, dict): + continue + matched_keys.add(key) + + estimated = set(entry.get("estimated_fields") or []) + + # bilingual text twins + for fld in _ENRICHMENT_TEXT_FIELDS: + val = entry.get(fld) + if isinstance(val, str) and val.strip(): + setattr(act, fld, val.strip()) + + # inferred / clarified structured numeric fields + for fld in _ENRICHMENT_INT_FIELDS: + if entry.get(fld) is not None: + try: + setattr(act, fld, int(entry[fld])) + except (TypeError, ValueError): + pass + + # enum filters — normalized to slug, dropped if unrecognised + if entry.get("indoor_outdoor") is not None: + slug = normalize_indoor_outdoor(entry["indoor_outdoor"]) + if slug: + act.indoor_outdoor = slug + if entry.get("space_needed") is not None: + slug = normalize_space_needed(entry["space_needed"]) + if slug: + act.space_needed = slug + + act.estimated_fields = sorted(estimated) + + # QA tally: stated vs estimated population, per field + for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"): + if entry.get(fld) is None: + continue + if fld in estimated: + fields_estimated[fld] += 1 + else: + fields_stated[fld] += 1 + + return { + "entries": len(enrichment), + "matched": len(matched_keys), + "orphaned": len(enrichment) - len(matched_keys), + "fields_stated": dict(fields_stated), + "fields_estimated": dict(fields_estimated), + } + + # -------------------------------------------------------------------------- # golden-set recall (plan §7) # -------------------------------------------------------------------------- @@ -390,9 +513,8 @@ def collect_activities( header = data.get("header", {}) chunk_text = find_chunk_text(json_path, header, chunks_dir) - source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit( - ".part", 1 - )[0] + chunk_key = chunk_key_for(json_path, header) + source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0] fallback_source = ( source_path_for(source_id, sources_dir) or source_id or json_path.stem ) @@ -409,7 +531,7 @@ def collect_activities( continue src = adict.get("source_file") or fallback_source raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", "")))) - activities.append(dict_to_activity(adict, src)) + activities.append(dict_to_activity(adict, src, source_id, chunk_key)) if hallucinated: _log_hallucinations(json_path, rejected_dir, hallucinated) @@ -496,6 +618,7 @@ def rebuild( sources_dir: Path, db_path: Path, decisions_path: Optional[Path] = None, + enrichment_path: Optional[Path] = None, schema_path: Path = DEFAULT_SCHEMA_PATH, golden_dir: Optional[Path] = None, do_swap: bool = True, @@ -517,6 +640,11 @@ def rebuild( decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {} final, decision_stats = apply_review_decisions(deduped, decisions) + # Enrichment overlay — applied immediately after review decisions, on the + # post-dedup list, keyed on the same stable content_key (plan B2). + enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {} + enrichment_stats = apply_enrichment(final, enrichment) + try: write_database(db_tmp_path, final) backup = atomic_swap(db_tmp_path, db_path) if do_swap else None @@ -529,6 +657,7 @@ def rebuild( **collected, "dedup": dedup_stats, "decisions": decision_stats, + "enrichment": enrichment_stats, "final_count": len(final), "backup": str(backup) if backup else None, "swapped": do_swap, @@ -579,6 +708,16 @@ def print_report(report: dict) -> None: f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})") print(f"review decisions : dropped {report['decisions']['dropped']}, " f"resolved {report['decisions']['resolved']}") + enr = report.get("enrichment") + if enr and enr.get("entries"): + print(f"enrichment : {enr['entries']} entries " + f"(matched {enr['matched']}, orphaned {enr['orphaned']})") + stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {}) + all_fields = sorted(set(stated) | set(estimated)) + if all_fields: + print(" field population : (stated / estimated)") + for fld in all_fields: + print(f" {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}") print(f"final inserted : {report['final_count']}") print(f"% with rules : {qa['pct_with_rules']}") print(f"needs_review rows : {qa['needs_review']}") @@ -615,6 +754,7 @@ def main(argv: Optional[list[str]] = None) -> int: parser.add_argument("--sources", default="data/sources") parser.add_argument("--db", default="data/activities.db") parser.add_argument("--decisions", default="data/review_decisions.json") + parser.add_argument("--enrichment", default="data/enrichment.json") parser.add_argument("--golden", default="data/golden") parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH)) args = parser.parse_args(argv) @@ -628,6 +768,7 @@ def main(argv: Optional[list[str]] = None) -> int: sources_dir=Path(args.sources), db_path=Path(args.db), decisions_path=Path(args.decisions), + enrichment_path=Path(args.enrichment), schema_path=Path(args.schema), golden_dir=Path(args.golden), ) diff --git a/scripts/repair_extractions.py b/scripts/repair_extractions.py new file mode 100644 index 0000000..b00b6a2 --- /dev/null +++ b/scripts/repair_extractions.py @@ -0,0 +1,244 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +repair_extractions.py — one-shot repair of malformed extraction JSON. + +Subagents systematically emit unescaped ASCII double-quotes inside string +values (Romanian text like „Unu" uses a closing " that terminates the JSON +string early). Re-extraction reproduces the bug, so we repair instead. + +IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by +ending the string at the stray quote and reinterpreting the trailing text as a +new key, which (a) TRUNCATES the value and (b) injects garbage keys. The +truncation is silent (the field is still non-empty) and slips past a naive +presence check. So we use a faithful char-scanner that ESCAPES stray quotes +(\\") instead of splitting on them, then validate the result against the real +activity schema (additionalProperties:false also catches any residual split). + +This is an OFFLINE maintenance tool. build_database.py must NOT depend on it — +the "DB regenerable from data/extracted/" invariant requires plain valid JSON on +disk. We write clean JSON back to data/extracted/ and the build reads vanilla +json. + +Source selection (faithful recovery needs the ORIGINAL malformed text): + * a chunk is a candidate when a MALFORMED original exists — either the + top-level data/extracted/.json is itself invalid, or a malformed + original sits in data/extracted/_rejected/.json. + * the malformed original is preferred as the repair source. + * chunks whose only artifact is already-valid JSON (e.g. a prior json_repair + output that lost the original) are NOT silently "repaired" — if such a chunk + has no valid top-level file it is reported as needing RE-EXTRACTION. + +Usage: + python scripts/repair_extractions.py # report only (dry run) + python scripts/repair_extractions.py --apply # write repaired JSON +""" + +from __future__ import annotations + +import argparse +import glob +import json +from pathlib import Path +from typing import Optional + +SCRIPT_DIR = Path(__file__).resolve().parent +REPO_ROOT = SCRIPT_DIR.parent +EXTRACTED = REPO_ROOT / "data" / "extracted" +REJECTED = EXTRACTED / "_rejected" + +if str(SCRIPT_DIR) not in __import__("sys").path: + __import__("sys").path.insert(0, str(SCRIPT_DIR)) +from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction # noqa: E402 + + +def escape_stray_quotes(s: str) -> str: + """Escape ASCII double-quotes that occur INSIDE a JSON string value. + + A `"` inside a string is treated as a real string-close only when the next + non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is + content and is escaped to `\\"`. This preserves the full value instead of + truncating it (the json_repair failure mode). + """ + out: list[str] = [] + in_str = False + esc = False + n = len(s) + i = 0 + while i < n: + c = s[i] + if esc: + out.append(c) + esc = False + i += 1 + continue + if c == "\\": + out.append(c) + esc = True + i += 1 + continue + if c == '"': + if not in_str: + in_str = True + out.append(c) + else: + j = i + 1 + while j < n and s[j] in " \t\r\n": + j += 1 + nxt = s[j] if j < n else "" + if nxt in ",}]:" or nxt == "": + in_str = False + out.append(c) + else: + out.append('\\"') # content quote → escape, keep value whole + i += 1 + continue + out.append(c) + i += 1 + return "".join(out) + + +def _is_valid_json(path: Path) -> bool: + try: + json.loads(path.read_text(encoding="utf-8")) + return True + except (json.JSONDecodeError, OSError): + return False + + +def _malformed_source(key: str) -> Optional[Path]: + """Return the malformed-original file for a chunk, preferring top-level.""" + live = EXTRACTED / f"{key}.json" + if live.exists() and not _is_valid_json(live): + return live + rej = REJECTED / f"{key}.json" + if rej.exists() and not _is_valid_json(rej): + return rej + return None + + +def _candidate_keys() -> tuple[dict[str, Path], list[str]]: + """ + (repair_candidates, needs_reextraction). + + repair_candidates: key -> malformed source file (faithfully repairable). + needs_reextraction: chunks with no malformed original AND no valid + top-level file (their original was lost) — must be re-extracted. + """ + keys = set() + for fn in glob.glob(str(EXTRACTED / "*.json")): + keys.add(Path(fn).stem) + for fn in glob.glob(str(REJECTED / "*.json")): + keys.add(Path(fn).stem) + + candidates: dict[str, Path] = {} + needs_reextraction: list[str] = [] + for key in sorted(keys): + # A malformed original anywhere is faithfully repairable, and is the + # source of truth even if a (json_repair-produced, possibly truncated) + # valid top-level file exists — escaping the original never truncates, + # so re-repairing from it is always >= the json_repair output. + src = _malformed_source(key) + if src is not None: + candidates[key] = src + continue + live = EXTRACTED / f"{key}.json" + if live.exists() and _is_valid_json(live): + continue # genuinely-valid extraction, nothing to do + # no valid top-level and no malformed original to repair from + needs_reextraction.append(key) + return candidates, needs_reextraction + + +def repair(apply: bool) -> int: + schema = load_schema(DEFAULT_SCHEMA_PATH) + candidates, needs_reextraction = _candidate_keys() + + print("=" * 64) + print(f"REPAIR EXTRACTIONS ({'APPLY' if apply else 'dry run'})") + print("=" * 64) + print(f"repair candidates: {len(candidates)}") + + def _textlen(data: dict) -> int: + total = 0 + for a in data.get("activities", []): + if isinstance(a, dict): + for v in a.values(): + if isinstance(v, str): + total += len(v) + return total + + ok = 0 + kept_toplevel = 0 + still_bad: list[str] = [] + schema_fail: list[tuple[str, str]] = [] + + for key, src in candidates.items(): + live = EXTRACTED / f"{key}.json" + live_valid = live.exists() and _is_valid_json(live) + + raw = src.read_text(encoding="utf-8") + fixed = escape_stray_quotes(raw) + try: + data = json.loads(fixed) + except json.JSONDecodeError as exc: + if live_valid: + kept_toplevel += 1 # genuine top-level is fine; stale _rejected + else: + still_bad.append(f"{key}: still invalid after escape ({exc})") + continue + errors = validate_extraction(data, schema) + if errors: + if live_valid: + kept_toplevel += 1 + else: + schema_fail.append((key, errors[0])) + print(f" {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}") + continue + + # Faithfulness guard: only replace a valid top-level when the escaped + # repair carries STRICTLY more text (i.e. the top-level was a truncated + # json_repair output). Genuine extractions are kept untouched. + if live_valid: + try: + live_data = json.loads(live.read_text(encoding="utf-8")) + except json.JSONDecodeError: + live_data = {} + if _textlen(data) <= _textlen(live_data): + kept_toplevel += 1 + continue + + n = len(data.get("activities", [])) + print(f" {key[:50]:<50} {n:>3} acts REPAIR") + if apply: + live.write_text( + json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8" + ) + ok += 1 + + print("-" * 64) + print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | " + f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | " + f"needs re-extraction: {len(needs_reextraction)}") + for key, err in schema_fail: + print(f" ⚠ schema {key}: {err[:60]}") + for msg in still_bad: + print(f" ✘ {msg}") + for key in needs_reextraction: + print(f" ↻ re-extract: {key}") + if not apply: + print("\nDry run — re-run with --apply to write repaired JSON.") + print("=" * 64) + return 0 + + +def main(argv: Optional[list[str]] = None) -> int: + parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.") + parser.add_argument("--apply", action="store_true", + help="write repaired JSON (default: dry run)") + args = parser.parse_args(argv) + return repair(args.apply) + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/run_enrichment.py b/scripts/run_enrichment.py new file mode 100644 index 0000000..dcf434a --- /dev/null +++ b/scripts/run_enrichment.py @@ -0,0 +1,270 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +""" +run_enrichment.py — enrichment orchestrator (plan Part B3). + +Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the +already-rebuilt data/activities.db, and for every activity emits one subagent +prompt asking for a single bilingual + inferred-filter enrichment pass. Like +extraction, this script does NOT call the LLM — the interactive Claude Code +orchestrator launches waves of subagents on the emitted prompts. + +Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on +import_common.content_key(normalized_name, language, _normalize_text(description)) +— the SAME function build_database uses to apply the overlay. The key is stable +only while the extraction text is frozen, so enrichment runs AFTER the freezing +rebuild. + +Modes: + (default) emit one prompt per activity that has no enrichment part yet + (resumable: data/enrichment_parts/.json present => skip) + --collect merge data/enrichment_parts/*.json -> data/enrichment.json + +Pilot scoping (plan B5): --source and/or --limit N narrow +the emitted prompts to a single source / category for the sign-off pilot. + +Usage: + python scripts/run_enrichment.py --source teambuilding_corbu # pilot + python scripts/run_enrichment.py # all rows + python scripts/run_enrichment.py --collect # merge parts +""" + +from __future__ import annotations + +import argparse +import json +import sqlite3 +import sys +from pathlib import Path +from typing import Optional + +SCRIPT_DIR = Path(__file__).resolve().parent +REPO_ROOT = SCRIPT_DIR.parent +for _p in (str(SCRIPT_DIR), str(REPO_ROOT)): + if _p not in sys.path: + sys.path.insert(0, _p) + +from import_common import ( # noqa: E402 + content_key, + find_chunk_text, + normalize_name, +) + +ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md" + +# Columns pulled from the DB into the prompt as the "current value" context. +_DB_COLUMNS = ( + "id", "name", "description", "rules", "variations", + "category", "content_type", "language", "normalized_name", + "page_reference", "source_id", "chunk_key", + "participants_min", "participants_max", + "duration_min", "duration_max", + "age_group_min", "age_group_max", +) + +# How much source-chunk text to inline. Chunks are page-sized; cap so a dense +# chunk does not blow the prompt up, but keep enough to ground the expansion. +_CHUNK_TEXT_CAP = 12000 + + +def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]: + conn = sqlite3.connect(db_path) + conn.row_factory = sqlite3.Row + try: + cols = ", ".join(_DB_COLUMNS) + sql = f"SELECT {cols} FROM activities" + params: list = [] + if source_substr: + sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)" + params = [f"%{source_substr}%", f"%{source_substr}%"] + sql += " ORDER BY source_id, id" + return [dict(r) for r in conn.execute(sql, params).fetchall()] + finally: + conn.close() + + +def _row_content_key(row: dict) -> str: + return content_key( + row.get("normalized_name") or normalize_name(row.get("name") or ""), + row.get("language"), + row.get("description") or "", + ) + + +def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]: + """Locate the source-chunk text via the row's chunk_key / source_id.""" + header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")} + if not header["chunk_key"]: + return None + # find_chunk_text resolves from the header when chunk_key is present; + # the json_path arg is only a fallback, so a synthetic path is fine. + text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir) + if text and len(text) > _CHUNK_TEXT_CAP: + text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…" + return text + + +def _current_fields_block(row: dict) -> str: + """The activity's current DB values, as a compact JSON block for context.""" + fields = { + "name": row.get("name"), + "description": row.get("description"), + "rules": row.get("rules"), + "variations": row.get("variations"), + "category": row.get("category"), + "content_type": row.get("content_type"), + "language": row.get("language"), + "participants_min": row.get("participants_min"), + "participants_max": row.get("participants_max"), + "duration_min": row.get("duration_min"), + "duration_max": row.get("duration_max"), + "age_group_min": row.get("age_group_min"), + "age_group_max": row.get("age_group_max"), + } + return json.dumps(fields, ensure_ascii=False, indent=2) + + +def emit_enrichment_prompt( + row: dict, key: str, chunks_dir: Path, prompts_dir: Path +) -> Path: + """Write the subagent enrichment prompt for one activity.""" + chunk_text = _chunk_text_for_row(row, chunks_dir) + source_block = ( + chunk_text if chunk_text is not None + else "[source chunk text unavailable — translate only what is given " + "above; do NOT invent steps, and mark any inferred filter field " + "as estimated]" + ) + part_path = f"data/enrichment_parts/{key}.json" + text = "\n".join([ + f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})", + "", + f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.", + "Single pass. Translate faithfully to Romanian; expand description_ro " + "ONLY from the source chunk text below; mark inferred filter fields in " + "`estimated_fields`.", + "", + f"Write the result JSON to: `{part_path}`", + f'It MUST include `"content_key": "{key}"`.', + f'Page reference: {row.get("page_reference") or "?"}', + "", + "## Current activity values (the text to translate / enrich)", + "```json", + _current_fields_block(row), + "```", + "", + "## Source chunk text (ground description_ro expansion in THIS only)", + "```", + source_block, + "```", + "", + ]) + prompts_dir.mkdir(parents=True, exist_ok=True) + out = prompts_dir / f"{key}.prompt.md" + out.write_text(text, encoding="utf-8") + return out + + +def collect_enrichment(parts_dir: Path, out_path: Path) -> dict: + """Merge data/enrichment_parts/*.json into one flat content_key map.""" + merged: dict = {} + bad: list[str] = [] + if parts_dir.is_dir(): + for part in sorted(parts_dir.glob("*.json")): + try: + data = json.loads(part.read_text(encoding="utf-8")) + except (json.JSONDecodeError, OSError): + bad.append(part.name) + continue + key = data.get("content_key") or part.stem + entry = {k: v for k, v in data.items() if k != "content_key"} + merged[key] = entry + out_path.write_text( + json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8" + ) + return {"entries": len(merged), "bad_parts": bad, "out": str(out_path)} + + +def run_emit( + *, + db_path: Path, + chunks_dir: Path, + parts_dir: Path, + prompts_dir: Path, + source_substr: Optional[str], + limit: Optional[int], +) -> dict: + rows = _fetch_rows(db_path, source_substr) + emitted, skipped = 0, 0 + for row in rows: + key = _row_content_key(row) + if (parts_dir / f"{key}.json").is_file(): + skipped += 1 + continue + emit_enrichment_prompt(row, key, chunks_dir, prompts_dir) + emitted += 1 + if limit and emitted >= limit: + break + return { + "rows": len(rows), + "emitted": emitted, + "skipped_done": skipped, + "prompts_dir": str(prompts_dir), + } + + +def main(argv: Optional[list[str]] = None) -> int: + parser = argparse.ArgumentParser(description="Enrichment orchestrator.") + parser.add_argument("--db", default="data/activities.db") + parser.add_argument("--chunks", default="data/chunks") + parser.add_argument("--parts", default="data/enrichment_parts") + parser.add_argument("--prompts", default="data/enrichment_prompts") + parser.add_argument("--out", default="data/enrichment.json") + parser.add_argument("--source", default=None, + help="only rows whose source_id/chunk_key contains this (pilot)") + parser.add_argument("--limit", type=int, default=None, + help="cap emitted prompts (pilot)") + parser.add_argument("--collect", action="store_true", + help="merge enrichment parts into the overlay JSON") + args = parser.parse_args(argv) + + print("=" * 60) + print("ENRICHMENT ORCHESTRATOR") + print("=" * 60) + + if args.collect: + result = collect_enrichment(Path(args.parts), Path(args.out)) + print(f"collected : {result['entries']} entries -> {result['out']}") + if result["bad_parts"]: + print(f"bad parts : {len(result['bad_parts'])} (skipped)") + for name in result["bad_parts"]: + print(f" - {name}") + print("Run build_database.py --rebuild to apply the overlay.") + print("=" * 60) + return 0 + + summary = run_emit( + db_path=Path(args.db), + chunks_dir=Path(args.chunks), + parts_dir=Path(args.parts), + prompts_dir=Path(args.prompts), + source_substr=args.source, + limit=args.limit, + ) + print(f"rows in DB : {summary['rows']}" + + (f" (filtered by '{args.source}')" if args.source else "")) + print(f"already enriched : {summary['skipped_done']}") + print(f"prompts emitted : {summary['emitted']}") + if summary["emitted"]: + print(f"prompts dir : {summary['prompts_dir']}/") + print("Launch waves of ~8-16 Sonnet subagents on those prompts, each " + "writing data/enrichment_parts/.json, then run " + "run_enrichment.py --collect and build_database.py --rebuild.") + else: + print("Nothing to emit — run --collect then build_database.py --rebuild.") + print("=" * 60) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/test_enrichment.py b/tests/test_enrichment.py new file mode 100644 index 0000000..3c0e943 --- /dev/null +++ b/tests/test_enrichment.py @@ -0,0 +1,231 @@ +""" +Tests for the enrichment overlay (plan Part B) and the new filter axes / +bilingual display helpers (plan Part A). + +Covers: + * config_taxonomy.normalize_indoor_outdoor / normalize_space_needed + * build_database.apply_enrichment keying, field application, estimated tally + * DatabaseManager indoor_outdoor / space_needed equality filters + * FTS5 indexing of the *_ro columns + * Activity bilingual display helpers +""" + +import os +import sys + +import pytest + +PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) +if PROJECT_ROOT not in sys.path: + sys.path.insert(0, PROJECT_ROOT) +SCRIPTS = os.path.join(PROJECT_ROOT, "scripts") +if SCRIPTS not in sys.path: + sys.path.insert(0, SCRIPTS) + +from app.models.activity import Activity # noqa: E402 +from app.models.database import DatabaseManager # noqa: E402 +from app.config_taxonomy import ( # noqa: E402 + normalize_indoor_outdoor, + normalize_space_needed, +) +from import_common import content_key, normalize_name # noqa: E402 +from build_database import apply_enrichment # noqa: E402 + + +# -------------------------------------------------------------------------- +# taxonomy normalizers +# -------------------------------------------------------------------------- +@pytest.mark.parametrize("raw,expected", [ + ("indoor", "indoor"), + ("Outdoor", "outdoor"), + ("either", "either"), + ("interior", "indoor"), + ("aer liber", "outdoor"), + ("both", "either"), + ("", None), + ("nonsense", None), + (None, None), +]) +def test_normalize_indoor_outdoor(raw, expected): + assert normalize_indoor_outdoor(raw) == expected + + +@pytest.mark.parametrize("raw,expected", [ + ("mic", "mic"), + ("MEDIU", "mediu"), + ("mare", "mare"), + ("small", "mic"), + ("large", "mare"), + ("", None), + ("huge", None), + (None, None), +]) +def test_normalize_space_needed(raw, expected): + assert normalize_space_needed(raw) == expected + + +# -------------------------------------------------------------------------- +# apply_enrichment +# -------------------------------------------------------------------------- +def _activity(name="Joc de test", description="O descriere de test.", language="ro"): + return Activity( + name=name, description=description, category="team-building", + content_type="joc", source_file="t.txt", language=language, + ) + + +def _key_for(act: Activity) -> str: + return content_key( + act.normalized_name or normalize_name(act.name), + act.language, + act.description or "", + ) + + +def test_apply_enrichment_matches_and_applies_fields(): + act = _activity() + key = _key_for(act) + enrichment = { + key: { + "name_ro": "Joc de test (RO)", + "description_ro": "Descriere îmbogățită în română.", + "indoor_outdoor": "outdoor", + "space_needed": "mediu", + "participants_min": 4, + "participants_max": 12, + "estimated_fields": ["space_needed", "participants_min", "participants_max"], + } + } + stats = apply_enrichment([act], enrichment) + + assert act.name_ro == "Joc de test (RO)" + assert act.description_ro == "Descriere îmbogățită în română." + assert act.indoor_outdoor == "outdoor" + assert act.space_needed == "mediu" + assert act.participants_min == 4 and act.participants_max == 12 + assert set(act.estimated_fields) == {"space_needed", "participants_min", "participants_max"} + + assert stats["entries"] == 1 + assert stats["matched"] == 1 + assert stats["orphaned"] == 0 + # indoor_outdoor stated, space_needed estimated + assert stats["fields_stated"].get("indoor_outdoor") == 1 + assert stats["fields_estimated"].get("space_needed") == 1 + + +def test_apply_enrichment_orphan_entry_counted(): + act = _activity() + enrichment = {"deadbeef" * 5: {"name_ro": "nu se potrivește"}} + stats = apply_enrichment([act], enrichment) + assert stats["matched"] == 0 + assert stats["orphaned"] == 1 + assert act.name_ro is None # untouched + + +def test_apply_enrichment_absent_fields_leave_value_untouched(): + act = _activity() + act.participants_min = 5 + key = _key_for(act) + # entry only translates name; participants must be preserved + apply_enrichment([act], {key: {"name_ro": "Tradus"}}) + assert act.participants_min == 5 + assert act.name_ro == "Tradus" + + +def test_apply_enrichment_drops_unrecognised_enum(): + act = _activity() + key = _key_for(act) + apply_enrichment([act], {key: {"indoor_outdoor": "spaceship", "space_needed": "mic"}}) + assert act.indoor_outdoor is None # unrecognised → dropped + assert act.space_needed == "mic" + + +# -------------------------------------------------------------------------- +# DB equality filters + FTS on *_ro +# -------------------------------------------------------------------------- +@pytest.fixture +def db(tmp_path): + return DatabaseManager(str(tmp_path / "enrich.db")) + + +def _insert(db, **overrides): + base = dict( + name="Activitate", description="desc", category="camp-outdoor", + content_type="joc", source_file="t.txt", language="ro", + ) + base.update(overrides) + return db.insert_activity(Activity(**base)) + + +def test_indoor_outdoor_equality_filter(db): + _insert(db, name="In casa", indoor_outdoor="indoor") + _insert(db, name="Afara", indoor_outdoor="outdoor") + res = db.search_activities(indoor_outdoor="outdoor") + assert len(res) == 1 + assert res[0]["name"] == "Afara" + + +def test_space_needed_equality_filter(db): + _insert(db, name="Mic", space_needed="mic") + _insert(db, name="Mare", space_needed="mare") + res = db.search_activities(space_needed="mare") + assert len(res) == 1 + assert res[0]["name"] == "Mare" + + +def test_fts_indexes_name_ro(db): + _insert(db, name="Treasure Hunt", name_ro="Vânătoarea de comori") + # term only present in the Romanian twin + res = db.search_activities(search_text="comori") + assert len(res) == 1 + assert res[0]["name"] == "Treasure Hunt" + + +def test_fts_indexes_description_ro(db): + _insert(db, name="Game", description="english desc", + description_ro="o activitate de cooperare") + res = db.search_activities(search_text="cooperare") + assert len(res) == 1 + + +def test_ro_columns_round_trip(db): + aid = _insert( + db, name="X", name_ro="X-ro", description_ro="d-ro", + rules_ro="r-ro", variations_ro="v-ro", + indoor_outdoor="either", space_needed="mediu", + estimated_fields=["duration_min"], source_id="src1", + source_ids=["src1", "src2"], chunk_key="src1.part01", + ) + row = db.get_activity_by_id(aid) + loaded = Activity.from_dict(row) + assert loaded.name_ro == "X-ro" + assert loaded.indoor_outdoor == "either" + assert loaded.space_needed == "mediu" + assert loaded.estimated_fields == ["duration_min"] + assert loaded.source_ids == ["src1", "src2"] + assert loaded.chunk_key == "src1.part01" + + +# -------------------------------------------------------------------------- +# display helpers +# -------------------------------------------------------------------------- +def test_display_helpers_prefer_ro_with_fallback(): + act = _activity(name="Original", description="Original desc") + assert act.get_display_name() == "Original" # no translation yet + assert act.get_display_description() == "Original desc" + act.name_ro = "Tradus" + act.description_ro = "Descriere tradusă" + assert act.get_display_name() == "Tradus" + assert act.get_display_description() == "Descriere tradusă" + assert act.has_translation() is True + + +def test_is_estimated_and_axis_displays(): + act = _activity() + act.indoor_outdoor = "outdoor" + act.space_needed = "mare" + act.estimated_fields = ["space_needed"] + assert act.get_indoor_outdoor_display() == "Exterior" + assert act.get_space_needed_display() == "Spațiu mare" + assert act.is_estimated("space_needed") is True + assert act.is_estimated("indoor_outdoor") is False