Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await
re-extraction). DB rebuilt and frozen at 9418 activities — content_keys
are now stable for the enrichment overlay.

Part A (plumbing + UI):
- database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor,
  space_needed, estimated_fields, source_id/source_ids/chunk_key columns;
  FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality
  filters + category counts for both axes.
- activity.py: new fields + bilingual display helpers (get_display_*,
  is_estimated, axis displays).
- config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers
  (None on unrecognised, no fabrication).
- search.py / routes.py / config.py / templates / css: new dropdowns,
  RO-primary rendering with "(estimat)" markers and collapsible original
  text, and a /source/<id> download route shipped DARK behind
  SOURCE_DOWNLOAD_ENABLED (copyright opt-in).
- build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster
  unions source_ids without touching enrichment fields.

Part B (enrichment pipeline, built not yet run):
- build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on
  content_key) + --enrichment CLI + stated-vs-estimated QA.
- run_enrichment.py (resumable, --source/--limit pilot scoping, --collect),
  ENRICHMENT_PROMPT.md.

Repair: scripts/repair_extractions.py fixes the subagents' systematic
unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never
truncates) + schema validation + a strictly-more-text guard. json_repair was
tried first, truncated silently, and is NOT used. build_database has no repair
dependency.

Tests: tests/test_enrichment.py added; 99 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-05-29 18:10:13 +00:00
parent 46d9592a55
commit bcfb6841eb
18 changed files with 1579 additions and 167 deletions

View File

@@ -1,135 +1,143 @@
# HANDOFF — Faza 1 extraction in progress # HANDOFF — Faza 1 extraction DONE, code landed, DB frozen; next = re-extract 7 + enrichment pilot
**Last updated:** 2026-05-20, commit `3d9f266` (pilot complete) + uncommitted Faza 1 work. **Snapshot:** 2026-05-29. Executing plan `enumerated-petting-badger.md` (bilingual
index + enrichment + new filters + source download).
## State of play ## Where we are
Faza 0 (pilot) is **complete and committed**. Faza 1 (full corpus) is **in progress at 10.9%**. | Step (plan Part C) | Status |
|--------------------|--------|
| 1. Finish extraction | **DONE** — 575/588 chunks extracted & valid; 13 missing (see below) |
| 2. Land code Part A1A4 (model/schema/merge) | **DONE & committed** |
| 2b. Code Part A5A8 (UI/search/download) | **DONE & committed** |
| 2c. Code Part B2B4 (enrichment pipeline) | **DONE & committed** |
| 3. Freeze rebuild (freezes content_keys) | **DONE**`data/activities.db` = **9418 activities** |
| Part D tests | **DONE**`tests/test_enrichment.py`, 99 pass total |
| 4. Enrichment pilot → **STOP for user sign-off** | **NOT STARTED** (this is the gate) |
| 5. Final rebuild `--enrichment` | not started |
| Phase | Status | DB rows | Tests | Everything is committed except whatever this session leaves dirty. `data/extracted/*.json`
|-------|--------|---------|-------| is gitignored (575 files on disk, durable across /clear).
| Faza 0 pilot (5 files) | committed (`3d9f266`) | 1751 in `data/activities.db` | 71 passed |
| Faza 1 extraction | **64/588 chunks done**, 1949 activities in `data/extracted/*.json` (not yet imported to DB) | — | — |
## What "Faza 1" is ## The 13 missing chunks (out of 588)
Process the full 96-source corpus (was 116 raw files; some are duplicates/empty zips/skipped junk) through the LLM-subagent extraction pipeline. Same code path as the pilot. Two large mirror dirs dominate the chunk count: **6 content-filter-blocked** (Anthropic safety; accept as missing — marginal loss):
- `87850302_dragon_sleepdeprived.part73 / .part85 / .part94` (camp song lyrics)
- `87850302_dragon_sleepdeprived` — 116 chunks (full dragon.sleepdeprived.ca mirror) - `c3162825_resource_pack__learning_by_playing_catalunya_…part94 / .part95 / .part96`
- `c3162825_resource_pack__learning_by_playing_catalunya_...` — 97 chunks (the catalunya mirror)
Combined they are 213/588 = 36% of all remaining chunks.
## Critical recommendation for the next session
**Use Sonnet 4.6 for subagent extractions, not Opus.** Opus was used through chunks 1-64 and burned through a 5-hour rate-limit window faster than this scale needs. Sonnet has 200K context which is plenty for the ~25k-token prompt + ~20k-token output of a single chunk extraction. The task is structured JSON extraction with a fixed schema — no complex reasoning needed.
The Agent tool's `model` parameter takes `"sonnet"`. Pass it on every Agent call below.
## Resuming — exact mechanical steps
1. **Verify state.**
```bash
cd /workspace/game-library
ls data/extracted/*.json | wc -l # should be 64 (or higher if more done)
ls data/chunks/_prompts/ | wc -l # 588 — the full prompt set
git log --oneline -3 # 3d9f266 must be HEAD or earlier
```
2. **Find what's still pending.** Compare prompts to extracted files:
```bash
ls data/chunks/_prompts/ | sed 's/\.prompt\.md$//' | sort > /tmp/all.txt
ls data/extracted/*.json 2>/dev/null | sed 's|.*/||;s|\.json$||' | sort > /tmp/done.txt
comm -23 /tmp/all.txt /tmp/done.txt > /tmp/pending.txt
wc -l /tmp/pending.txt # how many remain
head /tmp/pending.txt # what's next
```
3. **Launch waves of 16 subagents in parallel.** One Agent call per chunk. Single message with 16 Agent tool calls. Use this exact template (substitute `<CHUNK_KEY>`):
```
Agent(
description: "Extract <CHUNK_KEY>",
subagent_type: "general-purpose",
model: "sonnet", ← critical
prompt: "Working directory: /workspace/game-library. Extraction subagent — read data/chunks/_prompts/<CHUNK_KEY>.prompt.md and follow it EXACTLY. Apply rules from scripts/SUBAGENT_PROMPT.md and schema from scripts/activity_schema.json. Write the JSON. Set language per chunk content ('ro' or 'en'). Report under 40 words."
)
```
The per-chunk prompt file is fully self-contained — it points to the right chunk, sets source_id/source_hash/chunk_key, and references the rules + schema. The subagent just follows it.
4. **After every wave**, briefly check progress and continue:
```bash
ls data/extracted/*.json | wc -l
```
Repeat step 3 with the next 16 pending chunks. If an agent reports `"You've hit your limit · resets ..."` AND `tool_uses: 5` with `total_tokens: 0`, check whether the JSON was written anyway — agents often persist the file before the limit hit. Only re-launch if the JSON is missing.
5. **When all 588 chunks are done**, finalize:
```bash
python3 scripts/validate_extractions.py # any chunks marked rejected go to data/extracted/_reextract/
# re-extract any rejected chunks (same template, prompt from _reextract/)
python3 scripts/build_database.py --rebuild
# if many borderline needs_review rows:
python3 -c "
import sys; sys.path.insert(0,'scripts')
from import_common import content_key, normalize_name
import sqlite3, json
conn = sqlite3.connect('data/activities.db')
conn.row_factory = sqlite3.Row
rows = list(conn.execute('SELECT name, normalized_name, language, description FROM activities WHERE needs_review=1'))
d = {content_key(r['normalized_name'] or normalize_name(r['name']), r['language'], r['description'] or ''): 'merge' for r in rows}
json.dump(d, open('data/review_decisions.json','w'), indent=2)
print(f'{len(d)} merge decisions')
"
python3 scripts/build_database.py --rebuild # apply decisions
python3 -m pytest tests/ -q # 71 should pass
git add data/activities.db data/review_decisions.json
git commit -m "Faza 1: full corpus extraction"
```
## Code reference — what each script does
- `scripts/normalize_sources.py --corpus data/carti-camp-jocuri --out data/sources` → produces 96 `data/sources/<id>.txt` files with `--- PAGE N ---` markers. **Done. Don't re-run.**
- `scripts/chunk_sources.py --sources data/sources --chunks data/chunks` → splits each into ~20pg chunks with 4pg overlap, writes `data/chunks/<id>/<id>.partNN.txt` and updates `data/chunks/manifest.json`. **Done. Don't re-run unless sources change.**
- `scripts/run_extraction.py` → regenerates the per-chunk prompts in `data/chunks/_prompts/`. **Done. Don't re-run unless schema/prompt changes.**
- `scripts/SUBAGENT_PROMPT.md` — extraction rules (what subagents follow).
- `scripts/activity_schema.json` — JSON schema each extraction must validate against.
- `scripts/validate_extractions.py` — per-file schema check + fuzzy `source_excerpt` substring check; writes re-extraction prompts to `data/extracted/_reextract/` for rejected chunks; marks chunks `rejected` in manifest.
- `scripts/build_database.py --rebuild` — validates every `data/extracted/*.json` against schema, drops per-activity hallucinations, dedup, applies `data/review_decisions.json`, atomic swap into `data/activities.db`.
- `scripts/review_queue.py list|resolve <id> <merge|keep-separate|drop>` — CLI for borderline-dedup decisions; persisted in `data/review_decisions.json`.
## Pilot lessons that apply
- **~1.07% hallucinated drops** at pilot scale (well below the 2% threshold). Caused by source_excerpts straddling `--- PAGE N ---` markers. Re-extraction with verbatim within-page quotes fixed all 13 affected chunks. Expect similar rate at Faza 1 scale (~10-30 chunks may need re-extraction).
- **Borderline dedup queue** (369 rows in pilot) — same-name activities re-extracted from chunk overlap with slightly-different prose. Bulk-merge is the right call: same normalized_name + same language + 60-85% desc similarity → merge takes the longest fields. Use the snippet in step 5 above.
- **OCR not needed.** The candidate scanned PDF (`07.Cartea_Mare`) extracted 151 pages of real text via pdfplumber alone. Skip OCR for v1.
## Files not yet committed (uncommitted in this session)
- `data/sources/` — all 96 normalized `.txt` files (in `.gitignore`, don't try to commit them)
- `data/chunks/` — all 588 chunks + manifest (in `.gitignore`)
- `data/extracted/` — 64 JSON files so far (in `.gitignore`)
- `data/activities.db` — **still the pilot's 1751-row DB.** Will be rebuilt after Faza 1 finishes.
The schema, all scripts, all tests, and the pilot DB are already committed at `3d9f266`. No code changes are needed for Faza 1 — just data.
## Status snapshot (as of handoff)
**7 need RE-EXTRACTION** (their malformed-original JSON was destroyed — see "json_repair
incident" below; re-extract once the subagent session limit resets, ~5pm UTC):
``` ```
chunks done : 64 / 588 (10.9%) 3f9c8232_teambuilding_corbu_29092023.part01
activities so far : 1949 5f959f85_scoli_fara_bullying.part02
remaining chunks : 524 83057f6e_31_scurta_incursiune_printre_jocurile_copilariei_asociatia_c.part04
largest pending sources: d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part01
87850302_dragon_sleepdeprived 116 chunks (full dragon mirror) d297a434_07_cartea_mare_a_jocurilor_salvati_copiii_suedia_unicef.part04
c3162825_resource_pack__learning_by_playing 97 chunks (catalunya mirror) d5e51389_09_culegere_de_jocuri_si_povestiri_impact_noi_orizonturi.part05
4da6431e_cub_scout_leader_how_to_book 18 e3bd0953_02_1001_idei_pentru_o_educatie_timpurie_de_calitate_minister.part03
4a765782_1000_fantastic_scout_games 18 (re-extract; was in pilot)
bee67427_the_big_book_of_conflict_resolution 15
e3bd0953_1001_idei_pentru_o_educatie_timp 14
d5e51389_09_culegere_de_jocuri_si_povestiri 13
ce4b48f1_impact_culegere_de_jocuri_si_povest 13
193fdd94_ghid_de_integrare_a_persoanelor_vul 12 (in progress)
779f4fa0_ghidul_animatorului_855_de_jocuri 11
``` ```
Re-extract these (Sonnet subagents, one Agent call each, the per-chunk prompt is at
`data/chunks/_prompts/<key>.prompt.md`), then **re-run the freeze rebuild** so they join
the corpus before enrichment. Re-freezing is safe now — enrichment has NOT run, so no
overlay keys depend on the current freeze yet.
In a fresh session: `cat HANDOFF.md`, then go straight to step 3 above. ## The json_repair incident (important — root cause + what was fixed)
Subagents **systematically emit unescaped ASCII `"` inside string values** (Romanian
text like `„Unu"` uses a closing `"` that terminates the JSON string early). ~34 files
were affected.
First repair attempt used the `json_repair` lib. **It truncates**: on a stray quote it
ends the string and reinterprets the trailing text as a new key, silently dropping the
rest of the value and injecting garbage keys. Schema `additionalProperties:false` caught
the garbage-key cases (8 files dropped at rebuild), but the truncation that didn't create
an extra key slipped through. Applying json_repair output to disk also **overwrote the
malformed originals** for those 8 → originals lost → those (now 7, one recovered) need
re-extraction.
**Fix:** `scripts/repair_extractions.py` was rewritten to use a faithful char-scanner
(`escape_stray_quotes`) that **escapes** stray quotes (`\"`) instead of splitting on them,
validates against the real schema, and only replaces a valid top-level file when the
repaired version carries **strictly more text** (a length guard that catches truncated
json_repair output while leaving genuine extractions untouched). Re-running it cleanly
repaired the affected files; the final freeze had **0 schema-rejected, 0 invalid**.
`json_repair` is no longer used anywhere. Do NOT reintroduce it.
`build_database.py` does NOT depend on the repair script (the "DB regenerable from
data/extracted/" invariant holds — plain `json.loads` only).
## What the code does now (all committed)
**Part A — plumbing (corpus-independent):**
- `app/models/database.py`: new columns `name_ro/description_ro/rules_ro/variations_ro,
indoor_outdoor, space_needed, estimated_fields(JSON), source_id, source_ids(JSON),
chunk_key`; FTS5 indexes the 4 `*_ro` columns (CREATE + all 3 triggers — kept in sync);
indexes on `indoor_outdoor`/`space_needed`; `search_activities` gained `indoor_outdoor`
and `space_needed` equality kwargs; `_update_category_counts` feeds both new axes into
the categories table so dropdowns populate.
- `app/models/activity.py`: new fields + `to_dict`/`from_dict`; helpers `get_display_name`
/ `get_display_description` / `get_display_rules` / `get_display_variations`
(RO-primary, EN fallback), `has_translation`, `is_estimated(field)`,
`get_indoor_outdoor_display`, `get_space_needed_display`.
- `app/config_taxonomy.py`: `INDOOR_OUTDOOR`, `SPACE_NEEDED` enums + RO labels +
`normalize_indoor_outdoor` / `normalize_space_needed` (None on unrecognised, no
fallback — never fabricate a value) + display-name helpers.
- `scripts/build_database.py`: `dict_to_activity` sets `source_id`+`chunk_key`;
`merge_cluster` unions `source_ids` and carries rep's `source_id`/`chunk_key` but
**never** touches enrichment fields (those are applied post-dedup).
**Part A — UI/search:**
- `app/services/search.py`: `_map_filters_to_db_fields` maps `indoor_outdoor`/
`space_needed` to DB equality filters.
- `app/web/routes.py`: new `/source/<id>` download route — **shipped DARK behind
`SOURCE_DOWNLOAD_ENABLED` (default false; copyright exposure, user opts in)**; resolves
`source_file` under `CORPUS_DIR` via `send_from_directory` (traversal-safe, 404s for
web-mirror sources). `DISPLAY_NAMES` extended with both new axes.
- `app/config.py`: `SOURCE_DOWNLOAD_ENABLED`, `CORPUS_DIR`.
- Templates: `index.html`/`results.html` have the 2 new dropdowns; cards use display
helpers + `(estimat)` markers; `activity.html` is RO-primary with a collapsible
"Text original" section, indoor/space cards, estimat markers, and the download link
(only when the flag is on). `main.css` has `.estimated` / `.original-text` styles.
**Part B — enrichment pipeline (built, not yet run):**
- `scripts/build_database.py`: `load_enrichment` + `apply_enrichment(activities, enrichment)`
applied **right after** `apply_review_decisions`, on the post-dedup list, keyed on
`import_common.content_key(normalized_name, language, _normalize_text(description))`
(reused verbatim). CLI `--enrichment` (default `data/enrichment.json`). QA report prints
`enrichment {entries, matched, orphaned}` + per-field **stated vs estimated** counts.
Translated/expanded text is NOT re-validated against source (by design).
- `scripts/run_enrichment.py`: reads the rebuilt DB, computes each row's content_key,
skips rows already in `data/enrichment_parts/<key>.json` (resumable), emits one prompt
per activity to `data/enrichment_prompts/` (current EN fields + source chunk text via
`find_chunk_text`). Pilot scoping: `--source <substr>` and/or `--limit N`. `--collect`
merges parts → `data/enrichment.json`.
- `scripts/ENRICHMENT_PROMPT.md`: single-pass rules — translate faithfully, expand
`description_ro` ONLY from chunk text, mark inferred filter fields in `estimated_fields`,
fixed enum vocab, output `data/enrichment_parts/<content_key>.json` including `content_key`.
## Exact next steps
1. **Re-extract the 7 chunks** above (after session-limit reset). Verify each writes valid
JSON (`python3 -c "import json,glob; [json.loads(open(f).read()) for f in glob.glob('data/extracted/*.json')]"`).
If any come back malformed, `python3 scripts/repair_extractions.py --apply` (faithful now).
2. **Re-freeze:** `python3 scripts/build_database.py --rebuild` — confirm 0 schema-rejected,
note the new total (~9418 + the 7 chunks' activities).
3. **Enrichment PILOT** (plan B5 — the STOP gate guarding 68k LLM calls):
- Pick one source, e.g. `python3 scripts/run_enrichment.py --source teambuilding_corbu`
(or `--limit 30`). This writes prompts to `data/enrichment_prompts/`.
- Launch a small wave of Sonnet subagents on those prompts (each writes
`data/enrichment_parts/<key>.json`).
- `python3 scripts/run_enrichment.py --collect` → `data/enrichment.json`.
- `python3 scripts/build_database.py --rebuild` (picks up `--enrichment` by default).
- **STOP. Hand the user translation-quality + estimation-plausibility + description-
fidelity samples and get sign-off BEFORE scaling to the full corpus.** Do not
auto-proceed past this gate.
4. After sign-off: scale enrichment in waves of ~816 Sonnet subagents, `--collect`,
final `--rebuild --enrichment`.
## Verify / run
- Tests: `python3 -m pytest tests/ -q` → 99 pass.
- App: `SOURCE_DOWNLOAD_ENABLED` is false by default (download link hidden). Set it true
only if the user accepts the copyright exposure of serving original files.
- `data/activities.db.bak` is the pre-this-freeze backup.

View File

@@ -22,6 +22,18 @@ class Config:
# Search settings # Search settings
SEARCH_RESULTS_LIMIT = int(os.environ.get('SEARCH_RESULTS_LIMIT', '100')) SEARCH_RESULTS_LIMIT = int(os.environ.get('SEARCH_RESULTS_LIMIT', '100'))
FTS_ENABLED = True FTS_ENABLED = True
# Source-file download (plan A6). Shipped DARK by default: serving the
# original PDFs/books carries a copyright exposure the user must opt into.
# The /source/<id> route 404s entirely while this is false; the UI hides
# the download link. Enable with SOURCE_DOWNLOAD_ENABLED=true.
SOURCE_DOWNLOAD_ENABLED = (
os.environ.get('SOURCE_DOWNLOAD_ENABLED', 'false').lower() == 'true'
)
# Root of the original corpus. source_file values are relative to this.
CORPUS_DIR = os.environ.get('CORPUS_DIR') or str(
Path(__file__).parent.parent / 'data' / 'carti-camp-jocuri'
)
@staticmethod @staticmethod
def ensure_directories(): def ensure_directories():

View File

@@ -8,7 +8,7 @@ the UI displays the Romanian name. `category` (thematic domain) and
import unicodedata import unicodedata
import re import re
from typing import Dict, List from typing import Dict, List, Optional
# --- Categories (thematic domain) -------------------------------------------- # --- Categories (thematic domain) --------------------------------------------
# slug -> Romanian display name. ~16 fixed slugs; `altele` is the mandatory # slug -> Romanian display name. ~16 fixed slugs; `altele` is the mandatory
@@ -215,6 +215,89 @@ def normalize_content_type(value: str) -> str:
return aliases.get(slug, DEFAULT_CONTENT_TYPE) return aliases.get(slug, DEFAULT_CONTENT_TYPE)
# --- Indoor / outdoor (enrichment axis) --------------------------------------
# Where the activity is run. Inferred during enrichment when the source is
# silent — such inferences are flagged in `estimated_fields`. slug -> RO label.
INDOOR_OUTDOOR: Dict[str, str] = {
"indoor": "Interior",
"outdoor": "Exterior",
"either": "Interior sau exterior",
}
# --- Space needed (enrichment axis) ------------------------------------------
# Rough footprint the activity requires. slug -> RO label.
SPACE_NEEDED: Dict[str, str] = {
"mic": "Spațiu mic",
"mediu": "Spațiu mediu",
"mare": "Spațiu mare",
}
# Aliases for robustness against LLM output variation. Keys are _slugify'd.
_INDOOR_OUTDOOR_ALIASES: Dict[str, str] = {
"interior": "indoor",
"inside": "indoor",
"in": "indoor",
"exterior": "outdoor",
"outside": "outdoor",
"out": "outdoor",
"aer-liber": "outdoor",
"both": "either",
"any": "either",
"ambele": "either",
"interior-exterior": "either",
"indoor-outdoor": "either",
}
_SPACE_NEEDED_ALIASES: Dict[str, str] = {
"small": "mic",
"redus": "mic",
"putin": "mic",
"medium": "mediu",
"moderat": "mediu",
"large": "mare",
"big": "mare",
"mult": "mare",
"spatiu-mic": "mic",
"spatiu-mediu": "mediu",
"spatiu-mare": "mare",
}
def normalize_indoor_outdoor(value: str) -> Optional[str]:
"""Map an arbitrary string to an indoor_outdoor slug, or None.
Unlike categories, this has NO mandatory fallback: an unrecognised or
empty value yields None (field simply absent), so we never fabricate a
location the enrichment did not assert.
"""
if not value:
return None
slug = _slugify(str(value))
if slug in INDOOR_OUTDOOR:
return slug
return _INDOOR_OUTDOOR_ALIASES.get(slug)
def normalize_space_needed(value: str) -> Optional[str]:
"""Map an arbitrary string to a space_needed slug, or None (no fallback)."""
if not value:
return None
slug = _slugify(str(value))
if slug in SPACE_NEEDED:
return slug
return _SPACE_NEEDED_ALIASES.get(slug)
def indoor_outdoor_display_name(slug: str) -> str:
"""RO display name for an indoor_outdoor slug."""
return INDOOR_OUTDOOR.get(slug, slug)
def space_needed_display_name(slug: str) -> str:
"""RO display name for a space_needed slug."""
return SPACE_NEEDED.get(slug, slug)
def is_valid_category(slug: str) -> bool: def is_valid_category(slug: str) -> bool:
"""True if `slug` is a valid category slug.""" """True if `slug` is a valid category slug."""
return slug in CATEGORIES return slug in CATEGORIES

View File

@@ -76,6 +76,25 @@ class Activity:
extraction_confidence: Optional[str] = None # 'high' / 'med' / 'low' extraction_confidence: Optional[str] = None # 'high' / 'med' / 'low'
needs_review: int = 0 needs_review: int = 0
# Enrichment overlay (applied at build time from data/enrichment.json; see
# plan Part B). Bilingual: the EN/source text stays in name/description/...
# and the Romanian rendering lands in the *_ro twins. Absent fields leave
# the underlying DB value untouched.
name_ro: Optional[str] = None
description_ro: Optional[str] = None
rules_ro: Optional[str] = None
variations_ro: Optional[str] = None
indoor_outdoor: Optional[str] = None # slug: indoor / outdoor / either
space_needed: Optional[str] = None # slug: mic / mediu / mare
# Names of fields whose value was INFERRED by enrichment (source was
# silent) rather than stated in the source — surfaced as "(estimat)" in UI.
estimated_fields: List[str] = field(default_factory=list)
# Source provenance for the download route + enrichment keying.
source_id: Optional[str] = None # e.g. "876d1a2d_marcaje_turistice"
source_ids: List[str] = field(default_factory=list) # all source_ids merged
chunk_key: Optional[str] = None # e.g. "<source_id>.part01"
# Database fields # Database fields
id: Optional[int] = None id: Optional[int] = None
created_at: Optional[str] = None created_at: Optional[str] = None
@@ -117,6 +136,16 @@ class Activity:
'normalized_name': self.normalized_name or normalize_name(self.name), 'normalized_name': self.normalized_name or normalize_name(self.name),
'extraction_confidence': self.extraction_confidence, 'extraction_confidence': self.extraction_confidence,
'needs_review': self.needs_review, 'needs_review': self.needs_review,
'name_ro': self.name_ro,
'description_ro': self.description_ro,
'rules_ro': self.rules_ro,
'variations_ro': self.variations_ro,
'indoor_outdoor': self.indoor_outdoor,
'space_needed': self.space_needed,
'estimated_fields': json.dumps(self.estimated_fields) if self.estimated_fields else None,
'source_id': self.source_id,
'source_ids': json.dumps(self.source_ids) if self.source_ids else None,
'chunk_key': self.chunk_key,
} }
@classmethod @classmethod
@@ -140,6 +169,19 @@ class Activity:
elif source_files is None: elif source_files is None:
source_files = [] source_files = []
# estimated_fields / source_ids: JSON string (DB) or list (in-memory)
def _json_list(value):
if isinstance(value, str):
try:
parsed = json.loads(value)
return parsed if isinstance(parsed, list) else []
except (json.JSONDecodeError, TypeError):
return []
return list(value) if value else []
estimated_fields = _json_list(data.get('estimated_fields'))
source_ids = _json_list(data.get('source_ids'))
return cls( return cls(
id=data.get('id'), id=data.get('id'),
name=data.get('name', ''), name=data.get('name', ''),
@@ -170,6 +212,16 @@ class Activity:
normalized_name=data.get('normalized_name'), normalized_name=data.get('normalized_name'),
extraction_confidence=data.get('extraction_confidence'), extraction_confidence=data.get('extraction_confidence'),
needs_review=data.get('needs_review', 0) or 0, needs_review=data.get('needs_review', 0) or 0,
name_ro=data.get('name_ro'),
description_ro=data.get('description_ro'),
rules_ro=data.get('rules_ro'),
variations_ro=data.get('variations_ro'),
indoor_outdoor=data.get('indoor_outdoor'),
space_needed=data.get('space_needed'),
estimated_fields=estimated_fields,
source_id=data.get('source_id'),
source_ids=source_ids,
chunk_key=data.get('chunk_key'),
created_at=data.get('created_at'), created_at=data.get('created_at'),
updated_at=data.get('updated_at') updated_at=data.get('updated_at')
) )
@@ -210,4 +262,44 @@ class Activity:
return self.materials_category return self.materials_category
elif self.materials_list: elif self.materials_list:
return self.materials_list[:100] + "..." if len(self.materials_list) > 100 else self.materials_list return self.materials_list[:100] + "..." if len(self.materials_list) > 100 else self.materials_list
return "nu specificate" return "nu specificate"
# --- Enrichment / bilingual display helpers ------------------------------
def get_display_name(self) -> str:
"""Romanian name when enriched, else the original."""
return self.name_ro or self.name
def get_display_description(self) -> str:
"""Romanian description when enriched, else the original."""
return self.description_ro or self.description
def get_display_rules(self) -> Optional[str]:
"""Romanian rules when enriched, else the original."""
return self.rules_ro or self.rules
def get_display_variations(self) -> Optional[str]:
"""Romanian variations when enriched, else the original."""
return self.variations_ro or self.variations
def has_translation(self) -> bool:
"""True if any Romanian enrichment text is present."""
return bool(self.name_ro or self.description_ro
or self.rules_ro or self.variations_ro)
def is_estimated(self, field_name: str) -> bool:
"""True if `field_name` was inferred by enrichment (source was silent)."""
return field_name in (self.estimated_fields or [])
def get_indoor_outdoor_display(self) -> Optional[str]:
"""RO label for indoor_outdoor, or None when unset."""
if not self.indoor_outdoor:
return None
from app.config_taxonomy import indoor_outdoor_display_name
return indoor_outdoor_display_name(self.indoor_outdoor)
def get_space_needed_display(self) -> Optional[str]:
"""RO label for space_needed, or None when unset."""
if not self.space_needed:
return None
from app.config_taxonomy import space_needed_display_name
return space_needed_display_name(self.space_needed)

View File

@@ -72,6 +72,18 @@ class DatabaseManager:
extraction_confidence TEXT, extraction_confidence TEXT,
needs_review INTEGER DEFAULT 0, needs_review INTEGER DEFAULT 0,
-- Enrichment overlay (bilingual + inferred filters; Part B)
name_ro TEXT,
description_ro TEXT,
rules_ro TEXT,
variations_ro TEXT,
indoor_outdoor TEXT,
space_needed TEXT,
estimated_fields TEXT,
source_id TEXT,
source_ids TEXT,
chunk_key TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) )
@@ -82,6 +94,7 @@ class DatabaseManager:
CREATE VIRTUAL TABLE IF NOT EXISTS activities_fts USING fts5( CREATE VIRTUAL TABLE IF NOT EXISTS activities_fts USING fts5(
name, description, rules, variations, keywords, name, description, rules, variations, keywords,
materials_list, skills_developed, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro,
content='activities', content='activities',
content_rowid='id' content_rowid='id'
) )
@@ -106,6 +119,8 @@ class DatabaseManager:
"CREATE INDEX IF NOT EXISTS idx_activities_participants ON activities(participants_min, participants_max)", "CREATE INDEX IF NOT EXISTS idx_activities_participants ON activities(participants_min, participants_max)",
"CREATE INDEX IF NOT EXISTS idx_activities_duration ON activities(duration_min, duration_max)", "CREATE INDEX IF NOT EXISTS idx_activities_duration ON activities(duration_min, duration_max)",
"CREATE INDEX IF NOT EXISTS idx_activities_normalized_name ON activities(normalized_name)", "CREATE INDEX IF NOT EXISTS idx_activities_normalized_name ON activities(normalized_name)",
"CREATE INDEX IF NOT EXISTS idx_activities_indoor_outdoor ON activities(indoor_outdoor)",
"CREATE INDEX IF NOT EXISTS idx_activities_space_needed ON activities(space_needed)",
"CREATE INDEX IF NOT EXISTS idx_categories_type ON categories(type)" "CREATE INDEX IF NOT EXISTS idx_categories_type ON categories(type)"
] ]
@@ -117,9 +132,11 @@ class DatabaseManager:
CREATE TRIGGER IF NOT EXISTS activities_fts_insert AFTER INSERT ON activities CREATE TRIGGER IF NOT EXISTS activities_fts_insert AFTER INSERT ON activities
BEGIN BEGIN
INSERT INTO activities_fts(rowid, name, description, rules, variations, INSERT INTO activities_fts(rowid, name, description, rules, variations,
keywords, materials_list, skills_developed) keywords, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro)
VALUES (new.id, new.name, new.description, new.rules, new.variations, VALUES (new.id, new.name, new.description, new.rules, new.variations,
new.keywords, new.materials_list, new.skills_developed); new.keywords, new.materials_list, new.skills_developed,
new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
END END
""") """)
@@ -127,9 +144,11 @@ class DatabaseManager:
CREATE TRIGGER IF NOT EXISTS activities_fts_delete AFTER DELETE ON activities CREATE TRIGGER IF NOT EXISTS activities_fts_delete AFTER DELETE ON activities
BEGIN BEGIN
INSERT INTO activities_fts(activities_fts, rowid, name, description, rules, INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
variations, keywords, materials_list, skills_developed) variations, keywords, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro)
VALUES ('delete', old.id, old.name, old.description, old.rules, VALUES ('delete', old.id, old.name, old.description, old.rules,
old.variations, old.keywords, old.materials_list, old.skills_developed); old.variations, old.keywords, old.materials_list, old.skills_developed,
old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
END END
""") """)
@@ -137,13 +156,17 @@ class DatabaseManager:
CREATE TRIGGER IF NOT EXISTS activities_fts_update AFTER UPDATE ON activities CREATE TRIGGER IF NOT EXISTS activities_fts_update AFTER UPDATE ON activities
BEGIN BEGIN
INSERT INTO activities_fts(activities_fts, rowid, name, description, rules, INSERT INTO activities_fts(activities_fts, rowid, name, description, rules,
variations, keywords, materials_list, skills_developed) variations, keywords, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro)
VALUES ('delete', old.id, old.name, old.description, old.rules, VALUES ('delete', old.id, old.name, old.description, old.rules,
old.variations, old.keywords, old.materials_list, old.skills_developed); old.variations, old.keywords, old.materials_list, old.skills_developed,
old.name_ro, old.description_ro, old.rules_ro, old.variations_ro);
INSERT INTO activities_fts(rowid, name, description, rules, variations, INSERT INTO activities_fts(rowid, name, description, rules, variations,
keywords, materials_list, skills_developed) keywords, materials_list, skills_developed,
name_ro, description_ro, rules_ro, variations_ro)
VALUES (new.id, new.name, new.description, new.rules, new.variations, VALUES (new.id, new.name, new.description, new.rules, new.variations,
new.keywords, new.materials_list, new.skills_developed); new.keywords, new.materials_list, new.skills_developed,
new.name_ro, new.description_ro, new.rules_ro, new.variations_ro);
END END
""") """)
@@ -210,6 +233,10 @@ class DatabaseManager:
('duration', activity.get_duration_display()), ('duration', activity.get_duration_display()),
('materials', activity.get_materials_display()), ('materials', activity.get_materials_display()),
('difficulty', activity.difficulty_level), ('difficulty', activity.difficulty_level),
# Enrichment axes — slugs stored as value; UI maps to RO via
# DISPLAY_NAMES. Without these the new dropdowns would be empty.
('indoor_outdoor', activity.indoor_outdoor),
('space_needed', activity.space_needed),
] ]
for cat_type, cat_value in categories_to_update: for cat_type, cat_value in categories_to_update:
@@ -236,6 +263,8 @@ class DatabaseManager:
duration_max: Optional[int] = None, duration_max: Optional[int] = None,
materials_category: Optional[str] = None, materials_category: Optional[str] = None,
difficulty_level: Optional[str] = None, difficulty_level: Optional[str] = None,
indoor_outdoor: Optional[str] = None,
space_needed: Optional[str] = None,
limit: int = 100) -> List[Dict[str, Any]]: limit: int = 100) -> List[Dict[str, Any]]:
"""Enhanced search with FTS5 and filters""" """Enhanced search with FTS5 and filters"""
@@ -293,7 +322,15 @@ class DatabaseManager:
if difficulty_level: if difficulty_level:
base_query += " AND difficulty_level = ?" base_query += " AND difficulty_level = ?"
params.append(difficulty_level) params.append(difficulty_level)
if indoor_outdoor:
base_query += " AND indoor_outdoor = ?"
params.append(indoor_outdoor)
if space_needed:
base_query += " AND space_needed = ?"
params.append(space_needed)
# Add ordering and limit # Add ordering and limit
query = f"{base_query} {order_clause} LIMIT ?" query = f"{base_query} {order_clause} LIMIT ?"
params.append(limit) params.append(limit)

View File

@@ -200,7 +200,14 @@ class SearchService:
elif filter_key == 'difficulty': elif filter_key == 'difficulty':
db_filters['difficulty_level'] = filter_value db_filters['difficulty_level'] = filter_value
elif filter_key == 'indoor_outdoor':
# Equality filter on the slug column (mirror difficulty).
db_filters['indoor_outdoor'] = filter_value
elif filter_key == 'space_needed':
db_filters['space_needed'] = filter_value
# Handle any other custom filters # Handle any other custom filters
else: else:
# Generic filter handling - try to match against keywords or tags # Generic filter handling - try to match against keywords or tags

View File

@@ -705,4 +705,30 @@ body {
box-shadow: none; box-shadow: none;
border: 1px solid #ddd; border: 1px solid #ddd;
} }
}
/* Enrichment markers (plan Part A7) */
.estimated {
color: #8a6d3b;
font-style: italic;
font-size: 0.85em;
font-weight: normal;
}
.original-text > summary {
cursor: pointer;
color: #555;
user-select: none;
}
.original-text .original-content {
margin-top: 0.75rem;
padding-left: 1rem;
border-left: 3px solid #e0e0e0;
color: #555;
}
.download-hint {
color: #888;
font-size: 0.85em;
} }

View File

@@ -8,13 +8,13 @@
<nav class="breadcrumb"> <nav class="breadcrumb">
<a href="{{ url_for('main.index') }}">Căutare</a> <a href="{{ url_for('main.index') }}">Căutare</a>
<span class="breadcrumb-separator">»</span> <span class="breadcrumb-separator">»</span>
<span class="breadcrumb-current">{{ activity.name }}</span> <span class="breadcrumb-current">{{ activity.get_display_name() }}</span>
</nav> </nav>
<!-- Activity header --> <!-- Activity header -->
<header class="activity-detail-header"> <header class="activity-detail-header">
<div class="activity-title-section"> <div class="activity-title-section">
<h1 class="activity-detail-title">{{ activity.name }}</h1> <h1 class="activity-detail-title">{{ activity.get_display_name() }}</h1>
<span class="activity-category-badge">{{ display_names.get(activity.category, activity.category) }}</span> <span class="activity-category-badge">{{ display_names.get(activity.category, activity.category) }}</span>
{% if activity.content_type %} {% if activity.content_type %}
<span class="activity-content-type-badge">{{ display_names.get(activity.content_type, activity.content_type) }}</span> <span class="activity-content-type-badge">{{ display_names.get(activity.content_type, activity.content_type) }}</span>
@@ -31,27 +31,46 @@
<!-- Activity content --> <!-- Activity content -->
<div class="activity-detail-content"> <div class="activity-detail-content">
<!-- Main description --> <!-- Main description (Romanian-primary, falls back to original) -->
<section class="activity-section"> <section class="activity-section">
<h2 class="section-title">Descriere</h2> <h2 class="section-title">Descriere</h2>
<div class="activity-description">{{ activity.description }}</div> <div class="activity-description">{{ activity.get_display_description() }}</div>
</section> </section>
<!-- Rules and variations --> <!-- Rules and variations -->
{% if activity.rules %} {% if activity.get_display_rules() %}
<section class="activity-section"> <section class="activity-section">
<h2 class="section-title">Reguli</h2> <h2 class="section-title">Reguli</h2>
<div class="activity-rules">{{ activity.rules }}</div> <div class="activity-rules">{{ activity.get_display_rules() }}</div>
</section> </section>
{% endif %} {% endif %}
{% if activity.variations %} {% if activity.get_display_variations() %}
<section class="activity-section"> <section class="activity-section">
<h2 class="section-title">Variații</h2> <h2 class="section-title">Variații</h2>
<div class="activity-variations">{{ activity.variations }}</div> <div class="activity-variations">{{ activity.get_display_variations() }}</div>
</section> </section>
{% endif %} {% endif %}
<!-- Original (pre-translation) text, collapsed by default -->
{% if activity.has_translation() %}
<details class="activity-section original-text">
<summary class="section-title">Text original ({{ display_names.get(activity.language, activity.language or 'sursă') }})</summary>
<div class="original-content">
<h3 class="metadata-title">{{ activity.name }}</h3>
<div class="activity-description">{{ activity.description }}</div>
{% if activity.rules %}
<h4 class="metadata-title">Reguli</h4>
<div class="activity-rules">{{ activity.rules }}</div>
{% endif %}
{% if activity.variations %}
<h4 class="metadata-title">Variații</h4>
<div class="activity-variations">{{ activity.variations }}</div>
{% endif %}
</div>
</details>
{% endif %}
<!-- Metadata grid --> <!-- Metadata grid -->
<section class="activity-section"> <section class="activity-section">
<h2 class="section-title">Detalii activitate</h2> <h2 class="section-title">Detalii activitate</h2>
@@ -59,21 +78,35 @@
{% if activity.get_age_range_display() != "toate vârstele" %} {% if activity.get_age_range_display() != "toate vârstele" %}
<div class="metadata-card"> <div class="metadata-card">
<h3 class="metadata-title">Grupa de vârstă</h3> <h3 class="metadata-title">Grupa de vârstă</h3>
<p class="metadata-value">{{ activity.get_age_range_display() }}</p> <p class="metadata-value">{{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div> </div>
{% endif %} {% endif %}
{% if activity.get_participants_display() != "orice număr" %} {% if activity.get_participants_display() != "orice număr" %}
<div class="metadata-card"> <div class="metadata-card">
<h3 class="metadata-title">Participanți</h3> <h3 class="metadata-title">Participanți</h3>
<p class="metadata-value">{{ activity.get_participants_display() }}</p> <p class="metadata-value">{{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div> </div>
{% endif %} {% endif %}
{% if activity.get_duration_display() != "durată variabilă" %} {% if activity.get_duration_display() != "durată variabilă" %}
<div class="metadata-card"> <div class="metadata-card">
<h3 class="metadata-title">Durata</h3> <h3 class="metadata-title">Durata</h3>
<p class="metadata-value">{{ activity.get_duration_display() }}</p> <p class="metadata-value">{{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div>
{% endif %}
{% if activity.get_indoor_outdoor_display() %}
<div class="metadata-card">
<h3 class="metadata-title">Interior / exterior</h3>
<p class="metadata-value">{{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div>
{% endif %}
{% if activity.get_space_needed_display() %}
<div class="metadata-card">
<h3 class="metadata-title">Spațiu necesar</h3>
<p class="metadata-value">{{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}</p>
</div> </div>
{% endif %} {% endif %}
@@ -125,9 +158,15 @@
<h2 class="section-title">Informații sursă</h2> <h2 class="section-title">Informații sursă</h2>
<div class="source-info"> <div class="source-info">
{% if activity.source_file %} {% if activity.source_file %}
{% if config.SOURCE_DOWNLOAD_ENABLED %}
<p><strong>Fișier sursă:</strong>
<a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a>
<span class="download-hint">(descarcă)</span></p>
{% else %}
<p><strong>Fișier sursă:</strong> {{ activity.source_file }}</p> <p><strong>Fișier sursă:</strong> {{ activity.source_file }}</p>
{% endif %} {% endif %}
{% endif %}
{% if activity.page_reference %} {% if activity.page_reference %}
<p><strong>Referință:</strong> {{ activity.page_reference }}</p> <p><strong>Referință:</strong> {{ activity.page_reference }}</p>
{% endif %} {% endif %}

View File

@@ -125,6 +125,30 @@
</select> </select>
</div> </div>
{% endif %} {% endif %}
{% if filters.indoor_outdoor %}
<div class="filter-group">
<label for="indoor_outdoor" class="filter-label">Interior / exterior</label>
<select name="indoor_outdoor" id="indoor_outdoor" class="filter-select">
<option value="">Oriunde</option>
{% for io in filters.indoor_outdoor %}
<option value="{{ io }}">{{ display_names.get(io, io) }}</option>
{% endfor %}
</select>
</div>
{% endif %}
{% if filters.space_needed %}
<div class="filter-group">
<label for="space_needed" class="filter-label">Spațiu necesar</label>
<select name="space_needed" id="space_needed" class="filter-select">
<option value="">Orice spațiu</option>
{% for sp in filters.space_needed %}
<option value="{{ sp }}">{{ display_names.get(sp, sp) }}</option>
{% endfor %}
</select>
</div>
{% endif %}
{% endif %} {% endif %}
</div> </div>

View File

@@ -85,6 +85,28 @@
</select> </select>
{% endif %} {% endif %}
{% if filters.indoor_outdoor %}
<select name="indoor_outdoor" class="filter-select compact">
<option value="">Oriunde</option>
{% for io in filters.indoor_outdoor %}
<option value="{{ io }}" {% if applied_filters.indoor_outdoor == io %}selected{% endif %}>
{{ display_names.get(io, io) }}
</option>
{% endfor %}
</select>
{% endif %}
{% if filters.space_needed %}
<select name="space_needed" class="filter-select compact">
<option value="">Orice spațiu</option>
{% for sp in filters.space_needed %}
<option value="{{ sp }}" {% if applied_filters.space_needed == sp %}selected{% endif %}>
{{ display_names.get(sp, sp) }}
</option>
{% endfor %}
</select>
{% endif %}
<button type="button" class="btn btn-secondary btn-sm" onclick="clearFilters()"> <button type="button" class="btn btn-secondary btn-sm" onclick="clearFilters()">
Resetează Resetează
</button> </button>
@@ -128,7 +150,7 @@
<header class="activity-header"> <header class="activity-header">
<h3 class="activity-title"> <h3 class="activity-title">
<a href="{{ url_for('main.activity_detail', activity_id=activity.id) }}"> <a href="{{ url_for('main.activity_detail', activity_id=activity.id) }}">
{{ activity.name }} {{ activity.get_display_name() }}
</a> </a>
</h3> </h3>
<span class="activity-category">{{ display_names.get(activity.category, activity.category) }}</span> <span class="activity-category">{{ display_names.get(activity.category, activity.category) }}</span>
@@ -138,24 +160,36 @@
</header> </header>
<div class="activity-content"> <div class="activity-content">
<p class="activity-description">{{ activity.description }}</p> <p class="activity-description">{{ activity.get_display_description() }}</p>
<div class="activity-metadata"> <div class="activity-metadata">
{% if activity.get_age_range_display() != "toate vârstele" %} {% if activity.get_age_range_display() != "toate vârstele" %}
<span class="metadata-item"> <span class="metadata-item">
<strong>Vârsta:</strong> {{ activity.get_age_range_display() }} <strong>Vârsta:</strong> {{ activity.get_age_range_display() }}{% if activity.is_estimated('age_group_min') or activity.is_estimated('age_group_max') %} <em class="estimated">(estimat)</em>{% endif %}
</span> </span>
{% endif %} {% endif %}
{% if activity.get_participants_display() != "orice număr" %} {% if activity.get_participants_display() != "orice număr" %}
<span class="metadata-item"> <span class="metadata-item">
<strong>Participanți:</strong> {{ activity.get_participants_display() }} <strong>Participanți:</strong> {{ activity.get_participants_display() }}{% if activity.is_estimated('participants_min') or activity.is_estimated('participants_max') %} <em class="estimated">(estimat)</em>{% endif %}
</span> </span>
{% endif %} {% endif %}
{% if activity.get_duration_display() != "durată variabilă" %} {% if activity.get_duration_display() != "durată variabilă" %}
<span class="metadata-item"> <span class="metadata-item">
<strong>Durata:</strong> {{ activity.get_duration_display() }} <strong>Durata:</strong> {{ activity.get_duration_display() }}{% if activity.is_estimated('duration_min') or activity.is_estimated('duration_max') %} <em class="estimated">(estimat)</em>{% endif %}
</span>
{% endif %}
{% if activity.get_indoor_outdoor_display() %}
<span class="metadata-item">
<strong>Loc:</strong> {{ activity.get_indoor_outdoor_display() }}{% if activity.is_estimated('indoor_outdoor') %} <em class="estimated">(estimat)</em>{% endif %}
</span>
{% endif %}
{% if activity.get_space_needed_display() %}
<span class="metadata-item">
<strong>Spațiu:</strong> {{ activity.get_space_needed_display() }}{% if activity.is_estimated('space_needed') %} <em class="estimated">(estimat)</em>{% endif %}
</span> </span>
{% endif %} {% endif %}
@@ -168,7 +202,11 @@
{% if activity.source_file %} {% if activity.source_file %}
<div class="activity-source"> <div class="activity-source">
{% if config.SOURCE_DOWNLOAD_ENABLED %}
<small>Sursă: <a href="{{ url_for('main.source_download', activity_id=activity.id) }}">{{ activity.source_file }}</a></small>
{% else %}
<small>Sursă: {{ activity.source_file }}</small> <small>Sursă: {{ activity.source_file }}</small>
{% endif %}
</div> </div>
{% endif %} {% endif %}
</div> </div>

View File

@@ -3,20 +3,27 @@ Flask routes for INDEX-SISTEM-JOCURI v2.0
Clean, minimalist web interface with dynamic filters Clean, minimalist web interface with dynamic filters
""" """
from flask import Blueprint, request, render_template, jsonify, current_app from flask import (
Blueprint, request, render_template, jsonify, current_app,
send_from_directory,
)
from app.models.database import DatabaseManager from app.models.database import DatabaseManager
from app.models.activity import Activity from app.models.activity import Activity
from app.services.search import SearchService from app.services.search import SearchService
from app.config_taxonomy import CATEGORIES, CONTENT_TYPES from app.config_taxonomy import (
import os CATEGORIES, CONTENT_TYPES, INDOOR_OUTDOOR, SPACE_NEEDED,
from pathlib import Path )
bp = Blueprint('main', __name__) bp = Blueprint('main', __name__)
# Slug -> Romanian display name. Category and content_type slugs never collide, # Slug -> Romanian display name. Category, content_type, indoor_outdoor and
# so a single flat map is enough for the UI filter labels. # space_needed slugs never collide, so a single flat map is enough for the UI
# filter labels.
LANGUAGE_NAMES = {'ro': 'Română', 'en': 'Engleză'} LANGUAGE_NAMES = {'ro': 'Română', 'en': 'Engleză'}
DISPLAY_NAMES = {**CATEGORIES, **CONTENT_TYPES, **LANGUAGE_NAMES} DISPLAY_NAMES = {
**CATEGORIES, **CONTENT_TYPES, **INDOOR_OUTDOOR, **SPACE_NEEDED,
**LANGUAGE_NAMES,
}
# Initialize database manager (will be configured in application factory) # Initialize database manager (will be configured in application factory)
def get_db_manager(): def get_db_manager():
@@ -138,6 +145,44 @@ def activity_detail(activity_id):
print(f"Error loading activity {activity_id}: {e}") print(f"Error loading activity {activity_id}: {e}")
return render_template('404.html'), 404 return render_template('404.html'), 404
@bp.route('/source/<int:activity_id>')
def source_download(activity_id):
"""Download the original source file for an activity (plan A6).
Shipped DARK: returns 404 unless SOURCE_DOWNLOAD_ENABLED is set (copyright
exposure — the user opts in). Resolves the activity's `source_file` under
CORPUS_DIR. send_from_directory does the safe-join and blocks traversal;
web-mirror / extension-less sources that are not real files 404 gracefully.
"""
if not current_app.config.get('SOURCE_DOWNLOAD_ENABLED', False):
return render_template('404.html'), 404
try:
db = get_db_manager()
activity_data = db.get_activity_by_id(activity_id)
if not activity_data:
return render_template('404.html'), 404
source_file = (activity_data.get('source_file') or '').strip()
if not source_file:
return render_template('404.html'), 404
corpus_dir = current_app.config.get('CORPUS_DIR')
if not corpus_dir:
return render_template('404.html'), 404
try:
# send_from_directory rejects path traversal and missing files with
# a 404 (NotFound) — no manual safe_join needed.
return send_from_directory(
corpus_dir, source_file, as_attachment=True
)
except Exception:
# Missing file / web-mirror source with no on-disk original.
return render_template('404.html'), 404
except Exception as e:
print(f"Source download error for {activity_id}: {e}")
return render_template('404.html'), 404
@bp.route('/health') @bp.route('/health')
def health_check(): def health_check():
"""Health check endpoint for Docker""" """Health check endpoint for Docker"""

Binary file not shown.

View File

@@ -0,0 +1,98 @@
# SUBAGENT — Activity enrichment
You are a subagent in the game-library enrichment pipeline. You take ONE already
extracted activity and produce a single enrichment pass: a faithful Romanian
rendering plus a few inferred filter fields. You do **one** activity per prompt.
This is **not** re-extraction. The activity text already exists and is trusted.
Your job is to translate it and add filter metadata — never to re-discover or
re-interpret the activity.
## Your task
The prompt gives you two blocks:
1. **Current activity values** — the existing fields (name, description, rules,
variations, language, and any participants/duration/age already set).
2. **Source chunk text** — the original passage the activity came from. This is
your ground truth for any expansion. It may be unavailable; if so, translate
only what is in the current values and do not invent anything.
Produce one JSON object and write it to the path named in the prompt
(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
`content_key` string from the prompt.
## Rules
### Translation (always)
- Translate `name`, `description`, `rules`, `variations` into natural, fluent
Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
- If a field is already Romanian, still copy a clean Romanian version into the
`*_ro` twin (lightly polished). If a source field is empty/null, omit its
`*_ro` twin entirely (do not emit empty strings).
- Translate faithfully. Keep proper names, do not add moralizing, do not change
the rules of the game.
### Description expansion (constrained)
- You MAY make `description_ro` richer than a literal translation — but ONLY
using detail that is actually present in the **source chunk text**. Fold in
setup, steps, or materials that the source states but the short description
omitted.
- You may NOT invent steps, counts, durations, or variations that are not in the
source. If the source is thin, the translation stays thin. Hallucinated
expansion is the one unacceptable failure here.
### Inferred filter fields (mark when inferred)
Fill these when you can, using the source text first, then reasonable inference:
- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
- `participants_min`, `participants_max`: integers (people).
- `duration_min`, `duration_max`: integers (minutes).
- `age_group_min`, `age_group_max`: integers (years).
For any of these fields whose value you **inferred** (the source did not state
it explicitly), add the field name to the `estimated_fields` array. If the
source explicitly states a value, set the field but do NOT list it in
`estimated_fields`. Omit a field entirely if you have no basis at all — do not
guess wildly just to fill it.
Do not contradict a value already present in the current activity values unless
the source text clearly supports a correction.
## Enum vocabulary (fixed — use these exact slugs)
- `indoor_outdoor`: `indoor` | `outdoor` | `either`
- `space_needed`: `mic` | `mediu` | `mare`
## Output format
Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
```json
{
"content_key": "<the exact key from the prompt>",
"name_ro": "…",
"description_ro": "…",
"rules_ro": "…",
"variations_ro": "…",
"indoor_outdoor": "outdoor",
"space_needed": "mediu",
"participants_min": 6,
"participants_max": 20,
"duration_min": 15,
"duration_max": 30,
"age_group_min": 8,
"age_group_max": 14,
"estimated_fields": ["space_needed", "duration_min", "duration_max"]
}
```
Include only the fields you actually fill. Always include `content_key` and
`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
no commentary, no markdown fences in the file itself.
## Report
After writing the file, report in under 30 words: the activity name and which
fields you estimated.

View File

@@ -74,6 +74,23 @@ The file is one JSON object: a `header` plus an `activities` array.
- Do **not** paraphrase the `source_excerpt` — copy it character for character. - Do **not** paraphrase the `source_excerpt` — copy it character for character.
- Better to extract fewer activities accurately than to pad the output. - Better to extract fewer activities accurately than to pad the output.
## Writing large outputs in batches (IMPORTANT)
A single Write tool call has a hard ~32K output-token limit. Dense chunks
(50+ activities) will exceed this. If you estimate >30 activities, write the
file **incrementally**:
1. First Write: emit the file with `header` + the first batch (≤25 activities)
and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
2. For each subsequent batch (≤25 activities at a time), use an Edit call
that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
`,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
closing brace plus the last activity's tail) so the Edit is unambiguous.
3. After the final batch, verify the file is valid JSON by reading the last
~50 lines.
This keeps each tool call under the output-token cap.
## Before you finish ## Before you finish
- Every activity has a non-empty `source_excerpt` and `page_reference`. - Every activity has a non-empty `source_excerpt` and `page_reference`.

View File

@@ -86,7 +86,12 @@ def _split_csv(value: Optional[str]) -> list[str]:
return [p.strip() for p in str(value).split(",") if p.strip()] return [p.strip() for p in str(value).split(",") if p.strip()]
def dict_to_activity(adict: dict, source_file: str) -> Activity: def dict_to_activity(
adict: dict,
source_file: str,
source_id: Optional[str] = None,
chunk_key: Optional[str] = None,
) -> Activity:
"""Build an Activity from one extraction-JSON activity object.""" """Build an Activity from one extraction-JSON activity object."""
tags = adict.get("tags") or [] tags = adict.get("tags") or []
if isinstance(tags, str): if isinstance(tags, str):
@@ -99,6 +104,9 @@ def dict_to_activity(adict: dict, source_file: str) -> Activity:
source_files = [source_file, *source_files] source_files = [source_file, *source_files]
return Activity( return Activity(
source_id=source_id,
source_ids=[source_id] if source_id else [],
chunk_key=chunk_key,
name=(adict.get("name") or "").strip(), name=(adict.get("name") or "").strip(),
description=(adict.get("description") or "").strip(), description=(adict.get("description") or "").strip(),
rules=adict.get("rules"), rules=adict.get("rules"),
@@ -206,6 +214,19 @@ def merge_cluster(cluster: list[Activity]) -> Activity:
if s and s not in sources: if s and s not in sources:
sources.append(s) sources.append(s)
merged.source_files = sources merged.source_files = sources
# source provenance: keep rep's chunk_key/source_id as primary, union the
# source_ids for the download route. Enrichment fields (name_ro,
# description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
# enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
# row's content_key, so merging must not pre-populate them.
merged.source_id = rep.source_id
merged.chunk_key = rep.chunk_key
source_ids: list[str] = []
for a in cluster:
for sid in [a.source_id, *(a.source_ids or [])]:
if sid and sid not in source_ids:
source_ids.append(sid)
merged.source_ids = source_ids
# popularity_score++ per merged duplicate (plan §4) # popularity_score++ per merged duplicate (plan §4)
merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1) merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
return merged return merged
@@ -313,6 +334,108 @@ def apply_review_decisions(
return kept, stats return kept, stats
# --------------------------------------------------------------------------
# step 5b — enrichment overlay (plan Part B)
# --------------------------------------------------------------------------
# Translation / inferred-filter fields written by run_enrichment.py. Applied
# AFTER dedup + review decisions, keyed on the same stable content_key, so the
# overlay survives rebuilds as long as extraction text is frozen.
_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
_ENRICHMENT_INT_FIELDS = (
"participants_min", "participants_max",
"duration_min", "duration_max",
"age_group_min", "age_group_max",
)
def load_enrichment(path: Path) -> dict:
"""Load data/enrichment.json (flat map content_key -> field dict)."""
if path and path.is_file():
try:
data = json.loads(path.read_text(encoding="utf-8"))
if isinstance(data, dict):
return data
except (json.JSONDecodeError, OSError):
pass
return {}
def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
"""
Overlay enrichment fields onto the post-dedup activity list (plan B2).
Keyed by content_key. Only fields PRESENT in an entry are written; absent
fields leave the underlying DB value untouched. indoor_outdoor /
space_needed are normalized to slugs (None on unrecognised). Inferred
fields are recorded in `estimated_fields`. Translated / expanded text is
NOT re-validated against the source here — expansion fidelity is the
enrichment prompt's responsibility (plan B2 comment).
Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
"""
from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
matched_keys: set[str] = set()
fields_stated: dict[str, int] = defaultdict(int)
fields_estimated: dict[str, int] = defaultdict(int)
for act in activities:
key = content_key(
act.normalized_name or normalize_name(act.name),
act.language,
act.description or "",
)
entry = enrichment.get(key)
if not isinstance(entry, dict):
continue
matched_keys.add(key)
estimated = set(entry.get("estimated_fields") or [])
# bilingual text twins
for fld in _ENRICHMENT_TEXT_FIELDS:
val = entry.get(fld)
if isinstance(val, str) and val.strip():
setattr(act, fld, val.strip())
# inferred / clarified structured numeric fields
for fld in _ENRICHMENT_INT_FIELDS:
if entry.get(fld) is not None:
try:
setattr(act, fld, int(entry[fld]))
except (TypeError, ValueError):
pass
# enum filters — normalized to slug, dropped if unrecognised
if entry.get("indoor_outdoor") is not None:
slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
if slug:
act.indoor_outdoor = slug
if entry.get("space_needed") is not None:
slug = normalize_space_needed(entry["space_needed"])
if slug:
act.space_needed = slug
act.estimated_fields = sorted(estimated)
# QA tally: stated vs estimated population, per field
for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
if entry.get(fld) is None:
continue
if fld in estimated:
fields_estimated[fld] += 1
else:
fields_stated[fld] += 1
return {
"entries": len(enrichment),
"matched": len(matched_keys),
"orphaned": len(enrichment) - len(matched_keys),
"fields_stated": dict(fields_stated),
"fields_estimated": dict(fields_estimated),
}
# -------------------------------------------------------------------------- # --------------------------------------------------------------------------
# golden-set recall (plan §7) # golden-set recall (plan §7)
# -------------------------------------------------------------------------- # --------------------------------------------------------------------------
@@ -390,9 +513,8 @@ def collect_activities(
header = data.get("header", {}) header = data.get("header", {})
chunk_text = find_chunk_text(json_path, header, chunks_dir) chunk_text = find_chunk_text(json_path, header, chunks_dir)
source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit( chunk_key = chunk_key_for(json_path, header)
".part", 1 source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
)[0]
fallback_source = ( fallback_source = (
source_path_for(source_id, sources_dir) or source_id or json_path.stem source_path_for(source_id, sources_dir) or source_id or json_path.stem
) )
@@ -409,7 +531,7 @@ def collect_activities(
continue continue
src = adict.get("source_file") or fallback_source src = adict.get("source_file") or fallback_source
raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", "")))) raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
activities.append(dict_to_activity(adict, src)) activities.append(dict_to_activity(adict, src, source_id, chunk_key))
if hallucinated: if hallucinated:
_log_hallucinations(json_path, rejected_dir, hallucinated) _log_hallucinations(json_path, rejected_dir, hallucinated)
@@ -496,6 +618,7 @@ def rebuild(
sources_dir: Path, sources_dir: Path,
db_path: Path, db_path: Path,
decisions_path: Optional[Path] = None, decisions_path: Optional[Path] = None,
enrichment_path: Optional[Path] = None,
schema_path: Path = DEFAULT_SCHEMA_PATH, schema_path: Path = DEFAULT_SCHEMA_PATH,
golden_dir: Optional[Path] = None, golden_dir: Optional[Path] = None,
do_swap: bool = True, do_swap: bool = True,
@@ -517,6 +640,11 @@ def rebuild(
decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {} decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
final, decision_stats = apply_review_decisions(deduped, decisions) final, decision_stats = apply_review_decisions(deduped, decisions)
# Enrichment overlay — applied immediately after review decisions, on the
# post-dedup list, keyed on the same stable content_key (plan B2).
enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
enrichment_stats = apply_enrichment(final, enrichment)
try: try:
write_database(db_tmp_path, final) write_database(db_tmp_path, final)
backup = atomic_swap(db_tmp_path, db_path) if do_swap else None backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
@@ -529,6 +657,7 @@ def rebuild(
**collected, **collected,
"dedup": dedup_stats, "dedup": dedup_stats,
"decisions": decision_stats, "decisions": decision_stats,
"enrichment": enrichment_stats,
"final_count": len(final), "final_count": len(final),
"backup": str(backup) if backup else None, "backup": str(backup) if backup else None,
"swapped": do_swap, "swapped": do_swap,
@@ -579,6 +708,16 @@ def print_report(report: dict) -> None:
f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})") f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
print(f"review decisions : dropped {report['decisions']['dropped']}, " print(f"review decisions : dropped {report['decisions']['dropped']}, "
f"resolved {report['decisions']['resolved']}") f"resolved {report['decisions']['resolved']}")
enr = report.get("enrichment")
if enr and enr.get("entries"):
print(f"enrichment : {enr['entries']} entries "
f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
all_fields = sorted(set(stated) | set(estimated))
if all_fields:
print(" field population : (stated / estimated)")
for fld in all_fields:
print(f" {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
print(f"final inserted : {report['final_count']}") print(f"final inserted : {report['final_count']}")
print(f"% with rules : {qa['pct_with_rules']}") print(f"% with rules : {qa['pct_with_rules']}")
print(f"needs_review rows : {qa['needs_review']}") print(f"needs_review rows : {qa['needs_review']}")
@@ -615,6 +754,7 @@ def main(argv: Optional[list[str]] = None) -> int:
parser.add_argument("--sources", default="data/sources") parser.add_argument("--sources", default="data/sources")
parser.add_argument("--db", default="data/activities.db") parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--decisions", default="data/review_decisions.json") parser.add_argument("--decisions", default="data/review_decisions.json")
parser.add_argument("--enrichment", default="data/enrichment.json")
parser.add_argument("--golden", default="data/golden") parser.add_argument("--golden", default="data/golden")
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH)) parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
args = parser.parse_args(argv) args = parser.parse_args(argv)
@@ -628,6 +768,7 @@ def main(argv: Optional[list[str]] = None) -> int:
sources_dir=Path(args.sources), sources_dir=Path(args.sources),
db_path=Path(args.db), db_path=Path(args.db),
decisions_path=Path(args.decisions), decisions_path=Path(args.decisions),
enrichment_path=Path(args.enrichment),
schema_path=Path(args.schema), schema_path=Path(args.schema),
golden_dir=Path(args.golden), golden_dir=Path(args.golden),
) )

View File

@@ -0,0 +1,244 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
repair_extractions.py — one-shot repair of malformed extraction JSON.
Subagents systematically emit unescaped ASCII double-quotes inside string
values (Romanian text like „Unu" uses a closing " that terminates the JSON
string early). Re-extraction reproduces the bug, so we repair instead.
IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
ending the string at the stray quote and reinterpreting the trailing text as a
new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
truncation is silent (the field is still non-empty) and slips past a naive
presence check. So we use a faithful char-scanner that ESCAPES stray quotes
(\\") instead of splitting on them, then validate the result against the real
activity schema (additionalProperties:false also catches any residual split).
This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
disk. We write clean JSON back to data/extracted/ and the build reads vanilla
json.
Source selection (faithful recovery needs the ORIGINAL malformed text):
* a chunk is a candidate when a MALFORMED original exists — either the
top-level data/extracted/<key>.json is itself invalid, or a malformed
original sits in data/extracted/_rejected/<key>.json.
* the malformed original is preferred as the repair source.
* chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
output that lost the original) are NOT silently "repaired" — if such a chunk
has no valid top-level file it is reported as needing RE-EXTRACTION.
Usage:
python scripts/repair_extractions.py # report only (dry run)
python scripts/repair_extractions.py --apply # write repaired JSON
"""
from __future__ import annotations
import argparse
import glob
import json
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
EXTRACTED = REPO_ROOT / "data" / "extracted"
REJECTED = EXTRACTED / "_rejected"
if str(SCRIPT_DIR) not in __import__("sys").path:
__import__("sys").path.insert(0, str(SCRIPT_DIR))
from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction # noqa: E402
def escape_stray_quotes(s: str) -> str:
"""Escape ASCII double-quotes that occur INSIDE a JSON string value.
A `"` inside a string is treated as a real string-close only when the next
non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
content and is escaped to `\\"`. This preserves the full value instead of
truncating it (the json_repair failure mode).
"""
out: list[str] = []
in_str = False
esc = False
n = len(s)
i = 0
while i < n:
c = s[i]
if esc:
out.append(c)
esc = False
i += 1
continue
if c == "\\":
out.append(c)
esc = True
i += 1
continue
if c == '"':
if not in_str:
in_str = True
out.append(c)
else:
j = i + 1
while j < n and s[j] in " \t\r\n":
j += 1
nxt = s[j] if j < n else ""
if nxt in ",}]:" or nxt == "":
in_str = False
out.append(c)
else:
out.append('\\"') # content quote → escape, keep value whole
i += 1
continue
out.append(c)
i += 1
return "".join(out)
def _is_valid_json(path: Path) -> bool:
try:
json.loads(path.read_text(encoding="utf-8"))
return True
except (json.JSONDecodeError, OSError):
return False
def _malformed_source(key: str) -> Optional[Path]:
"""Return the malformed-original file for a chunk, preferring top-level."""
live = EXTRACTED / f"{key}.json"
if live.exists() and not _is_valid_json(live):
return live
rej = REJECTED / f"{key}.json"
if rej.exists() and not _is_valid_json(rej):
return rej
return None
def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
"""
(repair_candidates, needs_reextraction).
repair_candidates: key -> malformed source file (faithfully repairable).
needs_reextraction: chunks with no malformed original AND no valid
top-level file (their original was lost) — must be re-extracted.
"""
keys = set()
for fn in glob.glob(str(EXTRACTED / "*.json")):
keys.add(Path(fn).stem)
for fn in glob.glob(str(REJECTED / "*.json")):
keys.add(Path(fn).stem)
candidates: dict[str, Path] = {}
needs_reextraction: list[str] = []
for key in sorted(keys):
# A malformed original anywhere is faithfully repairable, and is the
# source of truth even if a (json_repair-produced, possibly truncated)
# valid top-level file exists — escaping the original never truncates,
# so re-repairing from it is always >= the json_repair output.
src = _malformed_source(key)
if src is not None:
candidates[key] = src
continue
live = EXTRACTED / f"{key}.json"
if live.exists() and _is_valid_json(live):
continue # genuinely-valid extraction, nothing to do
# no valid top-level and no malformed original to repair from
needs_reextraction.append(key)
return candidates, needs_reextraction
def repair(apply: bool) -> int:
schema = load_schema(DEFAULT_SCHEMA_PATH)
candidates, needs_reextraction = _candidate_keys()
print("=" * 64)
print(f"REPAIR EXTRACTIONS ({'APPLY' if apply else 'dry run'})")
print("=" * 64)
print(f"repair candidates: {len(candidates)}")
def _textlen(data: dict) -> int:
total = 0
for a in data.get("activities", []):
if isinstance(a, dict):
for v in a.values():
if isinstance(v, str):
total += len(v)
return total
ok = 0
kept_toplevel = 0
still_bad: list[str] = []
schema_fail: list[tuple[str, str]] = []
for key, src in candidates.items():
live = EXTRACTED / f"{key}.json"
live_valid = live.exists() and _is_valid_json(live)
raw = src.read_text(encoding="utf-8")
fixed = escape_stray_quotes(raw)
try:
data = json.loads(fixed)
except json.JSONDecodeError as exc:
if live_valid:
kept_toplevel += 1 # genuine top-level is fine; stale _rejected
else:
still_bad.append(f"{key}: still invalid after escape ({exc})")
continue
errors = validate_extraction(data, schema)
if errors:
if live_valid:
kept_toplevel += 1
else:
schema_fail.append((key, errors[0]))
print(f" {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
continue
# Faithfulness guard: only replace a valid top-level when the escaped
# repair carries STRICTLY more text (i.e. the top-level was a truncated
# json_repair output). Genuine extractions are kept untouched.
if live_valid:
try:
live_data = json.loads(live.read_text(encoding="utf-8"))
except json.JSONDecodeError:
live_data = {}
if _textlen(data) <= _textlen(live_data):
kept_toplevel += 1
continue
n = len(data.get("activities", []))
print(f" {key[:50]:<50} {n:>3} acts REPAIR")
if apply:
live.write_text(
json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
)
ok += 1
print("-" * 64)
print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
f"needs re-extraction: {len(needs_reextraction)}")
for key, err in schema_fail:
print(f" ⚠ schema {key}: {err[:60]}")
for msg in still_bad:
print(f"{msg}")
for key in needs_reextraction:
print(f" ↻ re-extract: {key}")
if not apply:
print("\nDry run — re-run with --apply to write repaired JSON.")
print("=" * 64)
return 0
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
parser.add_argument("--apply", action="store_true",
help="write repaired JSON (default: dry run)")
args = parser.parse_args(argv)
return repair(args.apply)
if __name__ == "__main__":
raise SystemExit(main())

270
scripts/run_enrichment.py Normal file
View File

@@ -0,0 +1,270 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
run_enrichment.py — enrichment orchestrator (plan Part B3).
Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
already-rebuilt data/activities.db, and for every activity emits one subagent
prompt asking for a single bilingual + inferred-filter enrichment pass. Like
extraction, this script does NOT call the LLM — the interactive Claude Code
orchestrator launches waves of subagents on the emitted prompts.
Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
import_common.content_key(normalized_name, language, _normalize_text(description))
— the SAME function build_database uses to apply the overlay. The key is stable
only while the extraction text is frozen, so enrichment runs AFTER the freezing
rebuild.
Modes:
(default) emit one prompt per activity that has no enrichment part yet
(resumable: data/enrichment_parts/<key>.json present => skip)
--collect merge data/enrichment_parts/*.json -> data/enrichment.json
Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
the emitted prompts to a single source / category for the sign-off pilot.
Usage:
python scripts/run_enrichment.py --source teambuilding_corbu # pilot
python scripts/run_enrichment.py # all rows
python scripts/run_enrichment.py --collect # merge parts
"""
from __future__ import annotations
import argparse
import json
import sqlite3
import sys
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from import_common import ( # noqa: E402
content_key,
find_chunk_text,
normalize_name,
)
ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
# Columns pulled from the DB into the prompt as the "current value" context.
_DB_COLUMNS = (
"id", "name", "description", "rules", "variations",
"category", "content_type", "language", "normalized_name",
"page_reference", "source_id", "chunk_key",
"participants_min", "participants_max",
"duration_min", "duration_max",
"age_group_min", "age_group_max",
)
# How much source-chunk text to inline. Chunks are page-sized; cap so a dense
# chunk does not blow the prompt up, but keep enough to ground the expansion.
_CHUNK_TEXT_CAP = 12000
def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
try:
cols = ", ".join(_DB_COLUMNS)
sql = f"SELECT {cols} FROM activities"
params: list = []
if source_substr:
sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
params = [f"%{source_substr}%", f"%{source_substr}%"]
sql += " ORDER BY source_id, id"
return [dict(r) for r in conn.execute(sql, params).fetchall()]
finally:
conn.close()
def _row_content_key(row: dict) -> str:
return content_key(
row.get("normalized_name") or normalize_name(row.get("name") or ""),
row.get("language"),
row.get("description") or "",
)
def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
"""Locate the source-chunk text via the row's chunk_key / source_id."""
header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
if not header["chunk_key"]:
return None
# find_chunk_text resolves from the header when chunk_key is present;
# the json_path arg is only a fallback, so a synthetic path is fine.
text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
if text and len(text) > _CHUNK_TEXT_CAP:
text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
return text
def _current_fields_block(row: dict) -> str:
"""The activity's current DB values, as a compact JSON block for context."""
fields = {
"name": row.get("name"),
"description": row.get("description"),
"rules": row.get("rules"),
"variations": row.get("variations"),
"category": row.get("category"),
"content_type": row.get("content_type"),
"language": row.get("language"),
"participants_min": row.get("participants_min"),
"participants_max": row.get("participants_max"),
"duration_min": row.get("duration_min"),
"duration_max": row.get("duration_max"),
"age_group_min": row.get("age_group_min"),
"age_group_max": row.get("age_group_max"),
}
return json.dumps(fields, ensure_ascii=False, indent=2)
def emit_enrichment_prompt(
row: dict, key: str, chunks_dir: Path, prompts_dir: Path
) -> Path:
"""Write the subagent enrichment prompt for one activity."""
chunk_text = _chunk_text_for_row(row, chunks_dir)
source_block = (
chunk_text if chunk_text is not None
else "[source chunk text unavailable — translate only what is given "
"above; do NOT invent steps, and mark any inferred filter field "
"as estimated]"
)
part_path = f"data/enrichment_parts/{key}.json"
text = "\n".join([
f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
"",
f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
"Single pass. Translate faithfully to Romanian; expand description_ro "
"ONLY from the source chunk text below; mark inferred filter fields in "
"`estimated_fields`.",
"",
f"Write the result JSON to: `{part_path}`",
f'It MUST include `"content_key": "{key}"`.',
f'Page reference: {row.get("page_reference") or "?"}',
"",
"## Current activity values (the text to translate / enrich)",
"```json",
_current_fields_block(row),
"```",
"",
"## Source chunk text (ground description_ro expansion in THIS only)",
"```",
source_block,
"```",
"",
])
prompts_dir.mkdir(parents=True, exist_ok=True)
out = prompts_dir / f"{key}.prompt.md"
out.write_text(text, encoding="utf-8")
return out
def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
"""Merge data/enrichment_parts/*.json into one flat content_key map."""
merged: dict = {}
bad: list[str] = []
if parts_dir.is_dir():
for part in sorted(parts_dir.glob("*.json")):
try:
data = json.loads(part.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
bad.append(part.name)
continue
key = data.get("content_key") or part.stem
entry = {k: v for k, v in data.items() if k != "content_key"}
merged[key] = entry
out_path.write_text(
json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
)
return {"entries": len(merged), "bad_parts": bad, "out": str(out_path)}
def run_emit(
*,
db_path: Path,
chunks_dir: Path,
parts_dir: Path,
prompts_dir: Path,
source_substr: Optional[str],
limit: Optional[int],
) -> dict:
rows = _fetch_rows(db_path, source_substr)
emitted, skipped = 0, 0
for row in rows:
key = _row_content_key(row)
if (parts_dir / f"{key}.json").is_file():
skipped += 1
continue
emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
emitted += 1
if limit and emitted >= limit:
break
return {
"rows": len(rows),
"emitted": emitted,
"skipped_done": skipped,
"prompts_dir": str(prompts_dir),
}
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--parts", default="data/enrichment_parts")
parser.add_argument("--prompts", default="data/enrichment_prompts")
parser.add_argument("--out", default="data/enrichment.json")
parser.add_argument("--source", default=None,
help="only rows whose source_id/chunk_key contains this (pilot)")
parser.add_argument("--limit", type=int, default=None,
help="cap emitted prompts (pilot)")
parser.add_argument("--collect", action="store_true",
help="merge enrichment parts into the overlay JSON")
args = parser.parse_args(argv)
print("=" * 60)
print("ENRICHMENT ORCHESTRATOR")
print("=" * 60)
if args.collect:
result = collect_enrichment(Path(args.parts), Path(args.out))
print(f"collected : {result['entries']} entries -> {result['out']}")
if result["bad_parts"]:
print(f"bad parts : {len(result['bad_parts'])} (skipped)")
for name in result["bad_parts"]:
print(f" - {name}")
print("Run build_database.py --rebuild to apply the overlay.")
print("=" * 60)
return 0
summary = run_emit(
db_path=Path(args.db),
chunks_dir=Path(args.chunks),
parts_dir=Path(args.parts),
prompts_dir=Path(args.prompts),
source_substr=args.source,
limit=args.limit,
)
print(f"rows in DB : {summary['rows']}"
+ (f" (filtered by '{args.source}')" if args.source else ""))
print(f"already enriched : {summary['skipped_done']}")
print(f"prompts emitted : {summary['emitted']}")
if summary["emitted"]:
print(f"prompts dir : {summary['prompts_dir']}/")
print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
"writing data/enrichment_parts/<key>.json, then run "
"run_enrichment.py --collect and build_database.py --rebuild.")
else:
print("Nothing to emit — run --collect then build_database.py --rebuild.")
print("=" * 60)
return 0
if __name__ == "__main__":
raise SystemExit(main())

231
tests/test_enrichment.py Normal file
View File

@@ -0,0 +1,231 @@
"""
Tests for the enrichment overlay (plan Part B) and the new filter axes /
bilingual display helpers (plan Part A).
Covers:
* config_taxonomy.normalize_indoor_outdoor / normalize_space_needed
* build_database.apply_enrichment keying, field application, estimated tally
* DatabaseManager indoor_outdoor / space_needed equality filters
* FTS5 indexing of the *_ro columns
* Activity bilingual display helpers
"""
import os
import sys
import pytest
PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
if PROJECT_ROOT not in sys.path:
sys.path.insert(0, PROJECT_ROOT)
SCRIPTS = os.path.join(PROJECT_ROOT, "scripts")
if SCRIPTS not in sys.path:
sys.path.insert(0, SCRIPTS)
from app.models.activity import Activity # noqa: E402
from app.models.database import DatabaseManager # noqa: E402
from app.config_taxonomy import ( # noqa: E402
normalize_indoor_outdoor,
normalize_space_needed,
)
from import_common import content_key, normalize_name # noqa: E402
from build_database import apply_enrichment # noqa: E402
# --------------------------------------------------------------------------
# taxonomy normalizers
# --------------------------------------------------------------------------
@pytest.mark.parametrize("raw,expected", [
("indoor", "indoor"),
("Outdoor", "outdoor"),
("either", "either"),
("interior", "indoor"),
("aer liber", "outdoor"),
("both", "either"),
("", None),
("nonsense", None),
(None, None),
])
def test_normalize_indoor_outdoor(raw, expected):
assert normalize_indoor_outdoor(raw) == expected
@pytest.mark.parametrize("raw,expected", [
("mic", "mic"),
("MEDIU", "mediu"),
("mare", "mare"),
("small", "mic"),
("large", "mare"),
("", None),
("huge", None),
(None, None),
])
def test_normalize_space_needed(raw, expected):
assert normalize_space_needed(raw) == expected
# --------------------------------------------------------------------------
# apply_enrichment
# --------------------------------------------------------------------------
def _activity(name="Joc de test", description="O descriere de test.", language="ro"):
return Activity(
name=name, description=description, category="team-building",
content_type="joc", source_file="t.txt", language=language,
)
def _key_for(act: Activity) -> str:
return content_key(
act.normalized_name or normalize_name(act.name),
act.language,
act.description or "",
)
def test_apply_enrichment_matches_and_applies_fields():
act = _activity()
key = _key_for(act)
enrichment = {
key: {
"name_ro": "Joc de test (RO)",
"description_ro": "Descriere îmbogățită în română.",
"indoor_outdoor": "outdoor",
"space_needed": "mediu",
"participants_min": 4,
"participants_max": 12,
"estimated_fields": ["space_needed", "participants_min", "participants_max"],
}
}
stats = apply_enrichment([act], enrichment)
assert act.name_ro == "Joc de test (RO)"
assert act.description_ro == "Descriere îmbogățită în română."
assert act.indoor_outdoor == "outdoor"
assert act.space_needed == "mediu"
assert act.participants_min == 4 and act.participants_max == 12
assert set(act.estimated_fields) == {"space_needed", "participants_min", "participants_max"}
assert stats["entries"] == 1
assert stats["matched"] == 1
assert stats["orphaned"] == 0
# indoor_outdoor stated, space_needed estimated
assert stats["fields_stated"].get("indoor_outdoor") == 1
assert stats["fields_estimated"].get("space_needed") == 1
def test_apply_enrichment_orphan_entry_counted():
act = _activity()
enrichment = {"deadbeef" * 5: {"name_ro": "nu se potrivește"}}
stats = apply_enrichment([act], enrichment)
assert stats["matched"] == 0
assert stats["orphaned"] == 1
assert act.name_ro is None # untouched
def test_apply_enrichment_absent_fields_leave_value_untouched():
act = _activity()
act.participants_min = 5
key = _key_for(act)
# entry only translates name; participants must be preserved
apply_enrichment([act], {key: {"name_ro": "Tradus"}})
assert act.participants_min == 5
assert act.name_ro == "Tradus"
def test_apply_enrichment_drops_unrecognised_enum():
act = _activity()
key = _key_for(act)
apply_enrichment([act], {key: {"indoor_outdoor": "spaceship", "space_needed": "mic"}})
assert act.indoor_outdoor is None # unrecognised → dropped
assert act.space_needed == "mic"
# --------------------------------------------------------------------------
# DB equality filters + FTS on *_ro
# --------------------------------------------------------------------------
@pytest.fixture
def db(tmp_path):
return DatabaseManager(str(tmp_path / "enrich.db"))
def _insert(db, **overrides):
base = dict(
name="Activitate", description="desc", category="camp-outdoor",
content_type="joc", source_file="t.txt", language="ro",
)
base.update(overrides)
return db.insert_activity(Activity(**base))
def test_indoor_outdoor_equality_filter(db):
_insert(db, name="In casa", indoor_outdoor="indoor")
_insert(db, name="Afara", indoor_outdoor="outdoor")
res = db.search_activities(indoor_outdoor="outdoor")
assert len(res) == 1
assert res[0]["name"] == "Afara"
def test_space_needed_equality_filter(db):
_insert(db, name="Mic", space_needed="mic")
_insert(db, name="Mare", space_needed="mare")
res = db.search_activities(space_needed="mare")
assert len(res) == 1
assert res[0]["name"] == "Mare"
def test_fts_indexes_name_ro(db):
_insert(db, name="Treasure Hunt", name_ro="Vânătoarea de comori")
# term only present in the Romanian twin
res = db.search_activities(search_text="comori")
assert len(res) == 1
assert res[0]["name"] == "Treasure Hunt"
def test_fts_indexes_description_ro(db):
_insert(db, name="Game", description="english desc",
description_ro="o activitate de cooperare")
res = db.search_activities(search_text="cooperare")
assert len(res) == 1
def test_ro_columns_round_trip(db):
aid = _insert(
db, name="X", name_ro="X-ro", description_ro="d-ro",
rules_ro="r-ro", variations_ro="v-ro",
indoor_outdoor="either", space_needed="mediu",
estimated_fields=["duration_min"], source_id="src1",
source_ids=["src1", "src2"], chunk_key="src1.part01",
)
row = db.get_activity_by_id(aid)
loaded = Activity.from_dict(row)
assert loaded.name_ro == "X-ro"
assert loaded.indoor_outdoor == "either"
assert loaded.space_needed == "mediu"
assert loaded.estimated_fields == ["duration_min"]
assert loaded.source_ids == ["src1", "src2"]
assert loaded.chunk_key == "src1.part01"
# --------------------------------------------------------------------------
# display helpers
# --------------------------------------------------------------------------
def test_display_helpers_prefer_ro_with_fallback():
act = _activity(name="Original", description="Original desc")
assert act.get_display_name() == "Original" # no translation yet
assert act.get_display_description() == "Original desc"
act.name_ro = "Tradus"
act.description_ro = "Descriere tradusă"
assert act.get_display_name() == "Tradus"
assert act.get_display_description() == "Descriere tradusă"
assert act.has_translation() is True
def test_is_estimated_and_axis_displays():
act = _activity()
act.indoor_outdoor = "outdoor"
act.space_needed = "mare"
act.estimated_fields = ["space_needed"]
assert act.get_indoor_outdoor_display() == "Exterior"
assert act.get_space_needed_display() == "Spațiu mare"
assert act.is_estimated("space_needed") is True
assert act.is_estimated("indoor_outdoor") is False