Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB

Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await
re-extraction). DB rebuilt and frozen at 9418 activities — content_keys
are now stable for the enrichment overlay.

Part A (plumbing + UI):
- database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor,
  space_needed, estimated_fields, source_id/source_ids/chunk_key columns;
  FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality
  filters + category counts for both axes.
- activity.py: new fields + bilingual display helpers (get_display_*,
  is_estimated, axis displays).
- config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers
  (None on unrecognised, no fabrication).
- search.py / routes.py / config.py / templates / css: new dropdowns,
  RO-primary rendering with "(estimat)" markers and collapsible original
  text, and a /source/<id> download route shipped DARK behind
  SOURCE_DOWNLOAD_ENABLED (copyright opt-in).
- build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster
  unions source_ids without touching enrichment fields.

Part B (enrichment pipeline, built not yet run):
- build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on
  content_key) + --enrichment CLI + stated-vs-estimated QA.
- run_enrichment.py (resumable, --source/--limit pilot scoping, --collect),
  ENRICHMENT_PROMPT.md.

Repair: scripts/repair_extractions.py fixes the subagents' systematic
unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never
truncates) + schema validation + a strictly-more-text guard. json_repair was
tried first, truncated silently, and is NOT used. build_database has no repair
dependency.

Tests: tests/test_enrichment.py added; 99 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-05-29 18:10:13 +00:00
parent 46d9592a55
commit bcfb6841eb
18 changed files with 1579 additions and 167 deletions

View File

@@ -0,0 +1,98 @@
# SUBAGENT — Activity enrichment
You are a subagent in the game-library enrichment pipeline. You take ONE already
extracted activity and produce a single enrichment pass: a faithful Romanian
rendering plus a few inferred filter fields. You do **one** activity per prompt.
This is **not** re-extraction. The activity text already exists and is trusted.
Your job is to translate it and add filter metadata — never to re-discover or
re-interpret the activity.
## Your task
The prompt gives you two blocks:
1. **Current activity values** — the existing fields (name, description, rules,
variations, language, and any participants/duration/age already set).
2. **Source chunk text** — the original passage the activity came from. This is
your ground truth for any expansion. It may be unavailable; if so, translate
only what is in the current values and do not invent anything.
Produce one JSON object and write it to the path named in the prompt
(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
`content_key` string from the prompt.
## Rules
### Translation (always)
- Translate `name`, `description`, `rules`, `variations` into natural, fluent
Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
- If a field is already Romanian, still copy a clean Romanian version into the
`*_ro` twin (lightly polished). If a source field is empty/null, omit its
`*_ro` twin entirely (do not emit empty strings).
- Translate faithfully. Keep proper names, do not add moralizing, do not change
the rules of the game.
### Description expansion (constrained)
- You MAY make `description_ro` richer than a literal translation — but ONLY
using detail that is actually present in the **source chunk text**. Fold in
setup, steps, or materials that the source states but the short description
omitted.
- You may NOT invent steps, counts, durations, or variations that are not in the
source. If the source is thin, the translation stays thin. Hallucinated
expansion is the one unacceptable failure here.
### Inferred filter fields (mark when inferred)
Fill these when you can, using the source text first, then reasonable inference:
- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
- `participants_min`, `participants_max`: integers (people).
- `duration_min`, `duration_max`: integers (minutes).
- `age_group_min`, `age_group_max`: integers (years).
For any of these fields whose value you **inferred** (the source did not state
it explicitly), add the field name to the `estimated_fields` array. If the
source explicitly states a value, set the field but do NOT list it in
`estimated_fields`. Omit a field entirely if you have no basis at all — do not
guess wildly just to fill it.
Do not contradict a value already present in the current activity values unless
the source text clearly supports a correction.
## Enum vocabulary (fixed — use these exact slugs)
- `indoor_outdoor`: `indoor` | `outdoor` | `either`
- `space_needed`: `mic` | `mediu` | `mare`
## Output format
Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
```json
{
"content_key": "<the exact key from the prompt>",
"name_ro": "…",
"description_ro": "…",
"rules_ro": "…",
"variations_ro": "…",
"indoor_outdoor": "outdoor",
"space_needed": "mediu",
"participants_min": 6,
"participants_max": 20,
"duration_min": 15,
"duration_max": 30,
"age_group_min": 8,
"age_group_max": 14,
"estimated_fields": ["space_needed", "duration_min", "duration_max"]
}
```
Include only the fields you actually fill. Always include `content_key` and
`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
no commentary, no markdown fences in the file itself.
## Report
After writing the file, report in under 30 words: the activity name and which
fields you estimated.

View File

@@ -74,6 +74,23 @@ The file is one JSON object: a `header` plus an `activities` array.
- Do **not** paraphrase the `source_excerpt` — copy it character for character.
- Better to extract fewer activities accurately than to pad the output.
## Writing large outputs in batches (IMPORTANT)
A single Write tool call has a hard ~32K output-token limit. Dense chunks
(50+ activities) will exceed this. If you estimate >30 activities, write the
file **incrementally**:
1. First Write: emit the file with `header` + the first batch (≤25 activities)
and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
2. For each subsequent batch (≤25 activities at a time), use an Edit call
that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
`,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
closing brace plus the last activity's tail) so the Edit is unambiguous.
3. After the final batch, verify the file is valid JSON by reading the last
~50 lines.
This keeps each tool call under the output-token cap.
## Before you finish
- Every activity has a non-empty `source_excerpt` and `page_reference`.

View File

@@ -86,7 +86,12 @@ def _split_csv(value: Optional[str]) -> list[str]:
return [p.strip() for p in str(value).split(",") if p.strip()]
def dict_to_activity(adict: dict, source_file: str) -> Activity:
def dict_to_activity(
adict: dict,
source_file: str,
source_id: Optional[str] = None,
chunk_key: Optional[str] = None,
) -> Activity:
"""Build an Activity from one extraction-JSON activity object."""
tags = adict.get("tags") or []
if isinstance(tags, str):
@@ -99,6 +104,9 @@ def dict_to_activity(adict: dict, source_file: str) -> Activity:
source_files = [source_file, *source_files]
return Activity(
source_id=source_id,
source_ids=[source_id] if source_id else [],
chunk_key=chunk_key,
name=(adict.get("name") or "").strip(),
description=(adict.get("description") or "").strip(),
rules=adict.get("rules"),
@@ -206,6 +214,19 @@ def merge_cluster(cluster: list[Activity]) -> Activity:
if s and s not in sources:
sources.append(s)
merged.source_files = sources
# source provenance: keep rep's chunk_key/source_id as primary, union the
# source_ids for the download route. Enrichment fields (name_ro,
# description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
# enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
# row's content_key, so merging must not pre-populate them.
merged.source_id = rep.source_id
merged.chunk_key = rep.chunk_key
source_ids: list[str] = []
for a in cluster:
for sid in [a.source_id, *(a.source_ids or [])]:
if sid and sid not in source_ids:
source_ids.append(sid)
merged.source_ids = source_ids
# popularity_score++ per merged duplicate (plan §4)
merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
return merged
@@ -313,6 +334,108 @@ def apply_review_decisions(
return kept, stats
# --------------------------------------------------------------------------
# step 5b — enrichment overlay (plan Part B)
# --------------------------------------------------------------------------
# Translation / inferred-filter fields written by run_enrichment.py. Applied
# AFTER dedup + review decisions, keyed on the same stable content_key, so the
# overlay survives rebuilds as long as extraction text is frozen.
_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
_ENRICHMENT_INT_FIELDS = (
"participants_min", "participants_max",
"duration_min", "duration_max",
"age_group_min", "age_group_max",
)
def load_enrichment(path: Path) -> dict:
"""Load data/enrichment.json (flat map content_key -> field dict)."""
if path and path.is_file():
try:
data = json.loads(path.read_text(encoding="utf-8"))
if isinstance(data, dict):
return data
except (json.JSONDecodeError, OSError):
pass
return {}
def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
"""
Overlay enrichment fields onto the post-dedup activity list (plan B2).
Keyed by content_key. Only fields PRESENT in an entry are written; absent
fields leave the underlying DB value untouched. indoor_outdoor /
space_needed are normalized to slugs (None on unrecognised). Inferred
fields are recorded in `estimated_fields`. Translated / expanded text is
NOT re-validated against the source here — expansion fidelity is the
enrichment prompt's responsibility (plan B2 comment).
Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
"""
from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
matched_keys: set[str] = set()
fields_stated: dict[str, int] = defaultdict(int)
fields_estimated: dict[str, int] = defaultdict(int)
for act in activities:
key = content_key(
act.normalized_name or normalize_name(act.name),
act.language,
act.description or "",
)
entry = enrichment.get(key)
if not isinstance(entry, dict):
continue
matched_keys.add(key)
estimated = set(entry.get("estimated_fields") or [])
# bilingual text twins
for fld in _ENRICHMENT_TEXT_FIELDS:
val = entry.get(fld)
if isinstance(val, str) and val.strip():
setattr(act, fld, val.strip())
# inferred / clarified structured numeric fields
for fld in _ENRICHMENT_INT_FIELDS:
if entry.get(fld) is not None:
try:
setattr(act, fld, int(entry[fld]))
except (TypeError, ValueError):
pass
# enum filters — normalized to slug, dropped if unrecognised
if entry.get("indoor_outdoor") is not None:
slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
if slug:
act.indoor_outdoor = slug
if entry.get("space_needed") is not None:
slug = normalize_space_needed(entry["space_needed"])
if slug:
act.space_needed = slug
act.estimated_fields = sorted(estimated)
# QA tally: stated vs estimated population, per field
for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
if entry.get(fld) is None:
continue
if fld in estimated:
fields_estimated[fld] += 1
else:
fields_stated[fld] += 1
return {
"entries": len(enrichment),
"matched": len(matched_keys),
"orphaned": len(enrichment) - len(matched_keys),
"fields_stated": dict(fields_stated),
"fields_estimated": dict(fields_estimated),
}
# --------------------------------------------------------------------------
# golden-set recall (plan §7)
# --------------------------------------------------------------------------
@@ -390,9 +513,8 @@ def collect_activities(
header = data.get("header", {})
chunk_text = find_chunk_text(json_path, header, chunks_dir)
source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit(
".part", 1
)[0]
chunk_key = chunk_key_for(json_path, header)
source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
fallback_source = (
source_path_for(source_id, sources_dir) or source_id or json_path.stem
)
@@ -409,7 +531,7 @@ def collect_activities(
continue
src = adict.get("source_file") or fallback_source
raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
activities.append(dict_to_activity(adict, src))
activities.append(dict_to_activity(adict, src, source_id, chunk_key))
if hallucinated:
_log_hallucinations(json_path, rejected_dir, hallucinated)
@@ -496,6 +618,7 @@ def rebuild(
sources_dir: Path,
db_path: Path,
decisions_path: Optional[Path] = None,
enrichment_path: Optional[Path] = None,
schema_path: Path = DEFAULT_SCHEMA_PATH,
golden_dir: Optional[Path] = None,
do_swap: bool = True,
@@ -517,6 +640,11 @@ def rebuild(
decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
final, decision_stats = apply_review_decisions(deduped, decisions)
# Enrichment overlay — applied immediately after review decisions, on the
# post-dedup list, keyed on the same stable content_key (plan B2).
enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
enrichment_stats = apply_enrichment(final, enrichment)
try:
write_database(db_tmp_path, final)
backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
@@ -529,6 +657,7 @@ def rebuild(
**collected,
"dedup": dedup_stats,
"decisions": decision_stats,
"enrichment": enrichment_stats,
"final_count": len(final),
"backup": str(backup) if backup else None,
"swapped": do_swap,
@@ -579,6 +708,16 @@ def print_report(report: dict) -> None:
f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
print(f"review decisions : dropped {report['decisions']['dropped']}, "
f"resolved {report['decisions']['resolved']}")
enr = report.get("enrichment")
if enr and enr.get("entries"):
print(f"enrichment : {enr['entries']} entries "
f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
all_fields = sorted(set(stated) | set(estimated))
if all_fields:
print(" field population : (stated / estimated)")
for fld in all_fields:
print(f" {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
print(f"final inserted : {report['final_count']}")
print(f"% with rules : {qa['pct_with_rules']}")
print(f"needs_review rows : {qa['needs_review']}")
@@ -615,6 +754,7 @@ def main(argv: Optional[list[str]] = None) -> int:
parser.add_argument("--sources", default="data/sources")
parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--decisions", default="data/review_decisions.json")
parser.add_argument("--enrichment", default="data/enrichment.json")
parser.add_argument("--golden", default="data/golden")
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
args = parser.parse_args(argv)
@@ -628,6 +768,7 @@ def main(argv: Optional[list[str]] = None) -> int:
sources_dir=Path(args.sources),
db_path=Path(args.db),
decisions_path=Path(args.decisions),
enrichment_path=Path(args.enrichment),
schema_path=Path(args.schema),
golden_dir=Path(args.golden),
)

View File

@@ -0,0 +1,244 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
repair_extractions.py — one-shot repair of malformed extraction JSON.
Subagents systematically emit unescaped ASCII double-quotes inside string
values (Romanian text like „Unu" uses a closing " that terminates the JSON
string early). Re-extraction reproduces the bug, so we repair instead.
IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
ending the string at the stray quote and reinterpreting the trailing text as a
new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
truncation is silent (the field is still non-empty) and slips past a naive
presence check. So we use a faithful char-scanner that ESCAPES stray quotes
(\\") instead of splitting on them, then validate the result against the real
activity schema (additionalProperties:false also catches any residual split).
This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
disk. We write clean JSON back to data/extracted/ and the build reads vanilla
json.
Source selection (faithful recovery needs the ORIGINAL malformed text):
* a chunk is a candidate when a MALFORMED original exists — either the
top-level data/extracted/<key>.json is itself invalid, or a malformed
original sits in data/extracted/_rejected/<key>.json.
* the malformed original is preferred as the repair source.
* chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
output that lost the original) are NOT silently "repaired" — if such a chunk
has no valid top-level file it is reported as needing RE-EXTRACTION.
Usage:
python scripts/repair_extractions.py # report only (dry run)
python scripts/repair_extractions.py --apply # write repaired JSON
"""
from __future__ import annotations
import argparse
import glob
import json
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
EXTRACTED = REPO_ROOT / "data" / "extracted"
REJECTED = EXTRACTED / "_rejected"
if str(SCRIPT_DIR) not in __import__("sys").path:
__import__("sys").path.insert(0, str(SCRIPT_DIR))
from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction # noqa: E402
def escape_stray_quotes(s: str) -> str:
"""Escape ASCII double-quotes that occur INSIDE a JSON string value.
A `"` inside a string is treated as a real string-close only when the next
non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
content and is escaped to `\\"`. This preserves the full value instead of
truncating it (the json_repair failure mode).
"""
out: list[str] = []
in_str = False
esc = False
n = len(s)
i = 0
while i < n:
c = s[i]
if esc:
out.append(c)
esc = False
i += 1
continue
if c == "\\":
out.append(c)
esc = True
i += 1
continue
if c == '"':
if not in_str:
in_str = True
out.append(c)
else:
j = i + 1
while j < n and s[j] in " \t\r\n":
j += 1
nxt = s[j] if j < n else ""
if nxt in ",}]:" or nxt == "":
in_str = False
out.append(c)
else:
out.append('\\"') # content quote → escape, keep value whole
i += 1
continue
out.append(c)
i += 1
return "".join(out)
def _is_valid_json(path: Path) -> bool:
try:
json.loads(path.read_text(encoding="utf-8"))
return True
except (json.JSONDecodeError, OSError):
return False
def _malformed_source(key: str) -> Optional[Path]:
"""Return the malformed-original file for a chunk, preferring top-level."""
live = EXTRACTED / f"{key}.json"
if live.exists() and not _is_valid_json(live):
return live
rej = REJECTED / f"{key}.json"
if rej.exists() and not _is_valid_json(rej):
return rej
return None
def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
"""
(repair_candidates, needs_reextraction).
repair_candidates: key -> malformed source file (faithfully repairable).
needs_reextraction: chunks with no malformed original AND no valid
top-level file (their original was lost) — must be re-extracted.
"""
keys = set()
for fn in glob.glob(str(EXTRACTED / "*.json")):
keys.add(Path(fn).stem)
for fn in glob.glob(str(REJECTED / "*.json")):
keys.add(Path(fn).stem)
candidates: dict[str, Path] = {}
needs_reextraction: list[str] = []
for key in sorted(keys):
# A malformed original anywhere is faithfully repairable, and is the
# source of truth even if a (json_repair-produced, possibly truncated)
# valid top-level file exists — escaping the original never truncates,
# so re-repairing from it is always >= the json_repair output.
src = _malformed_source(key)
if src is not None:
candidates[key] = src
continue
live = EXTRACTED / f"{key}.json"
if live.exists() and _is_valid_json(live):
continue # genuinely-valid extraction, nothing to do
# no valid top-level and no malformed original to repair from
needs_reextraction.append(key)
return candidates, needs_reextraction
def repair(apply: bool) -> int:
schema = load_schema(DEFAULT_SCHEMA_PATH)
candidates, needs_reextraction = _candidate_keys()
print("=" * 64)
print(f"REPAIR EXTRACTIONS ({'APPLY' if apply else 'dry run'})")
print("=" * 64)
print(f"repair candidates: {len(candidates)}")
def _textlen(data: dict) -> int:
total = 0
for a in data.get("activities", []):
if isinstance(a, dict):
for v in a.values():
if isinstance(v, str):
total += len(v)
return total
ok = 0
kept_toplevel = 0
still_bad: list[str] = []
schema_fail: list[tuple[str, str]] = []
for key, src in candidates.items():
live = EXTRACTED / f"{key}.json"
live_valid = live.exists() and _is_valid_json(live)
raw = src.read_text(encoding="utf-8")
fixed = escape_stray_quotes(raw)
try:
data = json.loads(fixed)
except json.JSONDecodeError as exc:
if live_valid:
kept_toplevel += 1 # genuine top-level is fine; stale _rejected
else:
still_bad.append(f"{key}: still invalid after escape ({exc})")
continue
errors = validate_extraction(data, schema)
if errors:
if live_valid:
kept_toplevel += 1
else:
schema_fail.append((key, errors[0]))
print(f" {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
continue
# Faithfulness guard: only replace a valid top-level when the escaped
# repair carries STRICTLY more text (i.e. the top-level was a truncated
# json_repair output). Genuine extractions are kept untouched.
if live_valid:
try:
live_data = json.loads(live.read_text(encoding="utf-8"))
except json.JSONDecodeError:
live_data = {}
if _textlen(data) <= _textlen(live_data):
kept_toplevel += 1
continue
n = len(data.get("activities", []))
print(f" {key[:50]:<50} {n:>3} acts REPAIR")
if apply:
live.write_text(
json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
)
ok += 1
print("-" * 64)
print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
f"needs re-extraction: {len(needs_reextraction)}")
for key, err in schema_fail:
print(f" ⚠ schema {key}: {err[:60]}")
for msg in still_bad:
print(f"{msg}")
for key in needs_reextraction:
print(f" ↻ re-extract: {key}")
if not apply:
print("\nDry run — re-run with --apply to write repaired JSON.")
print("=" * 64)
return 0
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
parser.add_argument("--apply", action="store_true",
help="write repaired JSON (default: dry run)")
args = parser.parse_args(argv)
return repair(args.apply)
if __name__ == "__main__":
raise SystemExit(main())

270
scripts/run_enrichment.py Normal file
View File

@@ -0,0 +1,270 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
run_enrichment.py — enrichment orchestrator (plan Part B3).
Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
already-rebuilt data/activities.db, and for every activity emits one subagent
prompt asking for a single bilingual + inferred-filter enrichment pass. Like
extraction, this script does NOT call the LLM — the interactive Claude Code
orchestrator launches waves of subagents on the emitted prompts.
Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
import_common.content_key(normalized_name, language, _normalize_text(description))
— the SAME function build_database uses to apply the overlay. The key is stable
only while the extraction text is frozen, so enrichment runs AFTER the freezing
rebuild.
Modes:
(default) emit one prompt per activity that has no enrichment part yet
(resumable: data/enrichment_parts/<key>.json present => skip)
--collect merge data/enrichment_parts/*.json -> data/enrichment.json
Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
the emitted prompts to a single source / category for the sign-off pilot.
Usage:
python scripts/run_enrichment.py --source teambuilding_corbu # pilot
python scripts/run_enrichment.py # all rows
python scripts/run_enrichment.py --collect # merge parts
"""
from __future__ import annotations
import argparse
import json
import sqlite3
import sys
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from import_common import ( # noqa: E402
content_key,
find_chunk_text,
normalize_name,
)
ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
# Columns pulled from the DB into the prompt as the "current value" context.
_DB_COLUMNS = (
"id", "name", "description", "rules", "variations",
"category", "content_type", "language", "normalized_name",
"page_reference", "source_id", "chunk_key",
"participants_min", "participants_max",
"duration_min", "duration_max",
"age_group_min", "age_group_max",
)
# How much source-chunk text to inline. Chunks are page-sized; cap so a dense
# chunk does not blow the prompt up, but keep enough to ground the expansion.
_CHUNK_TEXT_CAP = 12000
def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
try:
cols = ", ".join(_DB_COLUMNS)
sql = f"SELECT {cols} FROM activities"
params: list = []
if source_substr:
sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
params = [f"%{source_substr}%", f"%{source_substr}%"]
sql += " ORDER BY source_id, id"
return [dict(r) for r in conn.execute(sql, params).fetchall()]
finally:
conn.close()
def _row_content_key(row: dict) -> str:
return content_key(
row.get("normalized_name") or normalize_name(row.get("name") or ""),
row.get("language"),
row.get("description") or "",
)
def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
"""Locate the source-chunk text via the row's chunk_key / source_id."""
header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
if not header["chunk_key"]:
return None
# find_chunk_text resolves from the header when chunk_key is present;
# the json_path arg is only a fallback, so a synthetic path is fine.
text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
if text and len(text) > _CHUNK_TEXT_CAP:
text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
return text
def _current_fields_block(row: dict) -> str:
"""The activity's current DB values, as a compact JSON block for context."""
fields = {
"name": row.get("name"),
"description": row.get("description"),
"rules": row.get("rules"),
"variations": row.get("variations"),
"category": row.get("category"),
"content_type": row.get("content_type"),
"language": row.get("language"),
"participants_min": row.get("participants_min"),
"participants_max": row.get("participants_max"),
"duration_min": row.get("duration_min"),
"duration_max": row.get("duration_max"),
"age_group_min": row.get("age_group_min"),
"age_group_max": row.get("age_group_max"),
}
return json.dumps(fields, ensure_ascii=False, indent=2)
def emit_enrichment_prompt(
row: dict, key: str, chunks_dir: Path, prompts_dir: Path
) -> Path:
"""Write the subagent enrichment prompt for one activity."""
chunk_text = _chunk_text_for_row(row, chunks_dir)
source_block = (
chunk_text if chunk_text is not None
else "[source chunk text unavailable — translate only what is given "
"above; do NOT invent steps, and mark any inferred filter field "
"as estimated]"
)
part_path = f"data/enrichment_parts/{key}.json"
text = "\n".join([
f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
"",
f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
"Single pass. Translate faithfully to Romanian; expand description_ro "
"ONLY from the source chunk text below; mark inferred filter fields in "
"`estimated_fields`.",
"",
f"Write the result JSON to: `{part_path}`",
f'It MUST include `"content_key": "{key}"`.',
f'Page reference: {row.get("page_reference") or "?"}',
"",
"## Current activity values (the text to translate / enrich)",
"```json",
_current_fields_block(row),
"```",
"",
"## Source chunk text (ground description_ro expansion in THIS only)",
"```",
source_block,
"```",
"",
])
prompts_dir.mkdir(parents=True, exist_ok=True)
out = prompts_dir / f"{key}.prompt.md"
out.write_text(text, encoding="utf-8")
return out
def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
"""Merge data/enrichment_parts/*.json into one flat content_key map."""
merged: dict = {}
bad: list[str] = []
if parts_dir.is_dir():
for part in sorted(parts_dir.glob("*.json")):
try:
data = json.loads(part.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
bad.append(part.name)
continue
key = data.get("content_key") or part.stem
entry = {k: v for k, v in data.items() if k != "content_key"}
merged[key] = entry
out_path.write_text(
json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
)
return {"entries": len(merged), "bad_parts": bad, "out": str(out_path)}
def run_emit(
*,
db_path: Path,
chunks_dir: Path,
parts_dir: Path,
prompts_dir: Path,
source_substr: Optional[str],
limit: Optional[int],
) -> dict:
rows = _fetch_rows(db_path, source_substr)
emitted, skipped = 0, 0
for row in rows:
key = _row_content_key(row)
if (parts_dir / f"{key}.json").is_file():
skipped += 1
continue
emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
emitted += 1
if limit and emitted >= limit:
break
return {
"rows": len(rows),
"emitted": emitted,
"skipped_done": skipped,
"prompts_dir": str(prompts_dir),
}
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--parts", default="data/enrichment_parts")
parser.add_argument("--prompts", default="data/enrichment_prompts")
parser.add_argument("--out", default="data/enrichment.json")
parser.add_argument("--source", default=None,
help="only rows whose source_id/chunk_key contains this (pilot)")
parser.add_argument("--limit", type=int, default=None,
help="cap emitted prompts (pilot)")
parser.add_argument("--collect", action="store_true",
help="merge enrichment parts into the overlay JSON")
args = parser.parse_args(argv)
print("=" * 60)
print("ENRICHMENT ORCHESTRATOR")
print("=" * 60)
if args.collect:
result = collect_enrichment(Path(args.parts), Path(args.out))
print(f"collected : {result['entries']} entries -> {result['out']}")
if result["bad_parts"]:
print(f"bad parts : {len(result['bad_parts'])} (skipped)")
for name in result["bad_parts"]:
print(f" - {name}")
print("Run build_database.py --rebuild to apply the overlay.")
print("=" * 60)
return 0
summary = run_emit(
db_path=Path(args.db),
chunks_dir=Path(args.chunks),
parts_dir=Path(args.parts),
prompts_dir=Path(args.prompts),
source_substr=args.source,
limit=args.limit,
)
print(f"rows in DB : {summary['rows']}"
+ (f" (filtered by '{args.source}')" if args.source else ""))
print(f"already enriched : {summary['skipped_done']}")
print(f"prompts emitted : {summary['emitted']}")
if summary["emitted"]:
print(f"prompts dir : {summary['prompts_dir']}/")
print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
"writing data/enrichment_parts/<key>.json, then run "
"run_enrichment.py --collect and build_database.py --rebuild.")
else:
print("Nothing to emit — run --collect then build_database.py --rebuild.")
print("=" * 60)
return 0
if __name__ == "__main__":
raise SystemExit(main())