Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB
Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await re-extraction). DB rebuilt and frozen at 9418 activities — content_keys are now stable for the enrichment overlay. Part A (plumbing + UI): - database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor, space_needed, estimated_fields, source_id/source_ids/chunk_key columns; FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality filters + category counts for both axes. - activity.py: new fields + bilingual display helpers (get_display_*, is_estimated, axis displays). - config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers (None on unrecognised, no fabrication). - search.py / routes.py / config.py / templates / css: new dropdowns, RO-primary rendering with "(estimat)" markers and collapsible original text, and a /source/<id> download route shipped DARK behind SOURCE_DOWNLOAD_ENABLED (copyright opt-in). - build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster unions source_ids without touching enrichment fields. Part B (enrichment pipeline, built not yet run): - build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on content_key) + --enrichment CLI + stated-vs-estimated QA. - run_enrichment.py (resumable, --source/--limit pilot scoping, --collect), ENRICHMENT_PROMPT.md. Repair: scripts/repair_extractions.py fixes the subagents' systematic unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never truncates) + schema validation + a strictly-more-text guard. json_repair was tried first, truncated silently, and is NOT used. build_database has no repair dependency. Tests: tests/test_enrichment.py added; 99 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
98
scripts/ENRICHMENT_PROMPT.md
Normal file
98
scripts/ENRICHMENT_PROMPT.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# SUBAGENT — Activity enrichment
|
||||
|
||||
You are a subagent in the game-library enrichment pipeline. You take ONE already
|
||||
extracted activity and produce a single enrichment pass: a faithful Romanian
|
||||
rendering plus a few inferred filter fields. You do **one** activity per prompt.
|
||||
|
||||
This is **not** re-extraction. The activity text already exists and is trusted.
|
||||
Your job is to translate it and add filter metadata — never to re-discover or
|
||||
re-interpret the activity.
|
||||
|
||||
## Your task
|
||||
|
||||
The prompt gives you two blocks:
|
||||
|
||||
1. **Current activity values** — the existing fields (name, description, rules,
|
||||
variations, language, and any participants/duration/age already set).
|
||||
2. **Source chunk text** — the original passage the activity came from. This is
|
||||
your ground truth for any expansion. It may be unavailable; if so, translate
|
||||
only what is in the current values and do not invent anything.
|
||||
|
||||
Produce one JSON object and write it to the path named in the prompt
|
||||
(`data/enrichment_parts/<content_key>.json`). It MUST contain the exact
|
||||
`content_key` string from the prompt.
|
||||
|
||||
## Rules
|
||||
|
||||
### Translation (always)
|
||||
- Translate `name`, `description`, `rules`, `variations` into natural, fluent
|
||||
Romanian → `name_ro`, `description_ro`, `rules_ro`, `variations_ro`.
|
||||
- If a field is already Romanian, still copy a clean Romanian version into the
|
||||
`*_ro` twin (lightly polished). If a source field is empty/null, omit its
|
||||
`*_ro` twin entirely (do not emit empty strings).
|
||||
- Translate faithfully. Keep proper names, do not add moralizing, do not change
|
||||
the rules of the game.
|
||||
|
||||
### Description expansion (constrained)
|
||||
- You MAY make `description_ro` richer than a literal translation — but ONLY
|
||||
using detail that is actually present in the **source chunk text**. Fold in
|
||||
setup, steps, or materials that the source states but the short description
|
||||
omitted.
|
||||
- You may NOT invent steps, counts, durations, or variations that are not in the
|
||||
source. If the source is thin, the translation stays thin. Hallucinated
|
||||
expansion is the one unacceptable failure here.
|
||||
|
||||
### Inferred filter fields (mark when inferred)
|
||||
Fill these when you can, using the source text first, then reasonable inference:
|
||||
|
||||
- `indoor_outdoor`: one of `indoor`, `outdoor`, `either`.
|
||||
- `space_needed`: one of `mic`, `mediu`, `mare` (small / medium / large area).
|
||||
- `participants_min`, `participants_max`: integers (people).
|
||||
- `duration_min`, `duration_max`: integers (minutes).
|
||||
- `age_group_min`, `age_group_max`: integers (years).
|
||||
|
||||
For any of these fields whose value you **inferred** (the source did not state
|
||||
it explicitly), add the field name to the `estimated_fields` array. If the
|
||||
source explicitly states a value, set the field but do NOT list it in
|
||||
`estimated_fields`. Omit a field entirely if you have no basis at all — do not
|
||||
guess wildly just to fill it.
|
||||
|
||||
Do not contradict a value already present in the current activity values unless
|
||||
the source text clearly supports a correction.
|
||||
|
||||
## Enum vocabulary (fixed — use these exact slugs)
|
||||
|
||||
- `indoor_outdoor`: `indoor` | `outdoor` | `either`
|
||||
- `space_needed`: `mic` | `mediu` | `mare`
|
||||
|
||||
## Output format
|
||||
|
||||
Write exactly one JSON object to `data/enrichment_parts/<content_key>.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"content_key": "<the exact key from the prompt>",
|
||||
"name_ro": "…",
|
||||
"description_ro": "…",
|
||||
"rules_ro": "…",
|
||||
"variations_ro": "…",
|
||||
"indoor_outdoor": "outdoor",
|
||||
"space_needed": "mediu",
|
||||
"participants_min": 6,
|
||||
"participants_max": 20,
|
||||
"duration_min": 15,
|
||||
"duration_max": 30,
|
||||
"age_group_min": 8,
|
||||
"age_group_max": 14,
|
||||
"estimated_fields": ["space_needed", "duration_min", "duration_max"]
|
||||
}
|
||||
```
|
||||
|
||||
Include only the fields you actually fill. Always include `content_key` and
|
||||
`estimated_fields` (use `[]` if nothing was inferred). Output valid JSON only —
|
||||
no commentary, no markdown fences in the file itself.
|
||||
|
||||
## Report
|
||||
|
||||
After writing the file, report in under 30 words: the activity name and which
|
||||
fields you estimated.
|
||||
@@ -74,6 +74,23 @@ The file is one JSON object: a `header` plus an `activities` array.
|
||||
- Do **not** paraphrase the `source_excerpt` — copy it character for character.
|
||||
- Better to extract fewer activities accurately than to pad the output.
|
||||
|
||||
## Writing large outputs in batches (IMPORTANT)
|
||||
|
||||
A single Write tool call has a hard ~32K output-token limit. Dense chunks
|
||||
(50+ activities) will exceed this. If you estimate >30 activities, write the
|
||||
file **incrementally**:
|
||||
|
||||
1. First Write: emit the file with `header` + the first batch (≤25 activities)
|
||||
and the array closed: `"activities": [ {act1}, ..., {act25} ] }`.
|
||||
2. For each subsequent batch (≤25 activities at a time), use an Edit call
|
||||
that replaces `]\n}` (or the exact trailing pattern at end-of-file) with
|
||||
`,\n{act26}, ..., {act50}\n]\n}`. Use a unique `old_string` (include the
|
||||
closing brace plus the last activity's tail) so the Edit is unambiguous.
|
||||
3. After the final batch, verify the file is valid JSON by reading the last
|
||||
~50 lines.
|
||||
|
||||
This keeps each tool call under the output-token cap.
|
||||
|
||||
## Before you finish
|
||||
|
||||
- Every activity has a non-empty `source_excerpt` and `page_reference`.
|
||||
|
||||
@@ -86,7 +86,12 @@ def _split_csv(value: Optional[str]) -> list[str]:
|
||||
return [p.strip() for p in str(value).split(",") if p.strip()]
|
||||
|
||||
|
||||
def dict_to_activity(adict: dict, source_file: str) -> Activity:
|
||||
def dict_to_activity(
|
||||
adict: dict,
|
||||
source_file: str,
|
||||
source_id: Optional[str] = None,
|
||||
chunk_key: Optional[str] = None,
|
||||
) -> Activity:
|
||||
"""Build an Activity from one extraction-JSON activity object."""
|
||||
tags = adict.get("tags") or []
|
||||
if isinstance(tags, str):
|
||||
@@ -99,6 +104,9 @@ def dict_to_activity(adict: dict, source_file: str) -> Activity:
|
||||
source_files = [source_file, *source_files]
|
||||
|
||||
return Activity(
|
||||
source_id=source_id,
|
||||
source_ids=[source_id] if source_id else [],
|
||||
chunk_key=chunk_key,
|
||||
name=(adict.get("name") or "").strip(),
|
||||
description=(adict.get("description") or "").strip(),
|
||||
rules=adict.get("rules"),
|
||||
@@ -206,6 +214,19 @@ def merge_cluster(cluster: list[Activity]) -> Activity:
|
||||
if s and s not in sources:
|
||||
sources.append(s)
|
||||
merged.source_files = sources
|
||||
# source provenance: keep rep's chunk_key/source_id as primary, union the
|
||||
# source_ids for the download route. Enrichment fields (name_ro,
|
||||
# description_ro, indoor_outdoor, ...) are intentionally NOT carried here:
|
||||
# enrichment is applied AFTER dedup (plan Part B2), keyed on the merged
|
||||
# row's content_key, so merging must not pre-populate them.
|
||||
merged.source_id = rep.source_id
|
||||
merged.chunk_key = rep.chunk_key
|
||||
source_ids: list[str] = []
|
||||
for a in cluster:
|
||||
for sid in [a.source_id, *(a.source_ids or [])]:
|
||||
if sid and sid not in source_ids:
|
||||
source_ids.append(sid)
|
||||
merged.source_ids = source_ids
|
||||
# popularity_score++ per merged duplicate (plan §4)
|
||||
merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
|
||||
return merged
|
||||
@@ -313,6 +334,108 @@ def apply_review_decisions(
|
||||
return kept, stats
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 5b — enrichment overlay (plan Part B)
|
||||
# --------------------------------------------------------------------------
|
||||
# Translation / inferred-filter fields written by run_enrichment.py. Applied
|
||||
# AFTER dedup + review decisions, keyed on the same stable content_key, so the
|
||||
# overlay survives rebuilds as long as extraction text is frozen.
|
||||
_ENRICHMENT_TEXT_FIELDS = ("name_ro", "description_ro", "rules_ro", "variations_ro")
|
||||
_ENRICHMENT_INT_FIELDS = (
|
||||
"participants_min", "participants_max",
|
||||
"duration_min", "duration_max",
|
||||
"age_group_min", "age_group_max",
|
||||
)
|
||||
|
||||
|
||||
def load_enrichment(path: Path) -> dict:
|
||||
"""Load data/enrichment.json (flat map content_key -> field dict)."""
|
||||
if path and path.is_file():
|
||||
try:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
if isinstance(data, dict):
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def apply_enrichment(activities: list[Activity], enrichment: dict) -> dict:
|
||||
"""
|
||||
Overlay enrichment fields onto the post-dedup activity list (plan B2).
|
||||
|
||||
Keyed by content_key. Only fields PRESENT in an entry are written; absent
|
||||
fields leave the underlying DB value untouched. indoor_outdoor /
|
||||
space_needed are normalized to slugs (None on unrecognised). Inferred
|
||||
fields are recorded in `estimated_fields`. Translated / expanded text is
|
||||
NOT re-validated against the source here — expansion fidelity is the
|
||||
enrichment prompt's responsibility (plan B2 comment).
|
||||
|
||||
Returns {entries, matched, orphaned, fields_stated, fields_estimated}.
|
||||
"""
|
||||
from app.config_taxonomy import normalize_indoor_outdoor, normalize_space_needed
|
||||
|
||||
matched_keys: set[str] = set()
|
||||
fields_stated: dict[str, int] = defaultdict(int)
|
||||
fields_estimated: dict[str, int] = defaultdict(int)
|
||||
|
||||
for act in activities:
|
||||
key = content_key(
|
||||
act.normalized_name or normalize_name(act.name),
|
||||
act.language,
|
||||
act.description or "",
|
||||
)
|
||||
entry = enrichment.get(key)
|
||||
if not isinstance(entry, dict):
|
||||
continue
|
||||
matched_keys.add(key)
|
||||
|
||||
estimated = set(entry.get("estimated_fields") or [])
|
||||
|
||||
# bilingual text twins
|
||||
for fld in _ENRICHMENT_TEXT_FIELDS:
|
||||
val = entry.get(fld)
|
||||
if isinstance(val, str) and val.strip():
|
||||
setattr(act, fld, val.strip())
|
||||
|
||||
# inferred / clarified structured numeric fields
|
||||
for fld in _ENRICHMENT_INT_FIELDS:
|
||||
if entry.get(fld) is not None:
|
||||
try:
|
||||
setattr(act, fld, int(entry[fld]))
|
||||
except (TypeError, ValueError):
|
||||
pass
|
||||
|
||||
# enum filters — normalized to slug, dropped if unrecognised
|
||||
if entry.get("indoor_outdoor") is not None:
|
||||
slug = normalize_indoor_outdoor(entry["indoor_outdoor"])
|
||||
if slug:
|
||||
act.indoor_outdoor = slug
|
||||
if entry.get("space_needed") is not None:
|
||||
slug = normalize_space_needed(entry["space_needed"])
|
||||
if slug:
|
||||
act.space_needed = slug
|
||||
|
||||
act.estimated_fields = sorted(estimated)
|
||||
|
||||
# QA tally: stated vs estimated population, per field
|
||||
for fld in (*_ENRICHMENT_INT_FIELDS, "indoor_outdoor", "space_needed"):
|
||||
if entry.get(fld) is None:
|
||||
continue
|
||||
if fld in estimated:
|
||||
fields_estimated[fld] += 1
|
||||
else:
|
||||
fields_stated[fld] += 1
|
||||
|
||||
return {
|
||||
"entries": len(enrichment),
|
||||
"matched": len(matched_keys),
|
||||
"orphaned": len(enrichment) - len(matched_keys),
|
||||
"fields_stated": dict(fields_stated),
|
||||
"fields_estimated": dict(fields_estimated),
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# golden-set recall (plan §7)
|
||||
# --------------------------------------------------------------------------
|
||||
@@ -390,9 +513,8 @@ def collect_activities(
|
||||
|
||||
header = data.get("header", {})
|
||||
chunk_text = find_chunk_text(json_path, header, chunks_dir)
|
||||
source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit(
|
||||
".part", 1
|
||||
)[0]
|
||||
chunk_key = chunk_key_for(json_path, header)
|
||||
source_id = header.get("source_id") or chunk_key.rsplit(".part", 1)[0]
|
||||
fallback_source = (
|
||||
source_path_for(source_id, sources_dir) or source_id or json_path.stem
|
||||
)
|
||||
@@ -409,7 +531,7 @@ def collect_activities(
|
||||
continue
|
||||
src = adict.get("source_file") or fallback_source
|
||||
raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
|
||||
activities.append(dict_to_activity(adict, src))
|
||||
activities.append(dict_to_activity(adict, src, source_id, chunk_key))
|
||||
|
||||
if hallucinated:
|
||||
_log_hallucinations(json_path, rejected_dir, hallucinated)
|
||||
@@ -496,6 +618,7 @@ def rebuild(
|
||||
sources_dir: Path,
|
||||
db_path: Path,
|
||||
decisions_path: Optional[Path] = None,
|
||||
enrichment_path: Optional[Path] = None,
|
||||
schema_path: Path = DEFAULT_SCHEMA_PATH,
|
||||
golden_dir: Optional[Path] = None,
|
||||
do_swap: bool = True,
|
||||
@@ -517,6 +640,11 @@ def rebuild(
|
||||
decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
|
||||
final, decision_stats = apply_review_decisions(deduped, decisions)
|
||||
|
||||
# Enrichment overlay — applied immediately after review decisions, on the
|
||||
# post-dedup list, keyed on the same stable content_key (plan B2).
|
||||
enrichment = load_enrichment(Path(enrichment_path)) if enrichment_path else {}
|
||||
enrichment_stats = apply_enrichment(final, enrichment)
|
||||
|
||||
try:
|
||||
write_database(db_tmp_path, final)
|
||||
backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
|
||||
@@ -529,6 +657,7 @@ def rebuild(
|
||||
**collected,
|
||||
"dedup": dedup_stats,
|
||||
"decisions": decision_stats,
|
||||
"enrichment": enrichment_stats,
|
||||
"final_count": len(final),
|
||||
"backup": str(backup) if backup else None,
|
||||
"swapped": do_swap,
|
||||
@@ -579,6 +708,16 @@ def print_report(report: dict) -> None:
|
||||
f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
|
||||
print(f"review decisions : dropped {report['decisions']['dropped']}, "
|
||||
f"resolved {report['decisions']['resolved']}")
|
||||
enr = report.get("enrichment")
|
||||
if enr and enr.get("entries"):
|
||||
print(f"enrichment : {enr['entries']} entries "
|
||||
f"(matched {enr['matched']}, orphaned {enr['orphaned']})")
|
||||
stated, estimated = enr.get("fields_stated", {}), enr.get("fields_estimated", {})
|
||||
all_fields = sorted(set(stated) | set(estimated))
|
||||
if all_fields:
|
||||
print(" field population : (stated / estimated)")
|
||||
for fld in all_fields:
|
||||
print(f" {fld:<18}: {stated.get(fld, 0)} / {estimated.get(fld, 0)}")
|
||||
print(f"final inserted : {report['final_count']}")
|
||||
print(f"% with rules : {qa['pct_with_rules']}")
|
||||
print(f"needs_review rows : {qa['needs_review']}")
|
||||
@@ -615,6 +754,7 @@ def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser.add_argument("--sources", default="data/sources")
|
||||
parser.add_argument("--db", default="data/activities.db")
|
||||
parser.add_argument("--decisions", default="data/review_decisions.json")
|
||||
parser.add_argument("--enrichment", default="data/enrichment.json")
|
||||
parser.add_argument("--golden", default="data/golden")
|
||||
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
|
||||
args = parser.parse_args(argv)
|
||||
@@ -628,6 +768,7 @@ def main(argv: Optional[list[str]] = None) -> int:
|
||||
sources_dir=Path(args.sources),
|
||||
db_path=Path(args.db),
|
||||
decisions_path=Path(args.decisions),
|
||||
enrichment_path=Path(args.enrichment),
|
||||
schema_path=Path(args.schema),
|
||||
golden_dir=Path(args.golden),
|
||||
)
|
||||
|
||||
244
scripts/repair_extractions.py
Normal file
244
scripts/repair_extractions.py
Normal file
@@ -0,0 +1,244 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
repair_extractions.py — one-shot repair of malformed extraction JSON.
|
||||
|
||||
Subagents systematically emit unescaped ASCII double-quotes inside string
|
||||
values (Romanian text like „Unu" uses a closing " that terminates the JSON
|
||||
string early). Re-extraction reproduces the bug, so we repair instead.
|
||||
|
||||
IMPORTANT — why NOT json_repair: json_repair "recovers" an unescaped quote by
|
||||
ending the string at the stray quote and reinterpreting the trailing text as a
|
||||
new key, which (a) TRUNCATES the value and (b) injects garbage keys. The
|
||||
truncation is silent (the field is still non-empty) and slips past a naive
|
||||
presence check. So we use a faithful char-scanner that ESCAPES stray quotes
|
||||
(\\") instead of splitting on them, then validate the result against the real
|
||||
activity schema (additionalProperties:false also catches any residual split).
|
||||
|
||||
This is an OFFLINE maintenance tool. build_database.py must NOT depend on it —
|
||||
the "DB regenerable from data/extracted/" invariant requires plain valid JSON on
|
||||
disk. We write clean JSON back to data/extracted/ and the build reads vanilla
|
||||
json.
|
||||
|
||||
Source selection (faithful recovery needs the ORIGINAL malformed text):
|
||||
* a chunk is a candidate when a MALFORMED original exists — either the
|
||||
top-level data/extracted/<key>.json is itself invalid, or a malformed
|
||||
original sits in data/extracted/_rejected/<key>.json.
|
||||
* the malformed original is preferred as the repair source.
|
||||
* chunks whose only artifact is already-valid JSON (e.g. a prior json_repair
|
||||
output that lost the original) are NOT silently "repaired" — if such a chunk
|
||||
has no valid top-level file it is reported as needing RE-EXTRACTION.
|
||||
|
||||
Usage:
|
||||
python scripts/repair_extractions.py # report only (dry run)
|
||||
python scripts/repair_extractions.py --apply # write repaired JSON
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import glob
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
EXTRACTED = REPO_ROOT / "data" / "extracted"
|
||||
REJECTED = EXTRACTED / "_rejected"
|
||||
|
||||
if str(SCRIPT_DIR) not in __import__("sys").path:
|
||||
__import__("sys").path.insert(0, str(SCRIPT_DIR))
|
||||
from import_common import DEFAULT_SCHEMA_PATH, load_schema, validate_extraction # noqa: E402
|
||||
|
||||
|
||||
def escape_stray_quotes(s: str) -> str:
|
||||
"""Escape ASCII double-quotes that occur INSIDE a JSON string value.
|
||||
|
||||
A `"` inside a string is treated as a real string-close only when the next
|
||||
non-whitespace char is structural (`,` `}` `]` `:`) or EOF; otherwise it is
|
||||
content and is escaped to `\\"`. This preserves the full value instead of
|
||||
truncating it (the json_repair failure mode).
|
||||
"""
|
||||
out: list[str] = []
|
||||
in_str = False
|
||||
esc = False
|
||||
n = len(s)
|
||||
i = 0
|
||||
while i < n:
|
||||
c = s[i]
|
||||
if esc:
|
||||
out.append(c)
|
||||
esc = False
|
||||
i += 1
|
||||
continue
|
||||
if c == "\\":
|
||||
out.append(c)
|
||||
esc = True
|
||||
i += 1
|
||||
continue
|
||||
if c == '"':
|
||||
if not in_str:
|
||||
in_str = True
|
||||
out.append(c)
|
||||
else:
|
||||
j = i + 1
|
||||
while j < n and s[j] in " \t\r\n":
|
||||
j += 1
|
||||
nxt = s[j] if j < n else ""
|
||||
if nxt in ",}]:" or nxt == "":
|
||||
in_str = False
|
||||
out.append(c)
|
||||
else:
|
||||
out.append('\\"') # content quote → escape, keep value whole
|
||||
i += 1
|
||||
continue
|
||||
out.append(c)
|
||||
i += 1
|
||||
return "".join(out)
|
||||
|
||||
|
||||
def _is_valid_json(path: Path) -> bool:
|
||||
try:
|
||||
json.loads(path.read_text(encoding="utf-8"))
|
||||
return True
|
||||
except (json.JSONDecodeError, OSError):
|
||||
return False
|
||||
|
||||
|
||||
def _malformed_source(key: str) -> Optional[Path]:
|
||||
"""Return the malformed-original file for a chunk, preferring top-level."""
|
||||
live = EXTRACTED / f"{key}.json"
|
||||
if live.exists() and not _is_valid_json(live):
|
||||
return live
|
||||
rej = REJECTED / f"{key}.json"
|
||||
if rej.exists() and not _is_valid_json(rej):
|
||||
return rej
|
||||
return None
|
||||
|
||||
|
||||
def _candidate_keys() -> tuple[dict[str, Path], list[str]]:
|
||||
"""
|
||||
(repair_candidates, needs_reextraction).
|
||||
|
||||
repair_candidates: key -> malformed source file (faithfully repairable).
|
||||
needs_reextraction: chunks with no malformed original AND no valid
|
||||
top-level file (their original was lost) — must be re-extracted.
|
||||
"""
|
||||
keys = set()
|
||||
for fn in glob.glob(str(EXTRACTED / "*.json")):
|
||||
keys.add(Path(fn).stem)
|
||||
for fn in glob.glob(str(REJECTED / "*.json")):
|
||||
keys.add(Path(fn).stem)
|
||||
|
||||
candidates: dict[str, Path] = {}
|
||||
needs_reextraction: list[str] = []
|
||||
for key in sorted(keys):
|
||||
# A malformed original anywhere is faithfully repairable, and is the
|
||||
# source of truth even if a (json_repair-produced, possibly truncated)
|
||||
# valid top-level file exists — escaping the original never truncates,
|
||||
# so re-repairing from it is always >= the json_repair output.
|
||||
src = _malformed_source(key)
|
||||
if src is not None:
|
||||
candidates[key] = src
|
||||
continue
|
||||
live = EXTRACTED / f"{key}.json"
|
||||
if live.exists() and _is_valid_json(live):
|
||||
continue # genuinely-valid extraction, nothing to do
|
||||
# no valid top-level and no malformed original to repair from
|
||||
needs_reextraction.append(key)
|
||||
return candidates, needs_reextraction
|
||||
|
||||
|
||||
def repair(apply: bool) -> int:
|
||||
schema = load_schema(DEFAULT_SCHEMA_PATH)
|
||||
candidates, needs_reextraction = _candidate_keys()
|
||||
|
||||
print("=" * 64)
|
||||
print(f"REPAIR EXTRACTIONS ({'APPLY' if apply else 'dry run'})")
|
||||
print("=" * 64)
|
||||
print(f"repair candidates: {len(candidates)}")
|
||||
|
||||
def _textlen(data: dict) -> int:
|
||||
total = 0
|
||||
for a in data.get("activities", []):
|
||||
if isinstance(a, dict):
|
||||
for v in a.values():
|
||||
if isinstance(v, str):
|
||||
total += len(v)
|
||||
return total
|
||||
|
||||
ok = 0
|
||||
kept_toplevel = 0
|
||||
still_bad: list[str] = []
|
||||
schema_fail: list[tuple[str, str]] = []
|
||||
|
||||
for key, src in candidates.items():
|
||||
live = EXTRACTED / f"{key}.json"
|
||||
live_valid = live.exists() and _is_valid_json(live)
|
||||
|
||||
raw = src.read_text(encoding="utf-8")
|
||||
fixed = escape_stray_quotes(raw)
|
||||
try:
|
||||
data = json.loads(fixed)
|
||||
except json.JSONDecodeError as exc:
|
||||
if live_valid:
|
||||
kept_toplevel += 1 # genuine top-level is fine; stale _rejected
|
||||
else:
|
||||
still_bad.append(f"{key}: still invalid after escape ({exc})")
|
||||
continue
|
||||
errors = validate_extraction(data, schema)
|
||||
if errors:
|
||||
if live_valid:
|
||||
kept_toplevel += 1
|
||||
else:
|
||||
schema_fail.append((key, errors[0]))
|
||||
print(f" {key[:50]:<50} SCHEMA-FAIL: {errors[0][:40]}")
|
||||
continue
|
||||
|
||||
# Faithfulness guard: only replace a valid top-level when the escaped
|
||||
# repair carries STRICTLY more text (i.e. the top-level was a truncated
|
||||
# json_repair output). Genuine extractions are kept untouched.
|
||||
if live_valid:
|
||||
try:
|
||||
live_data = json.loads(live.read_text(encoding="utf-8"))
|
||||
except json.JSONDecodeError:
|
||||
live_data = {}
|
||||
if _textlen(data) <= _textlen(live_data):
|
||||
kept_toplevel += 1
|
||||
continue
|
||||
|
||||
n = len(data.get("activities", []))
|
||||
print(f" {key[:50]:<50} {n:>3} acts REPAIR")
|
||||
if apply:
|
||||
live.write_text(
|
||||
json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8"
|
||||
)
|
||||
ok += 1
|
||||
|
||||
print("-" * 64)
|
||||
print(f"repaired: {ok} | kept genuine top-level: {kept_toplevel} | "
|
||||
f"schema-fail: {len(schema_fail)} | still-bad: {len(still_bad)} | "
|
||||
f"needs re-extraction: {len(needs_reextraction)}")
|
||||
for key, err in schema_fail:
|
||||
print(f" ⚠ schema {key}: {err[:60]}")
|
||||
for msg in still_bad:
|
||||
print(f" ✘ {msg}")
|
||||
for key in needs_reextraction:
|
||||
print(f" ↻ re-extract: {key}")
|
||||
if not apply:
|
||||
print("\nDry run — re-run with --apply to write repaired JSON.")
|
||||
print("=" * 64)
|
||||
return 0
|
||||
|
||||
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Repair malformed extraction JSON.")
|
||||
parser.add_argument("--apply", action="store_true",
|
||||
help="write repaired JSON (default: dry run)")
|
||||
args = parser.parse_args(argv)
|
||||
return repair(args.apply)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
270
scripts/run_enrichment.py
Normal file
270
scripts/run_enrichment.py
Normal file
@@ -0,0 +1,270 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
run_enrichment.py — enrichment orchestrator (plan Part B3).
|
||||
|
||||
Mirror of run_extraction.py, on the *other* side of the rebuild. It reads the
|
||||
already-rebuilt data/activities.db, and for every activity emits one subagent
|
||||
prompt asking for a single bilingual + inferred-filter enrichment pass. Like
|
||||
extraction, this script does NOT call the LLM — the interactive Claude Code
|
||||
orchestrator launches waves of subagents on the emitted prompts.
|
||||
|
||||
Keying is the crux (plan §"Cheia de keying"): each row's overlay is keyed on
|
||||
import_common.content_key(normalized_name, language, _normalize_text(description))
|
||||
— the SAME function build_database uses to apply the overlay. The key is stable
|
||||
only while the extraction text is frozen, so enrichment runs AFTER the freezing
|
||||
rebuild.
|
||||
|
||||
Modes:
|
||||
(default) emit one prompt per activity that has no enrichment part yet
|
||||
(resumable: data/enrichment_parts/<key>.json present => skip)
|
||||
--collect merge data/enrichment_parts/*.json -> data/enrichment.json
|
||||
|
||||
Pilot scoping (plan B5): --source <source_id substring> and/or --limit N narrow
|
||||
the emitted prompts to a single source / category for the sign-off pilot.
|
||||
|
||||
Usage:
|
||||
python scripts/run_enrichment.py --source teambuilding_corbu # pilot
|
||||
python scripts/run_enrichment.py # all rows
|
||||
python scripts/run_enrichment.py --collect # merge parts
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sqlite3
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
from import_common import ( # noqa: E402
|
||||
content_key,
|
||||
find_chunk_text,
|
||||
normalize_name,
|
||||
)
|
||||
|
||||
ENRICHMENT_PROMPT = SCRIPT_DIR / "ENRICHMENT_PROMPT.md"
|
||||
|
||||
# Columns pulled from the DB into the prompt as the "current value" context.
|
||||
_DB_COLUMNS = (
|
||||
"id", "name", "description", "rules", "variations",
|
||||
"category", "content_type", "language", "normalized_name",
|
||||
"page_reference", "source_id", "chunk_key",
|
||||
"participants_min", "participants_max",
|
||||
"duration_min", "duration_max",
|
||||
"age_group_min", "age_group_max",
|
||||
)
|
||||
|
||||
# How much source-chunk text to inline. Chunks are page-sized; cap so a dense
|
||||
# chunk does not blow the prompt up, but keep enough to ground the expansion.
|
||||
_CHUNK_TEXT_CAP = 12000
|
||||
|
||||
|
||||
def _fetch_rows(db_path: Path, source_substr: Optional[str]) -> list[dict]:
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
try:
|
||||
cols = ", ".join(_DB_COLUMNS)
|
||||
sql = f"SELECT {cols} FROM activities"
|
||||
params: list = []
|
||||
if source_substr:
|
||||
sql += " WHERE (source_id LIKE ? OR chunk_key LIKE ?)"
|
||||
params = [f"%{source_substr}%", f"%{source_substr}%"]
|
||||
sql += " ORDER BY source_id, id"
|
||||
return [dict(r) for r in conn.execute(sql, params).fetchall()]
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def _row_content_key(row: dict) -> str:
|
||||
return content_key(
|
||||
row.get("normalized_name") or normalize_name(row.get("name") or ""),
|
||||
row.get("language"),
|
||||
row.get("description") or "",
|
||||
)
|
||||
|
||||
|
||||
def _chunk_text_for_row(row: dict, chunks_dir: Path) -> Optional[str]:
|
||||
"""Locate the source-chunk text via the row's chunk_key / source_id."""
|
||||
header = {"chunk_key": row.get("chunk_key"), "source_id": row.get("source_id")}
|
||||
if not header["chunk_key"]:
|
||||
return None
|
||||
# find_chunk_text resolves from the header when chunk_key is present;
|
||||
# the json_path arg is only a fallback, so a synthetic path is fine.
|
||||
text = find_chunk_text(Path(f"{row['chunk_key']}.json"), header, chunks_dir)
|
||||
if text and len(text) > _CHUNK_TEXT_CAP:
|
||||
text = text[:_CHUNK_TEXT_CAP] + "\n…[chunk truncated]…"
|
||||
return text
|
||||
|
||||
|
||||
def _current_fields_block(row: dict) -> str:
|
||||
"""The activity's current DB values, as a compact JSON block for context."""
|
||||
fields = {
|
||||
"name": row.get("name"),
|
||||
"description": row.get("description"),
|
||||
"rules": row.get("rules"),
|
||||
"variations": row.get("variations"),
|
||||
"category": row.get("category"),
|
||||
"content_type": row.get("content_type"),
|
||||
"language": row.get("language"),
|
||||
"participants_min": row.get("participants_min"),
|
||||
"participants_max": row.get("participants_max"),
|
||||
"duration_min": row.get("duration_min"),
|
||||
"duration_max": row.get("duration_max"),
|
||||
"age_group_min": row.get("age_group_min"),
|
||||
"age_group_max": row.get("age_group_max"),
|
||||
}
|
||||
return json.dumps(fields, ensure_ascii=False, indent=2)
|
||||
|
||||
|
||||
def emit_enrichment_prompt(
|
||||
row: dict, key: str, chunks_dir: Path, prompts_dir: Path
|
||||
) -> Path:
|
||||
"""Write the subagent enrichment prompt for one activity."""
|
||||
chunk_text = _chunk_text_for_row(row, chunks_dir)
|
||||
source_block = (
|
||||
chunk_text if chunk_text is not None
|
||||
else "[source chunk text unavailable — translate only what is given "
|
||||
"above; do NOT invent steps, and mark any inferred filter field "
|
||||
"as estimated]"
|
||||
)
|
||||
part_path = f"data/enrichment_parts/{key}.json"
|
||||
text = "\n".join([
|
||||
f"# ENRICHMENT — activity `{row.get('name')}` (id {row.get('id')})",
|
||||
"",
|
||||
f"Follow the rules in `{ENRICHMENT_PROMPT.relative_to(REPO_ROOT)}` EXACTLY.",
|
||||
"Single pass. Translate faithfully to Romanian; expand description_ro "
|
||||
"ONLY from the source chunk text below; mark inferred filter fields in "
|
||||
"`estimated_fields`.",
|
||||
"",
|
||||
f"Write the result JSON to: `{part_path}`",
|
||||
f'It MUST include `"content_key": "{key}"`.',
|
||||
f'Page reference: {row.get("page_reference") or "?"}',
|
||||
"",
|
||||
"## Current activity values (the text to translate / enrich)",
|
||||
"```json",
|
||||
_current_fields_block(row),
|
||||
"```",
|
||||
"",
|
||||
"## Source chunk text (ground description_ro expansion in THIS only)",
|
||||
"```",
|
||||
source_block,
|
||||
"```",
|
||||
"",
|
||||
])
|
||||
prompts_dir.mkdir(parents=True, exist_ok=True)
|
||||
out = prompts_dir / f"{key}.prompt.md"
|
||||
out.write_text(text, encoding="utf-8")
|
||||
return out
|
||||
|
||||
|
||||
def collect_enrichment(parts_dir: Path, out_path: Path) -> dict:
|
||||
"""Merge data/enrichment_parts/*.json into one flat content_key map."""
|
||||
merged: dict = {}
|
||||
bad: list[str] = []
|
||||
if parts_dir.is_dir():
|
||||
for part in sorted(parts_dir.glob("*.json")):
|
||||
try:
|
||||
data = json.loads(part.read_text(encoding="utf-8"))
|
||||
except (json.JSONDecodeError, OSError):
|
||||
bad.append(part.name)
|
||||
continue
|
||||
key = data.get("content_key") or part.stem
|
||||
entry = {k: v for k, v in data.items() if k != "content_key"}
|
||||
merged[key] = entry
|
||||
out_path.write_text(
|
||||
json.dumps(merged, ensure_ascii=False, indent=2), encoding="utf-8"
|
||||
)
|
||||
return {"entries": len(merged), "bad_parts": bad, "out": str(out_path)}
|
||||
|
||||
|
||||
def run_emit(
|
||||
*,
|
||||
db_path: Path,
|
||||
chunks_dir: Path,
|
||||
parts_dir: Path,
|
||||
prompts_dir: Path,
|
||||
source_substr: Optional[str],
|
||||
limit: Optional[int],
|
||||
) -> dict:
|
||||
rows = _fetch_rows(db_path, source_substr)
|
||||
emitted, skipped = 0, 0
|
||||
for row in rows:
|
||||
key = _row_content_key(row)
|
||||
if (parts_dir / f"{key}.json").is_file():
|
||||
skipped += 1
|
||||
continue
|
||||
emit_enrichment_prompt(row, key, chunks_dir, prompts_dir)
|
||||
emitted += 1
|
||||
if limit and emitted >= limit:
|
||||
break
|
||||
return {
|
||||
"rows": len(rows),
|
||||
"emitted": emitted,
|
||||
"skipped_done": skipped,
|
||||
"prompts_dir": str(prompts_dir),
|
||||
}
|
||||
|
||||
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Enrichment orchestrator.")
|
||||
parser.add_argument("--db", default="data/activities.db")
|
||||
parser.add_argument("--chunks", default="data/chunks")
|
||||
parser.add_argument("--parts", default="data/enrichment_parts")
|
||||
parser.add_argument("--prompts", default="data/enrichment_prompts")
|
||||
parser.add_argument("--out", default="data/enrichment.json")
|
||||
parser.add_argument("--source", default=None,
|
||||
help="only rows whose source_id/chunk_key contains this (pilot)")
|
||||
parser.add_argument("--limit", type=int, default=None,
|
||||
help="cap emitted prompts (pilot)")
|
||||
parser.add_argument("--collect", action="store_true",
|
||||
help="merge enrichment parts into the overlay JSON")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
print("=" * 60)
|
||||
print("ENRICHMENT ORCHESTRATOR")
|
||||
print("=" * 60)
|
||||
|
||||
if args.collect:
|
||||
result = collect_enrichment(Path(args.parts), Path(args.out))
|
||||
print(f"collected : {result['entries']} entries -> {result['out']}")
|
||||
if result["bad_parts"]:
|
||||
print(f"bad parts : {len(result['bad_parts'])} (skipped)")
|
||||
for name in result["bad_parts"]:
|
||||
print(f" - {name}")
|
||||
print("Run build_database.py --rebuild to apply the overlay.")
|
||||
print("=" * 60)
|
||||
return 0
|
||||
|
||||
summary = run_emit(
|
||||
db_path=Path(args.db),
|
||||
chunks_dir=Path(args.chunks),
|
||||
parts_dir=Path(args.parts),
|
||||
prompts_dir=Path(args.prompts),
|
||||
source_substr=args.source,
|
||||
limit=args.limit,
|
||||
)
|
||||
print(f"rows in DB : {summary['rows']}"
|
||||
+ (f" (filtered by '{args.source}')" if args.source else ""))
|
||||
print(f"already enriched : {summary['skipped_done']}")
|
||||
print(f"prompts emitted : {summary['emitted']}")
|
||||
if summary["emitted"]:
|
||||
print(f"prompts dir : {summary['prompts_dir']}/")
|
||||
print("Launch waves of ~8-16 Sonnet subagents on those prompts, each "
|
||||
"writing data/enrichment_parts/<key>.json, then run "
|
||||
"run_enrichment.py --collect and build_database.py --rebuild.")
|
||||
else:
|
||||
print("Nothing to emit — run --collect then build_database.py --rebuild.")
|
||||
print("=" * 60)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Reference in New Issue
Block a user