Rebuild extraction pipeline infrastructure (Faza 0 prep)

Implements the approved plan to replace the broken regex/index-master extraction with an LLM-subagent pipeline. Four parallel lanes: Lane A — scripts/extract_common.py (PDF/docx/doc/pptx/html/zip, no max_pages truncation), normalize_sources.py, chunk_sources.py (~20pg chunks + overlap, manifest registry), activity_schema.json. Lane B — app/config_taxonomy.py (16 fixed category slugs), schema rebuilt from scratch in app/models/ with content_type, language, source_files, source_excerpt, normalized_name, extraction_confidence, needs_review; FTS5 + 3 triggers extended with materials_list and skills_developed. Lane C — build_database.py (--rebuild, atomic swap, schema + fuzzy source_excerpt validation, dedup with needs_review band), validate_extractions.py, review_queue.py, new run_extraction.py orchestrator, SUBAGENT_PROMPT.md. Lane D — search.py content_type/language filters (default search excludes non-game content), E7 schema-compat audit; fixed a NULL keywords AttributeError in _boost_search_relevance. Removes 8 orphaned/dead scripts and app/services/parser.py + indexer.py. Adds tests/ (70 passing, 1 skipped — libreoffice absent). Note: Lane D made one additive edit to app/models/database.py (_update_category_counts) to surface content_type/language in get_filter_options, outside its nominal lane boundary but after Lane B completed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 17:43:38 +00:00
parent e0080edf85
commit 66ae831c36
37 changed files with 4101 additions and 1881 deletions
--- a/scripts/SUBAGENT_PROMPT.md
+++ b/scripts/SUBAGENT_PROMPT.md
@@ -0,0 +1,81 @@
+# SUBAGENT — Activity extraction
+
+You are a subagent in the game-library extraction pipeline. You extract
+educational activities (games, team-building, scouting, recipes, songs,
+ceremonies) from one chunk of a source document into structured JSON.
+
+## Your task
+
+1. **Read ONLY the chunk you were assigned.** Do not read other chunks, other
+   files, or the original document. The chunk is a `.txt` file with
+   `--- PAGE N ---` markers.
+2. Identify **every distinct activity** in the chunk.
+3. For each activity, fill the schema in `scripts/activity_schema.json`.
+4. Write the result to `data/extracted/<chunk_key>.json`.
+
+## What counts as "a distinct activity"
+
+A distinct activity is a self-contained game/activity/recipe/song/ceremony with
+its own name and a real description of how to do it. It is NOT:
+
+- a bare mention or a cross-reference with no description — **skip it**;
+- a sub-variant of an activity already extracted — fold it into `variations`;
+- a heading, a table of contents entry, or running page chrome.
+
+If the same activity is split across a page boundary inside your chunk, treat it
+as **one** activity and combine the text.
+
+## Output format
+
+The file is one JSON object: a `header` plus an `activities` array.
+
+```json
+{
+  "header": {
+    "source_id": "<set from the prompt>",
+    "chunk_key": "<set from the prompt>",
+    "source_hash": "<set from the prompt>",
+    "schema_version": "1.0",
+    "prompt_version": "1.0",
+    "chunk_range": "pages 1-20"
+  },
+  "activities": [ ... ]
+}
+```
+
+## Rules for each activity
+
+- **`name`** — the activity's real name (≥3 characters).
+- **`description`** — real prose describing the activity. No hard length limit,
+  but it must actually describe what happens.
+- **`rules`** — how it is played / carried out, if the source gives rules.
+- **`category`** — exactly one taxonomy slug (see the `enum` in the schema):
+  `jocuri-cercetasesti`, `team-building`, `icebreakers`, `camp-outdoor`,
+  `wide-games`, `orientare`, `prim-ajutor`, `escape-room-puzzle`,
+  `creative-stem`, `sports-active`, `cantece-ceremonii`, `retete`,
+  `supravietuire`, `integrare-incluziune`, `conflict-empatie`, `altele`.
+  When unsure, use `altele`.
+- **`content_type`** — the FORM of the content, independent of category:
+  `joc`, `activitate`, `reteta`, `cantec`, or `ceremonie`.
+- **`language`** — `ro` or `en` (the language the activity is written in).
+- **`source_excerpt`** — **MANDATORY.** A short quote (one or two sentences)
+  copied **verbatim** from the chunk. This is the anti-hallucination anchor: it
+  is checked as a fuzzy substring of the chunk, and invented quotes are
+  rejected.
+- **`page_reference`** — **MANDATORY.** The `--- PAGE N ---` marker(s) the
+  activity came from, e.g. `"page 14"` or `"pages 14-15"`.
+- **`extraction_confidence`** — `high`, `med`, or `low`. Use `low` when the
+  source text for the activity is thin or ambiguous.
+
+## Never invent data
+
+- Do **not** invent ages, participant counts, or durations. If the source does
+  not state them, leave those fields `null`.
+- Do **not** paraphrase the `source_excerpt` — copy it character for character.
+- Better to extract fewer activities accurately than to pad the output.
+
+## Before you finish
+
+- Every activity has a non-empty `source_excerpt` and `page_reference`.
+- The file validates against `scripts/activity_schema.json`.
+- You only used text from your assigned chunk.
--- a/scripts/activity_schema.json
+++ b/scripts/activity_schema.json
@@ -0,0 +1,110 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "title": "Game-library extraction output",
+  "description": "One subagent output file: a header carrying provenance/version metadata plus the list of activities extracted from a single chunk.",
+  "type": "object",
+  "required": ["header", "activities"],
+  "additionalProperties": false,
+  "properties": {
+    "header": {
+      "type": "object",
+      "required": ["source_hash", "schema_version", "prompt_version", "chunk_range"],
+      "additionalProperties": true,
+      "properties": {
+        "source_hash": {"type": "string", "minLength": 8},
+        "schema_version": {"type": "string"},
+        "prompt_version": {"type": "string"},
+        "chunk_range": {"type": "string"},
+        "source_id": {"type": ["string", "null"]},
+        "chunk_key": {"type": ["string", "null"]}
+      }
+    },
+    "activities": {
+      "type": "array",
+      "items": {"$ref": "#/definitions/activity"}
+    }
+  },
+  "definitions": {
+    "activity": {
+      "type": "object",
+      "required": [
+        "name",
+        "description",
+        "category",
+        "content_type",
+        "language",
+        "extraction_confidence",
+        "source_excerpt",
+        "page_reference"
+      ],
+      "additionalProperties": false,
+      "properties": {
+        "name": {"type": "string", "minLength": 3},
+        "description": {"type": "string", "minLength": 1},
+        "rules": {"type": ["string", "null"]},
+        "variations": {"type": ["string", "null"]},
+        "category": {
+          "type": "string",
+          "enum": [
+            "jocuri-cercetasesti",
+            "team-building",
+            "icebreakers",
+            "camp-outdoor",
+            "wide-games",
+            "orientare",
+            "prim-ajutor",
+            "escape-room-puzzle",
+            "creative-stem",
+            "sports-active",
+            "cantece-ceremonii",
+            "retete",
+            "supravietuire",
+            "integrare-incluziune",
+            "conflict-empatie",
+            "altele"
+          ]
+        },
+        "subcategory": {"type": ["string", "null"]},
+        "content_type": {
+          "type": "string",
+          "enum": ["joc", "activitate", "reteta", "cantec", "ceremonie"]
+        },
+        "language": {"type": "string", "enum": ["ro", "en"]},
+        "extraction_confidence": {
+          "type": "string",
+          "enum": ["high", "med", "low"]
+        },
+        "source_excerpt": {"type": "string", "minLength": 1},
+        "page_reference": {"type": "string", "minLength": 1},
+        "source_file": {"type": ["string", "null"]},
+        "age_group_min": {"type": ["integer", "null"], "minimum": 0},
+        "age_group_max": {"type": ["integer", "null"], "minimum": 0},
+        "participants_min": {"type": ["integer", "null"], "minimum": 0},
+        "participants_max": {"type": ["integer", "null"], "minimum": 0},
+        "duration_min": {"type": ["integer", "null"], "minimum": 0},
+        "duration_max": {"type": ["integer", "null"], "minimum": 0},
+        "materials_category": {"type": ["string", "null"]},
+        "materials_list": {
+          "type": ["array", "null"],
+          "items": {"type": "string"}
+        },
+        "skills_developed": {
+          "type": ["array", "null"],
+          "items": {"type": "string"}
+        },
+        "difficulty_level": {
+          "type": ["string", "null"],
+          "enum": ["usor", "mediu", "dificil", null]
+        },
+        "keywords": {
+          "type": ["array", "null"],
+          "items": {"type": "string"}
+        },
+        "tags": {
+          "type": ["array", "null"],
+          "items": {"type": "string"}
+        }
+      }
+    }
+  }
+}
--- a/scripts/build_database.py
+++ b/scripts/build_database.py
@@ -0,0 +1,639 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+build_database.py — build data/activities.db from the subagent extraction JSON.
+
+Replaces the old import_claude_activities.py. Pipeline (plan §4):
+
+  1. `--rebuild` builds into data/activities.db.tmp; on success the live DB is
+     backed up to data/activities.db.bak and the tmp file is swapped in with an
+     atomic os.replace. A mid-build crash leaves the live DB untouched.
+  2. Every data/extracted/*.json is validated against scripts/activity_schema.json;
+     invalid files are moved to data/extracted/_rejected/ with an error log.
+  2b. Each source_excerpt must appear as a fuzzy substring (rapidfuzz
+     partial_ratio >= 90) of its source chunk — non-matches are hallucinations
+     and the activity is dropped (logged to _rejected/).
+  3. `category` is normalized to a valid taxonomy slug (fallback `altele`).
+  4. Dedup (D5): group by exact normalized_name, never across languages; within a
+     group rapidfuzz on descriptions — >=85 auto-merge, 60-85 borderline (keep
+     both, needs_review), <60 separate variants.
+  5. data/review_decisions.json is applied before insert.
+  6. Bulk insert into the tmp DB, populate the categories table, rebuild FTS.
+  7. A QA report is printed.
+
+Usage:
+    python scripts/build_database.py --rebuild
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+import sys
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from app.config_taxonomy import (  # noqa: E402
+    category_display_name,
+    normalize_category,
+    normalize_content_type,
+)
+from app.models.activity import Activity  # noqa: E402
+from app.models.database import DatabaseManager  # noqa: E402
+from import_common import (  # noqa: E402
+    DEFAULT_SCHEMA_PATH,
+    content_key,
+    excerpt_matches,
+    find_chunk_text,
+    iter_extraction_files,
+    load_schema,
+    normalize_name,
+    source_path_for,
+)
+
+# dedup thresholds (rapidfuzz token_sort_ratio, 0..100 scale)
+AUTO_MERGE_THRESHOLD = 85.0
+BORDERLINE_THRESHOLD = 60.0
+
+
+# --------------------------------------------------------------------------
+# extraction dict -> Activity
+# --------------------------------------------------------------------------
+def _csv(value: Any) -> Optional[str]:
+    """Schema arrays -> comma string for the (TEXT) DB columns."""
+    if value is None:
+        return None
+    if isinstance(value, str):
+        return value.strip() or None
+    if isinstance(value, (list, tuple)):
+        parts = [str(v).strip() for v in value if str(v).strip()]
+        return ", ".join(parts) or None
+    return str(value)
+
+
+def _split_csv(value: Optional[str]) -> list[str]:
+    if not value:
+        return []
+    return [p.strip() for p in str(value).split(",") if p.strip()]
+
+
+def dict_to_activity(adict: dict, source_file: str) -> Activity:
+    """Build an Activity from one extraction-JSON activity object."""
+    tags = adict.get("tags") or []
+    if isinstance(tags, str):
+        tags = _split_csv(tags)
+
+    source_files = adict.get("source_files") or []
+    if isinstance(source_files, str):
+        source_files = _split_csv(source_files)
+    if source_file and source_file not in source_files:
+        source_files = [source_file, *source_files]
+
+    return Activity(
+        name=(adict.get("name") or "").strip(),
+        description=(adict.get("description") or "").strip(),
+        rules=adict.get("rules"),
+        variations=adict.get("variations"),
+        category=normalize_category(adict.get("category", "")),
+        subcategory=adict.get("subcategory"),
+        content_type=normalize_content_type(adict.get("content_type", "")),
+        source_file=source_file,
+        source_files=list(source_files),
+        page_reference=adict.get("page_reference"),
+        source_excerpt=adict.get("source_excerpt"),
+        age_group_min=adict.get("age_group_min"),
+        age_group_max=adict.get("age_group_max"),
+        participants_min=adict.get("participants_min"),
+        participants_max=adict.get("participants_max"),
+        duration_min=adict.get("duration_min"),
+        duration_max=adict.get("duration_max"),
+        materials_category=adict.get("materials_category"),
+        materials_list=_csv(adict.get("materials_list")),
+        skills_developed=_csv(adict.get("skills_developed")),
+        difficulty_level=adict.get("difficulty_level"),
+        keywords=_csv(adict.get("keywords")),
+        tags=list(tags),
+        language=adict.get("language"),
+        extraction_confidence=adict.get("extraction_confidence"),
+    )
+
+
+# --------------------------------------------------------------------------
+# step 3 — category normalization is done in dict_to_activity; a non-taxonomy
+# value silently falls back to `altele`. This logs the substitutions.
+# --------------------------------------------------------------------------
+def log_category_fallbacks(raw_pairs: list[tuple[str, str]]) -> list[str]:
+    """raw_pairs = (original, slug); return human-readable fallback messages."""
+    msgs = []
+    for original, slug in raw_pairs:
+        if slug == "altele" and normalize_name(original or "") not in ("", "altele"):
+            msgs.append(f"category '{original}' -> altele (not in taxonomy)")
+    return msgs
+
+
+# --------------------------------------------------------------------------
+# step 4 — dedup
+# --------------------------------------------------------------------------
+def _longest(*values: Optional[str]) -> Optional[str]:
+    best: Optional[str] = None
+    for v in values:
+        if v and (best is None or len(v) > len(best)):
+            best = v
+    return best
+
+
+def _union_csv(values: list[Optional[str]]) -> Optional[str]:
+    seen: list[str] = []
+    for value in values:
+        for item in _split_csv(value):
+            if item not in seen:
+                seen.append(item)
+    return ", ".join(seen) or None
+
+
+def merge_cluster(cluster: list[Activity]) -> Activity:
+    """Collapse a cluster of duplicate activities into one merged Activity."""
+    if len(cluster) == 1:
+        return cluster[0]
+
+    # representative = the one with the longest description
+    rep = max(cluster, key=lambda a: len(a.description or ""))
+    merged = Activity(
+        name=rep.name,
+        description=_longest(*(a.description for a in cluster)) or rep.description,
+        rules=_longest(*(a.rules for a in cluster)),
+        variations=_longest(*(a.variations for a in cluster)),
+        category=rep.category,
+        subcategory=rep.subcategory,
+        content_type=rep.content_type,
+        source_file=rep.source_file,
+        page_reference=rep.page_reference,
+        source_excerpt=rep.source_excerpt,
+        age_group_min=rep.age_group_min,
+        age_group_max=rep.age_group_max,
+        participants_min=rep.participants_min,
+        participants_max=rep.participants_max,
+        duration_min=rep.duration_min,
+        duration_max=rep.duration_max,
+        materials_category=rep.materials_category,
+        materials_list=_union_csv([a.materials_list for a in cluster]),
+        skills_developed=_union_csv([a.skills_developed for a in cluster]),
+        difficulty_level=rep.difficulty_level,
+        keywords=_union_csv([a.keywords for a in cluster]),
+        language=rep.language,
+        extraction_confidence=rep.extraction_confidence,
+    )
+    # union of tags
+    tags: list[str] = []
+    for a in cluster:
+        for t in a.tags or []:
+            if t not in tags:
+                tags.append(t)
+    merged.tags = tags
+    # accumulate every source the activity was seen in
+    sources: list[str] = []
+    for a in cluster:
+        for s in [a.source_file, *(a.source_files or [])]:
+            if s and s not in sources:
+                sources.append(s)
+    merged.source_files = sources
+    # popularity_score++ per merged duplicate (plan §4)
+    merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
+    return merged
+
+
+def dedup_activities(activities: list[Activity]) -> tuple[list[Activity], dict]:
+    """
+    Dedup per plan D5.
+
+    Groups by (normalized_name, language) — different languages are NEVER
+    merged. Within a group, descriptions are clustered with rapidfuzz:
+      >= 85  -> same cluster (auto-merge)
+      60-85  -> borderline: kept as separate clusters, both flagged needs_review
+      < 60   -> separate variants
+    """
+    from rapidfuzz import fuzz
+
+    groups: dict[tuple, list[Activity]] = defaultdict(list)
+    for act in activities:
+        key = (act.normalized_name or normalize_name(act.name), act.language)
+        groups[key].append(act)
+
+    result: list[Activity] = []
+    stats = {"input": len(activities), "auto_merged": 0, "borderline": 0, "output": 0}
+
+    for members in groups.values():
+        clusters: list[list[Activity]] = []
+        borderline_idx: set[int] = set()
+
+        for act in members:
+            best_idx, best_score = -1, -1.0
+            borderline_here: list[int] = []
+            for idx, cluster in enumerate(clusters):
+                score = fuzz.token_sort_ratio(
+                    act.description or "", cluster[0].description or ""
+                )
+                if score >= AUTO_MERGE_THRESHOLD:
+                    if score > best_score:
+                        best_idx, best_score = idx, score
+                elif score >= BORDERLINE_THRESHOLD:
+                    borderline_here.append(idx)
+            if best_idx >= 0:
+                clusters[best_idx].append(act)
+            else:
+                clusters.append([act])
+                new_idx = len(clusters) - 1
+                for bidx in borderline_here:
+                    borderline_idx.add(bidx)
+                    borderline_idx.add(new_idx)
+
+        for idx, cluster in enumerate(clusters):
+            merged = merge_cluster(cluster)
+            if len(cluster) > 1:
+                stats["auto_merged"] += len(cluster) - 1
+            if idx in borderline_idx:
+                merged.needs_review = 1
+                stats["borderline"] += 1
+            result.append(merged)
+
+    stats["output"] = len(result)
+    return result, stats
+
+
+# --------------------------------------------------------------------------
+# step 5 — review decisions
+# --------------------------------------------------------------------------
+def load_review_decisions(path: Path) -> dict:
+    if path and path.is_file():
+        try:
+            data = json.loads(path.read_text(encoding="utf-8"))
+            if isinstance(data, dict):
+                return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {}
+
+
+def apply_review_decisions(
+    activities: list[Activity], decisions: dict
+) -> tuple[list[Activity], dict]:
+    """
+    Apply data/review_decisions.json (plan §5c).
+
+    Keyed by the stable content_key. A decision of `drop` removes the row;
+    `keep-separate` / `merge` clear needs_review (the user has resolved it).
+    Rows with no decision keep needs_review and resurface in the queue.
+    """
+    kept: list[Activity] = []
+    stats = {"dropped": 0, "resolved": 0}
+    for act in activities:
+        key = content_key(
+            act.normalized_name or normalize_name(act.name),
+            act.language,
+            act.description or "",
+        )
+        entry = decisions.get(key)
+        decision = entry.get("decision") if isinstance(entry, dict) else entry
+        if decision == "drop":
+            stats["dropped"] += 1
+            continue
+        if decision in ("keep-separate", "merge"):
+            act.needs_review = 0
+            stats["resolved"] += 1
+        kept.append(act)
+    return kept, stats
+
+
+# --------------------------------------------------------------------------
+# golden-set recall (plan §7)
+# --------------------------------------------------------------------------
+def _golden_names(data: Any) -> list[str]:
+    items = data.get("activities", data) if isinstance(data, dict) else data
+    names: list[str] = []
+    for item in items or []:
+        if isinstance(item, str):
+            names.append(item)
+        elif isinstance(item, dict) and item.get("name"):
+            names.append(item["name"])
+    return names
+
+
+def golden_recall(golden_dir: Path, activities: list[Activity]) -> Optional[dict]:
+    if not golden_dir or not golden_dir.is_dir():
+        return None
+    found = {normalize_name(a.name) for a in activities}
+    expected, hits = 0, 0
+    for gf in sorted(golden_dir.glob("*.json")):
+        try:
+            data = json.loads(gf.read_text(encoding="utf-8"))
+        except (json.JSONDecodeError, OSError):
+            continue
+        for name in _golden_names(data):
+            expected += 1
+            if normalize_name(name) in found:
+                hits += 1
+    if expected == 0:
+        return None
+    return {"expected": expected, "found": hits, "recall": round(hits / expected, 3)}
+
+
+# --------------------------------------------------------------------------
+# load + validate + excerpt-check the extraction files
+# --------------------------------------------------------------------------
+def collect_activities(
+    extracted_dir: Path,
+    chunks_dir: Path,
+    sources_dir: Path,
+    schema: dict,
+) -> dict:
+    """Validate, excerpt-check and convert every extraction file."""
+    rejected_dir = extracted_dir / "_rejected"
+    activities: list[Activity] = []
+    report = {
+        "files_total": 0,
+        "files_valid": 0,
+        "files_rejected_schema": 0,
+        "activities_raw": 0,
+        "activities_hallucinated": 0,
+        "category_fallbacks": [],
+    }
+    raw_categories: list[tuple[str, str]] = []
+
+    from import_common import chunk_key_for  # local import to avoid clutter
+
+    for json_path in iter_extraction_files(extracted_dir):
+        report["files_total"] += 1
+        try:
+            data = json.loads(json_path.read_text(encoding="utf-8"))
+        except json.JSONDecodeError as exc:
+            _reject_file(json_path, rejected_dir, [f"invalid JSON: {exc}"])
+            report["files_rejected_schema"] += 1
+            continue
+
+        from import_common import validate_extraction
+
+        errors = validate_extraction(data, schema)
+        if errors:
+            _reject_file(json_path, rejected_dir, errors)
+            report["files_rejected_schema"] += 1
+            continue
+        report["files_valid"] += 1
+
+        header = data.get("header", {})
+        chunk_text = find_chunk_text(json_path, header, chunks_dir)
+        source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit(
+            ".part", 1
+        )[0]
+        fallback_source = (
+            source_path_for(source_id, sources_dir) or source_id or json_path.stem
+        )
+
+        hallucinated: list[dict] = []
+        for adict in data.get("activities", []):
+            report["activities_raw"] += 1
+            excerpt = adict.get("source_excerpt") or ""
+            # if the chunk text is unavailable we cannot verify — keep but the
+            # QA report still counts it under activities_raw.
+            if chunk_text is not None and not excerpt_matches(excerpt, chunk_text):
+                hallucinated.append(adict)
+                report["activities_hallucinated"] += 1
+                continue
+            src = adict.get("source_file") or fallback_source
+            raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
+            activities.append(dict_to_activity(adict, src))
+
+        if hallucinated:
+            _log_hallucinations(json_path, rejected_dir, hallucinated)
+
+    report["category_fallbacks"] = log_category_fallbacks(raw_categories)
+    report["activities"] = activities
+    return report
+
+
+def _reject_file(json_path: Path, rejected_dir: Path, errors: list[str]) -> None:
+    rejected_dir.mkdir(parents=True, exist_ok=True)
+    dest = rejected_dir / json_path.name
+    shutil.move(str(json_path), str(dest))
+    log = rejected_dir / f"{json_path.stem}.errors.txt"
+    log.write_text(
+        f"REJECTED (schema validation): {json_path.name}\n\n"
+        + "\n".join(f"  - {e}" for e in errors)
+        + "\n",
+        encoding="utf-8",
+    )
+
+
+def _log_hallucinations(
+    json_path: Path, rejected_dir: Path, hallucinated: list[dict]
+) -> None:
+    rejected_dir.mkdir(parents=True, exist_ok=True)
+    log = rejected_dir / f"{json_path.stem}.hallucinations.txt"
+    lines = [f"DROPPED activities (source_excerpt not found in chunk): {json_path.name}", ""]
+    for a in hallucinated:
+        lines.append(f"  - {a.get('name')!r}")
+        lines.append(f"    excerpt: {a.get('source_excerpt')!r}")
+    log.write_text("\n".join(lines) + "\n", encoding="utf-8")
+
+
+# --------------------------------------------------------------------------
+# DB write + atomic swap
+# --------------------------------------------------------------------------
+def _enrich_category_display_names(db_path: Path) -> None:
+    """Give the categories table proper Romanian display names for slugs."""
+    import sqlite3
+
+    conn = sqlite3.connect(db_path)
+    try:
+        rows = conn.execute(
+            "SELECT value FROM categories WHERE type = 'category'"
+        ).fetchall()
+        for (slug,) in rows:
+            conn.execute(
+                "UPDATE categories SET display_name = ? WHERE type='category' AND value = ?",
+                (category_display_name(slug), slug),
+            )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def write_database(db_tmp_path: Path, activities: list[Activity]) -> None:
+    """Create a fresh tmp DB, bulk insert, populate categories, rebuild FTS."""
+    if db_tmp_path.exists():
+        db_tmp_path.unlink()
+    db = DatabaseManager(str(db_tmp_path))
+    db.bulk_insert_activities(activities)
+    _enrich_category_display_names(db_tmp_path)
+    db.rebuild_fts_index()
+
+
+def atomic_swap(db_tmp_path: Path, db_path: Path) -> Optional[Path]:
+    """Back up the live DB then atomically swap the tmp file in."""
+    backup: Optional[Path] = None
+    if db_path.exists():
+        backup = db_path.with_suffix(db_path.suffix + ".bak")
+        shutil.copy2(db_path, backup)
+    os.replace(db_tmp_path, db_path)
+    return backup
+
+
+# --------------------------------------------------------------------------
+# orchestration
+# --------------------------------------------------------------------------
+def rebuild(
+    *,
+    extracted_dir: Path,
+    chunks_dir: Path,
+    sources_dir: Path,
+    db_path: Path,
+    decisions_path: Optional[Path] = None,
+    schema_path: Path = DEFAULT_SCHEMA_PATH,
+    golden_dir: Optional[Path] = None,
+    do_swap: bool = True,
+) -> dict:
+    """
+    Full rebuild. Everything is built into <db_path>.tmp; the live DB is only
+    touched by the final atomic swap, so a crash anywhere above leaves it intact.
+    """
+    extracted_dir = Path(extracted_dir)
+    db_path = Path(db_path)
+    db_tmp_path = db_path.with_suffix(db_path.suffix + ".tmp")
+
+    schema = load_schema(schema_path)
+    collected = collect_activities(extracted_dir, Path(chunks_dir), Path(sources_dir), schema)
+    activities: list[Activity] = collected.pop("activities")
+
+    deduped, dedup_stats = dedup_activities(activities)
+
+    decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
+    final, decision_stats = apply_review_decisions(deduped, decisions)
+
+    try:
+        write_database(db_tmp_path, final)
+        backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
+    except Exception:
+        if db_tmp_path.exists():
+            db_tmp_path.unlink()
+        raise
+
+    report = {
+        **collected,
+        "dedup": dedup_stats,
+        "decisions": decision_stats,
+        "final_count": len(final),
+        "backup": str(backup) if backup else None,
+        "swapped": do_swap,
+        "qa": _qa_report(final, collected, golden_dir),
+    }
+    return report
+
+
+def _qa_report(
+    activities: list[Activity], collected: dict, golden_dir: Optional[Path]
+) -> dict:
+    per_category: dict[str, int] = defaultdict(int)
+    per_content_type: dict[str, int] = defaultdict(int)
+    confidence: dict[str, int] = defaultdict(int)
+    with_rules = 0
+    for a in activities:
+        per_category[a.category] += 1
+        per_content_type[a.content_type or "?"] += 1
+        confidence[a.extraction_confidence or "?"] += 1
+        if a.rules and a.rules.strip():
+            with_rules += 1
+    raw = collected.get("activities_raw", 0)
+    hallucinated = collected.get("activities_hallucinated", 0)
+    return {
+        "total": len(activities),
+        "per_category": dict(per_category),
+        "per_content_type": dict(per_content_type),
+        "extraction_confidence": dict(confidence),
+        "pct_with_rules": round(100 * with_rules / len(activities), 1) if activities else 0.0,
+        "needs_review": sum(1 for a in activities if a.needs_review),
+        "hallucination_rate": round(100 * hallucinated / raw, 2) if raw else 0.0,
+        "golden_recall": golden_recall(Path(golden_dir), activities) if golden_dir else None,
+    }
+
+
+def print_report(report: dict) -> None:
+    qa = report["qa"]
+    print("=" * 60)
+    print("BUILD DATABASE — QA REPORT")
+    print("=" * 60)
+    print(f"extraction files     : {report['files_total']} "
+          f"(valid {report['files_valid']}, schema-rejected {report['files_rejected_schema']})")
+    print(f"activities raw       : {report['activities_raw']}")
+    print(f"  hallucinated drop  : {report['activities_hallucinated']} "
+          f"({qa['hallucination_rate']}%)")
+    d = report["dedup"]
+    print(f"dedup                : {d['input']} -> {d['output']} "
+          f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
+    print(f"review decisions     : dropped {report['decisions']['dropped']}, "
+          f"resolved {report['decisions']['resolved']}")
+    print(f"final inserted       : {report['final_count']}")
+    print(f"% with rules         : {qa['pct_with_rules']}")
+    print(f"needs_review rows    : {qa['needs_review']}")
+    print("per category         :")
+    for slug, n in sorted(qa["per_category"].items(), key=lambda kv: -kv[1]):
+        print(f"  {slug:<24}: {n}")
+    print("per content_type     :")
+    for ct, n in sorted(qa["per_content_type"].items(), key=lambda kv: -kv[1]):
+        print(f"  {ct:<24}: {n}")
+    print("extraction_confidence:")
+    for c, n in sorted(qa["extraction_confidence"].items()):
+        print(f"  {c:<24}: {n}")
+    if qa["golden_recall"]:
+        g = qa["golden_recall"]
+        print(f"golden recall        : {g['found']}/{g['expected']} = {g['recall']}")
+    if report["category_fallbacks"]:
+        print("category fallbacks   :")
+        for msg in report["category_fallbacks"]:
+            print(f"  {msg}")
+    if report["backup"]:
+        print(f"live DB backed up to : {report['backup']}")
+    print("=" * 60)
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Build activities.db from extraction JSON.")
+    parser.add_argument("--rebuild", action="store_true",
+                        help="rebuild the database from scratch (only mode supported)")
+    parser.add_argument("--extracted", default="data/extracted")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--sources", default="data/sources")
+    parser.add_argument("--db", default="data/activities.db")
+    parser.add_argument("--decisions", default="data/review_decisions.json")
+    parser.add_argument("--golden", default="data/golden")
+    parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
+    args = parser.parse_args(argv)
+
+    if not args.rebuild:
+        parser.error("only --rebuild is supported (full rebuild, no incremental merge)")
+
+    report = rebuild(
+        extracted_dir=Path(args.extracted),
+        chunks_dir=Path(args.chunks),
+        sources_dir=Path(args.sources),
+        db_path=Path(args.db),
+        decisions_path=Path(args.decisions),
+        schema_path=Path(args.schema),
+        golden_dir=Path(args.golden),
+    )
+    print_report(report)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/chunk_sources.py
+++ b/scripts/chunk_sources.py
@@ -0,0 +1,251 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+chunk_sources.py — split normalized data/sources/*.txt into ~20-page chunks
+for subagent extraction, and maintain data/chunks/manifest.json.
+
+Paginated text  → ~20-page chunks, ~4-page overlap (plan D8).
+Unpaginated text → ~10000-word windows, ~2000-word overlap.
+
+The manifest is a cache derived from the filesystem + per-chunk state. Re-running
+this script is idempotent: existing chunk states (pending/assigned/done/rejected)
+survive as long as the source content hash is unchanged.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+if str(SCRIPT_DIR) not in sys.path:
+    sys.path.insert(0, str(SCRIPT_DIR))
+
+from extract_common import content_hash, split_pages  # noqa: E402
+
+SCHEMA_VERSION = "1.0"
+PAGES_PER_CHUNK = 20
+PAGE_OVERLAP = 4
+WORD_WINDOW = 10_000
+WORD_OVERLAP = 2_000
+
+VALID_STATES = {"pending", "assigned", "done", "rejected"}
+
+
+# --------------------------------------------------------------------------
+# header parsing
+# --------------------------------------------------------------------------
+def parse_source(text: str) -> tuple[dict, str]:
+    """Split a normalized source file into (header_dict, body)."""
+    lines = text.splitlines()
+    header: dict = {}
+    body_start = 0
+    in_header = True
+    for i, line in enumerate(lines):
+        if line.startswith("--- PAGE "):
+            body_start = i
+            break
+        if not in_header:
+            continue
+        if set(line.strip()) == {"="} and line.strip():
+            body_start = i + 1
+            in_header = False  # header ends at the rule line
+            continue
+        if ":" in line:
+            key, _, val = line.partition(":")
+            header[key.strip()] = val.strip()
+    body = "\n".join(lines[body_start:])
+    return header, body
+
+
+# --------------------------------------------------------------------------
+# chunking — pure functions
+# --------------------------------------------------------------------------
+def chunk_pages(
+    pages: list[tuple[int, str]],
+    pages_per_chunk: int = PAGES_PER_CHUNK,
+    overlap: int = PAGE_OVERLAP,
+) -> list[dict]:
+    """
+    Split an ordered list of (page_no, text) into overlapping chunks.
+
+    stride = pages_per_chunk - overlap. Because stride < pages_per_chunk - 1, any
+    activity straddling a page boundary appears whole in at least one chunk.
+    """
+    if not pages:
+        return []
+    stride = max(1, pages_per_chunk - overlap)
+    chunks: list[dict] = []
+    i = 0
+    n = len(pages)
+    while i < n:
+        window = pages[i : i + pages_per_chunk]
+        first, last = window[0][0], window[-1][0]
+        text = "".join(
+            f"\n--- PAGE {num} ---\n{txt}\n" for num, txt in window
+        )
+        chunks.append(
+            {"page_start": first, "page_end": last,
+             "chunk_range": f"pages {first}-{last}", "text": text}
+        )
+        if i + pages_per_chunk >= n:
+            break
+        i += stride
+    return chunks
+
+
+def chunk_words(
+    text: str, window: int = WORD_WINDOW, overlap: int = WORD_OVERLAP
+) -> list[dict]:
+    """Split unpaginated text into overlapping word windows."""
+    words = text.split()
+    if not words:
+        return []
+    stride = max(1, window - overlap)
+    chunks: list[dict] = []
+    i = 0
+    n = len(words)
+    while i < n:
+        seg = words[i : i + window]
+        chunks.append(
+            {"word_start": i, "word_end": i + len(seg),
+             "chunk_range": f"words {i}-{i + len(seg)}", "text": " ".join(seg)}
+        )
+        if i + window >= n:
+            break
+        i += stride
+    return chunks
+
+
+def make_chunks(source_text: str) -> list[dict]:
+    """Chunk one normalized source file. Picks page- or word-windowing."""
+    _, body = parse_source(source_text)
+    pages = split_pages(body)
+    if pages:
+        return chunk_pages(pages)
+    return chunk_words(body)
+
+
+# --------------------------------------------------------------------------
+# manifest
+# --------------------------------------------------------------------------
+def _empty_manifest() -> dict:
+    return {"schema_version": SCHEMA_VERSION, "chunks": {}}
+
+
+def load_manifest(manifest_path: Path) -> dict:
+    if manifest_path.exists():
+        try:
+            data = json.loads(manifest_path.read_text(encoding="utf-8"))
+            data.setdefault("schema_version", SCHEMA_VERSION)
+            data.setdefault("chunks", {})
+            return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return _empty_manifest()
+
+
+def save_manifest(manifest: dict, manifest_path: Path) -> None:
+    manifest_path.parent.mkdir(parents=True, exist_ok=True)
+    manifest_path.write_text(
+        json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
+    )
+
+
+def chunk_source_file(
+    source_path: Path, chunks_dir: Path, manifest: dict
+) -> list[str]:
+    """
+    Chunk one data/sources/<id>.txt → data/chunks/<id>/<id>.partNN.txt and
+    register every chunk in `manifest`. Preserves prior state when the source
+    content hash is unchanged. Returns the list of chunk keys written.
+    """
+    source_id = source_path.stem
+    text = source_path.read_text(encoding="utf-8", errors="replace")
+    src_hash = content_hash(text)
+    chunks = make_chunks(text)
+
+    out_dir = chunks_dir / source_id
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    written: list[str] = []
+    for idx, chunk in enumerate(chunks, 1):
+        key = f"{source_id}.part{idx:02d}"
+        chunk_file = out_dir / f"{key}.txt"
+        chunk_file.write_text(chunk["text"], encoding="utf-8")
+
+        prior = manifest["chunks"].get(key)
+        # preserve state only if the source content is unchanged
+        if prior and prior.get("source_hash") == src_hash and \
+                prior.get("state") in VALID_STATES:
+            state = prior["state"]
+        else:
+            state = "pending"
+
+        manifest["chunks"][key] = {
+            "source_id": source_id,
+            "source_hash": src_hash,
+            "part": idx,
+            "chunk_range": chunk["chunk_range"],
+            "chunk_file": str(chunk_file.relative_to(chunks_dir.parent)),
+            "expected_json": f"{key}.json",
+            "state": state,
+        }
+        written.append(key)
+    return written
+
+
+def prune_stale(manifest: dict, live_keys: set[str]) -> list[str]:
+    """Drop manifest entries whose chunk no longer exists on disk."""
+    stale = [k for k in manifest["chunks"] if k not in live_keys]
+    for k in stale:
+        del manifest["chunks"][k]
+    return stale
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def run(sources_dir: Path, chunks_dir: Path) -> dict:
+    """Chunk every *.txt in sources_dir. Returns a summary dict."""
+    manifest_path = chunks_dir / "manifest.json"
+    manifest = load_manifest(manifest_path)
+
+    live_keys: set[str] = set()
+    source_files = sorted(sources_dir.glob("*.txt"))
+    for src in source_files:
+        live_keys.update(chunk_source_file(src, chunks_dir, manifest))
+
+    stale = prune_stale(manifest, live_keys)
+    save_manifest(manifest, manifest_path)
+
+    states: dict[str, int] = {}
+    for meta in manifest["chunks"].values():
+        states[meta["state"]] = states.get(meta["state"], 0) + 1
+    return {
+        "sources": len(source_files),
+        "chunks": len(live_keys),
+        "pruned": len(stale),
+        "states": states,
+    }
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="Chunk normalized sources.")
+    parser.add_argument("--sources", default="data/sources", help="sources dir")
+    parser.add_argument("--chunks", default="data/chunks", help="chunks output dir")
+    args = parser.parse_args(argv)
+
+    summary = run(Path(args.sources), Path(args.chunks))
+    print(f"sources processed : {summary['sources']}")
+    print(f"chunks written    : {summary['chunks']}")
+    print(f"stale pruned      : {summary['pruned']}")
+    for state, count in sorted(summary["states"].items()):
+        print(f"  {state:<10}: {count}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/claude_extraction_template.md
+++ b/scripts/claude_extraction_template.md
@@ -1,54 +0,0 @@
-# TEMPLATE PENTRU EXTRACȚIE ACTIVITĂȚI CU CLAUDE
-
-## Instrucțiuni pentru Claude Code:
-
-Pentru fiecare PDF/DOC, folosește următorul format de extracție:
-
-### 1. Citește fișierul:
-```
-Claude, te rog citește fișierul: [CALE_FISIER]
-```
-
-### 2. Extrage activitățile folosind acest template JSON:
-```json
-{
-  "source_file": "[NUME_FISIER]",
-  "activities": [
-    {
-      "name": "Numele activității",
-      "description": "Descrierea completă a activității",
-      "rules": "Regulile jocului/activității",
-      "variations": "Variante sau adaptări",
-      "category": "[A-H] bazat pe tip",
-      "age_group_min": 6,
-      "age_group_max": 14,
-      "participants_min": 4,
-      "participants_max": 20,
-      "duration_min": 10,
-      "duration_max": 30,
-      "materials_list": "Lista materialelor necesare",
-      "skills_developed": "Competențe dezvoltate",
-      "difficulty_level": "Ușor/Mediu/Dificil",
-      "keywords": "cuvinte cheie separate prin virgulă",
-      "tags": "taguri relevante"
-    }
-  ]
-}
-```
-
-### 3. Salvează în fișier:
-După extracție, salvează JSON-ul în: `/scripts/extracted_activities/[NUME_FISIER].json`
-
-### 4. Priorități de procesare:
-
-**TOP PRIORITY (procesează primele):**
-1. 1000 Fantastic Scout Games.pdf
-2. Cartea Mare a jocurilor.pdf
-3. 160-de-activitati-dinamice-jocuri-pentru-team-building.pdf
-4. 101 Ways to Create an Unforgettable Camp Experience.pdf
-5. 151 Awesome Summer Camp Nature Activities.pdf
-
-**Categorii de focus:**
- [A] Jocuri Cercetășești
- [C] Camping & Activități Exterior
- [G] Activități Educaționale
--- a/scripts/create_databases.py
+++ b/scripts/create_databases.py
@@ -1,164 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-"""
-DATABASE SETUP SCRIPT - INDEX-SISTEM-JOCURI
-
-Script pentru recrearea bazelor de date din .gitignore
-Folosește clasele DatabaseManager pentru consistență
-
-Usage:
-    python scripts/create_databases.py
-    python scripts/create_databases.py --clear-existing
-"""
-
-import sys
-import argparse
-from pathlib import Path
-
-# Add src to path so we can import our modules
-sys.path.append(str(Path(__file__).parent.parent / 'src'))
-
-from database import DatabaseManager
-from game_library_manager import GameLibraryManager
-
-def create_main_database(db_path: str = "data/activities.db", clear: bool = False):
-    """Create the main activities database"""
-    db_file = Path(db_path)
-    
-    if clear and db_file.exists():
-        print(f"🗑️  Removing existing database: {db_path}")
-        db_file.unlink()
-    
-    print(f"📊 Creating main database: {db_path}")
-    db = DatabaseManager(db_path)
-    
-    # Test the database
-    try:
-        stats = db.get_statistics()
-        print(f"✅ Database created successfully: {stats['total_activities']} activities")
-        return True
-    except Exception as e:
-        print(f"❌ Error creating database: {e}")
-        return False
-
-def create_game_library_database(db_path: str = "data/game_library.db", clear: bool = False):
-    """Create the legacy game library database"""
-    db_file = Path(db_path)
-    
-    if clear and db_file.exists():
-        print(f"🗑️  Removing existing database: {db_path}")
-        db_file.unlink()
-    
-    print(f"📊 Creating game library database: {db_path}")
-    manager = GameLibraryManager(db_path)
-    
-    print(f"✅ Game library database created successfully")
-    return True
-
-def create_test_database(db_path: str = "data/test_activities.db", clear: bool = False):
-    """Create the test database"""
-    db_file = Path(db_path)
-    
-    if clear and db_file.exists():
-        print(f"🗑️  Removing existing database: {db_path}")
-        db_file.unlink()
-    
-    print(f"📊 Creating test database: {db_path}")
-    db = DatabaseManager(db_path)
-    
-    # Add some test data
-    test_activity = {
-        'title': 'Test Activity - Setup Script',
-        'description': 'This is a test activity created by the setup script',
-        'file_path': 'test/sample.txt',
-        'file_type': 'TXT',
-        'category': 'test',
-        'age_group': '8-12 ani',
-        'participants': '5-10 persoane',
-        'duration': '15-30min',
-        'materials': 'Fără materiale',
-        'tags': '["test", "setup"]',
-        'source_text': 'Sample test content for verification'
-    }
-    
-    try:
-        db.insert_activity(test_activity)
-        stats = db.get_statistics()
-        print(f"✅ Test database created with sample data: {stats['total_activities']} activities")
-        return True
-    except Exception as e:
-        print(f"❌ Error creating test database: {e}")
-        return False
-
-def ensure_data_directory():
-    """Ensure the data directory exists"""
-    data_dir = Path("data")
-    if not data_dir.exists():
-        print(f"📁 Creating data directory: {data_dir}")
-        data_dir.mkdir(parents=True)
-    else:
-        print(f"📁 Data directory exists: {data_dir}")
-
-def main():
-    """Main setup function"""
-    parser = argparse.ArgumentParser(description='Create databases for INDEX-SISTEM-JOCURI')
-    parser.add_argument('--clear-existing', '-c', action='store_true',
-                       help='Remove existing databases before creating new ones')
-    parser.add_argument('--main-only', action='store_true',
-                       help='Create only the main activities database')
-    parser.add_argument('--test-only', action='store_true',
-                       help='Create only the test database')
-    
-    args = parser.parse_args()
-    
-    print("🚀 DATABASE SETUP - INDEX-SISTEM-JOCURI")
-    print("=" * 50)
-    
-    # Ensure data directory exists
-    ensure_data_directory()
-    
-    success_count = 0
-    total_count = 0
-    
-    if args.test_only:
-        total_count = 1
-        if create_test_database(clear=args.clear_existing):
-            success_count += 1
-    elif args.main_only:
-        total_count = 1
-        if create_main_database(clear=args.clear_existing):
-            success_count += 1
-    else:
-        # Create all databases
-        databases = [
-            ("Main activities", lambda: create_main_database(clear=args.clear_existing)),
-            ("Game library", lambda: create_game_library_database(clear=args.clear_existing)),
-            ("Test activities", lambda: create_test_database(clear=args.clear_existing))
-        ]
-        
-        total_count = len(databases)
-        
-        for name, create_func in databases:
-            print(f"\n📂 Creating {name} database...")
-            try:
-                if create_func():
-                    success_count += 1
-            except Exception as e:
-                print(f"❌ Failed to create {name} database: {e}")
-    
-    print("\n" + "=" * 50)
-    print(f"🎯 SUMMARY: {success_count}/{total_count} databases created successfully")
-    
-    if success_count == total_count:
-        print("✅ All databases ready!")
-        print("\nNext steps:")
-        print("1. Run indexer: cd src && python indexer.py --clear-db")
-        print("2. Start web app: cd src && python app.py")
-    else:
-        print("⚠️  Some databases failed to create. Check errors above.")
-        return 1
-    
-    return 0
-
-if __name__ == '__main__':
-    sys.exit(main())
--- a/scripts/extract_common.py
+++ b/scripts/extract_common.py
@@ -0,0 +1,361 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+extract_common.py — single home for per-format text extraction.
+
+Every extractor returns a plain text *body* with synthetic page markers
+(`--- PAGE N ---`). The file-level header (`SOURCE:` / `CONVERTED:`) is added
+by normalize_sources.py, not here.
+
+Critical fix vs. the old pdf_to_text_converter.py: there is NO `max_pages` cap.
+Large books are extracted in full.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import importlib
+import os
+import re
+import shutil
+import subprocess
+import tempfile
+import zipfile
+from pathlib import Path
+from typing import Callable
+
+PAGE_MARKER_RE = re.compile(r"^--- PAGE (\d+) ---\s*$", re.MULTILINE)
+
+# paragraphs per synthetic page for paginated-by-flow formats (docx)
+DOCX_PARAS_PER_PAGE = 40
+
+# formats we deliberately ignore (epub duplicates existing PDFs — plan §1)
+IGNORED_EXTENSIONS = {".epub"}
+
+# obvious junk filenames skipped during a walk
+JUNK_NAMES = {"desktop.ini", "linkuri-jocuri.txt"}
+JUNK_SUFFIXES = {".bak", ".tmp", ".ini"}
+
+
+# --------------------------------------------------------------------------
+# page assembly helpers
+# --------------------------------------------------------------------------
+def join_pages(pages: list[str], start: int = 1) -> str:
+    """Join a list of page texts into a body string with `--- PAGE N ---`."""
+    out: list[str] = []
+    for i, text in enumerate(pages, start):
+        out.append(f"\n--- PAGE {i} ---\n{(text or '').strip()}\n")
+    return "".join(out)
+
+
+def split_pages(body: str) -> list[tuple[int, str]]:
+    """Inverse of join_pages: parse a body into [(page_number, text), ...]."""
+    matches = list(PAGE_MARKER_RE.finditer(body))
+    if not matches:
+        return []
+    pages: list[tuple[int, str]] = []
+    for idx, m in enumerate(matches):
+        num = int(m.group(1))
+        seg_start = m.end()
+        seg_end = matches[idx + 1].start() if idx + 1 < len(matches) else len(body)
+        pages.append((num, body[seg_start:seg_end].strip()))
+    return pages
+
+
+def count_page_markers(body: str) -> int:
+    return len(PAGE_MARKER_RE.findall(body))
+
+
+# --------------------------------------------------------------------------
+# format detection
+# --------------------------------------------------------------------------
+FORMAT_BY_EXT = {
+    ".pdf": "pdf",
+    ".docx": "docx",
+    ".doc": "doc",
+    ".pptx": "pptx",
+    ".ppt": "pptx",
+    ".htm": "html",
+    ".html": "html",
+    ".zip": "zip",
+    ".epub": "epub",
+    ".txt": "txt",
+}
+
+
+def detect_format(path: str | os.PathLike) -> str:
+    """Return a format key for a path based on its extension."""
+    ext = Path(path).suffix.lower()
+    return FORMAT_BY_EXT.get(ext, "unknown")
+
+
+def is_junk(path: str | os.PathLike) -> bool:
+    p = Path(path)
+    name = p.name.lower()
+    if name in JUNK_NAMES:
+        return True
+    if name.startswith("readme") and p.suffix.lower() == ".md":
+        return True
+    if p.suffix.lower() in JUNK_SUFFIXES:
+        return True
+    return False
+
+
+# --------------------------------------------------------------------------
+# content hashing + near-duplicate elimination
+# --------------------------------------------------------------------------
+def _normalize_for_hash(text: str) -> str:
+    return re.sub(r"\s+", " ", (text or "")).strip().lower()
+
+
+def content_hash(text: str) -> str:
+    """Stable SHA1 of whitespace-normalized text — used for exact-dup detection."""
+    return hashlib.sha1(_normalize_for_hash(text).encode("utf-8")).hexdigest()
+
+
+def near_duplicate_ratio(a: str, b: str) -> float:
+    """Similarity score in [0, 100] between two texts (rapidfuzz token ratio)."""
+    from rapidfuzz import fuzz
+
+    return fuzz.token_sort_ratio(_normalize_for_hash(a), _normalize_for_hash(b))
+
+
+def dedupe_texts(
+    items: list[tuple[str, str]], threshold: float = 95.0
+) -> list[tuple[str, str]]:
+    """
+    Drop exact and near-duplicate texts from a list of (key, text) pairs.
+
+    Used for HTML mirror pages (print copies, repeated index/footer pages).
+    Keeps the first occurrence; O(n) on exact hash, O(n*k) fuzzy only against
+    already-kept items.
+    """
+    kept: list[tuple[str, str]] = []
+    seen_hashes: set[str] = set()
+    for key, text in items:
+        h = content_hash(text)
+        if h in seen_hashes:
+            continue
+        if any(near_duplicate_ratio(text, kt) >= threshold for _, kt in kept):
+            continue
+        seen_hashes.add(h)
+        kept.append((key, text))
+    return kept
+
+
+# --------------------------------------------------------------------------
+# preflight dependency check
+# --------------------------------------------------------------------------
+REQUIRED_PYTHON_MODULES = {
+    "pdfplumber": "pdfplumber",
+    "PyPDF2": "pypdf2",
+    "docx": "python-docx",
+    "pptx": "python-pptx",
+    "bs4": "beautifulsoup4",
+    "lxml": "lxml",
+    "jsonschema": "jsonschema",
+    "rapidfuzz": "rapidfuzz",
+    "chardet": "chardet",
+}
+
+
+def preflight(check_ocr: bool = False) -> dict:
+    """
+    Check system + Python dependencies before a long normalization run.
+
+    Returns {'ok': bool, 'missing_python': [...], 'missing_system': [...],
+             'warnings': [...]}.  libreoffice is a *warning* (only .doc needs it),
+             tesseract only checked when check_ocr=True.
+    """
+    missing_python: list[str] = []
+    for module, pip_name in REQUIRED_PYTHON_MODULES.items():
+        try:
+            importlib.import_module(module)
+        except ImportError:
+            missing_python.append(pip_name)
+
+    warnings: list[str] = []
+    missing_system: list[str] = []
+
+    if not (shutil.which("libreoffice") or shutil.which("soffice")):
+        warnings.append("libreoffice not found — legacy .doc files cannot be converted")
+
+    if check_ocr and not shutil.which("tesseract"):
+        missing_system.append("tesseract (OCR requested but not installed)")
+
+    return {
+        "ok": not missing_python and not missing_system,
+        "missing_python": missing_python,
+        "missing_system": missing_system,
+        "warnings": warnings,
+    }
+
+
+# --------------------------------------------------------------------------
+# per-format extractors
+# --------------------------------------------------------------------------
+def extract_pdf(path: str | os.PathLike) -> str:
+    """PDF → body. pdfplumber primary, PyPDF2 fallback. No page cap."""
+    path = str(path)
+    try:
+        return _extract_pdf_pdfplumber(path)
+    except Exception:
+        return _extract_pdf_pypdf2(path)
+
+
+def _extract_pdf_pdfplumber(path: str) -> str:
+    import pdfplumber
+
+    pages: list[str] = []
+    with pdfplumber.open(path) as pdf:
+        for page in pdf.pages:  # ALL pages — no max_pages
+            try:
+                pages.append(page.extract_text() or "")
+            except Exception:
+                pages.append("")
+    return join_pages(pages)
+
+
+def _extract_pdf_pypdf2(path: str) -> str:
+    import PyPDF2
+
+    pages: list[str] = []
+    with open(path, "rb") as fh:
+        reader = PyPDF2.PdfReader(fh)
+        for page in reader.pages:  # ALL pages — no max_pages
+            try:
+                pages.append(page.extract_text() or "")
+            except Exception:
+                pages.append("")
+    return join_pages(pages)
+
+
+def extract_docx(path: str | os.PathLike) -> str:
+    """docx → body. Synthetic page marker every DOCX_PARAS_PER_PAGE paragraphs."""
+    import docx
+
+    document = docx.Document(str(path))
+    paragraphs = [p.text for p in document.paragraphs]
+    pages: list[str] = []
+    for i in range(0, max(len(paragraphs), 1), DOCX_PARAS_PER_PAGE):
+        chunk = paragraphs[i : i + DOCX_PARAS_PER_PAGE]
+        pages.append("\n".join(chunk))
+    return join_pages(pages)
+
+
+def extract_doc(path: str | os.PathLike) -> str:
+    """
+    Legacy .doc → body via `libreoffice --headless --convert-to docx`.
+
+    Raises RuntimeError if libreoffice is unavailable — the caller marks the
+    resulting source `needs_review` regardless (conversion is imperfect).
+    """
+    soffice = shutil.which("libreoffice") or shutil.which("soffice")
+    if not soffice:
+        raise RuntimeError("libreoffice/soffice not available — cannot convert .doc")
+
+    src = Path(path).resolve()
+    with tempfile.TemporaryDirectory() as tmp:
+        subprocess.run(
+            [soffice, "--headless", "--convert-to", "docx", "--outdir", tmp, str(src)],
+            check=True,
+            capture_output=True,
+            timeout=300,
+        )
+        converted = Path(tmp) / (src.stem + ".docx")
+        if not converted.exists():
+            raise RuntimeError(f"libreoffice produced no output for {src.name}")
+        return extract_docx(converted)
+
+
+def extract_pptx(path: str | os.PathLike) -> str:
+    """pptx → body. One page per slide: title + body text + speaker notes."""
+    from pptx import Presentation
+
+    presentation = Presentation(str(path))
+    pages: list[str] = []
+    for slide in presentation.slides:
+        parts: list[str] = []
+        for shape in slide.shapes:
+            if shape.has_text_frame and shape.text_frame.text.strip():
+                parts.append(shape.text_frame.text.strip())
+        if slide.has_notes_slide:
+            notes = slide.notes_slide.notes_text_frame.text.strip()
+            if notes:
+                parts.append(f"[NOTES] {notes}")
+        pages.append("\n".join(parts))
+    return join_pages(pages)
+
+
+def extract_html(path: str | os.PathLike) -> str:
+    """HTML mirror page → body. Strips nav/script/style/footer/header/aside."""
+    import chardet
+    from bs4 import BeautifulSoup
+
+    raw = Path(path).read_bytes()
+    enc = chardet.detect(raw).get("encoding") or "utf-8"
+    soup = BeautifulSoup(raw.decode(enc, errors="replace"), "lxml")
+
+    for tag in soup(["script", "style", "nav", "footer", "header", "aside", "noscript"]):
+        tag.decompose()
+    # also drop common chrome by role/class
+    for tag in soup.find_all(attrs={"role": ["navigation", "banner", "contentinfo"]}):
+        tag.decompose()
+
+    text = soup.get_text(separator="\n")
+    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
+    return join_pages(["\n".join(lines)])
+
+
+def extract_zip(path: str | os.PathLike) -> str:
+    """
+    zip → body. Unzips into a temp dir and recurses on every extractable inner
+    file. Inner files are page-renumbered into one continuous body.
+    """
+    path = str(path)
+    pages: list[str] = []
+    with tempfile.TemporaryDirectory() as tmp:
+        try:
+            with zipfile.ZipFile(path) as zf:
+                zf.extractall(tmp)
+        except zipfile.BadZipFile:
+            return ""
+        for inner in sorted(Path(tmp).rglob("*")):
+            if not inner.is_file() or is_junk(inner):
+                continue
+            fmt = detect_format(inner)
+            if fmt in ("unknown", "epub", "zip"):
+                # nested zips handled by recursion below
+                if fmt == "zip":
+                    body = extract_zip(inner)
+                    pages.extend(t for _, t in split_pages(body))
+                continue
+            try:
+                body = extract_file(inner)
+            except Exception:
+                continue
+            pages.extend(t for _, t in split_pages(body))
+    return join_pages(pages)
+
+
+EXTRACTORS: dict[str, Callable[[str | os.PathLike], str]] = {
+    "pdf": extract_pdf,
+    "docx": extract_docx,
+    "doc": extract_doc,
+    "pptx": extract_pptx,
+    "html": extract_html,
+    "zip": extract_zip,
+}
+
+
+def extract_file(path: str | os.PathLike) -> str:
+    """Dispatch a single file to the right extractor. Returns a page-marked body."""
+    fmt = detect_format(path)
+    if fmt == "txt":
+        body = Path(path).read_text(encoding="utf-8", errors="replace")
+        # already paginated? pass through; else wrap as one page
+        return body if count_page_markers(body) else join_pages([body])
+    extractor = EXTRACTORS.get(fmt)
+    if extractor is None:
+        raise ValueError(f"No extractor for format '{fmt}': {path}")
+    return extractor(path)
--- a/scripts/html_extractor.py
+++ b/scripts/html_extractor.py
@@ -1,424 +0,0 @@
-#!/usr/bin/env python3
-"""
-HTML Activity Extractor - Proceseaz 1876 fiiere HTML
-Extrage automat activiti folosind pattern recognition
-"""
-
-import os
-import re
-import json
-from pathlib import Path
-from bs4 import BeautifulSoup
-import chardet
-from typing import List, Dict, Optional
-import sqlite3
-from datetime import datetime
-
-class HTMLActivityExtractor:
-    def __init__(self, db_path='data/activities.db'):
-        self.db_path = db_path
-        # Pattern-uri pentru detectare activiti <20>n rom<6F>n
-        self.activity_patterns = {
-            'title_patterns': [
-                r'(?i)(joc|activitate|exerci[t]iu|team[\s-]?building|energizer|ice[\s-]?breaker)[\s:]+([^\.]{5,100})',
-                r'(?i)<h[1-6][^>]*>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</h[1-6]>',
-                r'(?i)<strong>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</strong>',
-                r'(?i)^[\d]+\.?\s*([A-Z][^\.]{10,100}(?:joc|activitate|exerci[t]iu)[^\.]{0,50})$',
-            ],
-            'description_markers': [
-                'descriere', 'reguli', 'cum se joac[a]', 'instructiuni', 
-                'obiectiv', 'desfasurare', 'explicatie', 'mod de joc'
-            ],
-            'materials_markers': [
-                'materiale', 'necesare', 'echipament', 'ce avem nevoie',
-                'se folosesc', 'trebuie sa avem', 'dotari'
-            ],
-            'age_patterns': [
-                r'(?i)v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
-                r'(?i)(\d+)[\s-]+(\d+)\s*ani',
-                r'(?i)pentru\s+(\d+)[\s-]+(\d+)\s*ani',
-                r'(?i)categoria?\s*(?:de\s*)?v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
-            ],
-            'participants_patterns': [
-                r'(?i)(\d+)[\s-]+(\d+)\s*(?:participan[t]i|juc[a]tori|persoane|copii)',
-                r'(?i)num[a]r\s*(?:de\s*)?(?:participan[t]i|juc[a]tori)[\s:]+(\d+)[\s-]+(\d+)',
-                r'(?i)grup\s*de\s*(\d+)[\s-]+(\d+)',
-            ],
-            'duration_patterns': [
-                r'(?i)durat[a][\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
-                r'(?i)timp[\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
-                r'(?i)(\d+)[\s-]+(\d+)\s*minute',
-            ]
-        }
-        
-        # Categorii predefinite bazate pe sistemul existent
-        self.categories = {
-            '[A]': ['joc', 'joaca', 'distractie', 'amuzament'],
-            '[B]': ['aventura', 'explorare', 'descoperire'],
-            '[C]': ['camping', 'tabara', 'excursie', 'drumetie'],
-            '[D]': ['foc', 'flacara', 'lumina'],
-            '[E]': ['noduri', 'fr<EFBFBD>nghii', 'sfori', 'legare'],
-            '[F]': ['bushcraft', 'supravietuire', 'survival'],
-            '[G]': ['educatie', 'educativ', 'invatare', 'scoala'],
-            '[H]': ['orientare', 'busola', 'harta', 'navigare']
-        }
-    
-    def detect_encoding(self, file_path):
-        """Detecteaz encoding-ul fiierului"""
-        with open(file_path, 'rb') as f:
-            result = chardet.detect(f.read())
-        return result['encoding'] or 'utf-8'
-    
-    def extract_from_html(self, html_path: str) -> List[Dict]:
-        """Extrage activiti dintr-un singur fiier HTML"""
-        activities = []
-        
-        try:
-            # Detectare encoding i citire
-            encoding = self.detect_encoding(html_path)
-            with open(html_path, 'r', encoding=encoding, errors='ignore') as f:
-                content = f.read()
-            
-            soup = BeautifulSoup(content, 'lxml')
-            
-            # Metod 1: Caut liste de activiti
-            activities.extend(self._extract_from_lists(soup, html_path))
-            
-            # Metod 2: Caut activiti <20>n headings
-            activities.extend(self._extract_from_headings(soup, html_path))
-            
-            # Metod 3: Caut pattern-uri <20>n text
-            activities.extend(self._extract_from_patterns(soup, html_path))
-            
-            # Metod 4: Caut <20>n tabele
-            activities.extend(self._extract_from_tables(soup, html_path))
-            
-        except Exception as e:
-            print(f"Error processing {html_path}: {e}")
-        
-        return activities
-    
-    def _extract_from_lists(self, soup, source_file):
-        """Extrage activiti din liste HTML (ul, ol)"""
-        activities = []
-        
-        for list_elem in soup.find_all(['ul', 'ol']):
-            # Verific dac lista pare s conin activiti
-            list_text = list_elem.get_text().lower()
-            if any(marker in list_text for marker in ['joc', 'activitate', 'exercitiu']):
-                for li in list_elem.find_all('li'):
-                    text = li.get_text(strip=True)
-                    if len(text) > 20:  # Minim 20 caractere pentru o activitate valid
-                        activity = self._create_activity_from_text(text, source_file)
-                        if activity:
-                            activities.append(activity)
-        
-        return activities
-    
-    def _extract_from_headings(self, soup, source_file):
-        """Extrage activiti bazate pe headings"""
-        activities = []
-        
-        for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
-            heading_text = heading.get_text(strip=True)
-            
-            # Verific dac heading-ul conine cuvinte cheie
-            if any(keyword in heading_text.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
-                # Caut descrierea <20>n elementele urmtoare
-                description = ""
-                next_elem = heading.find_next_sibling()
-                
-                while next_elem and next_elem.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
-                    if next_elem.name in ['p', 'div', 'ul']:
-                        description += next_elem.get_text(strip=True) + " "
-                        if len(description) > 500:  # Limit descriere
-                            break
-                    next_elem = next_elem.find_next_sibling()
-                
-                if description:
-                    activity = {
-                        'name': heading_text[:200],
-                        'description': description[:1000],
-                        'source_file': str(source_file),
-                        'category': self._detect_category(heading_text + " " + description)
-                    }
-                    activities.append(activity)
-        
-        return activities
-    
-    def _extract_from_patterns(self, soup, source_file):
-        """Extrage activiti folosind pattern matching"""
-        activities = []
-        text = soup.get_text()
-        
-        # Caut pattern-uri de activiti
-        for pattern in self.activity_patterns['title_patterns']:
-            matches = re.finditer(pattern, text, re.MULTILINE)
-            for match in matches:
-                title = match.group(0) if match.lastindex == 0 else match.group(match.lastindex)
-                if len(title) > 10:
-                    # Extrage context <20>n jurul match-ului
-                    start = max(0, match.start() - 200)
-                    end = min(len(text), match.end() + 500)
-                    context = text[start:end]
-                    
-                    activity = self._create_activity_from_text(context, source_file, title)
-                    if activity:
-                        activities.append(activity)
-        
-        return activities
-    
-    def _extract_from_tables(self, soup, source_file):
-        """Extrage activiti din tabele"""
-        activities = []
-        
-        for table in soup.find_all('table'):
-            rows = table.find_all('tr')
-            if len(rows) > 1:  # Cel puin header i o linie de date
-                # Detecteaz coloanele relevante
-                headers = [th.get_text(strip=True).lower() for th in rows[0].find_all(['th', 'td'])]
-                
-                for row in rows[1:]:
-                    cells = row.find_all(['td'])
-                    if cells:
-                        activity_data = {}
-                        for i, cell in enumerate(cells):
-                            if i < len(headers):
-                                activity_data[headers[i]] = cell.get_text(strip=True)
-                        
-                        # Creeaz activitate din date tabel
-                        if any(key in activity_data for key in ['joc', 'activitate', 'nume', 'titlu']):
-                            activity = self._create_activity_from_table_data(activity_data, source_file)
-                            if activity:
-                                activities.append(activity)
-        
-        return activities
-    
-    def _create_activity_from_text(self, text, source_file, title=None):
-        """Creeaz un dicionar de activitate din text"""
-        if not text or len(text) < 30:
-            return None
-        
-        activity = {
-            'name': title or text[:100].split('.')[0].strip(),
-            'description': text[:1000],
-            'source_file': str(source_file),
-            'category': self._detect_category(text),
-            'keywords': self._extract_keywords(text),
-            'created_at': datetime.now().isoformat()
-        }
-        
-        # Extrage metadata suplimentar
-        activity.update(self._extract_metadata(text))
-        
-        return activity
-    
-    def _create_activity_from_table_data(self, data, source_file):
-        """Creeaz activitate din date de tabel"""
-        activity = {
-            'source_file': str(source_file),
-            'created_at': datetime.now().isoformat()
-        }
-        
-        # Mapare c<>mpuri tabel la c<>mpuri DB
-        field_mapping = {
-            'nume': 'name', 'titlu': 'name', 'joc': 'name', 'activitate': 'name',
-            'descriere': 'description', 'detalii': 'description', 'explicatie': 'description',
-            'materiale': 'materials_list', 'echipament': 'materials_list',
-            'varsta': 'age_group_min', 'categoria': 'category',
-            'participanti': 'participants_min', 'numar': 'participants_min',
-            'durata': 'duration_min', 'timp': 'duration_min'
-        }
-        
-        for table_field, db_field in field_mapping.items():
-            if table_field in data:
-                activity[db_field] = data[table_field]
-        
-        # Validare minim
-        if 'name' in activity and len(activity.get('name', '')) > 5:
-            return activity
-        
-        return None
-    
-    def _extract_metadata(self, text):
-        """Extrage metadata din text folosind pattern-uri"""
-        metadata = {}
-        
-        # Extrage v<>rsta
-        for pattern in self.activity_patterns['age_patterns']:
-            match = re.search(pattern, text)
-            if match:
-                metadata['age_group_min'] = int(match.group(1))
-                metadata['age_group_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
-                break
-        
-        # Extrage numr participani
-        for pattern in self.activity_patterns['participants_patterns']:
-            match = re.search(pattern, text)
-            if match:
-                metadata['participants_min'] = int(match.group(1))
-                metadata['participants_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
-                break
-        
-        # Extrage durata
-        for pattern in self.activity_patterns['duration_patterns']:
-            match = re.search(pattern, text)
-            if match:
-                metadata['duration_min'] = int(match.group(1))
-                metadata['duration_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
-                break
-        
-        # Extrage materiale
-        materials = []
-        text_lower = text.lower()
-        for marker in self.activity_patterns['materials_markers']:
-            idx = text_lower.find(marker)
-            if idx != -1:
-                # Extrage urmtoarele 200 caractere dup marker
-                materials_text = text[idx:idx+200]
-                # Extrage items din list
-                items = re.findall(r'[-"]\s*([^\n-"]+)', materials_text)
-                if items:
-                    materials.extend(items)
-        
-        if materials:
-            metadata['materials_list'] = ', '.join(materials[:10])  # Maxim 10 materiale
-        
-        return metadata
-    
-    def _detect_category(self, text):
-        """Detecteaz categoria activitii bazat pe cuvinte cheie"""
-        text_lower = text.lower()
-        
-        for category, keywords in self.categories.items():
-            if any(keyword in text_lower for keyword in keywords):
-                return category
-        
-        return '[A]'  # Default categoria jocuri
-    
-    def _extract_keywords(self, text):
-        """Extrage cuvinte cheie din text"""
-        keywords = []
-        text_lower = text.lower()
-        
-        # Lista de cuvinte cheie relevante
-        keyword_list = [
-            'cooperare', 'competitie', 'echipa', 'creativitate', 'miscare',
-            'strategie', 'comunicare', 'incredere', 'coordonare', 'atentie',
-            'reflexe', 'logica', 'imaginatie', 'muzica', 'dans', 'sport',
-            'natura', 'mediu', 'stiinta', 'matematica', 'limba', 'cultura'
-        ]
-        
-        for keyword in keyword_list:
-            if keyword in text_lower:
-                keywords.append(keyword)
-        
-        return ', '.join(keywords[:5])  # Maxim 5 keywords
-    
-    def save_to_database(self, activities):
-        """Salveaz activitile <20>n baza de date"""
-        conn = sqlite3.connect(self.db_path)
-        cursor = conn.cursor()
-        
-        saved_count = 0
-        duplicate_count = 0
-        
-        for activity in activities:
-            try:
-                # Verific duplicate
-                cursor.execute(
-                    "SELECT id FROM activities WHERE name = ? AND source_file = ?",
-                    (activity.get('name'), activity.get('source_file'))
-                )
-                
-                if cursor.fetchone():
-                    duplicate_count += 1
-                    continue
-                
-                # Pregtete valorile pentru insert
-                columns = []
-                values = []
-                placeholders = []
-                
-                for key, value in activity.items():
-                    if key != 'created_at':  # Skip created_at, it has default
-                        columns.append(key)
-                        values.append(value)
-                        placeholders.append('?')
-                
-                # Insert <20>n DB
-                query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
-                cursor.execute(query, values)
-                saved_count += 1
-                
-            except Exception as e:
-                print(f"Error saving activity: {e}")
-                continue
-        
-        conn.commit()
-        conn.close()
-        
-        return saved_count, duplicate_count
-    
-    def process_all_html_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
-        """Proceseaz toate fiierele HTML din directorul specificat"""
-        base_path = Path(base_path)
-        html_files = list(base_path.rglob("*.html"))
-        html_files.extend(list(base_path.rglob("*.htm")))
-        
-        print(f"Found {len(html_files)} HTML files to process")
-        
-        all_activities = []
-        processed = 0
-        errors = 0
-        
-        for i, html_file in enumerate(html_files):
-            try:
-                activities = self.extract_from_html(str(html_file))
-                all_activities.extend(activities)
-                processed += 1
-                
-                # Progress update
-                if (i + 1) % 100 == 0:
-                    print(f"Progress: {i+1}/{len(html_files)} files processed, {len(all_activities)} activities found")
-                    # Save batch to DB
-                    if all_activities:
-                        saved, dupes = self.save_to_database(all_activities)
-                        print(f"Batch saved: {saved} new activities, {dupes} duplicates skipped")
-                        all_activities = []  # Clear buffer
-                
-            except Exception as e:
-                print(f"Error processing {html_file}: {e}")
-                errors += 1
-        
-        # Save remaining activities
-        if all_activities:
-            saved, dupes = self.save_to_database(all_activities)
-            print(f"Final batch saved: {saved} new activities, {dupes} duplicates skipped")
-        
-        print(f"\nProcessing complete!")
-        print(f"Files processed: {processed}")
-        print(f"Errors: {errors}")
-        
-        return processed, errors
-
-# Funcie main pentru test
-if __name__ == "__main__":
-    extractor = HTMLActivityExtractor()
-    
-    # Test pe un fiier sample mai <20>nt<6E>i
-    print("Testing on sample file first...")
-    # Gsete un fiier HTML pentru test
-    test_files = list(Path('/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri').rglob("*.html"))[:3]
-    
-    for test_file in test_files:
-        print(f"\nTesting: {test_file}")
-        activities = extractor.extract_from_html(str(test_file))
-        print(f"Found {len(activities)} activities")
-        if activities:
-            print(f"Sample activity: {activities[0]['name'][:50]}...")
-    
-    # <20>ntreab dac s continue cu procesarea complet
-    response = input("\nContinue with full processing? (y/n): ")
-    if response.lower() == 'y':
-        extractor.process_all_html_files()
--- a/scripts/import_claude_activities.py
+++ b/scripts/import_claude_activities.py
@@ -1,78 +0,0 @@
-#!/usr/bin/env python3
-"""
-Import activities extracted by Claude from JSON files
-"""
-
-import json
-import sqlite3
-from pathlib import Path
-from datetime import datetime
-
-class ClaudeActivityImporter:
-    def __init__(self, db_path='data/activities.db'):
-        self.db_path = db_path
-        self.json_dir = Path('scripts/extracted_activities')
-        self.json_dir.mkdir(exist_ok=True)
-    
-    def import_json_file(self, json_path):
-        """Import activities from a single JSON file"""
-        with open(json_path, 'r', encoding='utf-8') as f:
-            data = json.load(f)
-        
-        source_file = data.get('source_file', str(json_path))
-        activities = data.get('activities', [])
-        
-        conn = sqlite3.connect(self.db_path)
-        cursor = conn.cursor()
-        
-        imported = 0
-        for activity in activities:
-            try:
-                # Add source file and timestamp
-                activity['source_file'] = source_file
-                activity['created_at'] = datetime.now().isoformat()
-                
-                # Prepare insert
-                columns = list(activity.keys())
-                values = list(activity.values())
-                placeholders = ['?' for _ in values]
-                
-                # Check for duplicate
-                cursor.execute(
-                    "SELECT id FROM activities WHERE name = ? AND source_file = ?",
-                    (activity.get('name'), source_file)
-                )
-                
-                if not cursor.fetchone():
-                    query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
-                    cursor.execute(query, values)
-                    imported += 1
-                    
-            except Exception as e:
-                print(f"Error importing activity: {e}")
-        
-        conn.commit()
-        conn.close()
-        
-        print(f"Imported {imported} activities from {json_path.name}")
-        return imported
-    
-    def import_all_json_files(self):
-        """Import all JSON files from the extracted_activities directory"""
-        json_files = list(self.json_dir.glob("*.json"))
-        
-        if not json_files:
-            print("No JSON files found in extracted_activities directory")
-            return 0
-        
-        total_imported = 0
-        for json_file in json_files:
-            imported = self.import_json_file(json_file)
-            total_imported += imported
-        
-        print(f"\nTotal imported: {total_imported} activities from {len(json_files)} files")
-        return total_imported
-
-if __name__ == "__main__":
-    importer = ClaudeActivityImporter()
-    importer.import_all_json_files()
--- a/scripts/import_common.py
+++ b/scripts/import_common.py
@@ -0,0 +1,179 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+import_common.py — shared helpers for the import / validation side of the
+extraction pipeline (Lane C).
+
+Used by build_database.py and validate_extractions.py:
+  * JSON-schema validation of subagent extraction files,
+  * the anti-hallucination source_excerpt substring check (E5),
+  * locating the source chunk that an extraction file came from,
+  * the stable content key used by the needs_review queue.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import json
+import re
+import unicodedata
+from pathlib import Path
+from typing import Any, Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+
+DEFAULT_SCHEMA_PATH = SCRIPT_DIR / "activity_schema.json"
+
+# rapidfuzz.partial_ratio is on a 0..100 scale; an excerpt counts as a real
+# quote from the source when it scores at least this against the chunk text.
+EXCERPT_MATCH_THRESHOLD = 90.0
+
+
+# --------------------------------------------------------------------------
+# schema validation
+# --------------------------------------------------------------------------
+def load_schema(schema_path: str | Path = DEFAULT_SCHEMA_PATH) -> dict:
+    """Load the activity JSON schema produced by Lane A."""
+    return json.loads(Path(schema_path).read_text(encoding="utf-8"))
+
+
+def validate_extraction(data: Any, schema: dict) -> list[str]:
+    """
+    Validate one parsed extraction file against `schema`.
+
+    Returns a list of human-readable error strings; empty list == valid.
+    """
+    import jsonschema
+
+    validator = jsonschema.Draft7Validator(schema)
+    errors: list[str] = []
+    for err in sorted(validator.iter_errors(data), key=lambda e: list(e.path)):
+        location = "/".join(str(p) for p in err.path) or "<root>"
+        errors.append(f"{location}: {err.message}")
+    return errors
+
+
+# --------------------------------------------------------------------------
+# excerpt verification (E5 — anti-hallucination)
+# --------------------------------------------------------------------------
+def _normalize_text(text: str) -> str:
+    return re.sub(r"\s+", " ", (text or "")).strip().lower()
+
+
+def excerpt_score(excerpt: str, chunk_text: str) -> float:
+    """Best fuzzy-substring score (0..100) of `excerpt` inside `chunk_text`."""
+    from rapidfuzz import fuzz
+
+    if not excerpt or not chunk_text:
+        return 0.0
+    return float(fuzz.partial_ratio(_normalize_text(excerpt), _normalize_text(chunk_text)))
+
+
+def excerpt_matches(
+    excerpt: str, chunk_text: str, threshold: float = EXCERPT_MATCH_THRESHOLD
+) -> bool:
+    """True when `excerpt` appears (fuzzily) as a substring of `chunk_text`."""
+    return excerpt_score(excerpt, chunk_text) >= threshold
+
+
+# --------------------------------------------------------------------------
+# locating the source chunk an extraction file came from
+# --------------------------------------------------------------------------
+def chunk_key_for(json_path: Path, header: Optional[dict]) -> str:
+    """
+    Resolve the chunk key for an extraction file.
+
+    Prefers the explicit `chunk_key` in the header, otherwise falls back to the
+    JSON file stem (extraction files are named `<chunk_key>.json`).
+    """
+    if header and header.get("chunk_key"):
+        return str(header["chunk_key"])
+    return json_path.stem
+
+
+def source_id_for(chunk_key: str, header: Optional[dict]) -> str:
+    """Resolve the source id; `<source_id>.partNN` → `<source_id>`."""
+    if header and header.get("source_id"):
+        return str(header["source_id"])
+    # chunk keys look like "<source_id>.partNN"
+    return chunk_key.rsplit(".part", 1)[0]
+
+
+def find_chunk_text(
+    json_path: Path, header: Optional[dict], chunks_dir: Path
+) -> Optional[str]:
+    """
+    Return the text of the source chunk for an extraction file, or None.
+
+    Looks for data/chunks/<source_id>/<chunk_key>.txt, then falls back to a
+    recursive glob on the chunk key.
+    """
+    chunk_key = chunk_key_for(json_path, header)
+    source_id = source_id_for(chunk_key, header)
+
+    candidate = chunks_dir / source_id / f"{chunk_key}.txt"
+    if candidate.is_file():
+        return candidate.read_text(encoding="utf-8", errors="replace")
+
+    matches = list(chunks_dir.rglob(f"{chunk_key}.txt"))
+    if matches:
+        return matches[0].read_text(encoding="utf-8", errors="replace")
+    return None
+
+
+def source_path_for(source_id: str, sources_dir: Path) -> Optional[str]:
+    """
+    Read the original `SOURCE:` path from a normalized source header.
+
+    data/sources/<source_id>.txt starts with a `SOURCE: <relative path>` line.
+    """
+    src_file = sources_dir / f"{source_id}.txt"
+    if not src_file.is_file():
+        return None
+    try:
+        with src_file.open(encoding="utf-8", errors="replace") as fh:
+            for line in fh:
+                if line.startswith("SOURCE:"):
+                    return line.split(":", 1)[1].strip()
+                if line.startswith("=") or line.startswith("--- PAGE "):
+                    break
+    except OSError:
+        return None
+    return None
+
+
+# --------------------------------------------------------------------------
+# stable content key for the needs_review queue (plan §5c)
+# --------------------------------------------------------------------------
+def normalize_name(name: str) -> str:
+    """Diacritic-free, lowercased, whitespace-collapsed name (dedup key)."""
+    if not name:
+        return ""
+    decomposed = unicodedata.normalize("NFKD", name)
+    ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
+    return re.sub(r"\s+", " ", ascii_str.lower().strip())
+
+
+def content_key(normalized_name: str, language: Optional[str], description: str) -> str:
+    """
+    Stable hash identifying a row for the review queue.
+
+    Only borderline-kept-separate rows and legacy `.doc` rows ever carry
+    needs_review, and neither is auto-merged — so their (normalized_name,
+    language, description) triple is stable across rebuilds.
+    """
+    payload = f"{normalized_name}\x1f{language or ''}\x1f{_normalize_text(description)}"
+    return hashlib.sha1(payload.encode("utf-8")).hexdigest()
+
+
+# --------------------------------------------------------------------------
+# iteration
+# --------------------------------------------------------------------------
+def iter_extraction_files(extracted_dir: Path):
+    """Yield every *.json directly under `extracted_dir` (skips _rejected/)."""
+    if not extracted_dir.is_dir():
+        return
+    for path in sorted(extracted_dir.glob("*.json")):
+        if path.is_file():
+            yield path
--- a/scripts/normalize_sources.py
+++ b/scripts/normalize_sources.py
@@ -0,0 +1,255 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+normalize_sources.py — walk data/carti-camp-jocuri/ and write data/sources/<id>.txt.
+
+Output files keep the existing header format:
+
+    SOURCE: <original relative path>
+    CONVERTED: <iso date>
+    FORMAT: <pdf|docx|doc|pptx|html-mirror|zip>
+    NEEDS_REVIEW: <reason>            (optional — legacy .doc conversions)
+    ==================================================
+
+    --- PAGE 1 ---
+    ...
+
+Each source gets a stable id = <8-hex hash of relative path>_<sanitized stem>,
+so two files with the same name in different folders never collide.
+
+The pipeline is script-only: this normalizes formats, it does NOT run extraction.
+Run `--check-deps` before a long job.
+"""
+
+from __future__ import annotations
+
+import argparse
+import datetime as _dt
+import hashlib
+import re
+import sys
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+if str(SCRIPT_DIR) not in sys.path:
+    sys.path.insert(0, str(SCRIPT_DIR))
+
+from extract_common import (  # noqa: E402
+    count_page_markers,
+    dedupe_texts,
+    detect_format,
+    extract_file,
+    extract_html,
+    is_junk,
+    join_pages,
+    preflight,
+    split_pages,
+)
+
+HEADER_RULE = "=" * 50
+
+
+# --------------------------------------------------------------------------
+# stable source id
+# --------------------------------------------------------------------------
+def sanitize_stem(stem: str) -> str:
+    s = re.sub(r"[^\w]+", "_", stem, flags=re.UNICODE).strip("_").lower()
+    return s[:60] or "source"
+
+
+def stable_id(relative_path: str | Path) -> str:
+    """Collision-proof id derived from the path relative to the corpus root."""
+    rel = str(relative_path).replace("\\", "/")
+    digest = hashlib.sha1(rel.encode("utf-8")).hexdigest()[:8]
+    stem = sanitize_stem(Path(rel).stem)
+    return f"{digest}_{stem}"
+
+
+# --------------------------------------------------------------------------
+# header
+# --------------------------------------------------------------------------
+def build_header(
+    source_rel: str, fmt: str, needs_review: str | None = None
+) -> str:
+    today = _dt.date.today().isoformat()
+    lines = [
+        f"SOURCE: {source_rel}",
+        f"CONVERTED: {today}",
+        f"FORMAT: {fmt}",
+    ]
+    if needs_review:
+        lines.append(f"NEEDS_REVIEW: {needs_review}")
+    lines.append(HEADER_RULE)
+    return "\n".join(lines) + "\n\n"
+
+
+# --------------------------------------------------------------------------
+# mirror-site directories
+# --------------------------------------------------------------------------
+MIRROR_PAGE_EXTS = {".html", ".htm"}
+
+
+def is_mirror_dir(path: Path) -> bool:
+    """A directory counts as a site mirror if it contains HTML pages."""
+    if not path.is_dir():
+        return False
+    if path.name.endswith("_files"):
+        return False
+    return any(
+        p.suffix.lower() in MIRROR_PAGE_EXTS
+        for p in path.rglob("*")
+        if p.is_file()
+    )
+
+
+def normalize_mirror(mirror_dir: Path) -> str:
+    """Extract every HTML page in a mirror dir, dedupe near-duplicates, join."""
+    pages: list[tuple[str, str]] = []
+    for html in sorted(mirror_dir.rglob("*")):
+        if not html.is_file() or html.suffix.lower() not in MIRROR_PAGE_EXTS:
+            continue
+        if "_files" in html.parts:
+            continue
+        try:
+            body = extract_html(html)
+        except Exception:
+            continue
+        text = "\n".join(t for _, t in split_pages(body))
+        if text.strip():
+            pages.append((str(html.relative_to(mirror_dir)), text))
+    pages = dedupe_texts(pages)
+    return join_pages([t for _, t in pages])
+
+
+# --------------------------------------------------------------------------
+# one source
+# --------------------------------------------------------------------------
+def normalize_one(
+    path: Path, corpus_root: Path, out_dir: Path
+) -> dict | None:
+    """
+    Normalize a single file or mirror directory → data/sources/<id>.txt.
+
+    Returns a result dict, or None if the entry was skipped (junk / ignored).
+    """
+    rel = path.relative_to(corpus_root)
+    sid = stable_id(rel)
+
+    if path.is_dir():
+        if not is_mirror_dir(path):
+            return None
+        fmt, needs_review = "html-mirror", None
+        body = normalize_mirror(path)
+    else:
+        if is_junk(path):
+            return None
+        fmt = detect_format(path)
+        if fmt in ("unknown", "epub", "txt"):
+            return None  # epub duplicates PDFs; txt is not a source format here
+        needs_review = "legacy .doc conversion is imperfect" if fmt == "doc" else None
+        try:
+            body = extract_file(path)
+        except Exception as exc:  # noqa: BLE001
+            return {"id": sid, "source": str(rel), "status": "error", "error": str(exc)}
+
+    if not body.strip():
+        return {"id": sid, "source": str(rel), "status": "empty"}
+
+    out_path = out_dir / f"{sid}.txt"
+    out_path.write_text(build_header(str(rel), fmt, needs_review) + body,
+                        encoding="utf-8")
+    return {
+        "id": sid,
+        "source": str(rel),
+        "status": "ok",
+        "format": fmt,
+        "pages": count_page_markers(body),
+        "needs_review": bool(needs_review),
+    }
+
+
+# --------------------------------------------------------------------------
+# walk
+# --------------------------------------------------------------------------
+def iter_corpus_entries(corpus_root: Path):
+    """Yield top-level files and mirror directories under the corpus root."""
+    for entry in sorted(corpus_root.iterdir()):
+        if entry.name.startswith("."):
+            continue
+        if entry.is_dir():
+            if is_mirror_dir(entry):
+                yield entry
+        else:
+            yield entry
+
+
+def run(corpus_root: Path, out_dir: Path) -> dict:
+    out_dir.mkdir(parents=True, exist_ok=True)
+    results: list[dict] = []
+    for entry in iter_corpus_entries(corpus_root):
+        res = normalize_one(entry, corpus_root, out_dir)
+        if res is not None:
+            results.append(res)
+    summary = {
+        "total": len(results),
+        "ok": sum(1 for r in results if r["status"] == "ok"),
+        "errors": sum(1 for r in results if r["status"] == "error"),
+        "empty": sum(1 for r in results if r["status"] == "empty"),
+        "needs_review": sum(1 for r in results if r.get("needs_review")),
+        "results": results,
+    }
+    return summary
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def print_preflight(report: dict) -> int:
+    print("Dependency preflight")
+    print("--------------------")
+    if report["missing_python"]:
+        print("  MISSING Python packages: " + ", ".join(report["missing_python"]))
+    else:
+        print("  Python packages: OK")
+    if report["missing_system"]:
+        print("  MISSING system tools  : " + ", ".join(report["missing_system"]))
+    for w in report["warnings"]:
+        print(f"  WARNING: {w}")
+    print("  => " + ("READY" if report["ok"] else "NOT READY — install the above"))
+    return 0 if report["ok"] else 1
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="Normalize mixed sources to .txt")
+    parser.add_argument("--corpus", default="data/carti-camp-jocuri",
+                        help="corpus root to walk")
+    parser.add_argument("--out", default="data/sources", help="output directory")
+    parser.add_argument("--check-deps", action="store_true",
+                        help="run dependency preflight and exit")
+    parser.add_argument("--ocr", action="store_true",
+                        help="include OCR (tesseract) in the preflight check")
+    args = parser.parse_args(argv)
+
+    if args.check_deps:
+        return print_preflight(preflight(check_ocr=args.ocr))
+
+    report = preflight(check_ocr=args.ocr)
+    if report["missing_python"]:
+        print_preflight(report)
+        return 1
+    for w in report["warnings"]:
+        print(f"WARNING: {w}")
+
+    summary = run(Path(args.corpus), Path(args.out))
+    print(f"normalized : {summary['ok']}/{summary['total']}")
+    print(f"errors     : {summary['errors']}")
+    print(f"empty      : {summary['empty']}")
+    print(f"needs_review: {summary['needs_review']}")
+    for r in summary["results"]:
+        if r["status"] != "ok":
+            print(f"  [{r['status']}] {r['source']}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/pdf_extractor.py
+++ b/scripts/pdf_extractor.py
--- a/scripts/pdf_to_text_converter.py
+++ b/scripts/pdf_to_text_converter.py
@@ -1,143 +0,0 @@
-#!/usr/bin/env python3
-"""
-PDF Mass Conversion to Text for Activity Extraction
-Handles all PDF sizes efficiently with multiple fallback methods
-"""
-
-import os
-import json
-from pathlib import Path
-import PyPDF2
-import pdfplumber
-from typing import List, Dict
-import logging
-
-class PDFConverter:
-    def __init__(self, max_pages=50):
-        self.max_pages = max_pages
-        self.conversion_stats = {}
-    
-    def convert_pdf_to_text(self, pdf_path: str) -> str:
-        """Convert PDF to text using multiple methods with fallbacks"""
-        try:
-            # Method 1: pdfplumber (best for tables and layout)
-            return self._convert_with_pdfplumber(pdf_path)
-        except Exception as e:
-            print(f"pdfplumber failed for {pdf_path}: {e}")
-            
-            try:
-                # Method 2: PyPDF2 (fallback)
-                return self._convert_with_pypdf2(pdf_path)
-            except Exception as e2:
-                print(f"PyPDF2 also failed for {pdf_path}: {e2}")
-                return ""
-    
-    def _convert_with_pdfplumber(self, pdf_path: str) -> str:
-        """Primary conversion method using pdfplumber"""
-        text_content = ""
-        
-        with pdfplumber.open(pdf_path) as pdf:
-            total_pages = len(pdf.pages)
-            pages_to_process = min(total_pages, self.max_pages)
-            
-            print(f"  Converting {pdf_path}: {pages_to_process}/{total_pages} pages")
-            
-            for i, page in enumerate(pdf.pages[:pages_to_process]):
-                try:
-                    page_text = page.extract_text()
-                    if page_text:
-                        text_content += f"\n--- PAGE {i+1} ---\n"
-                        text_content += page_text
-                        text_content += "\n"
-                except Exception as e:
-                    print(f"    Error on page {i+1}: {e}")
-                    continue
-        
-        self.conversion_stats[pdf_path] = {
-            'method': 'pdfplumber',
-            'pages_processed': pages_to_process,
-            'total_pages': total_pages,
-            'success': True,
-            'text_length': len(text_content)
-        }
-        
-        return text_content
-    
-    def _convert_with_pypdf2(self, pdf_path: str) -> str:
-        """Fallback conversion method using PyPDF2"""
-        text_content = ""
-        
-        with open(pdf_path, 'rb') as file:
-            reader = PyPDF2.PdfReader(file)
-            total_pages = len(reader.pages)
-            pages_to_process = min(total_pages, self.max_pages)
-            
-            print(f"  Converting {pdf_path} (fallback): {pages_to_process}/{total_pages} pages")
-            
-            for i in range(pages_to_process):
-                try:
-                    page = reader.pages[i]
-                    page_text = page.extract_text()
-                    if page_text:
-                        text_content += f"\n--- PAGE {i+1} ---\n"
-                        text_content += page_text
-                        text_content += "\n"
-                except Exception as e:
-                    print(f"    Error on page {i+1}: {e}")
-                    continue
-        
-        self.conversion_stats[pdf_path] = {
-            'method': 'PyPDF2',
-            'pages_processed': pages_to_process,
-            'total_pages': total_pages,
-            'success': True,
-            'text_length': len(text_content)
-        }
-        
-        return text_content
-    
-    def convert_all_pdfs(self, pdf_directory: str, output_directory: str):
-        """Convert all PDFs in directory to text files"""
-        pdf_files = list(Path(pdf_directory).glob("**/*.pdf"))
-        
-        print(f"🔄 Converting {len(pdf_files)} PDF files to text...")
-        
-        os.makedirs(output_directory, exist_ok=True)
-        
-        for i, pdf_path in enumerate(pdf_files):
-            print(f"\n[{i+1}/{len(pdf_files)}] Processing {pdf_path.name}...")
-            
-            # Convert to text
-            text_content = self.convert_pdf_to_text(str(pdf_path))
-            
-            if text_content.strip():
-                # Save as text file
-                output_file = Path(output_directory) / f"{pdf_path.stem}.txt"
-                with open(output_file, 'w', encoding='utf-8') as f:
-                    f.write(f"SOURCE: {pdf_path}\n")
-                    f.write(f"CONVERTED: 2025-01-11\n")
-                    f.write("="*50 + "\n\n")
-                    f.write(text_content)
-                
-                print(f"  ✅ Saved: {output_file}")
-            else:
-                print(f"  ❌ No text extracted from {pdf_path.name}")
-        
-        # Save conversion statistics
-        stats_file = Path(output_directory) / "conversion_stats.json"
-        with open(stats_file, 'w', encoding='utf-8') as f:
-            json.dump(self.conversion_stats, f, indent=2, ensure_ascii=False)
-        
-        print(f"\n🎉 PDF conversion complete! Check {output_directory}")
-        return len([f for f in self.conversion_stats.values() if f['success']])
-
-# Usage
-if __name__ == "__main__":
-    converter = PDFConverter(max_pages=50)
-    
-    # Convert all PDFs
-    pdf_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri"
-    output_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri/INDEX-SISTEM-JOCURI/converted_pdfs"
-    
-    converted_count = converter.convert_all_pdfs(pdf_dir, output_dir)
-    print(f"Final result: {converted_count} PDFs successfully converted")
--- a/scripts/review_queue.py
+++ b/scripts/review_queue.py
@@ -0,0 +1,145 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+review_queue.py — CLI for the needs_review lifecycle (plan §5c).
+
+Rows land in the queue when dedup leaves a borderline pair separate, or when a
+legacy `.doc` source was converted imperfectly. Each row has a stable content
+key; a decision written here is stored in data/review_decisions.json (git
+tracked) and re-applied by build_database.py on every rebuild, so the queue
+never resurfaces a resolved row.
+
+Commands:
+    python scripts/review_queue.py list
+    python scripts/review_queue.py resolve <id> <merge|keep-separate|drop>
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sqlite3
+import sys
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from import_common import content_key, normalize_name  # noqa: E402
+
+VALID_DECISIONS = ("merge", "keep-separate", "drop")
+
+
+# --------------------------------------------------------------------------
+# review_decisions.json
+# --------------------------------------------------------------------------
+def load_decisions(path: Path) -> dict:
+    if path.is_file():
+        try:
+            data = json.loads(path.read_text(encoding="utf-8"))
+            if isinstance(data, dict):
+                return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {}
+
+
+def save_decisions(decisions: dict, path: Path) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(
+        json.dumps(decisions, indent=2, ensure_ascii=False, sort_keys=True),
+        encoding="utf-8",
+    )
+
+
+# --------------------------------------------------------------------------
+# queue
+# --------------------------------------------------------------------------
+def list_queue(db_path: Path) -> list[dict]:
+    """Return every needs_review row in the current DB, with its content key."""
+    if not db_path.is_file():
+        return []
+    conn = sqlite3.connect(db_path)
+    conn.row_factory = sqlite3.Row
+    try:
+        rows = conn.execute(
+            "SELECT name, normalized_name, language, description "
+            "FROM activities WHERE needs_review = 1 ORDER BY normalized_name"
+        ).fetchall()
+    except sqlite3.OperationalError:
+        return []
+    finally:
+        conn.close()
+
+    out = []
+    for row in rows:
+        norm = row["normalized_name"] or normalize_name(row["name"])
+        key = content_key(norm, row["language"], row["description"] or "")
+        out.append({
+            "id": key,
+            "name": row["name"],
+            "language": row["language"],
+            "description": row["description"] or "",
+        })
+    return out
+
+
+def resolve(decisions_path: Path, content_id: str, decision: str) -> dict:
+    """Record a decision for a content key in review_decisions.json."""
+    if decision not in VALID_DECISIONS:
+        raise ValueError(
+            f"invalid decision {decision!r}; expected one of {VALID_DECISIONS}"
+        )
+    decisions = load_decisions(decisions_path)
+    decisions[content_id] = {"decision": decision}
+    save_decisions(decisions, decisions_path)
+    return decisions
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="needs_review queue CLI")
+    parser.add_argument("--db", default="data/activities.db")
+    parser.add_argument("--decisions", default="data/review_decisions.json")
+    sub = parser.add_subparsers(dest="command", required=True)
+
+    sub.add_parser("list", help="list rows currently flagged needs_review")
+
+    p_resolve = sub.add_parser("resolve", help="record a decision for a row")
+    p_resolve.add_argument("id", help="content id from `list`")
+    p_resolve.add_argument("decision", choices=VALID_DECISIONS)
+
+    args = parser.parse_args(argv)
+
+    if args.command == "list":
+        rows = list_queue(Path(args.db))
+        if not rows:
+            print("review queue is empty.")
+            return 0
+        print(f"{len(rows)} row(s) need review:\n")
+        for r in rows:
+            desc = r["description"][:80].replace("\n", " ")
+            print(f"  id   : {r['id']}")
+            print(f"  name : {r['name']}  [{r['language']}]")
+            print(f"  desc : {desc}")
+            print(f"  -> review_queue.py resolve {r['id']} <merge|keep-separate|drop>")
+            print()
+        return 0
+
+    if args.command == "resolve":
+        resolve(Path(args.decisions), args.id, args.decision)
+        print(f"recorded: {args.id} -> {args.decision}")
+        print(f"written to {args.decisions} (applied on next build_database --rebuild)")
+        return 0
+
+    return 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/run_extraction.py
+++ b/scripts/run_extraction.py
@@ -1,50 +1,140 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
-Main extraction orchestrator
-Ruleaza intregul proces de extractie
+run_extraction.py — extraction orchestrator (plan §3).
+
+The pipeline is script-only up to the LLM step: this script normalizes the
+corpus, chunks the normalized sources, and emits one subagent prompt per
+`pending` chunk. It does NOT run the extraction itself — that step is the
+interactive Claude Code orchestrator launching waves of subagents.
+
+Steps:
+  1. normalize  data/carti-camp-jocuri/ -> data/sources/*.txt
+  2. chunk      data/sources/*.txt      -> data/chunks/<id>/*.txt + manifest.json
+  3. emit       one prompt per `pending` chunk -> data/chunks/_prompts/*.md
+  4. report     how many chunks remain `pending`
+
+Usage:
+    python scripts/run_extraction.py
+    python scripts/run_extraction.py --skip-normalize   # re-chunk only
 """

+from __future__ import annotations
+
+import argparse
 import sys
-import time
 from pathlib import Path
+from typing import Optional

-from unified_processor import UnifiedProcessor
-from import_claude_activities import ClaudeActivityImporter
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+import chunk_sources  # noqa: E402
+import normalize_sources  # noqa: E402
+
+SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
+
+
+def emit_chunk_prompt(chunk_key: str, meta: dict, prompts_dir: Path) -> Path:
+    """Write the subagent prompt for one pending chunk."""
+    chunk_file = meta.get("chunk_file", f"data/chunks/<id>/{chunk_key}.txt")
+    expected_json = meta.get("expected_json", f"{chunk_key}.json")
+    text = "\n".join([
+        f"# EXTRACTION — chunk `{chunk_key}`",
+        "",
+        f"Read ONLY this chunk: `{chunk_file}`",
+        f"Chunk range: {meta.get('chunk_range', '?')}",
+        "",
+        f"Follow the rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
+        "Identify every distinct activity, fill the schema "
+        "(`scripts/activity_schema.json`), and write the result to:",
+        "",
+        f"    data/extracted/{expected_json}",
+        "",
+        "Header fields to set: "
+        f'source_id="{meta.get("source_id", "")}", chunk_key="{chunk_key}", '
+        f'source_hash="{meta.get("source_hash", "")}".',
+        "",
+    ])
+    prompts_dir.mkdir(parents=True, exist_ok=True)
+    out = prompts_dir / f"{chunk_key}.prompt.md"
+    out.write_text(text, encoding="utf-8")
+    return out
+
+
+def run(
+    *,
+    corpus_root: Path,
+    sources_dir: Path,
+    chunks_dir: Path,
+    skip_normalize: bool = False,
+) -> dict:
+    summary: dict = {}
+
+    if not skip_normalize:
+        norm = normalize_sources.run(corpus_root, sources_dir)
+        summary["normalized"] = {"ok": norm["ok"], "total": norm["total"],
+                                 "errors": norm["errors"]}
+
+    chunk_summary = chunk_sources.run(sources_dir, chunks_dir)
+    summary["chunks"] = chunk_summary
+
+    manifest_path = chunks_dir / "manifest.json"
+    manifest = chunk_sources.load_manifest(manifest_path)
+    prompts_dir = chunks_dir / "_prompts"
+
+    pending = {k: m for k, m in manifest["chunks"].items()
+               if m.get("state") == "pending"}
+    for key, meta in sorted(pending.items()):
+        emit_chunk_prompt(key, meta, prompts_dir)
+
+    states: dict[str, int] = {}
+    for m in manifest["chunks"].values():
+        states[m.get("state", "?")] = states.get(m.get("state", "?"), 0) + 1
+    summary["states"] = states
+    summary["pending"] = len(pending)
+    summary["prompts_dir"] = str(prompts_dir)
+    return summary
+
+
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Extraction orchestrator.")
+    parser.add_argument("--corpus", default="data/carti-camp-jocuri")
+    parser.add_argument("--sources", default="data/sources")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--skip-normalize", action="store_true",
+                        help="skip normalization, re-chunk existing sources only")
+    args = parser.parse_args(argv)
+
+    summary = run(
+        corpus_root=Path(args.corpus),
+        sources_dir=Path(args.sources),
+        chunks_dir=Path(args.chunks),
+        skip_normalize=args.skip_normalize,
+    )
+
+    print("=" * 60)
+    print("EXTRACTION ORCHESTRATOR")
+    print("=" * 60)
+    if "normalized" in summary:
+        n = summary["normalized"]
+        print(f"normalized : {n['ok']}/{n['total']} (errors {n['errors']})")
+    print(f"chunks     : {summary['chunks']['chunks']}")
+    for state, count in sorted(summary["states"].items()):
+        print(f"  {state:<10}: {count}")
+    print(f"\npending chunks remaining : {summary['pending']}")
+    if summary["pending"]:
+        print(f"subagent prompts written : {summary['prompts_dir']}/")
+        print("Launch waves of ~5-10 subagents on those prompts, then run "
+              "validate_extractions.py and build_database.py --rebuild.")
+    else:
+        print("All chunks extracted — run build_database.py --rebuild.")
+    print("=" * 60)
+    return 0

-def main():
-    print("="*60)
-    print("ACTIVITY EXTRACTION SYSTEM")
-    print("Strategy S8: Hybrid Claude + Scripts")
-    print("="*60)
-    
-    # Step 1: Run automated extraction
-    print("\nSTEP 1: Automated Extraction")
-    print("-"*40)
-    processor = UnifiedProcessor()
-    processor.process_automated_formats()
-    
-    # Step 2: Wait for Claude processing
-    print("\n" + "="*60)
-    print("STEP 2: Manual Claude Processing Required")
-    print("-"*40)
-    print("Please process PDF/DOC files with Claude using the template.")
-    print("Files are listed in: pdf_doc_for_claude.txt")
-    print("Save extracted activities as JSON in: scripts/extracted_activities/")
-    print("="*60)
-    
-    response = input("\nHave you completed Claude processing? (y/n): ")
-    
-    if response.lower() == 'y':
-        # Step 3: Import Claude-extracted activities
-        print("\nSTEP 3: Importing Claude-extracted activities")
-        print("-"*40)
-        importer = ClaudeActivityImporter()
-        importer.import_all_json_files()
-    
-    print("\n" + "="*60)
-    print("EXTRACTION COMPLETE!")
-    print("="*60)

 if __name__ == "__main__":
-    main()
+    raise SystemExit(main())
--- a/scripts/text_extractor.py
+++ b/scripts/text_extractor.py
@@ -1,197 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-"""
-Text/Markdown Activity Extractor
-Proceseaza fisiere TXT si MD pentru extractie activitati
-"""
-
-import re
-from pathlib import Path
-from typing import List, Dict
-import sqlite3
-from datetime import datetime
-
-class TextActivityExtractor:
-    def __init__(self, db_path='data/activities.db'):
-        self.db_path = db_path
-        self.activity_patterns = {
-            'section_headers': [
-                r'^#{1,6}\s*(.+)$',  # Markdown headers
-                r'^([A-Z][^\.]{10,100})$',  # Titluri simple
-                r'^\d+\.\s*(.+)$',  # Numbered lists
-                r'^[•\-\*]\s*(.+)$',  # Bullet points
-            ],
-            'activity_markers': [
-                'joc:', 'activitate:', 'exercitiu:', 'team building:',
-                'nume:', 'titlu:', 'denumire:'
-            ]
-        }
-    
-    def extract_from_text(self, file_path: str) -> List[Dict]:
-        """Extrage activitati din fisier text/markdown"""
-        activities = []
-        
-        try:
-            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
-                content = f.read()
-            
-            # Metoda 1: Cauta sectiuni markdown
-            if file_path.endswith('.md'):
-                activities.extend(self._extract_from_markdown(content, file_path))
-            
-            # Metoda 2: Cauta pattern-uri generale
-            activities.extend(self._extract_from_patterns(content, file_path))
-            
-            # Metoda 3: Cauta blocuri de text structurate
-            activities.extend(self._extract_from_blocks(content, file_path))
-            
-        except Exception as e:
-            print(f"Error processing {file_path}: {e}")
-        
-        return activities
-    
-    def _extract_from_markdown(self, content, source_file):
-        """Extrage activitati din format markdown"""
-        activities = []
-        lines = content.split('\n')
-        
-        current_activity = None
-        current_content = []
-        
-        for line in lines:
-            # Verifica daca e header de activitate
-            if re.match(r'^#{1,3}\s*(.+)', line):
-                # Salveaza activitatea anterioara daca exista
-                if current_activity and current_content:
-                    current_activity['description'] = '\n'.join(current_content[:20])  # Max 20 linii
-                    activities.append(current_activity)
-                
-                # Verifica daca noul header e o activitate
-                header_text = re.sub(r'^#{1,3}\s*', '', line)
-                if any(marker in header_text.lower() for marker in ['joc', 'activitate', 'exercitiu']):
-                    current_activity = {
-                        'name': header_text[:200],
-                        'source_file': str(source_file),
-                        'category': '[A]'
-                    }
-                    current_content = []
-                else:
-                    current_activity = None
-            
-            elif current_activity:
-                # Adauga continut la activitatea curenta
-                if line.strip():
-                    current_content.append(line)
-        
-        # Salveaza ultima activitate
-        if current_activity and current_content:
-            current_activity['description'] = '\n'.join(current_content[:20])
-            activities.append(current_activity)
-        
-        return activities
-    
-    def _extract_from_patterns(self, content, source_file):
-        """Extrage folosind pattern matching"""
-        activities = []
-        
-        # Cauta markeri specifici de activitati
-        for marker in self.activity_patterns['activity_markers']:
-            pattern = re.compile(f'{re.escape(marker)}\\s*(.+?)(?=\\n\\n|{re.escape(marker)}|$)', 
-                               re.IGNORECASE | re.DOTALL)
-            matches = pattern.finditer(content)
-            
-            for match in matches:
-                activity_text = match.group(1)
-                if len(activity_text) > 20:
-                    activity = {
-                        'name': activity_text.split('\n')[0][:200],
-                        'description': activity_text[:1000],
-                        'source_file': str(source_file),
-                        'category': '[A]'
-                    }
-                    activities.append(activity)
-        
-        return activities
-    
-    def _extract_from_blocks(self, content, source_file):
-        """Extrage din blocuri de text separate"""
-        activities = []
-        
-        # Imparte in blocuri separate de linii goale
-        blocks = re.split(r'\n\s*\n', content)
-        
-        for block in blocks:
-            if len(block) > 50:  # Minim 50 caractere
-                lines = block.strip().split('\n')
-                first_line = lines[0].strip()
-                
-                # Verifica daca blocul pare o activitate
-                if any(keyword in first_line.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
-                    activity = {
-                        'name': first_line[:200],
-                        'description': block[:1000],
-                        'source_file': str(source_file),
-                        'category': '[A]'
-                    }
-                    activities.append(activity)
-        
-        return activities
-    
-    def save_to_database(self, activities):
-        """Salveaza in baza de date"""
-        conn = sqlite3.connect(self.db_path)
-        cursor = conn.cursor()
-        
-        saved_count = 0
-        
-        for activity in activities:
-            try:
-                # Check for duplicates
-                cursor.execute(
-                    "SELECT id FROM activities WHERE name = ? AND source_file = ?",
-                    (activity.get('name'), activity.get('source_file'))
-                )
-                
-                if not cursor.fetchone():
-                    columns = list(activity.keys())
-                    values = list(activity.values())
-                    placeholders = ['?' for _ in values]
-                    
-                    query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
-                    cursor.execute(query, values)
-                    saved_count += 1
-                    
-            except Exception as e:
-                print(f"Error saving: {e}")
-        
-        conn.commit()
-        conn.close()
-        
-        return saved_count
-    
-    def process_all_text_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
-        """Proceseaza toate fisierele text si markdown"""
-        base_path = Path(base_path)
-        
-        text_files = list(base_path.rglob("*.txt"))
-        md_files = list(base_path.rglob("*.md"))
-        all_files = text_files + md_files
-        
-        print(f"Found {len(all_files)} text/markdown files")
-        
-        all_activities = []
-        
-        for file_path in all_files:
-            activities = self.extract_from_text(str(file_path))
-            all_activities.extend(activities)
-            print(f"Processed {file_path.name}: {len(activities)} activities")
-        
-        # Save to database
-        saved = self.save_to_database(all_activities)
-        print(f"\nTotal saved: {saved} activities from {len(all_files)} files")
-        
-        return len(all_files), saved
-
-if __name__ == "__main__":
-    extractor = TextActivityExtractor()
-    extractor.process_all_text_files()
--- a/scripts/unified_processor.py
+++ b/scripts/unified_processor.py
@@ -1,151 +0,0 @@
-#!/usr/bin/env python3
-"""
-Unified Activity Processor
-Orchestreaz toate extractoarele pentru procesare complet
-"""
-
-import time
-from pathlib import Path
-from html_extractor import HTMLActivityExtractor
-from text_extractor import TextActivityExtractor
-import sqlite3
-
-class UnifiedProcessor:
-    def __init__(self, db_path='data/activities.db'):
-        self.db_path = db_path
-        self.html_extractor = HTMLActivityExtractor(db_path)
-        self.text_extractor = TextActivityExtractor(db_path)
-        self.stats = {
-            'html_processed': 0,
-            'text_processed': 0,
-            'pdf_to_process': 0,
-            'doc_to_process': 0,
-            'total_activities': 0,
-            'start_time': None,
-            'end_time': None
-        }
-    
-    def get_current_activity_count(self):
-        """Obine numrul curent de activiti din DB"""
-        conn = sqlite3.connect(self.db_path)
-        cursor = conn.cursor()
-        cursor.execute("SELECT COUNT(*) FROM activities")
-        count = cursor.fetchone()[0]
-        conn.close()
-        return count
-    
-    def count_files_to_process(self, base_path):
-        """Numr fiierele care trebuie procesate"""
-        base_path = Path(base_path)
-        
-        counts = {
-            'html': len(list(base_path.rglob("*.html"))) + len(list(base_path.rglob("*.htm"))),
-            'txt': len(list(base_path.rglob("*.txt"))),
-            'md': len(list(base_path.rglob("*.md"))),
-            'pdf': len(list(base_path.rglob("*.pdf"))),
-            'doc': len(list(base_path.rglob("*.doc"))),
-            'docx': len(list(base_path.rglob("*.docx")))
-        }
-        
-        return counts
-    
-    def process_automated_formats(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
-        """Proceseaz toate formatele care pot fi automatizate"""
-        print("="*60)
-        print("UNIFIED ACTIVITY PROCESSOR - AUTOMATED PHASE")
-        print("="*60)
-        
-        self.stats['start_time'] = time.time()
-        initial_count = self.get_current_activity_count()
-        
-        # Afieaz statistici iniiale
-        file_counts = self.count_files_to_process(base_path)
-        print(f"\nFiles to process:")
-        for format, count in file_counts.items():
-            print(f"  {format.upper()}: {count} files")
-        print(f"\nCurrent activities in database: {initial_count}")
-        print("-"*60)
-        
-        # FAZA 1: Procesare HTML (prioritate maxim - volum mare)
-        print("\n[1/2] Processing HTML files...")
-        print("-"*40)
-        html_processed, html_errors = self.html_extractor.process_all_html_files(base_path)
-        self.stats['html_processed'] = html_processed
-        
-        # FAZA 2: Procesare Text/MD
-        print("\n[2/2] Processing Text/Markdown files...")
-        print("-"*40)
-        text_processed, text_saved = self.text_extractor.process_all_text_files(base_path)
-        self.stats['text_processed'] = text_processed
-        
-        # Statistici finale
-        self.stats['end_time'] = time.time()
-        final_count = self.get_current_activity_count()
-        self.stats['total_activities'] = final_count - initial_count
-        
-        # Identific fiierele care necesit procesare manual
-        self.stats['pdf_to_process'] = file_counts['pdf']
-        self.stats['doc_to_process'] = file_counts['doc'] + file_counts['docx']
-        
-        self.print_summary()
-        self.save_pdf_doc_list(base_path)
-    
-    def print_summary(self):
-        """Afieaz rezumatul procesrii"""
-        print("\n" + "="*60)
-        print("PROCESSING SUMMARY")
-        print("="*60)
-        
-        duration = self.stats['end_time'] - self.stats['start_time']
-        
-        print(f"\nAutomated Processing Results:")
-        print(f"  HTML files processed: {self.stats['html_processed']}")
-        print(f"  Text/MD files processed: {self.stats['text_processed']}")
-        print(f"  New activities added: {self.stats['total_activities']}")
-        print(f"  Processing time: {duration:.1f} seconds")
-        
-        print(f"\nFiles requiring Claude processing:")
-        print(f"  PDF files: {self.stats['pdf_to_process']}")
-        print(f"  DOC/DOCX files: {self.stats['doc_to_process']}")
-        
-        print("\n" + "="*60)
-        print("NEXT STEPS:")
-        print("1. Review the file 'pdf_doc_for_claude.txt' for manual processing")
-        print("2. Use Claude to extract activities from PDF/DOC files")
-        print("3. Focus on largest PDF files first (highest activity density)")
-        print("="*60)
-    
-    def save_pdf_doc_list(self, base_path):
-        """Salveaz lista de PDF/DOC pentru procesare cu Claude"""
-        base_path = Path(base_path)
-        
-        pdf_files = sorted(base_path.rglob("*.pdf"), key=lambda p: p.stat().st_size, reverse=True)
-        doc_files = list(base_path.rglob("*.doc"))
-        docx_files = list(base_path.rglob("*.docx"))
-        
-        with open('pdf_doc_for_claude.txt', 'w', encoding='utf-8') as f:
-            f.write("PDF/DOC FILES FOR CLAUDE PROCESSING\n")
-            f.write("="*60 + "\n")
-            f.write("Files sorted by size (largest first = likely more activities)\n\n")
-            
-            f.write("TOP PRIORITY PDF FILES (process these first):\n")
-            f.write("-"*40 + "\n")
-            for i, pdf in enumerate(pdf_files[:20], 1):
-                size_mb = pdf.stat().st_size / (1024*1024)
-                f.write(f"{i}. {pdf.name} ({size_mb:.1f} MB)\n")
-                f.write(f"   Path: {pdf}\n\n")
-            
-            if len(pdf_files) > 20:
-                f.write(f"\n... and {len(pdf_files)-20} more PDF files\n\n")
-            
-            f.write("\nDOC/DOCX FILES:\n")
-            f.write("-"*40 + "\n")
-            for doc in doc_files + docx_files:
-                size_kb = doc.stat().st_size / 1024
-                f.write(f"- {doc.name} ({size_kb:.1f} KB)\n")
-        
-        print(f"\nPDF/DOC list saved to: pdf_doc_for_claude.txt")
-
-if __name__ == "__main__":
-    processor = UnifiedProcessor()
-    processor.process_automated_formats()
--- a/scripts/validate_extractions.py
+++ b/scripts/validate_extractions.py
@@ -0,0 +1,208 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+validate_extractions.py — validate every data/extracted/*.json (plan §5b).
+
+For each extraction file it runs two checks:
+  1. JSON-schema validation against scripts/activity_schema.json,
+  2. the source_excerpt anti-hallucination check (each excerpt must be a fuzzy
+     substring of the chunk it came from).
+
+For every failing chunk it:
+  * writes the exact re-extraction prompt to data/extracted/_reextract/<chunk>.prompt.md,
+  * marks the chunk `rejected` in data/chunks/manifest.json.
+
+The orchestrator then re-launches subagents only on the `rejected` chunks; the
+loop repeats until nothing is rejected.
+
+Usage:
+    python scripts/validate_extractions.py
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+from typing import Optional
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
+    if _p not in sys.path:
+        sys.path.insert(0, _p)
+
+from import_common import (  # noqa: E402
+    DEFAULT_SCHEMA_PATH,
+    chunk_key_for,
+    excerpt_matches,
+    excerpt_score,
+    find_chunk_text,
+    iter_extraction_files,
+    load_schema,
+    validate_extraction,
+)
+
+SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
+
+
+# --------------------------------------------------------------------------
+# re-extraction prompt
+# --------------------------------------------------------------------------
+def build_reextraction_prompt(
+    chunk_key: str, chunk_file: Optional[str], errors: list[str]
+) -> str:
+    """The exact prompt to hand a subagent to re-extract a rejected chunk."""
+    chunk_ref = chunk_file or f"data/chunks/<source_id>/{chunk_key}.txt"
+    lines = [
+        f"# RE-EXTRACTION — chunk `{chunk_key}`",
+        "",
+        "The previous extraction for this chunk was **REJECTED**. Reasons:",
+        "",
+    ]
+    lines += [f"- {e}" for e in errors]
+    lines += [
+        "",
+        "## What to do",
+        "",
+        f"1. Read ONLY this chunk: `{chunk_ref}`",
+        f"2. Follow the extraction rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
+        "3. Fix every problem listed above. In particular:",
+        "   - every `source_excerpt` must be copied **verbatim** from the chunk",
+        "     (it is checked as a fuzzy substring — invented quotes are rejected);",
+        "   - `source_excerpt` and `page_reference` are mandatory on every activity;",
+        "   - the output must validate against `scripts/activity_schema.json`.",
+        f"4. Overwrite the extraction file `data/extracted/{chunk_key}.json`.",
+        "",
+    ]
+    return "\n".join(lines)
+
+
+# --------------------------------------------------------------------------
+# manifest
+# --------------------------------------------------------------------------
+def load_manifest(manifest_path: Path) -> dict:
+    if manifest_path.is_file():
+        try:
+            data = json.loads(manifest_path.read_text(encoding="utf-8"))
+            data.setdefault("chunks", {})
+            return data
+        except (json.JSONDecodeError, OSError):
+            pass
+    return {"chunks": {}}
+
+
+def save_manifest(manifest: dict, manifest_path: Path) -> None:
+    manifest_path.parent.mkdir(parents=True, exist_ok=True)
+    manifest_path.write_text(
+        json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
+    )
+
+
+def mark_rejected(manifest: dict, chunk_key: str) -> None:
+    """Flip a chunk to `rejected` in the manifest (creating the entry if new)."""
+    entry = manifest["chunks"].get(chunk_key, {})
+    entry["state"] = "rejected"
+    manifest["chunks"][chunk_key] = entry
+
+
+# --------------------------------------------------------------------------
+# validation
+# --------------------------------------------------------------------------
+def validate_file(json_path: Path, schema: dict, chunks_dir: Path) -> list[str]:
+    """Return the list of errors for one extraction file (empty == valid)."""
+    try:
+        data = json.loads(json_path.read_text(encoding="utf-8"))
+    except json.JSONDecodeError as exc:
+        return [f"invalid JSON: {exc}"]
+
+    errors = validate_extraction(data, schema)
+    if errors:
+        return errors
+
+    header = data.get("header", {})
+    chunk_text = find_chunk_text(json_path, header, chunks_dir)
+    if chunk_text is None:
+        return [f"source chunk not found for {chunk_key_for(json_path, header)}"]
+
+    for adict in data.get("activities", []):
+        excerpt = adict.get("source_excerpt") or ""
+        if not excerpt_matches(excerpt, chunk_text):
+            score = excerpt_score(excerpt, chunk_text)
+            errors.append(
+                f"activity {adict.get('name')!r}: source_excerpt not found in "
+                f"chunk (best match {score:.0f}/100) — possible hallucination"
+            )
+    return errors
+
+
+def run(
+    extracted_dir: Path,
+    chunks_dir: Path,
+    manifest_path: Path,
+    schema_path: Path = DEFAULT_SCHEMA_PATH,
+) -> dict:
+    schema = load_schema(schema_path)
+    manifest = load_manifest(manifest_path)
+    reextract_dir = extracted_dir / "_reextract"
+
+    report = {"total": 0, "valid": 0, "rejected": 0, "rejected_chunks": []}
+    for json_path in iter_extraction_files(extracted_dir):
+        report["total"] += 1
+        errors = validate_file(json_path, schema, chunks_dir)
+        if not errors:
+            report["valid"] += 1
+            continue
+
+        report["rejected"] += 1
+        try:
+            data = json.loads(json_path.read_text(encoding="utf-8"))
+            header = data.get("header", {})
+        except json.JSONDecodeError:
+            header = {}
+        chunk_key = chunk_key_for(json_path, header)
+        chunk_file = None
+        meta = manifest["chunks"].get(chunk_key)
+        if meta:
+            chunk_file = meta.get("chunk_file")
+
+        reextract_dir.mkdir(parents=True, exist_ok=True)
+        prompt = build_reextraction_prompt(chunk_key, chunk_file, errors)
+        (reextract_dir / f"{chunk_key}.prompt.md").write_text(prompt, encoding="utf-8")
+
+        mark_rejected(manifest, chunk_key)
+        report["rejected_chunks"].append({"chunk": chunk_key, "errors": errors})
+
+    save_manifest(manifest, manifest_path)
+    return report
+
+
+# --------------------------------------------------------------------------
+# CLI
+# --------------------------------------------------------------------------
+def main(argv: Optional[list[str]] = None) -> int:
+    parser = argparse.ArgumentParser(description="Validate extraction JSON files.")
+    parser.add_argument("--extracted", default="data/extracted")
+    parser.add_argument("--chunks", default="data/chunks")
+    parser.add_argument("--manifest", default="data/chunks/manifest.json")
+    parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
+    args = parser.parse_args(argv)
+
+    report = run(
+        Path(args.extracted), Path(args.chunks), Path(args.manifest), Path(args.schema)
+    )
+    print(f"extraction files : {report['total']}")
+    print(f"  valid          : {report['valid']}")
+    print(f"  rejected       : {report['rejected']}")
+    for item in report["rejected_chunks"]:
+        print(f"  [rejected] {item['chunk']}")
+        for err in item["errors"]:
+            print(f"      - {err}")
+    if report["rejected"]:
+        print(f"\nRe-extraction prompts written to {args.extracted}/_reextract/")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())