Rebuild extraction pipeline infrastructure (Faza 0 prep)

Implements the approved plan to replace the broken regex/index-master
extraction with an LLM-subagent pipeline. Four parallel lanes:

Lane A — scripts/extract_common.py (PDF/docx/doc/pptx/html/zip, no
  max_pages truncation), normalize_sources.py, chunk_sources.py
  (~20pg chunks + overlap, manifest registry), activity_schema.json.
Lane B — app/config_taxonomy.py (16 fixed category slugs), schema
  rebuilt from scratch in app/models/ with content_type, language,
  source_files, source_excerpt, normalized_name, extraction_confidence,
  needs_review; FTS5 + 3 triggers extended with materials_list and
  skills_developed.
Lane C — build_database.py (--rebuild, atomic swap, schema + fuzzy
  source_excerpt validation, dedup with needs_review band),
  validate_extractions.py, review_queue.py, new run_extraction.py
  orchestrator, SUBAGENT_PROMPT.md.
Lane D — search.py content_type/language filters (default search
  excludes non-game content), E7 schema-compat audit; fixed a NULL
  keywords AttributeError in _boost_search_relevance.

Removes 8 orphaned/dead scripts and app/services/parser.py +
indexer.py. Adds tests/ (70 passing, 1 skipped — libreoffice absent).

Note: Lane D made one additive edit to app/models/database.py
(_update_category_counts) to surface content_type/language in
get_filter_options, outside its nominal lane boundary but after
Lane B completed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-05-19 17:43:38 +00:00
parent e0080edf85
commit 66ae831c36
37 changed files with 4101 additions and 1881 deletions

View File

@@ -0,0 +1,81 @@
# SUBAGENT — Activity extraction
You are a subagent in the game-library extraction pipeline. You extract
educational activities (games, team-building, scouting, recipes, songs,
ceremonies) from one chunk of a source document into structured JSON.
## Your task
1. **Read ONLY the chunk you were assigned.** Do not read other chunks, other
files, or the original document. The chunk is a `.txt` file with
`--- PAGE N ---` markers.
2. Identify **every distinct activity** in the chunk.
3. For each activity, fill the schema in `scripts/activity_schema.json`.
4. Write the result to `data/extracted/<chunk_key>.json`.
## What counts as "a distinct activity"
A distinct activity is a self-contained game/activity/recipe/song/ceremony with
its own name and a real description of how to do it. It is NOT:
- a bare mention or a cross-reference with no description — **skip it**;
- a sub-variant of an activity already extracted — fold it into `variations`;
- a heading, a table of contents entry, or running page chrome.
If the same activity is split across a page boundary inside your chunk, treat it
as **one** activity and combine the text.
## Output format
The file is one JSON object: a `header` plus an `activities` array.
```json
{
"header": {
"source_id": "<set from the prompt>",
"chunk_key": "<set from the prompt>",
"source_hash": "<set from the prompt>",
"schema_version": "1.0",
"prompt_version": "1.0",
"chunk_range": "pages 1-20"
},
"activities": [ ... ]
}
```
## Rules for each activity
- **`name`** — the activity's real name (≥3 characters).
- **`description`** — real prose describing the activity. No hard length limit,
but it must actually describe what happens.
- **`rules`** — how it is played / carried out, if the source gives rules.
- **`category`** — exactly one taxonomy slug (see the `enum` in the schema):
`jocuri-cercetasesti`, `team-building`, `icebreakers`, `camp-outdoor`,
`wide-games`, `orientare`, `prim-ajutor`, `escape-room-puzzle`,
`creative-stem`, `sports-active`, `cantece-ceremonii`, `retete`,
`supravietuire`, `integrare-incluziune`, `conflict-empatie`, `altele`.
When unsure, use `altele`.
- **`content_type`** — the FORM of the content, independent of category:
`joc`, `activitate`, `reteta`, `cantec`, or `ceremonie`.
- **`language`** — `ro` or `en` (the language the activity is written in).
- **`source_excerpt`** — **MANDATORY.** A short quote (one or two sentences)
copied **verbatim** from the chunk. This is the anti-hallucination anchor: it
is checked as a fuzzy substring of the chunk, and invented quotes are
rejected.
- **`page_reference`** — **MANDATORY.** The `--- PAGE N ---` marker(s) the
activity came from, e.g. `"page 14"` or `"pages 14-15"`.
- **`extraction_confidence`** — `high`, `med`, or `low`. Use `low` when the
source text for the activity is thin or ambiguous.
## Never invent data
- Do **not** invent ages, participant counts, or durations. If the source does
not state them, leave those fields `null`.
- Do **not** paraphrase the `source_excerpt` — copy it character for character.
- Better to extract fewer activities accurately than to pad the output.
## Before you finish
- Every activity has a non-empty `source_excerpt` and `page_reference`.
- The file validates against `scripts/activity_schema.json`.
- You only used text from your assigned chunk.

View File

@@ -0,0 +1,110 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Game-library extraction output",
"description": "One subagent output file: a header carrying provenance/version metadata plus the list of activities extracted from a single chunk.",
"type": "object",
"required": ["header", "activities"],
"additionalProperties": false,
"properties": {
"header": {
"type": "object",
"required": ["source_hash", "schema_version", "prompt_version", "chunk_range"],
"additionalProperties": true,
"properties": {
"source_hash": {"type": "string", "minLength": 8},
"schema_version": {"type": "string"},
"prompt_version": {"type": "string"},
"chunk_range": {"type": "string"},
"source_id": {"type": ["string", "null"]},
"chunk_key": {"type": ["string", "null"]}
}
},
"activities": {
"type": "array",
"items": {"$ref": "#/definitions/activity"}
}
},
"definitions": {
"activity": {
"type": "object",
"required": [
"name",
"description",
"category",
"content_type",
"language",
"extraction_confidence",
"source_excerpt",
"page_reference"
],
"additionalProperties": false,
"properties": {
"name": {"type": "string", "minLength": 3},
"description": {"type": "string", "minLength": 1},
"rules": {"type": ["string", "null"]},
"variations": {"type": ["string", "null"]},
"category": {
"type": "string",
"enum": [
"jocuri-cercetasesti",
"team-building",
"icebreakers",
"camp-outdoor",
"wide-games",
"orientare",
"prim-ajutor",
"escape-room-puzzle",
"creative-stem",
"sports-active",
"cantece-ceremonii",
"retete",
"supravietuire",
"integrare-incluziune",
"conflict-empatie",
"altele"
]
},
"subcategory": {"type": ["string", "null"]},
"content_type": {
"type": "string",
"enum": ["joc", "activitate", "reteta", "cantec", "ceremonie"]
},
"language": {"type": "string", "enum": ["ro", "en"]},
"extraction_confidence": {
"type": "string",
"enum": ["high", "med", "low"]
},
"source_excerpt": {"type": "string", "minLength": 1},
"page_reference": {"type": "string", "minLength": 1},
"source_file": {"type": ["string", "null"]},
"age_group_min": {"type": ["integer", "null"], "minimum": 0},
"age_group_max": {"type": ["integer", "null"], "minimum": 0},
"participants_min": {"type": ["integer", "null"], "minimum": 0},
"participants_max": {"type": ["integer", "null"], "minimum": 0},
"duration_min": {"type": ["integer", "null"], "minimum": 0},
"duration_max": {"type": ["integer", "null"], "minimum": 0},
"materials_category": {"type": ["string", "null"]},
"materials_list": {
"type": ["array", "null"],
"items": {"type": "string"}
},
"skills_developed": {
"type": ["array", "null"],
"items": {"type": "string"}
},
"difficulty_level": {
"type": ["string", "null"],
"enum": ["usor", "mediu", "dificil", null]
},
"keywords": {
"type": ["array", "null"],
"items": {"type": "string"}
},
"tags": {
"type": ["array", "null"],
"items": {"type": "string"}
}
}
}
}
}

639
scripts/build_database.py Normal file
View File

@@ -0,0 +1,639 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
build_database.py — build data/activities.db from the subagent extraction JSON.
Replaces the old import_claude_activities.py. Pipeline (plan §4):
1. `--rebuild` builds into data/activities.db.tmp; on success the live DB is
backed up to data/activities.db.bak and the tmp file is swapped in with an
atomic os.replace. A mid-build crash leaves the live DB untouched.
2. Every data/extracted/*.json is validated against scripts/activity_schema.json;
invalid files are moved to data/extracted/_rejected/ with an error log.
2b. Each source_excerpt must appear as a fuzzy substring (rapidfuzz
partial_ratio >= 90) of its source chunk — non-matches are hallucinations
and the activity is dropped (logged to _rejected/).
3. `category` is normalized to a valid taxonomy slug (fallback `altele`).
4. Dedup (D5): group by exact normalized_name, never across languages; within a
group rapidfuzz on descriptions — >=85 auto-merge, 60-85 borderline (keep
both, needs_review), <60 separate variants.
5. data/review_decisions.json is applied before insert.
6. Bulk insert into the tmp DB, populate the categories table, rebuild FTS.
7. A QA report is printed.
Usage:
python scripts/build_database.py --rebuild
"""
from __future__ import annotations
import argparse
import json
import os
import shutil
import sys
from collections import defaultdict
from pathlib import Path
from typing import Any, Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from app.config_taxonomy import ( # noqa: E402
category_display_name,
normalize_category,
normalize_content_type,
)
from app.models.activity import Activity # noqa: E402
from app.models.database import DatabaseManager # noqa: E402
from import_common import ( # noqa: E402
DEFAULT_SCHEMA_PATH,
content_key,
excerpt_matches,
find_chunk_text,
iter_extraction_files,
load_schema,
normalize_name,
source_path_for,
)
# dedup thresholds (rapidfuzz token_sort_ratio, 0..100 scale)
AUTO_MERGE_THRESHOLD = 85.0
BORDERLINE_THRESHOLD = 60.0
# --------------------------------------------------------------------------
# extraction dict -> Activity
# --------------------------------------------------------------------------
def _csv(value: Any) -> Optional[str]:
"""Schema arrays -> comma string for the (TEXT) DB columns."""
if value is None:
return None
if isinstance(value, str):
return value.strip() or None
if isinstance(value, (list, tuple)):
parts = [str(v).strip() for v in value if str(v).strip()]
return ", ".join(parts) or None
return str(value)
def _split_csv(value: Optional[str]) -> list[str]:
if not value:
return []
return [p.strip() for p in str(value).split(",") if p.strip()]
def dict_to_activity(adict: dict, source_file: str) -> Activity:
"""Build an Activity from one extraction-JSON activity object."""
tags = adict.get("tags") or []
if isinstance(tags, str):
tags = _split_csv(tags)
source_files = adict.get("source_files") or []
if isinstance(source_files, str):
source_files = _split_csv(source_files)
if source_file and source_file not in source_files:
source_files = [source_file, *source_files]
return Activity(
name=(adict.get("name") or "").strip(),
description=(adict.get("description") or "").strip(),
rules=adict.get("rules"),
variations=adict.get("variations"),
category=normalize_category(adict.get("category", "")),
subcategory=adict.get("subcategory"),
content_type=normalize_content_type(adict.get("content_type", "")),
source_file=source_file,
source_files=list(source_files),
page_reference=adict.get("page_reference"),
source_excerpt=adict.get("source_excerpt"),
age_group_min=adict.get("age_group_min"),
age_group_max=adict.get("age_group_max"),
participants_min=adict.get("participants_min"),
participants_max=adict.get("participants_max"),
duration_min=adict.get("duration_min"),
duration_max=adict.get("duration_max"),
materials_category=adict.get("materials_category"),
materials_list=_csv(adict.get("materials_list")),
skills_developed=_csv(adict.get("skills_developed")),
difficulty_level=adict.get("difficulty_level"),
keywords=_csv(adict.get("keywords")),
tags=list(tags),
language=adict.get("language"),
extraction_confidence=adict.get("extraction_confidence"),
)
# --------------------------------------------------------------------------
# step 3 — category normalization is done in dict_to_activity; a non-taxonomy
# value silently falls back to `altele`. This logs the substitutions.
# --------------------------------------------------------------------------
def log_category_fallbacks(raw_pairs: list[tuple[str, str]]) -> list[str]:
"""raw_pairs = (original, slug); return human-readable fallback messages."""
msgs = []
for original, slug in raw_pairs:
if slug == "altele" and normalize_name(original or "") not in ("", "altele"):
msgs.append(f"category '{original}' -> altele (not in taxonomy)")
return msgs
# --------------------------------------------------------------------------
# step 4 — dedup
# --------------------------------------------------------------------------
def _longest(*values: Optional[str]) -> Optional[str]:
best: Optional[str] = None
for v in values:
if v and (best is None or len(v) > len(best)):
best = v
return best
def _union_csv(values: list[Optional[str]]) -> Optional[str]:
seen: list[str] = []
for value in values:
for item in _split_csv(value):
if item not in seen:
seen.append(item)
return ", ".join(seen) or None
def merge_cluster(cluster: list[Activity]) -> Activity:
"""Collapse a cluster of duplicate activities into one merged Activity."""
if len(cluster) == 1:
return cluster[0]
# representative = the one with the longest description
rep = max(cluster, key=lambda a: len(a.description or ""))
merged = Activity(
name=rep.name,
description=_longest(*(a.description for a in cluster)) or rep.description,
rules=_longest(*(a.rules for a in cluster)),
variations=_longest(*(a.variations for a in cluster)),
category=rep.category,
subcategory=rep.subcategory,
content_type=rep.content_type,
source_file=rep.source_file,
page_reference=rep.page_reference,
source_excerpt=rep.source_excerpt,
age_group_min=rep.age_group_min,
age_group_max=rep.age_group_max,
participants_min=rep.participants_min,
participants_max=rep.participants_max,
duration_min=rep.duration_min,
duration_max=rep.duration_max,
materials_category=rep.materials_category,
materials_list=_union_csv([a.materials_list for a in cluster]),
skills_developed=_union_csv([a.skills_developed for a in cluster]),
difficulty_level=rep.difficulty_level,
keywords=_union_csv([a.keywords for a in cluster]),
language=rep.language,
extraction_confidence=rep.extraction_confidence,
)
# union of tags
tags: list[str] = []
for a in cluster:
for t in a.tags or []:
if t not in tags:
tags.append(t)
merged.tags = tags
# accumulate every source the activity was seen in
sources: list[str] = []
for a in cluster:
for s in [a.source_file, *(a.source_files or [])]:
if s and s not in sources:
sources.append(s)
merged.source_files = sources
# popularity_score++ per merged duplicate (plan §4)
merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
return merged
def dedup_activities(activities: list[Activity]) -> tuple[list[Activity], dict]:
"""
Dedup per plan D5.
Groups by (normalized_name, language) — different languages are NEVER
merged. Within a group, descriptions are clustered with rapidfuzz:
>= 85 -> same cluster (auto-merge)
60-85 -> borderline: kept as separate clusters, both flagged needs_review
< 60 -> separate variants
"""
from rapidfuzz import fuzz
groups: dict[tuple, list[Activity]] = defaultdict(list)
for act in activities:
key = (act.normalized_name or normalize_name(act.name), act.language)
groups[key].append(act)
result: list[Activity] = []
stats = {"input": len(activities), "auto_merged": 0, "borderline": 0, "output": 0}
for members in groups.values():
clusters: list[list[Activity]] = []
borderline_idx: set[int] = set()
for act in members:
best_idx, best_score = -1, -1.0
borderline_here: list[int] = []
for idx, cluster in enumerate(clusters):
score = fuzz.token_sort_ratio(
act.description or "", cluster[0].description or ""
)
if score >= AUTO_MERGE_THRESHOLD:
if score > best_score:
best_idx, best_score = idx, score
elif score >= BORDERLINE_THRESHOLD:
borderline_here.append(idx)
if best_idx >= 0:
clusters[best_idx].append(act)
else:
clusters.append([act])
new_idx = len(clusters) - 1
for bidx in borderline_here:
borderline_idx.add(bidx)
borderline_idx.add(new_idx)
for idx, cluster in enumerate(clusters):
merged = merge_cluster(cluster)
if len(cluster) > 1:
stats["auto_merged"] += len(cluster) - 1
if idx in borderline_idx:
merged.needs_review = 1
stats["borderline"] += 1
result.append(merged)
stats["output"] = len(result)
return result, stats
# --------------------------------------------------------------------------
# step 5 — review decisions
# --------------------------------------------------------------------------
def load_review_decisions(path: Path) -> dict:
if path and path.is_file():
try:
data = json.loads(path.read_text(encoding="utf-8"))
if isinstance(data, dict):
return data
except (json.JSONDecodeError, OSError):
pass
return {}
def apply_review_decisions(
activities: list[Activity], decisions: dict
) -> tuple[list[Activity], dict]:
"""
Apply data/review_decisions.json (plan §5c).
Keyed by the stable content_key. A decision of `drop` removes the row;
`keep-separate` / `merge` clear needs_review (the user has resolved it).
Rows with no decision keep needs_review and resurface in the queue.
"""
kept: list[Activity] = []
stats = {"dropped": 0, "resolved": 0}
for act in activities:
key = content_key(
act.normalized_name or normalize_name(act.name),
act.language,
act.description or "",
)
entry = decisions.get(key)
decision = entry.get("decision") if isinstance(entry, dict) else entry
if decision == "drop":
stats["dropped"] += 1
continue
if decision in ("keep-separate", "merge"):
act.needs_review = 0
stats["resolved"] += 1
kept.append(act)
return kept, stats
# --------------------------------------------------------------------------
# golden-set recall (plan §7)
# --------------------------------------------------------------------------
def _golden_names(data: Any) -> list[str]:
items = data.get("activities", data) if isinstance(data, dict) else data
names: list[str] = []
for item in items or []:
if isinstance(item, str):
names.append(item)
elif isinstance(item, dict) and item.get("name"):
names.append(item["name"])
return names
def golden_recall(golden_dir: Path, activities: list[Activity]) -> Optional[dict]:
if not golden_dir or not golden_dir.is_dir():
return None
found = {normalize_name(a.name) for a in activities}
expected, hits = 0, 0
for gf in sorted(golden_dir.glob("*.json")):
try:
data = json.loads(gf.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError):
continue
for name in _golden_names(data):
expected += 1
if normalize_name(name) in found:
hits += 1
if expected == 0:
return None
return {"expected": expected, "found": hits, "recall": round(hits / expected, 3)}
# --------------------------------------------------------------------------
# load + validate + excerpt-check the extraction files
# --------------------------------------------------------------------------
def collect_activities(
extracted_dir: Path,
chunks_dir: Path,
sources_dir: Path,
schema: dict,
) -> dict:
"""Validate, excerpt-check and convert every extraction file."""
rejected_dir = extracted_dir / "_rejected"
activities: list[Activity] = []
report = {
"files_total": 0,
"files_valid": 0,
"files_rejected_schema": 0,
"activities_raw": 0,
"activities_hallucinated": 0,
"category_fallbacks": [],
}
raw_categories: list[tuple[str, str]] = []
from import_common import chunk_key_for # local import to avoid clutter
for json_path in iter_extraction_files(extracted_dir):
report["files_total"] += 1
try:
data = json.loads(json_path.read_text(encoding="utf-8"))
except json.JSONDecodeError as exc:
_reject_file(json_path, rejected_dir, [f"invalid JSON: {exc}"])
report["files_rejected_schema"] += 1
continue
from import_common import validate_extraction
errors = validate_extraction(data, schema)
if errors:
_reject_file(json_path, rejected_dir, errors)
report["files_rejected_schema"] += 1
continue
report["files_valid"] += 1
header = data.get("header", {})
chunk_text = find_chunk_text(json_path, header, chunks_dir)
source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit(
".part", 1
)[0]
fallback_source = (
source_path_for(source_id, sources_dir) or source_id or json_path.stem
)
hallucinated: list[dict] = []
for adict in data.get("activities", []):
report["activities_raw"] += 1
excerpt = adict.get("source_excerpt") or ""
# if the chunk text is unavailable we cannot verify — keep but the
# QA report still counts it under activities_raw.
if chunk_text is not None and not excerpt_matches(excerpt, chunk_text):
hallucinated.append(adict)
report["activities_hallucinated"] += 1
continue
src = adict.get("source_file") or fallback_source
raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
activities.append(dict_to_activity(adict, src))
if hallucinated:
_log_hallucinations(json_path, rejected_dir, hallucinated)
report["category_fallbacks"] = log_category_fallbacks(raw_categories)
report["activities"] = activities
return report
def _reject_file(json_path: Path, rejected_dir: Path, errors: list[str]) -> None:
rejected_dir.mkdir(parents=True, exist_ok=True)
dest = rejected_dir / json_path.name
shutil.move(str(json_path), str(dest))
log = rejected_dir / f"{json_path.stem}.errors.txt"
log.write_text(
f"REJECTED (schema validation): {json_path.name}\n\n"
+ "\n".join(f" - {e}" for e in errors)
+ "\n",
encoding="utf-8",
)
def _log_hallucinations(
json_path: Path, rejected_dir: Path, hallucinated: list[dict]
) -> None:
rejected_dir.mkdir(parents=True, exist_ok=True)
log = rejected_dir / f"{json_path.stem}.hallucinations.txt"
lines = [f"DROPPED activities (source_excerpt not found in chunk): {json_path.name}", ""]
for a in hallucinated:
lines.append(f" - {a.get('name')!r}")
lines.append(f" excerpt: {a.get('source_excerpt')!r}")
log.write_text("\n".join(lines) + "\n", encoding="utf-8")
# --------------------------------------------------------------------------
# DB write + atomic swap
# --------------------------------------------------------------------------
def _enrich_category_display_names(db_path: Path) -> None:
"""Give the categories table proper Romanian display names for slugs."""
import sqlite3
conn = sqlite3.connect(db_path)
try:
rows = conn.execute(
"SELECT value FROM categories WHERE type = 'category'"
).fetchall()
for (slug,) in rows:
conn.execute(
"UPDATE categories SET display_name = ? WHERE type='category' AND value = ?",
(category_display_name(slug), slug),
)
conn.commit()
finally:
conn.close()
def write_database(db_tmp_path: Path, activities: list[Activity]) -> None:
"""Create a fresh tmp DB, bulk insert, populate categories, rebuild FTS."""
if db_tmp_path.exists():
db_tmp_path.unlink()
db = DatabaseManager(str(db_tmp_path))
db.bulk_insert_activities(activities)
_enrich_category_display_names(db_tmp_path)
db.rebuild_fts_index()
def atomic_swap(db_tmp_path: Path, db_path: Path) -> Optional[Path]:
"""Back up the live DB then atomically swap the tmp file in."""
backup: Optional[Path] = None
if db_path.exists():
backup = db_path.with_suffix(db_path.suffix + ".bak")
shutil.copy2(db_path, backup)
os.replace(db_tmp_path, db_path)
return backup
# --------------------------------------------------------------------------
# orchestration
# --------------------------------------------------------------------------
def rebuild(
*,
extracted_dir: Path,
chunks_dir: Path,
sources_dir: Path,
db_path: Path,
decisions_path: Optional[Path] = None,
schema_path: Path = DEFAULT_SCHEMA_PATH,
golden_dir: Optional[Path] = None,
do_swap: bool = True,
) -> dict:
"""
Full rebuild. Everything is built into <db_path>.tmp; the live DB is only
touched by the final atomic swap, so a crash anywhere above leaves it intact.
"""
extracted_dir = Path(extracted_dir)
db_path = Path(db_path)
db_tmp_path = db_path.with_suffix(db_path.suffix + ".tmp")
schema = load_schema(schema_path)
collected = collect_activities(extracted_dir, Path(chunks_dir), Path(sources_dir), schema)
activities: list[Activity] = collected.pop("activities")
deduped, dedup_stats = dedup_activities(activities)
decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
final, decision_stats = apply_review_decisions(deduped, decisions)
try:
write_database(db_tmp_path, final)
backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
except Exception:
if db_tmp_path.exists():
db_tmp_path.unlink()
raise
report = {
**collected,
"dedup": dedup_stats,
"decisions": decision_stats,
"final_count": len(final),
"backup": str(backup) if backup else None,
"swapped": do_swap,
"qa": _qa_report(final, collected, golden_dir),
}
return report
def _qa_report(
activities: list[Activity], collected: dict, golden_dir: Optional[Path]
) -> dict:
per_category: dict[str, int] = defaultdict(int)
per_content_type: dict[str, int] = defaultdict(int)
confidence: dict[str, int] = defaultdict(int)
with_rules = 0
for a in activities:
per_category[a.category] += 1
per_content_type[a.content_type or "?"] += 1
confidence[a.extraction_confidence or "?"] += 1
if a.rules and a.rules.strip():
with_rules += 1
raw = collected.get("activities_raw", 0)
hallucinated = collected.get("activities_hallucinated", 0)
return {
"total": len(activities),
"per_category": dict(per_category),
"per_content_type": dict(per_content_type),
"extraction_confidence": dict(confidence),
"pct_with_rules": round(100 * with_rules / len(activities), 1) if activities else 0.0,
"needs_review": sum(1 for a in activities if a.needs_review),
"hallucination_rate": round(100 * hallucinated / raw, 2) if raw else 0.0,
"golden_recall": golden_recall(Path(golden_dir), activities) if golden_dir else None,
}
def print_report(report: dict) -> None:
qa = report["qa"]
print("=" * 60)
print("BUILD DATABASE — QA REPORT")
print("=" * 60)
print(f"extraction files : {report['files_total']} "
f"(valid {report['files_valid']}, schema-rejected {report['files_rejected_schema']})")
print(f"activities raw : {report['activities_raw']}")
print(f" hallucinated drop : {report['activities_hallucinated']} "
f"({qa['hallucination_rate']}%)")
d = report["dedup"]
print(f"dedup : {d['input']} -> {d['output']} "
f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
print(f"review decisions : dropped {report['decisions']['dropped']}, "
f"resolved {report['decisions']['resolved']}")
print(f"final inserted : {report['final_count']}")
print(f"% with rules : {qa['pct_with_rules']}")
print(f"needs_review rows : {qa['needs_review']}")
print("per category :")
for slug, n in sorted(qa["per_category"].items(), key=lambda kv: -kv[1]):
print(f" {slug:<24}: {n}")
print("per content_type :")
for ct, n in sorted(qa["per_content_type"].items(), key=lambda kv: -kv[1]):
print(f" {ct:<24}: {n}")
print("extraction_confidence:")
for c, n in sorted(qa["extraction_confidence"].items()):
print(f" {c:<24}: {n}")
if qa["golden_recall"]:
g = qa["golden_recall"]
print(f"golden recall : {g['found']}/{g['expected']} = {g['recall']}")
if report["category_fallbacks"]:
print("category fallbacks :")
for msg in report["category_fallbacks"]:
print(f" {msg}")
if report["backup"]:
print(f"live DB backed up to : {report['backup']}")
print("=" * 60)
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Build activities.db from extraction JSON.")
parser.add_argument("--rebuild", action="store_true",
help="rebuild the database from scratch (only mode supported)")
parser.add_argument("--extracted", default="data/extracted")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--sources", default="data/sources")
parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--decisions", default="data/review_decisions.json")
parser.add_argument("--golden", default="data/golden")
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
args = parser.parse_args(argv)
if not args.rebuild:
parser.error("only --rebuild is supported (full rebuild, no incremental merge)")
report = rebuild(
extracted_dir=Path(args.extracted),
chunks_dir=Path(args.chunks),
sources_dir=Path(args.sources),
db_path=Path(args.db),
decisions_path=Path(args.decisions),
schema_path=Path(args.schema),
golden_dir=Path(args.golden),
)
print_report(report)
return 0
if __name__ == "__main__":
raise SystemExit(main())

251
scripts/chunk_sources.py Normal file
View File

@@ -0,0 +1,251 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
chunk_sources.py — split normalized data/sources/*.txt into ~20-page chunks
for subagent extraction, and maintain data/chunks/manifest.json.
Paginated text → ~20-page chunks, ~4-page overlap (plan D8).
Unpaginated text → ~10000-word windows, ~2000-word overlap.
The manifest is a cache derived from the filesystem + per-chunk state. Re-running
this script is idempotent: existing chunk states (pending/assigned/done/rejected)
survive as long as the source content hash is unchanged.
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
SCRIPT_DIR = Path(__file__).resolve().parent
if str(SCRIPT_DIR) not in sys.path:
sys.path.insert(0, str(SCRIPT_DIR))
from extract_common import content_hash, split_pages # noqa: E402
SCHEMA_VERSION = "1.0"
PAGES_PER_CHUNK = 20
PAGE_OVERLAP = 4
WORD_WINDOW = 10_000
WORD_OVERLAP = 2_000
VALID_STATES = {"pending", "assigned", "done", "rejected"}
# --------------------------------------------------------------------------
# header parsing
# --------------------------------------------------------------------------
def parse_source(text: str) -> tuple[dict, str]:
"""Split a normalized source file into (header_dict, body)."""
lines = text.splitlines()
header: dict = {}
body_start = 0
in_header = True
for i, line in enumerate(lines):
if line.startswith("--- PAGE "):
body_start = i
break
if not in_header:
continue
if set(line.strip()) == {"="} and line.strip():
body_start = i + 1
in_header = False # header ends at the rule line
continue
if ":" in line:
key, _, val = line.partition(":")
header[key.strip()] = val.strip()
body = "\n".join(lines[body_start:])
return header, body
# --------------------------------------------------------------------------
# chunking — pure functions
# --------------------------------------------------------------------------
def chunk_pages(
pages: list[tuple[int, str]],
pages_per_chunk: int = PAGES_PER_CHUNK,
overlap: int = PAGE_OVERLAP,
) -> list[dict]:
"""
Split an ordered list of (page_no, text) into overlapping chunks.
stride = pages_per_chunk - overlap. Because stride < pages_per_chunk - 1, any
activity straddling a page boundary appears whole in at least one chunk.
"""
if not pages:
return []
stride = max(1, pages_per_chunk - overlap)
chunks: list[dict] = []
i = 0
n = len(pages)
while i < n:
window = pages[i : i + pages_per_chunk]
first, last = window[0][0], window[-1][0]
text = "".join(
f"\n--- PAGE {num} ---\n{txt}\n" for num, txt in window
)
chunks.append(
{"page_start": first, "page_end": last,
"chunk_range": f"pages {first}-{last}", "text": text}
)
if i + pages_per_chunk >= n:
break
i += stride
return chunks
def chunk_words(
text: str, window: int = WORD_WINDOW, overlap: int = WORD_OVERLAP
) -> list[dict]:
"""Split unpaginated text into overlapping word windows."""
words = text.split()
if not words:
return []
stride = max(1, window - overlap)
chunks: list[dict] = []
i = 0
n = len(words)
while i < n:
seg = words[i : i + window]
chunks.append(
{"word_start": i, "word_end": i + len(seg),
"chunk_range": f"words {i}-{i + len(seg)}", "text": " ".join(seg)}
)
if i + window >= n:
break
i += stride
return chunks
def make_chunks(source_text: str) -> list[dict]:
"""Chunk one normalized source file. Picks page- or word-windowing."""
_, body = parse_source(source_text)
pages = split_pages(body)
if pages:
return chunk_pages(pages)
return chunk_words(body)
# --------------------------------------------------------------------------
# manifest
# --------------------------------------------------------------------------
def _empty_manifest() -> dict:
return {"schema_version": SCHEMA_VERSION, "chunks": {}}
def load_manifest(manifest_path: Path) -> dict:
if manifest_path.exists():
try:
data = json.loads(manifest_path.read_text(encoding="utf-8"))
data.setdefault("schema_version", SCHEMA_VERSION)
data.setdefault("chunks", {})
return data
except (json.JSONDecodeError, OSError):
pass
return _empty_manifest()
def save_manifest(manifest: dict, manifest_path: Path) -> None:
manifest_path.parent.mkdir(parents=True, exist_ok=True)
manifest_path.write_text(
json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
)
def chunk_source_file(
source_path: Path, chunks_dir: Path, manifest: dict
) -> list[str]:
"""
Chunk one data/sources/<id>.txt → data/chunks/<id>/<id>.partNN.txt and
register every chunk in `manifest`. Preserves prior state when the source
content hash is unchanged. Returns the list of chunk keys written.
"""
source_id = source_path.stem
text = source_path.read_text(encoding="utf-8", errors="replace")
src_hash = content_hash(text)
chunks = make_chunks(text)
out_dir = chunks_dir / source_id
out_dir.mkdir(parents=True, exist_ok=True)
written: list[str] = []
for idx, chunk in enumerate(chunks, 1):
key = f"{source_id}.part{idx:02d}"
chunk_file = out_dir / f"{key}.txt"
chunk_file.write_text(chunk["text"], encoding="utf-8")
prior = manifest["chunks"].get(key)
# preserve state only if the source content is unchanged
if prior and prior.get("source_hash") == src_hash and \
prior.get("state") in VALID_STATES:
state = prior["state"]
else:
state = "pending"
manifest["chunks"][key] = {
"source_id": source_id,
"source_hash": src_hash,
"part": idx,
"chunk_range": chunk["chunk_range"],
"chunk_file": str(chunk_file.relative_to(chunks_dir.parent)),
"expected_json": f"{key}.json",
"state": state,
}
written.append(key)
return written
def prune_stale(manifest: dict, live_keys: set[str]) -> list[str]:
"""Drop manifest entries whose chunk no longer exists on disk."""
stale = [k for k in manifest["chunks"] if k not in live_keys]
for k in stale:
del manifest["chunks"][k]
return stale
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def run(sources_dir: Path, chunks_dir: Path) -> dict:
"""Chunk every *.txt in sources_dir. Returns a summary dict."""
manifest_path = chunks_dir / "manifest.json"
manifest = load_manifest(manifest_path)
live_keys: set[str] = set()
source_files = sorted(sources_dir.glob("*.txt"))
for src in source_files:
live_keys.update(chunk_source_file(src, chunks_dir, manifest))
stale = prune_stale(manifest, live_keys)
save_manifest(manifest, manifest_path)
states: dict[str, int] = {}
for meta in manifest["chunks"].values():
states[meta["state"]] = states.get(meta["state"], 0) + 1
return {
"sources": len(source_files),
"chunks": len(live_keys),
"pruned": len(stale),
"states": states,
}
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(description="Chunk normalized sources.")
parser.add_argument("--sources", default="data/sources", help="sources dir")
parser.add_argument("--chunks", default="data/chunks", help="chunks output dir")
args = parser.parse_args(argv)
summary = run(Path(args.sources), Path(args.chunks))
print(f"sources processed : {summary['sources']}")
print(f"chunks written : {summary['chunks']}")
print(f"stale pruned : {summary['pruned']}")
for state, count in sorted(summary["states"].items()):
print(f" {state:<10}: {count}")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -1,54 +0,0 @@
# TEMPLATE PENTRU EXTRACȚIE ACTIVITĂȚI CU CLAUDE
## Instrucțiuni pentru Claude Code:
Pentru fiecare PDF/DOC, folosește următorul format de extracție:
### 1. Citește fișierul:
```
Claude, te rog citește fișierul: [CALE_FISIER]
```
### 2. Extrage activitățile folosind acest template JSON:
```json
{
"source_file": "[NUME_FISIER]",
"activities": [
{
"name": "Numele activității",
"description": "Descrierea completă a activității",
"rules": "Regulile jocului/activității",
"variations": "Variante sau adaptări",
"category": "[A-H] bazat pe tip",
"age_group_min": 6,
"age_group_max": 14,
"participants_min": 4,
"participants_max": 20,
"duration_min": 10,
"duration_max": 30,
"materials_list": "Lista materialelor necesare",
"skills_developed": "Competențe dezvoltate",
"difficulty_level": "Ușor/Mediu/Dificil",
"keywords": "cuvinte cheie separate prin virgulă",
"tags": "taguri relevante"
}
]
}
```
### 3. Salvează în fișier:
După extracție, salvează JSON-ul în: `/scripts/extracted_activities/[NUME_FISIER].json`
### 4. Priorități de procesare:
**TOP PRIORITY (procesează primele):**
1. 1000 Fantastic Scout Games.pdf
2. Cartea Mare a jocurilor.pdf
3. 160-de-activitati-dinamice-jocuri-pentru-team-building.pdf
4. 101 Ways to Create an Unforgettable Camp Experience.pdf
5. 151 Awesome Summer Camp Nature Activities.pdf
**Categorii de focus:**
- [A] Jocuri Cercetășești
- [C] Camping & Activități Exterior
- [G] Activități Educaționale

View File

@@ -1,164 +0,0 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
DATABASE SETUP SCRIPT - INDEX-SISTEM-JOCURI
Script pentru recrearea bazelor de date din .gitignore
Folosește clasele DatabaseManager pentru consistență
Usage:
python scripts/create_databases.py
python scripts/create_databases.py --clear-existing
"""
import sys
import argparse
from pathlib import Path
# Add src to path so we can import our modules
sys.path.append(str(Path(__file__).parent.parent / 'src'))
from database import DatabaseManager
from game_library_manager import GameLibraryManager
def create_main_database(db_path: str = "data/activities.db", clear: bool = False):
"""Create the main activities database"""
db_file = Path(db_path)
if clear and db_file.exists():
print(f"🗑️ Removing existing database: {db_path}")
db_file.unlink()
print(f"📊 Creating main database: {db_path}")
db = DatabaseManager(db_path)
# Test the database
try:
stats = db.get_statistics()
print(f"✅ Database created successfully: {stats['total_activities']} activities")
return True
except Exception as e:
print(f"❌ Error creating database: {e}")
return False
def create_game_library_database(db_path: str = "data/game_library.db", clear: bool = False):
"""Create the legacy game library database"""
db_file = Path(db_path)
if clear and db_file.exists():
print(f"🗑️ Removing existing database: {db_path}")
db_file.unlink()
print(f"📊 Creating game library database: {db_path}")
manager = GameLibraryManager(db_path)
print(f"✅ Game library database created successfully")
return True
def create_test_database(db_path: str = "data/test_activities.db", clear: bool = False):
"""Create the test database"""
db_file = Path(db_path)
if clear and db_file.exists():
print(f"🗑️ Removing existing database: {db_path}")
db_file.unlink()
print(f"📊 Creating test database: {db_path}")
db = DatabaseManager(db_path)
# Add some test data
test_activity = {
'title': 'Test Activity - Setup Script',
'description': 'This is a test activity created by the setup script',
'file_path': 'test/sample.txt',
'file_type': 'TXT',
'category': 'test',
'age_group': '8-12 ani',
'participants': '5-10 persoane',
'duration': '15-30min',
'materials': 'Fără materiale',
'tags': '["test", "setup"]',
'source_text': 'Sample test content for verification'
}
try:
db.insert_activity(test_activity)
stats = db.get_statistics()
print(f"✅ Test database created with sample data: {stats['total_activities']} activities")
return True
except Exception as e:
print(f"❌ Error creating test database: {e}")
return False
def ensure_data_directory():
"""Ensure the data directory exists"""
data_dir = Path("data")
if not data_dir.exists():
print(f"📁 Creating data directory: {data_dir}")
data_dir.mkdir(parents=True)
else:
print(f"📁 Data directory exists: {data_dir}")
def main():
"""Main setup function"""
parser = argparse.ArgumentParser(description='Create databases for INDEX-SISTEM-JOCURI')
parser.add_argument('--clear-existing', '-c', action='store_true',
help='Remove existing databases before creating new ones')
parser.add_argument('--main-only', action='store_true',
help='Create only the main activities database')
parser.add_argument('--test-only', action='store_true',
help='Create only the test database')
args = parser.parse_args()
print("🚀 DATABASE SETUP - INDEX-SISTEM-JOCURI")
print("=" * 50)
# Ensure data directory exists
ensure_data_directory()
success_count = 0
total_count = 0
if args.test_only:
total_count = 1
if create_test_database(clear=args.clear_existing):
success_count += 1
elif args.main_only:
total_count = 1
if create_main_database(clear=args.clear_existing):
success_count += 1
else:
# Create all databases
databases = [
("Main activities", lambda: create_main_database(clear=args.clear_existing)),
("Game library", lambda: create_game_library_database(clear=args.clear_existing)),
("Test activities", lambda: create_test_database(clear=args.clear_existing))
]
total_count = len(databases)
for name, create_func in databases:
print(f"\n📂 Creating {name} database...")
try:
if create_func():
success_count += 1
except Exception as e:
print(f"❌ Failed to create {name} database: {e}")
print("\n" + "=" * 50)
print(f"🎯 SUMMARY: {success_count}/{total_count} databases created successfully")
if success_count == total_count:
print("✅ All databases ready!")
print("\nNext steps:")
print("1. Run indexer: cd src && python indexer.py --clear-db")
print("2. Start web app: cd src && python app.py")
else:
print("⚠️ Some databases failed to create. Check errors above.")
return 1
return 0
if __name__ == '__main__':
sys.exit(main())

361
scripts/extract_common.py Normal file
View File

@@ -0,0 +1,361 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
extract_common.py — single home for per-format text extraction.
Every extractor returns a plain text *body* with synthetic page markers
(`--- PAGE N ---`). The file-level header (`SOURCE:` / `CONVERTED:`) is added
by normalize_sources.py, not here.
Critical fix vs. the old pdf_to_text_converter.py: there is NO `max_pages` cap.
Large books are extracted in full.
"""
from __future__ import annotations
import hashlib
import importlib
import os
import re
import shutil
import subprocess
import tempfile
import zipfile
from pathlib import Path
from typing import Callable
PAGE_MARKER_RE = re.compile(r"^--- PAGE (\d+) ---\s*$", re.MULTILINE)
# paragraphs per synthetic page for paginated-by-flow formats (docx)
DOCX_PARAS_PER_PAGE = 40
# formats we deliberately ignore (epub duplicates existing PDFs — plan §1)
IGNORED_EXTENSIONS = {".epub"}
# obvious junk filenames skipped during a walk
JUNK_NAMES = {"desktop.ini", "linkuri-jocuri.txt"}
JUNK_SUFFIXES = {".bak", ".tmp", ".ini"}
# --------------------------------------------------------------------------
# page assembly helpers
# --------------------------------------------------------------------------
def join_pages(pages: list[str], start: int = 1) -> str:
"""Join a list of page texts into a body string with `--- PAGE N ---`."""
out: list[str] = []
for i, text in enumerate(pages, start):
out.append(f"\n--- PAGE {i} ---\n{(text or '').strip()}\n")
return "".join(out)
def split_pages(body: str) -> list[tuple[int, str]]:
"""Inverse of join_pages: parse a body into [(page_number, text), ...]."""
matches = list(PAGE_MARKER_RE.finditer(body))
if not matches:
return []
pages: list[tuple[int, str]] = []
for idx, m in enumerate(matches):
num = int(m.group(1))
seg_start = m.end()
seg_end = matches[idx + 1].start() if idx + 1 < len(matches) else len(body)
pages.append((num, body[seg_start:seg_end].strip()))
return pages
def count_page_markers(body: str) -> int:
return len(PAGE_MARKER_RE.findall(body))
# --------------------------------------------------------------------------
# format detection
# --------------------------------------------------------------------------
FORMAT_BY_EXT = {
".pdf": "pdf",
".docx": "docx",
".doc": "doc",
".pptx": "pptx",
".ppt": "pptx",
".htm": "html",
".html": "html",
".zip": "zip",
".epub": "epub",
".txt": "txt",
}
def detect_format(path: str | os.PathLike) -> str:
"""Return a format key for a path based on its extension."""
ext = Path(path).suffix.lower()
return FORMAT_BY_EXT.get(ext, "unknown")
def is_junk(path: str | os.PathLike) -> bool:
p = Path(path)
name = p.name.lower()
if name in JUNK_NAMES:
return True
if name.startswith("readme") and p.suffix.lower() == ".md":
return True
if p.suffix.lower() in JUNK_SUFFIXES:
return True
return False
# --------------------------------------------------------------------------
# content hashing + near-duplicate elimination
# --------------------------------------------------------------------------
def _normalize_for_hash(text: str) -> str:
return re.sub(r"\s+", " ", (text or "")).strip().lower()
def content_hash(text: str) -> str:
"""Stable SHA1 of whitespace-normalized text — used for exact-dup detection."""
return hashlib.sha1(_normalize_for_hash(text).encode("utf-8")).hexdigest()
def near_duplicate_ratio(a: str, b: str) -> float:
"""Similarity score in [0, 100] between two texts (rapidfuzz token ratio)."""
from rapidfuzz import fuzz
return fuzz.token_sort_ratio(_normalize_for_hash(a), _normalize_for_hash(b))
def dedupe_texts(
items: list[tuple[str, str]], threshold: float = 95.0
) -> list[tuple[str, str]]:
"""
Drop exact and near-duplicate texts from a list of (key, text) pairs.
Used for HTML mirror pages (print copies, repeated index/footer pages).
Keeps the first occurrence; O(n) on exact hash, O(n*k) fuzzy only against
already-kept items.
"""
kept: list[tuple[str, str]] = []
seen_hashes: set[str] = set()
for key, text in items:
h = content_hash(text)
if h in seen_hashes:
continue
if any(near_duplicate_ratio(text, kt) >= threshold for _, kt in kept):
continue
seen_hashes.add(h)
kept.append((key, text))
return kept
# --------------------------------------------------------------------------
# preflight dependency check
# --------------------------------------------------------------------------
REQUIRED_PYTHON_MODULES = {
"pdfplumber": "pdfplumber",
"PyPDF2": "pypdf2",
"docx": "python-docx",
"pptx": "python-pptx",
"bs4": "beautifulsoup4",
"lxml": "lxml",
"jsonschema": "jsonschema",
"rapidfuzz": "rapidfuzz",
"chardet": "chardet",
}
def preflight(check_ocr: bool = False) -> dict:
"""
Check system + Python dependencies before a long normalization run.
Returns {'ok': bool, 'missing_python': [...], 'missing_system': [...],
'warnings': [...]}. libreoffice is a *warning* (only .doc needs it),
tesseract only checked when check_ocr=True.
"""
missing_python: list[str] = []
for module, pip_name in REQUIRED_PYTHON_MODULES.items():
try:
importlib.import_module(module)
except ImportError:
missing_python.append(pip_name)
warnings: list[str] = []
missing_system: list[str] = []
if not (shutil.which("libreoffice") or shutil.which("soffice")):
warnings.append("libreoffice not found — legacy .doc files cannot be converted")
if check_ocr and not shutil.which("tesseract"):
missing_system.append("tesseract (OCR requested but not installed)")
return {
"ok": not missing_python and not missing_system,
"missing_python": missing_python,
"missing_system": missing_system,
"warnings": warnings,
}
# --------------------------------------------------------------------------
# per-format extractors
# --------------------------------------------------------------------------
def extract_pdf(path: str | os.PathLike) -> str:
"""PDF → body. pdfplumber primary, PyPDF2 fallback. No page cap."""
path = str(path)
try:
return _extract_pdf_pdfplumber(path)
except Exception:
return _extract_pdf_pypdf2(path)
def _extract_pdf_pdfplumber(path: str) -> str:
import pdfplumber
pages: list[str] = []
with pdfplumber.open(path) as pdf:
for page in pdf.pages: # ALL pages — no max_pages
try:
pages.append(page.extract_text() or "")
except Exception:
pages.append("")
return join_pages(pages)
def _extract_pdf_pypdf2(path: str) -> str:
import PyPDF2
pages: list[str] = []
with open(path, "rb") as fh:
reader = PyPDF2.PdfReader(fh)
for page in reader.pages: # ALL pages — no max_pages
try:
pages.append(page.extract_text() or "")
except Exception:
pages.append("")
return join_pages(pages)
def extract_docx(path: str | os.PathLike) -> str:
"""docx → body. Synthetic page marker every DOCX_PARAS_PER_PAGE paragraphs."""
import docx
document = docx.Document(str(path))
paragraphs = [p.text for p in document.paragraphs]
pages: list[str] = []
for i in range(0, max(len(paragraphs), 1), DOCX_PARAS_PER_PAGE):
chunk = paragraphs[i : i + DOCX_PARAS_PER_PAGE]
pages.append("\n".join(chunk))
return join_pages(pages)
def extract_doc(path: str | os.PathLike) -> str:
"""
Legacy .doc → body via `libreoffice --headless --convert-to docx`.
Raises RuntimeError if libreoffice is unavailable — the caller marks the
resulting source `needs_review` regardless (conversion is imperfect).
"""
soffice = shutil.which("libreoffice") or shutil.which("soffice")
if not soffice:
raise RuntimeError("libreoffice/soffice not available — cannot convert .doc")
src = Path(path).resolve()
with tempfile.TemporaryDirectory() as tmp:
subprocess.run(
[soffice, "--headless", "--convert-to", "docx", "--outdir", tmp, str(src)],
check=True,
capture_output=True,
timeout=300,
)
converted = Path(tmp) / (src.stem + ".docx")
if not converted.exists():
raise RuntimeError(f"libreoffice produced no output for {src.name}")
return extract_docx(converted)
def extract_pptx(path: str | os.PathLike) -> str:
"""pptx → body. One page per slide: title + body text + speaker notes."""
from pptx import Presentation
presentation = Presentation(str(path))
pages: list[str] = []
for slide in presentation.slides:
parts: list[str] = []
for shape in slide.shapes:
if shape.has_text_frame and shape.text_frame.text.strip():
parts.append(shape.text_frame.text.strip())
if slide.has_notes_slide:
notes = slide.notes_slide.notes_text_frame.text.strip()
if notes:
parts.append(f"[NOTES] {notes}")
pages.append("\n".join(parts))
return join_pages(pages)
def extract_html(path: str | os.PathLike) -> str:
"""HTML mirror page → body. Strips nav/script/style/footer/header/aside."""
import chardet
from bs4 import BeautifulSoup
raw = Path(path).read_bytes()
enc = chardet.detect(raw).get("encoding") or "utf-8"
soup = BeautifulSoup(raw.decode(enc, errors="replace"), "lxml")
for tag in soup(["script", "style", "nav", "footer", "header", "aside", "noscript"]):
tag.decompose()
# also drop common chrome by role/class
for tag in soup.find_all(attrs={"role": ["navigation", "banner", "contentinfo"]}):
tag.decompose()
text = soup.get_text(separator="\n")
lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
return join_pages(["\n".join(lines)])
def extract_zip(path: str | os.PathLike) -> str:
"""
zip → body. Unzips into a temp dir and recurses on every extractable inner
file. Inner files are page-renumbered into one continuous body.
"""
path = str(path)
pages: list[str] = []
with tempfile.TemporaryDirectory() as tmp:
try:
with zipfile.ZipFile(path) as zf:
zf.extractall(tmp)
except zipfile.BadZipFile:
return ""
for inner in sorted(Path(tmp).rglob("*")):
if not inner.is_file() or is_junk(inner):
continue
fmt = detect_format(inner)
if fmt in ("unknown", "epub", "zip"):
# nested zips handled by recursion below
if fmt == "zip":
body = extract_zip(inner)
pages.extend(t for _, t in split_pages(body))
continue
try:
body = extract_file(inner)
except Exception:
continue
pages.extend(t for _, t in split_pages(body))
return join_pages(pages)
EXTRACTORS: dict[str, Callable[[str | os.PathLike], str]] = {
"pdf": extract_pdf,
"docx": extract_docx,
"doc": extract_doc,
"pptx": extract_pptx,
"html": extract_html,
"zip": extract_zip,
}
def extract_file(path: str | os.PathLike) -> str:
"""Dispatch a single file to the right extractor. Returns a page-marked body."""
fmt = detect_format(path)
if fmt == "txt":
body = Path(path).read_text(encoding="utf-8", errors="replace")
# already paginated? pass through; else wrap as one page
return body if count_page_markers(body) else join_pages([body])
extractor = EXTRACTORS.get(fmt)
if extractor is None:
raise ValueError(f"No extractor for format '{fmt}': {path}")
return extractor(path)

View File

@@ -1,424 +0,0 @@
#!/usr/bin/env python3
"""
HTML Activity Extractor - Proceseaz 1876 fiiere HTML
Extrage automat activiti folosind pattern recognition
"""
import os
import re
import json
from pathlib import Path
from bs4 import BeautifulSoup
import chardet
from typing import List, Dict, Optional
import sqlite3
from datetime import datetime
class HTMLActivityExtractor:
def __init__(self, db_path='data/activities.db'):
self.db_path = db_path
# Pattern-uri pentru detectare activiti <20>n rom<6F>n
self.activity_patterns = {
'title_patterns': [
r'(?i)(joc|activitate|exerci[t]iu|team[\s-]?building|energizer|ice[\s-]?breaker)[\s:]+([^\.]{5,100})',
r'(?i)<h[1-6][^>]*>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</h[1-6]>',
r'(?i)<strong>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</strong>',
r'(?i)^[\d]+\.?\s*([A-Z][^\.]{10,100}(?:joc|activitate|exerci[t]iu)[^\.]{0,50})$',
],
'description_markers': [
'descriere', 'reguli', 'cum se joac[a]', 'instructiuni',
'obiectiv', 'desfasurare', 'explicatie', 'mod de joc'
],
'materials_markers': [
'materiale', 'necesare', 'echipament', 'ce avem nevoie',
'se folosesc', 'trebuie sa avem', 'dotari'
],
'age_patterns': [
r'(?i)v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
r'(?i)(\d+)[\s-]+(\d+)\s*ani',
r'(?i)pentru\s+(\d+)[\s-]+(\d+)\s*ani',
r'(?i)categoria?\s*(?:de\s*)?v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
],
'participants_patterns': [
r'(?i)(\d+)[\s-]+(\d+)\s*(?:participan[t]i|juc[a]tori|persoane|copii)',
r'(?i)num[a]r\s*(?:de\s*)?(?:participan[t]i|juc[a]tori)[\s:]+(\d+)[\s-]+(\d+)',
r'(?i)grup\s*de\s*(\d+)[\s-]+(\d+)',
],
'duration_patterns': [
r'(?i)durat[a][\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
r'(?i)timp[\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
r'(?i)(\d+)[\s-]+(\d+)\s*minute',
]
}
# Categorii predefinite bazate pe sistemul existent
self.categories = {
'[A]': ['joc', 'joaca', 'distractie', 'amuzament'],
'[B]': ['aventura', 'explorare', 'descoperire'],
'[C]': ['camping', 'tabara', 'excursie', 'drumetie'],
'[D]': ['foc', 'flacara', 'lumina'],
'[E]': ['noduri', 'fr<EFBFBD>nghii', 'sfori', 'legare'],
'[F]': ['bushcraft', 'supravietuire', 'survival'],
'[G]': ['educatie', 'educativ', 'invatare', 'scoala'],
'[H]': ['orientare', 'busola', 'harta', 'navigare']
}
def detect_encoding(self, file_path):
"""Detecteaz encoding-ul fiierului"""
with open(file_path, 'rb') as f:
result = chardet.detect(f.read())
return result['encoding'] or 'utf-8'
def extract_from_html(self, html_path: str) -> List[Dict]:
"""Extrage activiti dintr-un singur fiier HTML"""
activities = []
try:
# Detectare encoding i citire
encoding = self.detect_encoding(html_path)
with open(html_path, 'r', encoding=encoding, errors='ignore') as f:
content = f.read()
soup = BeautifulSoup(content, 'lxml')
# Metod 1: Caut liste de activiti
activities.extend(self._extract_from_lists(soup, html_path))
# Metod 2: Caut activiti <20>n headings
activities.extend(self._extract_from_headings(soup, html_path))
# Metod 3: Caut pattern-uri <20>n text
activities.extend(self._extract_from_patterns(soup, html_path))
# Metod 4: Caut <20>n tabele
activities.extend(self._extract_from_tables(soup, html_path))
except Exception as e:
print(f"Error processing {html_path}: {e}")
return activities
def _extract_from_lists(self, soup, source_file):
"""Extrage activiti din liste HTML (ul, ol)"""
activities = []
for list_elem in soup.find_all(['ul', 'ol']):
# Verific dac lista pare s conin activiti
list_text = list_elem.get_text().lower()
if any(marker in list_text for marker in ['joc', 'activitate', 'exercitiu']):
for li in list_elem.find_all('li'):
text = li.get_text(strip=True)
if len(text) > 20: # Minim 20 caractere pentru o activitate valid
activity = self._create_activity_from_text(text, source_file)
if activity:
activities.append(activity)
return activities
def _extract_from_headings(self, soup, source_file):
"""Extrage activiti bazate pe headings"""
activities = []
for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
heading_text = heading.get_text(strip=True)
# Verific dac heading-ul conine cuvinte cheie
if any(keyword in heading_text.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
# Caut descrierea <20>n elementele urmtoare
description = ""
next_elem = heading.find_next_sibling()
while next_elem and next_elem.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
if next_elem.name in ['p', 'div', 'ul']:
description += next_elem.get_text(strip=True) + " "
if len(description) > 500: # Limit descriere
break
next_elem = next_elem.find_next_sibling()
if description:
activity = {
'name': heading_text[:200],
'description': description[:1000],
'source_file': str(source_file),
'category': self._detect_category(heading_text + " " + description)
}
activities.append(activity)
return activities
def _extract_from_patterns(self, soup, source_file):
"""Extrage activiti folosind pattern matching"""
activities = []
text = soup.get_text()
# Caut pattern-uri de activiti
for pattern in self.activity_patterns['title_patterns']:
matches = re.finditer(pattern, text, re.MULTILINE)
for match in matches:
title = match.group(0) if match.lastindex == 0 else match.group(match.lastindex)
if len(title) > 10:
# Extrage context <20>n jurul match-ului
start = max(0, match.start() - 200)
end = min(len(text), match.end() + 500)
context = text[start:end]
activity = self._create_activity_from_text(context, source_file, title)
if activity:
activities.append(activity)
return activities
def _extract_from_tables(self, soup, source_file):
"""Extrage activiti din tabele"""
activities = []
for table in soup.find_all('table'):
rows = table.find_all('tr')
if len(rows) > 1: # Cel puin header i o linie de date
# Detecteaz coloanele relevante
headers = [th.get_text(strip=True).lower() for th in rows[0].find_all(['th', 'td'])]
for row in rows[1:]:
cells = row.find_all(['td'])
if cells:
activity_data = {}
for i, cell in enumerate(cells):
if i < len(headers):
activity_data[headers[i]] = cell.get_text(strip=True)
# Creeaz activitate din date tabel
if any(key in activity_data for key in ['joc', 'activitate', 'nume', 'titlu']):
activity = self._create_activity_from_table_data(activity_data, source_file)
if activity:
activities.append(activity)
return activities
def _create_activity_from_text(self, text, source_file, title=None):
"""Creeaz un dicionar de activitate din text"""
if not text or len(text) < 30:
return None
activity = {
'name': title or text[:100].split('.')[0].strip(),
'description': text[:1000],
'source_file': str(source_file),
'category': self._detect_category(text),
'keywords': self._extract_keywords(text),
'created_at': datetime.now().isoformat()
}
# Extrage metadata suplimentar
activity.update(self._extract_metadata(text))
return activity
def _create_activity_from_table_data(self, data, source_file):
"""Creeaz activitate din date de tabel"""
activity = {
'source_file': str(source_file),
'created_at': datetime.now().isoformat()
}
# Mapare c<>mpuri tabel la c<>mpuri DB
field_mapping = {
'nume': 'name', 'titlu': 'name', 'joc': 'name', 'activitate': 'name',
'descriere': 'description', 'detalii': 'description', 'explicatie': 'description',
'materiale': 'materials_list', 'echipament': 'materials_list',
'varsta': 'age_group_min', 'categoria': 'category',
'participanti': 'participants_min', 'numar': 'participants_min',
'durata': 'duration_min', 'timp': 'duration_min'
}
for table_field, db_field in field_mapping.items():
if table_field in data:
activity[db_field] = data[table_field]
# Validare minim
if 'name' in activity and len(activity.get('name', '')) > 5:
return activity
return None
def _extract_metadata(self, text):
"""Extrage metadata din text folosind pattern-uri"""
metadata = {}
# Extrage v<>rsta
for pattern in self.activity_patterns['age_patterns']:
match = re.search(pattern, text)
if match:
metadata['age_group_min'] = int(match.group(1))
metadata['age_group_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
break
# Extrage numr participani
for pattern in self.activity_patterns['participants_patterns']:
match = re.search(pattern, text)
if match:
metadata['participants_min'] = int(match.group(1))
metadata['participants_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
break
# Extrage durata
for pattern in self.activity_patterns['duration_patterns']:
match = re.search(pattern, text)
if match:
metadata['duration_min'] = int(match.group(1))
metadata['duration_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
break
# Extrage materiale
materials = []
text_lower = text.lower()
for marker in self.activity_patterns['materials_markers']:
idx = text_lower.find(marker)
if idx != -1:
# Extrage urmtoarele 200 caractere dup marker
materials_text = text[idx:idx+200]
# Extrage items din list
items = re.findall(r'[-"]\s*([^\n-"]+)', materials_text)
if items:
materials.extend(items)
if materials:
metadata['materials_list'] = ', '.join(materials[:10]) # Maxim 10 materiale
return metadata
def _detect_category(self, text):
"""Detecteaz categoria activitii bazat pe cuvinte cheie"""
text_lower = text.lower()
for category, keywords in self.categories.items():
if any(keyword in text_lower for keyword in keywords):
return category
return '[A]' # Default categoria jocuri
def _extract_keywords(self, text):
"""Extrage cuvinte cheie din text"""
keywords = []
text_lower = text.lower()
# Lista de cuvinte cheie relevante
keyword_list = [
'cooperare', 'competitie', 'echipa', 'creativitate', 'miscare',
'strategie', 'comunicare', 'incredere', 'coordonare', 'atentie',
'reflexe', 'logica', 'imaginatie', 'muzica', 'dans', 'sport',
'natura', 'mediu', 'stiinta', 'matematica', 'limba', 'cultura'
]
for keyword in keyword_list:
if keyword in text_lower:
keywords.append(keyword)
return ', '.join(keywords[:5]) # Maxim 5 keywords
def save_to_database(self, activities):
"""Salveaz activitile <20>n baza de date"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
saved_count = 0
duplicate_count = 0
for activity in activities:
try:
# Verific duplicate
cursor.execute(
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
(activity.get('name'), activity.get('source_file'))
)
if cursor.fetchone():
duplicate_count += 1
continue
# Pregtete valorile pentru insert
columns = []
values = []
placeholders = []
for key, value in activity.items():
if key != 'created_at': # Skip created_at, it has default
columns.append(key)
values.append(value)
placeholders.append('?')
# Insert <20>n DB
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
cursor.execute(query, values)
saved_count += 1
except Exception as e:
print(f"Error saving activity: {e}")
continue
conn.commit()
conn.close()
return saved_count, duplicate_count
def process_all_html_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
"""Proceseaz toate fiierele HTML din directorul specificat"""
base_path = Path(base_path)
html_files = list(base_path.rglob("*.html"))
html_files.extend(list(base_path.rglob("*.htm")))
print(f"Found {len(html_files)} HTML files to process")
all_activities = []
processed = 0
errors = 0
for i, html_file in enumerate(html_files):
try:
activities = self.extract_from_html(str(html_file))
all_activities.extend(activities)
processed += 1
# Progress update
if (i + 1) % 100 == 0:
print(f"Progress: {i+1}/{len(html_files)} files processed, {len(all_activities)} activities found")
# Save batch to DB
if all_activities:
saved, dupes = self.save_to_database(all_activities)
print(f"Batch saved: {saved} new activities, {dupes} duplicates skipped")
all_activities = [] # Clear buffer
except Exception as e:
print(f"Error processing {html_file}: {e}")
errors += 1
# Save remaining activities
if all_activities:
saved, dupes = self.save_to_database(all_activities)
print(f"Final batch saved: {saved} new activities, {dupes} duplicates skipped")
print(f"\nProcessing complete!")
print(f"Files processed: {processed}")
print(f"Errors: {errors}")
return processed, errors
# Funcie main pentru test
if __name__ == "__main__":
extractor = HTMLActivityExtractor()
# Test pe un fiier sample mai <20>nt<6E>i
print("Testing on sample file first...")
# Gsete un fiier HTML pentru test
test_files = list(Path('/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri').rglob("*.html"))[:3]
for test_file in test_files:
print(f"\nTesting: {test_file}")
activities = extractor.extract_from_html(str(test_file))
print(f"Found {len(activities)} activities")
if activities:
print(f"Sample activity: {activities[0]['name'][:50]}...")
# <20>ntreab dac s continue cu procesarea complet
response = input("\nContinue with full processing? (y/n): ")
if response.lower() == 'y':
extractor.process_all_html_files()

View File

@@ -1,78 +0,0 @@
#!/usr/bin/env python3
"""
Import activities extracted by Claude from JSON files
"""
import json
import sqlite3
from pathlib import Path
from datetime import datetime
class ClaudeActivityImporter:
def __init__(self, db_path='data/activities.db'):
self.db_path = db_path
self.json_dir = Path('scripts/extracted_activities')
self.json_dir.mkdir(exist_ok=True)
def import_json_file(self, json_path):
"""Import activities from a single JSON file"""
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
source_file = data.get('source_file', str(json_path))
activities = data.get('activities', [])
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
imported = 0
for activity in activities:
try:
# Add source file and timestamp
activity['source_file'] = source_file
activity['created_at'] = datetime.now().isoformat()
# Prepare insert
columns = list(activity.keys())
values = list(activity.values())
placeholders = ['?' for _ in values]
# Check for duplicate
cursor.execute(
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
(activity.get('name'), source_file)
)
if not cursor.fetchone():
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
cursor.execute(query, values)
imported += 1
except Exception as e:
print(f"Error importing activity: {e}")
conn.commit()
conn.close()
print(f"Imported {imported} activities from {json_path.name}")
return imported
def import_all_json_files(self):
"""Import all JSON files from the extracted_activities directory"""
json_files = list(self.json_dir.glob("*.json"))
if not json_files:
print("No JSON files found in extracted_activities directory")
return 0
total_imported = 0
for json_file in json_files:
imported = self.import_json_file(json_file)
total_imported += imported
print(f"\nTotal imported: {total_imported} activities from {len(json_files)} files")
return total_imported
if __name__ == "__main__":
importer = ClaudeActivityImporter()
importer.import_all_json_files()

179
scripts/import_common.py Normal file
View File

@@ -0,0 +1,179 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
import_common.py — shared helpers for the import / validation side of the
extraction pipeline (Lane C).
Used by build_database.py and validate_extractions.py:
* JSON-schema validation of subagent extraction files,
* the anti-hallucination source_excerpt substring check (E5),
* locating the source chunk that an extraction file came from,
* the stable content key used by the needs_review queue.
"""
from __future__ import annotations
import hashlib
import json
import re
import unicodedata
from pathlib import Path
from typing import Any, Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
DEFAULT_SCHEMA_PATH = SCRIPT_DIR / "activity_schema.json"
# rapidfuzz.partial_ratio is on a 0..100 scale; an excerpt counts as a real
# quote from the source when it scores at least this against the chunk text.
EXCERPT_MATCH_THRESHOLD = 90.0
# --------------------------------------------------------------------------
# schema validation
# --------------------------------------------------------------------------
def load_schema(schema_path: str | Path = DEFAULT_SCHEMA_PATH) -> dict:
"""Load the activity JSON schema produced by Lane A."""
return json.loads(Path(schema_path).read_text(encoding="utf-8"))
def validate_extraction(data: Any, schema: dict) -> list[str]:
"""
Validate one parsed extraction file against `schema`.
Returns a list of human-readable error strings; empty list == valid.
"""
import jsonschema
validator = jsonschema.Draft7Validator(schema)
errors: list[str] = []
for err in sorted(validator.iter_errors(data), key=lambda e: list(e.path)):
location = "/".join(str(p) for p in err.path) or "<root>"
errors.append(f"{location}: {err.message}")
return errors
# --------------------------------------------------------------------------
# excerpt verification (E5 — anti-hallucination)
# --------------------------------------------------------------------------
def _normalize_text(text: str) -> str:
return re.sub(r"\s+", " ", (text or "")).strip().lower()
def excerpt_score(excerpt: str, chunk_text: str) -> float:
"""Best fuzzy-substring score (0..100) of `excerpt` inside `chunk_text`."""
from rapidfuzz import fuzz
if not excerpt or not chunk_text:
return 0.0
return float(fuzz.partial_ratio(_normalize_text(excerpt), _normalize_text(chunk_text)))
def excerpt_matches(
excerpt: str, chunk_text: str, threshold: float = EXCERPT_MATCH_THRESHOLD
) -> bool:
"""True when `excerpt` appears (fuzzily) as a substring of `chunk_text`."""
return excerpt_score(excerpt, chunk_text) >= threshold
# --------------------------------------------------------------------------
# locating the source chunk an extraction file came from
# --------------------------------------------------------------------------
def chunk_key_for(json_path: Path, header: Optional[dict]) -> str:
"""
Resolve the chunk key for an extraction file.
Prefers the explicit `chunk_key` in the header, otherwise falls back to the
JSON file stem (extraction files are named `<chunk_key>.json`).
"""
if header and header.get("chunk_key"):
return str(header["chunk_key"])
return json_path.stem
def source_id_for(chunk_key: str, header: Optional[dict]) -> str:
"""Resolve the source id; `<source_id>.partNN` → `<source_id>`."""
if header and header.get("source_id"):
return str(header["source_id"])
# chunk keys look like "<source_id>.partNN"
return chunk_key.rsplit(".part", 1)[0]
def find_chunk_text(
json_path: Path, header: Optional[dict], chunks_dir: Path
) -> Optional[str]:
"""
Return the text of the source chunk for an extraction file, or None.
Looks for data/chunks/<source_id>/<chunk_key>.txt, then falls back to a
recursive glob on the chunk key.
"""
chunk_key = chunk_key_for(json_path, header)
source_id = source_id_for(chunk_key, header)
candidate = chunks_dir / source_id / f"{chunk_key}.txt"
if candidate.is_file():
return candidate.read_text(encoding="utf-8", errors="replace")
matches = list(chunks_dir.rglob(f"{chunk_key}.txt"))
if matches:
return matches[0].read_text(encoding="utf-8", errors="replace")
return None
def source_path_for(source_id: str, sources_dir: Path) -> Optional[str]:
"""
Read the original `SOURCE:` path from a normalized source header.
data/sources/<source_id>.txt starts with a `SOURCE: <relative path>` line.
"""
src_file = sources_dir / f"{source_id}.txt"
if not src_file.is_file():
return None
try:
with src_file.open(encoding="utf-8", errors="replace") as fh:
for line in fh:
if line.startswith("SOURCE:"):
return line.split(":", 1)[1].strip()
if line.startswith("=") or line.startswith("--- PAGE "):
break
except OSError:
return None
return None
# --------------------------------------------------------------------------
# stable content key for the needs_review queue (plan §5c)
# --------------------------------------------------------------------------
def normalize_name(name: str) -> str:
"""Diacritic-free, lowercased, whitespace-collapsed name (dedup key)."""
if not name:
return ""
decomposed = unicodedata.normalize("NFKD", name)
ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
return re.sub(r"\s+", " ", ascii_str.lower().strip())
def content_key(normalized_name: str, language: Optional[str], description: str) -> str:
"""
Stable hash identifying a row for the review queue.
Only borderline-kept-separate rows and legacy `.doc` rows ever carry
needs_review, and neither is auto-merged — so their (normalized_name,
language, description) triple is stable across rebuilds.
"""
payload = f"{normalized_name}\x1f{language or ''}\x1f{_normalize_text(description)}"
return hashlib.sha1(payload.encode("utf-8")).hexdigest()
# --------------------------------------------------------------------------
# iteration
# --------------------------------------------------------------------------
def iter_extraction_files(extracted_dir: Path):
"""Yield every *.json directly under `extracted_dir` (skips _rejected/)."""
if not extracted_dir.is_dir():
return
for path in sorted(extracted_dir.glob("*.json")):
if path.is_file():
yield path

View File

@@ -0,0 +1,255 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
normalize_sources.py — walk data/carti-camp-jocuri/ and write data/sources/<id>.txt.
Output files keep the existing header format:
SOURCE: <original relative path>
CONVERTED: <iso date>
FORMAT: <pdf|docx|doc|pptx|html-mirror|zip>
NEEDS_REVIEW: <reason> (optional — legacy .doc conversions)
==================================================
--- PAGE 1 ---
...
Each source gets a stable id = <8-hex hash of relative path>_<sanitized stem>,
so two files with the same name in different folders never collide.
The pipeline is script-only: this normalizes formats, it does NOT run extraction.
Run `--check-deps` before a long job.
"""
from __future__ import annotations
import argparse
import datetime as _dt
import hashlib
import re
import sys
from pathlib import Path
SCRIPT_DIR = Path(__file__).resolve().parent
if str(SCRIPT_DIR) not in sys.path:
sys.path.insert(0, str(SCRIPT_DIR))
from extract_common import ( # noqa: E402
count_page_markers,
dedupe_texts,
detect_format,
extract_file,
extract_html,
is_junk,
join_pages,
preflight,
split_pages,
)
HEADER_RULE = "=" * 50
# --------------------------------------------------------------------------
# stable source id
# --------------------------------------------------------------------------
def sanitize_stem(stem: str) -> str:
s = re.sub(r"[^\w]+", "_", stem, flags=re.UNICODE).strip("_").lower()
return s[:60] or "source"
def stable_id(relative_path: str | Path) -> str:
"""Collision-proof id derived from the path relative to the corpus root."""
rel = str(relative_path).replace("\\", "/")
digest = hashlib.sha1(rel.encode("utf-8")).hexdigest()[:8]
stem = sanitize_stem(Path(rel).stem)
return f"{digest}_{stem}"
# --------------------------------------------------------------------------
# header
# --------------------------------------------------------------------------
def build_header(
source_rel: str, fmt: str, needs_review: str | None = None
) -> str:
today = _dt.date.today().isoformat()
lines = [
f"SOURCE: {source_rel}",
f"CONVERTED: {today}",
f"FORMAT: {fmt}",
]
if needs_review:
lines.append(f"NEEDS_REVIEW: {needs_review}")
lines.append(HEADER_RULE)
return "\n".join(lines) + "\n\n"
# --------------------------------------------------------------------------
# mirror-site directories
# --------------------------------------------------------------------------
MIRROR_PAGE_EXTS = {".html", ".htm"}
def is_mirror_dir(path: Path) -> bool:
"""A directory counts as a site mirror if it contains HTML pages."""
if not path.is_dir():
return False
if path.name.endswith("_files"):
return False
return any(
p.suffix.lower() in MIRROR_PAGE_EXTS
for p in path.rglob("*")
if p.is_file()
)
def normalize_mirror(mirror_dir: Path) -> str:
"""Extract every HTML page in a mirror dir, dedupe near-duplicates, join."""
pages: list[tuple[str, str]] = []
for html in sorted(mirror_dir.rglob("*")):
if not html.is_file() or html.suffix.lower() not in MIRROR_PAGE_EXTS:
continue
if "_files" in html.parts:
continue
try:
body = extract_html(html)
except Exception:
continue
text = "\n".join(t for _, t in split_pages(body))
if text.strip():
pages.append((str(html.relative_to(mirror_dir)), text))
pages = dedupe_texts(pages)
return join_pages([t for _, t in pages])
# --------------------------------------------------------------------------
# one source
# --------------------------------------------------------------------------
def normalize_one(
path: Path, corpus_root: Path, out_dir: Path
) -> dict | None:
"""
Normalize a single file or mirror directory → data/sources/<id>.txt.
Returns a result dict, or None if the entry was skipped (junk / ignored).
"""
rel = path.relative_to(corpus_root)
sid = stable_id(rel)
if path.is_dir():
if not is_mirror_dir(path):
return None
fmt, needs_review = "html-mirror", None
body = normalize_mirror(path)
else:
if is_junk(path):
return None
fmt = detect_format(path)
if fmt in ("unknown", "epub", "txt"):
return None # epub duplicates PDFs; txt is not a source format here
needs_review = "legacy .doc conversion is imperfect" if fmt == "doc" else None
try:
body = extract_file(path)
except Exception as exc: # noqa: BLE001
return {"id": sid, "source": str(rel), "status": "error", "error": str(exc)}
if not body.strip():
return {"id": sid, "source": str(rel), "status": "empty"}
out_path = out_dir / f"{sid}.txt"
out_path.write_text(build_header(str(rel), fmt, needs_review) + body,
encoding="utf-8")
return {
"id": sid,
"source": str(rel),
"status": "ok",
"format": fmt,
"pages": count_page_markers(body),
"needs_review": bool(needs_review),
}
# --------------------------------------------------------------------------
# walk
# --------------------------------------------------------------------------
def iter_corpus_entries(corpus_root: Path):
"""Yield top-level files and mirror directories under the corpus root."""
for entry in sorted(corpus_root.iterdir()):
if entry.name.startswith("."):
continue
if entry.is_dir():
if is_mirror_dir(entry):
yield entry
else:
yield entry
def run(corpus_root: Path, out_dir: Path) -> dict:
out_dir.mkdir(parents=True, exist_ok=True)
results: list[dict] = []
for entry in iter_corpus_entries(corpus_root):
res = normalize_one(entry, corpus_root, out_dir)
if res is not None:
results.append(res)
summary = {
"total": len(results),
"ok": sum(1 for r in results if r["status"] == "ok"),
"errors": sum(1 for r in results if r["status"] == "error"),
"empty": sum(1 for r in results if r["status"] == "empty"),
"needs_review": sum(1 for r in results if r.get("needs_review")),
"results": results,
}
return summary
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def print_preflight(report: dict) -> int:
print("Dependency preflight")
print("--------------------")
if report["missing_python"]:
print(" MISSING Python packages: " + ", ".join(report["missing_python"]))
else:
print(" Python packages: OK")
if report["missing_system"]:
print(" MISSING system tools : " + ", ".join(report["missing_system"]))
for w in report["warnings"]:
print(f" WARNING: {w}")
print(" => " + ("READY" if report["ok"] else "NOT READY — install the above"))
return 0 if report["ok"] else 1
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(description="Normalize mixed sources to .txt")
parser.add_argument("--corpus", default="data/carti-camp-jocuri",
help="corpus root to walk")
parser.add_argument("--out", default="data/sources", help="output directory")
parser.add_argument("--check-deps", action="store_true",
help="run dependency preflight and exit")
parser.add_argument("--ocr", action="store_true",
help="include OCR (tesseract) in the preflight check")
args = parser.parse_args(argv)
if args.check_deps:
return print_preflight(preflight(check_ocr=args.ocr))
report = preflight(check_ocr=args.ocr)
if report["missing_python"]:
print_preflight(report)
return 1
for w in report["warnings"]:
print(f"WARNING: {w}")
summary = run(Path(args.corpus), Path(args.out))
print(f"normalized : {summary['ok']}/{summary['total']}")
print(f"errors : {summary['errors']}")
print(f"empty : {summary['empty']}")
print(f"needs_review: {summary['needs_review']}")
for r in summary["results"]:
if r["status"] != "ok":
print(f" [{r['status']}] {r['source']}")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -1,143 +0,0 @@
#!/usr/bin/env python3
"""
PDF Mass Conversion to Text for Activity Extraction
Handles all PDF sizes efficiently with multiple fallback methods
"""
import os
import json
from pathlib import Path
import PyPDF2
import pdfplumber
from typing import List, Dict
import logging
class PDFConverter:
def __init__(self, max_pages=50):
self.max_pages = max_pages
self.conversion_stats = {}
def convert_pdf_to_text(self, pdf_path: str) -> str:
"""Convert PDF to text using multiple methods with fallbacks"""
try:
# Method 1: pdfplumber (best for tables and layout)
return self._convert_with_pdfplumber(pdf_path)
except Exception as e:
print(f"pdfplumber failed for {pdf_path}: {e}")
try:
# Method 2: PyPDF2 (fallback)
return self._convert_with_pypdf2(pdf_path)
except Exception as e2:
print(f"PyPDF2 also failed for {pdf_path}: {e2}")
return ""
def _convert_with_pdfplumber(self, pdf_path: str) -> str:
"""Primary conversion method using pdfplumber"""
text_content = ""
with pdfplumber.open(pdf_path) as pdf:
total_pages = len(pdf.pages)
pages_to_process = min(total_pages, self.max_pages)
print(f" Converting {pdf_path}: {pages_to_process}/{total_pages} pages")
for i, page in enumerate(pdf.pages[:pages_to_process]):
try:
page_text = page.extract_text()
if page_text:
text_content += f"\n--- PAGE {i+1} ---\n"
text_content += page_text
text_content += "\n"
except Exception as e:
print(f" Error on page {i+1}: {e}")
continue
self.conversion_stats[pdf_path] = {
'method': 'pdfplumber',
'pages_processed': pages_to_process,
'total_pages': total_pages,
'success': True,
'text_length': len(text_content)
}
return text_content
def _convert_with_pypdf2(self, pdf_path: str) -> str:
"""Fallback conversion method using PyPDF2"""
text_content = ""
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
total_pages = len(reader.pages)
pages_to_process = min(total_pages, self.max_pages)
print(f" Converting {pdf_path} (fallback): {pages_to_process}/{total_pages} pages")
for i in range(pages_to_process):
try:
page = reader.pages[i]
page_text = page.extract_text()
if page_text:
text_content += f"\n--- PAGE {i+1} ---\n"
text_content += page_text
text_content += "\n"
except Exception as e:
print(f" Error on page {i+1}: {e}")
continue
self.conversion_stats[pdf_path] = {
'method': 'PyPDF2',
'pages_processed': pages_to_process,
'total_pages': total_pages,
'success': True,
'text_length': len(text_content)
}
return text_content
def convert_all_pdfs(self, pdf_directory: str, output_directory: str):
"""Convert all PDFs in directory to text files"""
pdf_files = list(Path(pdf_directory).glob("**/*.pdf"))
print(f"🔄 Converting {len(pdf_files)} PDF files to text...")
os.makedirs(output_directory, exist_ok=True)
for i, pdf_path in enumerate(pdf_files):
print(f"\n[{i+1}/{len(pdf_files)}] Processing {pdf_path.name}...")
# Convert to text
text_content = self.convert_pdf_to_text(str(pdf_path))
if text_content.strip():
# Save as text file
output_file = Path(output_directory) / f"{pdf_path.stem}.txt"
with open(output_file, 'w', encoding='utf-8') as f:
f.write(f"SOURCE: {pdf_path}\n")
f.write(f"CONVERTED: 2025-01-11\n")
f.write("="*50 + "\n\n")
f.write(text_content)
print(f" ✅ Saved: {output_file}")
else:
print(f" ❌ No text extracted from {pdf_path.name}")
# Save conversion statistics
stats_file = Path(output_directory) / "conversion_stats.json"
with open(stats_file, 'w', encoding='utf-8') as f:
json.dump(self.conversion_stats, f, indent=2, ensure_ascii=False)
print(f"\n🎉 PDF conversion complete! Check {output_directory}")
return len([f for f in self.conversion_stats.values() if f['success']])
# Usage
if __name__ == "__main__":
converter = PDFConverter(max_pages=50)
# Convert all PDFs
pdf_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri"
output_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri/INDEX-SISTEM-JOCURI/converted_pdfs"
converted_count = converter.convert_all_pdfs(pdf_dir, output_dir)
print(f"Final result: {converted_count} PDFs successfully converted")

145
scripts/review_queue.py Normal file
View File

@@ -0,0 +1,145 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
review_queue.py — CLI for the needs_review lifecycle (plan §5c).
Rows land in the queue when dedup leaves a borderline pair separate, or when a
legacy `.doc` source was converted imperfectly. Each row has a stable content
key; a decision written here is stored in data/review_decisions.json (git
tracked) and re-applied by build_database.py on every rebuild, so the queue
never resurfaces a resolved row.
Commands:
python scripts/review_queue.py list
python scripts/review_queue.py resolve <id> <merge|keep-separate|drop>
"""
from __future__ import annotations
import argparse
import json
import sqlite3
import sys
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from import_common import content_key, normalize_name # noqa: E402
VALID_DECISIONS = ("merge", "keep-separate", "drop")
# --------------------------------------------------------------------------
# review_decisions.json
# --------------------------------------------------------------------------
def load_decisions(path: Path) -> dict:
if path.is_file():
try:
data = json.loads(path.read_text(encoding="utf-8"))
if isinstance(data, dict):
return data
except (json.JSONDecodeError, OSError):
pass
return {}
def save_decisions(decisions: dict, path: Path) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(
json.dumps(decisions, indent=2, ensure_ascii=False, sort_keys=True),
encoding="utf-8",
)
# --------------------------------------------------------------------------
# queue
# --------------------------------------------------------------------------
def list_queue(db_path: Path) -> list[dict]:
"""Return every needs_review row in the current DB, with its content key."""
if not db_path.is_file():
return []
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
try:
rows = conn.execute(
"SELECT name, normalized_name, language, description "
"FROM activities WHERE needs_review = 1 ORDER BY normalized_name"
).fetchall()
except sqlite3.OperationalError:
return []
finally:
conn.close()
out = []
for row in rows:
norm = row["normalized_name"] or normalize_name(row["name"])
key = content_key(norm, row["language"], row["description"] or "")
out.append({
"id": key,
"name": row["name"],
"language": row["language"],
"description": row["description"] or "",
})
return out
def resolve(decisions_path: Path, content_id: str, decision: str) -> dict:
"""Record a decision for a content key in review_decisions.json."""
if decision not in VALID_DECISIONS:
raise ValueError(
f"invalid decision {decision!r}; expected one of {VALID_DECISIONS}"
)
decisions = load_decisions(decisions_path)
decisions[content_id] = {"decision": decision}
save_decisions(decisions, decisions_path)
return decisions
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="needs_review queue CLI")
parser.add_argument("--db", default="data/activities.db")
parser.add_argument("--decisions", default="data/review_decisions.json")
sub = parser.add_subparsers(dest="command", required=True)
sub.add_parser("list", help="list rows currently flagged needs_review")
p_resolve = sub.add_parser("resolve", help="record a decision for a row")
p_resolve.add_argument("id", help="content id from `list`")
p_resolve.add_argument("decision", choices=VALID_DECISIONS)
args = parser.parse_args(argv)
if args.command == "list":
rows = list_queue(Path(args.db))
if not rows:
print("review queue is empty.")
return 0
print(f"{len(rows)} row(s) need review:\n")
for r in rows:
desc = r["description"][:80].replace("\n", " ")
print(f" id : {r['id']}")
print(f" name : {r['name']} [{r['language']}]")
print(f" desc : {desc}")
print(f" -> review_queue.py resolve {r['id']} <merge|keep-separate|drop>")
print()
return 0
if args.command == "resolve":
resolve(Path(args.decisions), args.id, args.decision)
print(f"recorded: {args.id} -> {args.decision}")
print(f"written to {args.decisions} (applied on next build_database --rebuild)")
return 0
return 1
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -1,50 +1,140 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Main extraction orchestrator
Ruleaza intregul proces de extractie
run_extraction.py — extraction orchestrator (plan §3).
The pipeline is script-only up to the LLM step: this script normalizes the
corpus, chunks the normalized sources, and emits one subagent prompt per
`pending` chunk. It does NOT run the extraction itself — that step is the
interactive Claude Code orchestrator launching waves of subagents.
Steps:
1. normalize data/carti-camp-jocuri/ -> data/sources/*.txt
2. chunk data/sources/*.txt -> data/chunks/<id>/*.txt + manifest.json
3. emit one prompt per `pending` chunk -> data/chunks/_prompts/*.md
4. report how many chunks remain `pending`
Usage:
python scripts/run_extraction.py
python scripts/run_extraction.py --skip-normalize # re-chunk only
"""
from __future__ import annotations
import argparse
import sys
import time
from pathlib import Path
from typing import Optional
from unified_processor import UnifiedProcessor
from import_claude_activities import ClaudeActivityImporter
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
import chunk_sources # noqa: E402
import normalize_sources # noqa: E402
SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
def emit_chunk_prompt(chunk_key: str, meta: dict, prompts_dir: Path) -> Path:
"""Write the subagent prompt for one pending chunk."""
chunk_file = meta.get("chunk_file", f"data/chunks/<id>/{chunk_key}.txt")
expected_json = meta.get("expected_json", f"{chunk_key}.json")
text = "\n".join([
f"# EXTRACTION — chunk `{chunk_key}`",
"",
f"Read ONLY this chunk: `{chunk_file}`",
f"Chunk range: {meta.get('chunk_range', '?')}",
"",
f"Follow the rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
"Identify every distinct activity, fill the schema "
"(`scripts/activity_schema.json`), and write the result to:",
"",
f" data/extracted/{expected_json}",
"",
"Header fields to set: "
f'source_id="{meta.get("source_id", "")}", chunk_key="{chunk_key}", '
f'source_hash="{meta.get("source_hash", "")}".',
"",
])
prompts_dir.mkdir(parents=True, exist_ok=True)
out = prompts_dir / f"{chunk_key}.prompt.md"
out.write_text(text, encoding="utf-8")
return out
def run(
*,
corpus_root: Path,
sources_dir: Path,
chunks_dir: Path,
skip_normalize: bool = False,
) -> dict:
summary: dict = {}
if not skip_normalize:
norm = normalize_sources.run(corpus_root, sources_dir)
summary["normalized"] = {"ok": norm["ok"], "total": norm["total"],
"errors": norm["errors"]}
chunk_summary = chunk_sources.run(sources_dir, chunks_dir)
summary["chunks"] = chunk_summary
manifest_path = chunks_dir / "manifest.json"
manifest = chunk_sources.load_manifest(manifest_path)
prompts_dir = chunks_dir / "_prompts"
pending = {k: m for k, m in manifest["chunks"].items()
if m.get("state") == "pending"}
for key, meta in sorted(pending.items()):
emit_chunk_prompt(key, meta, prompts_dir)
states: dict[str, int] = {}
for m in manifest["chunks"].values():
states[m.get("state", "?")] = states.get(m.get("state", "?"), 0) + 1
summary["states"] = states
summary["pending"] = len(pending)
summary["prompts_dir"] = str(prompts_dir)
return summary
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Extraction orchestrator.")
parser.add_argument("--corpus", default="data/carti-camp-jocuri")
parser.add_argument("--sources", default="data/sources")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--skip-normalize", action="store_true",
help="skip normalization, re-chunk existing sources only")
args = parser.parse_args(argv)
summary = run(
corpus_root=Path(args.corpus),
sources_dir=Path(args.sources),
chunks_dir=Path(args.chunks),
skip_normalize=args.skip_normalize,
)
print("=" * 60)
print("EXTRACTION ORCHESTRATOR")
print("=" * 60)
if "normalized" in summary:
n = summary["normalized"]
print(f"normalized : {n['ok']}/{n['total']} (errors {n['errors']})")
print(f"chunks : {summary['chunks']['chunks']}")
for state, count in sorted(summary["states"].items()):
print(f" {state:<10}: {count}")
print(f"\npending chunks remaining : {summary['pending']}")
if summary["pending"]:
print(f"subagent prompts written : {summary['prompts_dir']}/")
print("Launch waves of ~5-10 subagents on those prompts, then run "
"validate_extractions.py and build_database.py --rebuild.")
else:
print("All chunks extracted — run build_database.py --rebuild.")
print("=" * 60)
return 0
def main():
print("="*60)
print("ACTIVITY EXTRACTION SYSTEM")
print("Strategy S8: Hybrid Claude + Scripts")
print("="*60)
# Step 1: Run automated extraction
print("\nSTEP 1: Automated Extraction")
print("-"*40)
processor = UnifiedProcessor()
processor.process_automated_formats()
# Step 2: Wait for Claude processing
print("\n" + "="*60)
print("STEP 2: Manual Claude Processing Required")
print("-"*40)
print("Please process PDF/DOC files with Claude using the template.")
print("Files are listed in: pdf_doc_for_claude.txt")
print("Save extracted activities as JSON in: scripts/extracted_activities/")
print("="*60)
response = input("\nHave you completed Claude processing? (y/n): ")
if response.lower() == 'y':
# Step 3: Import Claude-extracted activities
print("\nSTEP 3: Importing Claude-extracted activities")
print("-"*40)
importer = ClaudeActivityImporter()
importer.import_all_json_files()
print("\n" + "="*60)
print("EXTRACTION COMPLETE!")
print("="*60)
if __name__ == "__main__":
main()
raise SystemExit(main())

View File

@@ -1,197 +0,0 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Text/Markdown Activity Extractor
Proceseaza fisiere TXT si MD pentru extractie activitati
"""
import re
from pathlib import Path
from typing import List, Dict
import sqlite3
from datetime import datetime
class TextActivityExtractor:
def __init__(self, db_path='data/activities.db'):
self.db_path = db_path
self.activity_patterns = {
'section_headers': [
r'^#{1,6}\s*(.+)$', # Markdown headers
r'^([A-Z][^\.]{10,100})$', # Titluri simple
r'^\d+\.\s*(.+)$', # Numbered lists
r'^[•\-\*]\s*(.+)$', # Bullet points
],
'activity_markers': [
'joc:', 'activitate:', 'exercitiu:', 'team building:',
'nume:', 'titlu:', 'denumire:'
]
}
def extract_from_text(self, file_path: str) -> List[Dict]:
"""Extrage activitati din fisier text/markdown"""
activities = []
try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
# Metoda 1: Cauta sectiuni markdown
if file_path.endswith('.md'):
activities.extend(self._extract_from_markdown(content, file_path))
# Metoda 2: Cauta pattern-uri generale
activities.extend(self._extract_from_patterns(content, file_path))
# Metoda 3: Cauta blocuri de text structurate
activities.extend(self._extract_from_blocks(content, file_path))
except Exception as e:
print(f"Error processing {file_path}: {e}")
return activities
def _extract_from_markdown(self, content, source_file):
"""Extrage activitati din format markdown"""
activities = []
lines = content.split('\n')
current_activity = None
current_content = []
for line in lines:
# Verifica daca e header de activitate
if re.match(r'^#{1,3}\s*(.+)', line):
# Salveaza activitatea anterioara daca exista
if current_activity and current_content:
current_activity['description'] = '\n'.join(current_content[:20]) # Max 20 linii
activities.append(current_activity)
# Verifica daca noul header e o activitate
header_text = re.sub(r'^#{1,3}\s*', '', line)
if any(marker in header_text.lower() for marker in ['joc', 'activitate', 'exercitiu']):
current_activity = {
'name': header_text[:200],
'source_file': str(source_file),
'category': '[A]'
}
current_content = []
else:
current_activity = None
elif current_activity:
# Adauga continut la activitatea curenta
if line.strip():
current_content.append(line)
# Salveaza ultima activitate
if current_activity and current_content:
current_activity['description'] = '\n'.join(current_content[:20])
activities.append(current_activity)
return activities
def _extract_from_patterns(self, content, source_file):
"""Extrage folosind pattern matching"""
activities = []
# Cauta markeri specifici de activitati
for marker in self.activity_patterns['activity_markers']:
pattern = re.compile(f'{re.escape(marker)}\\s*(.+?)(?=\\n\\n|{re.escape(marker)}|$)',
re.IGNORECASE | re.DOTALL)
matches = pattern.finditer(content)
for match in matches:
activity_text = match.group(1)
if len(activity_text) > 20:
activity = {
'name': activity_text.split('\n')[0][:200],
'description': activity_text[:1000],
'source_file': str(source_file),
'category': '[A]'
}
activities.append(activity)
return activities
def _extract_from_blocks(self, content, source_file):
"""Extrage din blocuri de text separate"""
activities = []
# Imparte in blocuri separate de linii goale
blocks = re.split(r'\n\s*\n', content)
for block in blocks:
if len(block) > 50: # Minim 50 caractere
lines = block.strip().split('\n')
first_line = lines[0].strip()
# Verifica daca blocul pare o activitate
if any(keyword in first_line.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
activity = {
'name': first_line[:200],
'description': block[:1000],
'source_file': str(source_file),
'category': '[A]'
}
activities.append(activity)
return activities
def save_to_database(self, activities):
"""Salveaza in baza de date"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
saved_count = 0
for activity in activities:
try:
# Check for duplicates
cursor.execute(
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
(activity.get('name'), activity.get('source_file'))
)
if not cursor.fetchone():
columns = list(activity.keys())
values = list(activity.values())
placeholders = ['?' for _ in values]
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
cursor.execute(query, values)
saved_count += 1
except Exception as e:
print(f"Error saving: {e}")
conn.commit()
conn.close()
return saved_count
def process_all_text_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
"""Proceseaza toate fisierele text si markdown"""
base_path = Path(base_path)
text_files = list(base_path.rglob("*.txt"))
md_files = list(base_path.rglob("*.md"))
all_files = text_files + md_files
print(f"Found {len(all_files)} text/markdown files")
all_activities = []
for file_path in all_files:
activities = self.extract_from_text(str(file_path))
all_activities.extend(activities)
print(f"Processed {file_path.name}: {len(activities)} activities")
# Save to database
saved = self.save_to_database(all_activities)
print(f"\nTotal saved: {saved} activities from {len(all_files)} files")
return len(all_files), saved
if __name__ == "__main__":
extractor = TextActivityExtractor()
extractor.process_all_text_files()

View File

@@ -1,151 +0,0 @@
#!/usr/bin/env python3
"""
Unified Activity Processor
Orchestreaz toate extractoarele pentru procesare complet
"""
import time
from pathlib import Path
from html_extractor import HTMLActivityExtractor
from text_extractor import TextActivityExtractor
import sqlite3
class UnifiedProcessor:
def __init__(self, db_path='data/activities.db'):
self.db_path = db_path
self.html_extractor = HTMLActivityExtractor(db_path)
self.text_extractor = TextActivityExtractor(db_path)
self.stats = {
'html_processed': 0,
'text_processed': 0,
'pdf_to_process': 0,
'doc_to_process': 0,
'total_activities': 0,
'start_time': None,
'end_time': None
}
def get_current_activity_count(self):
"""Obine numrul curent de activiti din DB"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM activities")
count = cursor.fetchone()[0]
conn.close()
return count
def count_files_to_process(self, base_path):
"""Numr fiierele care trebuie procesate"""
base_path = Path(base_path)
counts = {
'html': len(list(base_path.rglob("*.html"))) + len(list(base_path.rglob("*.htm"))),
'txt': len(list(base_path.rglob("*.txt"))),
'md': len(list(base_path.rglob("*.md"))),
'pdf': len(list(base_path.rglob("*.pdf"))),
'doc': len(list(base_path.rglob("*.doc"))),
'docx': len(list(base_path.rglob("*.docx")))
}
return counts
def process_automated_formats(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
"""Proceseaz toate formatele care pot fi automatizate"""
print("="*60)
print("UNIFIED ACTIVITY PROCESSOR - AUTOMATED PHASE")
print("="*60)
self.stats['start_time'] = time.time()
initial_count = self.get_current_activity_count()
# Afieaz statistici iniiale
file_counts = self.count_files_to_process(base_path)
print(f"\nFiles to process:")
for format, count in file_counts.items():
print(f" {format.upper()}: {count} files")
print(f"\nCurrent activities in database: {initial_count}")
print("-"*60)
# FAZA 1: Procesare HTML (prioritate maxim - volum mare)
print("\n[1/2] Processing HTML files...")
print("-"*40)
html_processed, html_errors = self.html_extractor.process_all_html_files(base_path)
self.stats['html_processed'] = html_processed
# FAZA 2: Procesare Text/MD
print("\n[2/2] Processing Text/Markdown files...")
print("-"*40)
text_processed, text_saved = self.text_extractor.process_all_text_files(base_path)
self.stats['text_processed'] = text_processed
# Statistici finale
self.stats['end_time'] = time.time()
final_count = self.get_current_activity_count()
self.stats['total_activities'] = final_count - initial_count
# Identific fiierele care necesit procesare manual
self.stats['pdf_to_process'] = file_counts['pdf']
self.stats['doc_to_process'] = file_counts['doc'] + file_counts['docx']
self.print_summary()
self.save_pdf_doc_list(base_path)
def print_summary(self):
"""Afieaz rezumatul procesrii"""
print("\n" + "="*60)
print("PROCESSING SUMMARY")
print("="*60)
duration = self.stats['end_time'] - self.stats['start_time']
print(f"\nAutomated Processing Results:")
print(f" HTML files processed: {self.stats['html_processed']}")
print(f" Text/MD files processed: {self.stats['text_processed']}")
print(f" New activities added: {self.stats['total_activities']}")
print(f" Processing time: {duration:.1f} seconds")
print(f"\nFiles requiring Claude processing:")
print(f" PDF files: {self.stats['pdf_to_process']}")
print(f" DOC/DOCX files: {self.stats['doc_to_process']}")
print("\n" + "="*60)
print("NEXT STEPS:")
print("1. Review the file 'pdf_doc_for_claude.txt' for manual processing")
print("2. Use Claude to extract activities from PDF/DOC files")
print("3. Focus on largest PDF files first (highest activity density)")
print("="*60)
def save_pdf_doc_list(self, base_path):
"""Salveaz lista de PDF/DOC pentru procesare cu Claude"""
base_path = Path(base_path)
pdf_files = sorted(base_path.rglob("*.pdf"), key=lambda p: p.stat().st_size, reverse=True)
doc_files = list(base_path.rglob("*.doc"))
docx_files = list(base_path.rglob("*.docx"))
with open('pdf_doc_for_claude.txt', 'w', encoding='utf-8') as f:
f.write("PDF/DOC FILES FOR CLAUDE PROCESSING\n")
f.write("="*60 + "\n")
f.write("Files sorted by size (largest first = likely more activities)\n\n")
f.write("TOP PRIORITY PDF FILES (process these first):\n")
f.write("-"*40 + "\n")
for i, pdf in enumerate(pdf_files[:20], 1):
size_mb = pdf.stat().st_size / (1024*1024)
f.write(f"{i}. {pdf.name} ({size_mb:.1f} MB)\n")
f.write(f" Path: {pdf}\n\n")
if len(pdf_files) > 20:
f.write(f"\n... and {len(pdf_files)-20} more PDF files\n\n")
f.write("\nDOC/DOCX FILES:\n")
f.write("-"*40 + "\n")
for doc in doc_files + docx_files:
size_kb = doc.stat().st_size / 1024
f.write(f"- {doc.name} ({size_kb:.1f} KB)\n")
print(f"\nPDF/DOC list saved to: pdf_doc_for_claude.txt")
if __name__ == "__main__":
processor = UnifiedProcessor()
processor.process_automated_formats()

View File

@@ -0,0 +1,208 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
validate_extractions.py — validate every data/extracted/*.json (plan §5b).
For each extraction file it runs two checks:
1. JSON-schema validation against scripts/activity_schema.json,
2. the source_excerpt anti-hallucination check (each excerpt must be a fuzzy
substring of the chunk it came from).
For every failing chunk it:
* writes the exact re-extraction prompt to data/extracted/_reextract/<chunk>.prompt.md,
* marks the chunk `rejected` in data/chunks/manifest.json.
The orchestrator then re-launches subagents only on the `rejected` chunks; the
loop repeats until nothing is rejected.
Usage:
python scripts/validate_extractions.py
"""
from __future__ import annotations
import argparse
import json
import sys
from pathlib import Path
from typing import Optional
SCRIPT_DIR = Path(__file__).resolve().parent
REPO_ROOT = SCRIPT_DIR.parent
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
if _p not in sys.path:
sys.path.insert(0, _p)
from import_common import ( # noqa: E402
DEFAULT_SCHEMA_PATH,
chunk_key_for,
excerpt_matches,
excerpt_score,
find_chunk_text,
iter_extraction_files,
load_schema,
validate_extraction,
)
SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
# --------------------------------------------------------------------------
# re-extraction prompt
# --------------------------------------------------------------------------
def build_reextraction_prompt(
chunk_key: str, chunk_file: Optional[str], errors: list[str]
) -> str:
"""The exact prompt to hand a subagent to re-extract a rejected chunk."""
chunk_ref = chunk_file or f"data/chunks/<source_id>/{chunk_key}.txt"
lines = [
f"# RE-EXTRACTION — chunk `{chunk_key}`",
"",
"The previous extraction for this chunk was **REJECTED**. Reasons:",
"",
]
lines += [f"- {e}" for e in errors]
lines += [
"",
"## What to do",
"",
f"1. Read ONLY this chunk: `{chunk_ref}`",
f"2. Follow the extraction rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
"3. Fix every problem listed above. In particular:",
" - every `source_excerpt` must be copied **verbatim** from the chunk",
" (it is checked as a fuzzy substring — invented quotes are rejected);",
" - `source_excerpt` and `page_reference` are mandatory on every activity;",
" - the output must validate against `scripts/activity_schema.json`.",
f"4. Overwrite the extraction file `data/extracted/{chunk_key}.json`.",
"",
]
return "\n".join(lines)
# --------------------------------------------------------------------------
# manifest
# --------------------------------------------------------------------------
def load_manifest(manifest_path: Path) -> dict:
if manifest_path.is_file():
try:
data = json.loads(manifest_path.read_text(encoding="utf-8"))
data.setdefault("chunks", {})
return data
except (json.JSONDecodeError, OSError):
pass
return {"chunks": {}}
def save_manifest(manifest: dict, manifest_path: Path) -> None:
manifest_path.parent.mkdir(parents=True, exist_ok=True)
manifest_path.write_text(
json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
)
def mark_rejected(manifest: dict, chunk_key: str) -> None:
"""Flip a chunk to `rejected` in the manifest (creating the entry if new)."""
entry = manifest["chunks"].get(chunk_key, {})
entry["state"] = "rejected"
manifest["chunks"][chunk_key] = entry
# --------------------------------------------------------------------------
# validation
# --------------------------------------------------------------------------
def validate_file(json_path: Path, schema: dict, chunks_dir: Path) -> list[str]:
"""Return the list of errors for one extraction file (empty == valid)."""
try:
data = json.loads(json_path.read_text(encoding="utf-8"))
except json.JSONDecodeError as exc:
return [f"invalid JSON: {exc}"]
errors = validate_extraction(data, schema)
if errors:
return errors
header = data.get("header", {})
chunk_text = find_chunk_text(json_path, header, chunks_dir)
if chunk_text is None:
return [f"source chunk not found for {chunk_key_for(json_path, header)}"]
for adict in data.get("activities", []):
excerpt = adict.get("source_excerpt") or ""
if not excerpt_matches(excerpt, chunk_text):
score = excerpt_score(excerpt, chunk_text)
errors.append(
f"activity {adict.get('name')!r}: source_excerpt not found in "
f"chunk (best match {score:.0f}/100) — possible hallucination"
)
return errors
def run(
extracted_dir: Path,
chunks_dir: Path,
manifest_path: Path,
schema_path: Path = DEFAULT_SCHEMA_PATH,
) -> dict:
schema = load_schema(schema_path)
manifest = load_manifest(manifest_path)
reextract_dir = extracted_dir / "_reextract"
report = {"total": 0, "valid": 0, "rejected": 0, "rejected_chunks": []}
for json_path in iter_extraction_files(extracted_dir):
report["total"] += 1
errors = validate_file(json_path, schema, chunks_dir)
if not errors:
report["valid"] += 1
continue
report["rejected"] += 1
try:
data = json.loads(json_path.read_text(encoding="utf-8"))
header = data.get("header", {})
except json.JSONDecodeError:
header = {}
chunk_key = chunk_key_for(json_path, header)
chunk_file = None
meta = manifest["chunks"].get(chunk_key)
if meta:
chunk_file = meta.get("chunk_file")
reextract_dir.mkdir(parents=True, exist_ok=True)
prompt = build_reextraction_prompt(chunk_key, chunk_file, errors)
(reextract_dir / f"{chunk_key}.prompt.md").write_text(prompt, encoding="utf-8")
mark_rejected(manifest, chunk_key)
report["rejected_chunks"].append({"chunk": chunk_key, "errors": errors})
save_manifest(manifest, manifest_path)
return report
# --------------------------------------------------------------------------
# CLI
# --------------------------------------------------------------------------
def main(argv: Optional[list[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Validate extraction JSON files.")
parser.add_argument("--extracted", default="data/extracted")
parser.add_argument("--chunks", default="data/chunks")
parser.add_argument("--manifest", default="data/chunks/manifest.json")
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
args = parser.parse_args(argv)
report = run(
Path(args.extracted), Path(args.chunks), Path(args.manifest), Path(args.schema)
)
print(f"extraction files : {report['total']}")
print(f" valid : {report['valid']}")
print(f" rejected : {report['rejected']}")
for item in report["rejected_chunks"]:
print(f" [rejected] {item['chunk']}")
for err in item["errors"]:
print(f" - {err}")
if report["rejected"]:
print(f"\nRe-extraction prompts written to {args.extracted}/_reextract/")
return 0
if __name__ == "__main__":
raise SystemExit(main())