Rebuild extraction pipeline infrastructure (Faza 0 prep)
Implements the approved plan to replace the broken regex/index-master extraction with an LLM-subagent pipeline. Four parallel lanes: Lane A — scripts/extract_common.py (PDF/docx/doc/pptx/html/zip, no max_pages truncation), normalize_sources.py, chunk_sources.py (~20pg chunks + overlap, manifest registry), activity_schema.json. Lane B — app/config_taxonomy.py (16 fixed category slugs), schema rebuilt from scratch in app/models/ with content_type, language, source_files, source_excerpt, normalized_name, extraction_confidence, needs_review; FTS5 + 3 triggers extended with materials_list and skills_developed. Lane C — build_database.py (--rebuild, atomic swap, schema + fuzzy source_excerpt validation, dedup with needs_review band), validate_extractions.py, review_queue.py, new run_extraction.py orchestrator, SUBAGENT_PROMPT.md. Lane D — search.py content_type/language filters (default search excludes non-game content), E7 schema-compat audit; fixed a NULL keywords AttributeError in _boost_search_relevance. Removes 8 orphaned/dead scripts and app/services/parser.py + indexer.py. Adds tests/ (70 passing, 1 skipped — libreoffice absent). Note: Lane D made one additive edit to app/models/database.py (_update_category_counts) to surface content_type/language in get_filter_options, outside its nominal lane boundary but after Lane B completed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
81
scripts/SUBAGENT_PROMPT.md
Normal file
81
scripts/SUBAGENT_PROMPT.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# SUBAGENT — Activity extraction
|
||||
|
||||
You are a subagent in the game-library extraction pipeline. You extract
|
||||
educational activities (games, team-building, scouting, recipes, songs,
|
||||
ceremonies) from one chunk of a source document into structured JSON.
|
||||
|
||||
## Your task
|
||||
|
||||
1. **Read ONLY the chunk you were assigned.** Do not read other chunks, other
|
||||
files, or the original document. The chunk is a `.txt` file with
|
||||
`--- PAGE N ---` markers.
|
||||
2. Identify **every distinct activity** in the chunk.
|
||||
3. For each activity, fill the schema in `scripts/activity_schema.json`.
|
||||
4. Write the result to `data/extracted/<chunk_key>.json`.
|
||||
|
||||
## What counts as "a distinct activity"
|
||||
|
||||
A distinct activity is a self-contained game/activity/recipe/song/ceremony with
|
||||
its own name and a real description of how to do it. It is NOT:
|
||||
|
||||
- a bare mention or a cross-reference with no description — **skip it**;
|
||||
- a sub-variant of an activity already extracted — fold it into `variations`;
|
||||
- a heading, a table of contents entry, or running page chrome.
|
||||
|
||||
If the same activity is split across a page boundary inside your chunk, treat it
|
||||
as **one** activity and combine the text.
|
||||
|
||||
## Output format
|
||||
|
||||
The file is one JSON object: a `header` plus an `activities` array.
|
||||
|
||||
```json
|
||||
{
|
||||
"header": {
|
||||
"source_id": "<set from the prompt>",
|
||||
"chunk_key": "<set from the prompt>",
|
||||
"source_hash": "<set from the prompt>",
|
||||
"schema_version": "1.0",
|
||||
"prompt_version": "1.0",
|
||||
"chunk_range": "pages 1-20"
|
||||
},
|
||||
"activities": [ ... ]
|
||||
}
|
||||
```
|
||||
|
||||
## Rules for each activity
|
||||
|
||||
- **`name`** — the activity's real name (≥3 characters).
|
||||
- **`description`** — real prose describing the activity. No hard length limit,
|
||||
but it must actually describe what happens.
|
||||
- **`rules`** — how it is played / carried out, if the source gives rules.
|
||||
- **`category`** — exactly one taxonomy slug (see the `enum` in the schema):
|
||||
`jocuri-cercetasesti`, `team-building`, `icebreakers`, `camp-outdoor`,
|
||||
`wide-games`, `orientare`, `prim-ajutor`, `escape-room-puzzle`,
|
||||
`creative-stem`, `sports-active`, `cantece-ceremonii`, `retete`,
|
||||
`supravietuire`, `integrare-incluziune`, `conflict-empatie`, `altele`.
|
||||
When unsure, use `altele`.
|
||||
- **`content_type`** — the FORM of the content, independent of category:
|
||||
`joc`, `activitate`, `reteta`, `cantec`, or `ceremonie`.
|
||||
- **`language`** — `ro` or `en` (the language the activity is written in).
|
||||
- **`source_excerpt`** — **MANDATORY.** A short quote (one or two sentences)
|
||||
copied **verbatim** from the chunk. This is the anti-hallucination anchor: it
|
||||
is checked as a fuzzy substring of the chunk, and invented quotes are
|
||||
rejected.
|
||||
- **`page_reference`** — **MANDATORY.** The `--- PAGE N ---` marker(s) the
|
||||
activity came from, e.g. `"page 14"` or `"pages 14-15"`.
|
||||
- **`extraction_confidence`** — `high`, `med`, or `low`. Use `low` when the
|
||||
source text for the activity is thin or ambiguous.
|
||||
|
||||
## Never invent data
|
||||
|
||||
- Do **not** invent ages, participant counts, or durations. If the source does
|
||||
not state them, leave those fields `null`.
|
||||
- Do **not** paraphrase the `source_excerpt` — copy it character for character.
|
||||
- Better to extract fewer activities accurately than to pad the output.
|
||||
|
||||
## Before you finish
|
||||
|
||||
- Every activity has a non-empty `source_excerpt` and `page_reference`.
|
||||
- The file validates against `scripts/activity_schema.json`.
|
||||
- You only used text from your assigned chunk.
|
||||
110
scripts/activity_schema.json
Normal file
110
scripts/activity_schema.json
Normal file
@@ -0,0 +1,110 @@
|
||||
{
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"title": "Game-library extraction output",
|
||||
"description": "One subagent output file: a header carrying provenance/version metadata plus the list of activities extracted from a single chunk.",
|
||||
"type": "object",
|
||||
"required": ["header", "activities"],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"header": {
|
||||
"type": "object",
|
||||
"required": ["source_hash", "schema_version", "prompt_version", "chunk_range"],
|
||||
"additionalProperties": true,
|
||||
"properties": {
|
||||
"source_hash": {"type": "string", "minLength": 8},
|
||||
"schema_version": {"type": "string"},
|
||||
"prompt_version": {"type": "string"},
|
||||
"chunk_range": {"type": "string"},
|
||||
"source_id": {"type": ["string", "null"]},
|
||||
"chunk_key": {"type": ["string", "null"]}
|
||||
}
|
||||
},
|
||||
"activities": {
|
||||
"type": "array",
|
||||
"items": {"$ref": "#/definitions/activity"}
|
||||
}
|
||||
},
|
||||
"definitions": {
|
||||
"activity": {
|
||||
"type": "object",
|
||||
"required": [
|
||||
"name",
|
||||
"description",
|
||||
"category",
|
||||
"content_type",
|
||||
"language",
|
||||
"extraction_confidence",
|
||||
"source_excerpt",
|
||||
"page_reference"
|
||||
],
|
||||
"additionalProperties": false,
|
||||
"properties": {
|
||||
"name": {"type": "string", "minLength": 3},
|
||||
"description": {"type": "string", "minLength": 1},
|
||||
"rules": {"type": ["string", "null"]},
|
||||
"variations": {"type": ["string", "null"]},
|
||||
"category": {
|
||||
"type": "string",
|
||||
"enum": [
|
||||
"jocuri-cercetasesti",
|
||||
"team-building",
|
||||
"icebreakers",
|
||||
"camp-outdoor",
|
||||
"wide-games",
|
||||
"orientare",
|
||||
"prim-ajutor",
|
||||
"escape-room-puzzle",
|
||||
"creative-stem",
|
||||
"sports-active",
|
||||
"cantece-ceremonii",
|
||||
"retete",
|
||||
"supravietuire",
|
||||
"integrare-incluziune",
|
||||
"conflict-empatie",
|
||||
"altele"
|
||||
]
|
||||
},
|
||||
"subcategory": {"type": ["string", "null"]},
|
||||
"content_type": {
|
||||
"type": "string",
|
||||
"enum": ["joc", "activitate", "reteta", "cantec", "ceremonie"]
|
||||
},
|
||||
"language": {"type": "string", "enum": ["ro", "en"]},
|
||||
"extraction_confidence": {
|
||||
"type": "string",
|
||||
"enum": ["high", "med", "low"]
|
||||
},
|
||||
"source_excerpt": {"type": "string", "minLength": 1},
|
||||
"page_reference": {"type": "string", "minLength": 1},
|
||||
"source_file": {"type": ["string", "null"]},
|
||||
"age_group_min": {"type": ["integer", "null"], "minimum": 0},
|
||||
"age_group_max": {"type": ["integer", "null"], "minimum": 0},
|
||||
"participants_min": {"type": ["integer", "null"], "minimum": 0},
|
||||
"participants_max": {"type": ["integer", "null"], "minimum": 0},
|
||||
"duration_min": {"type": ["integer", "null"], "minimum": 0},
|
||||
"duration_max": {"type": ["integer", "null"], "minimum": 0},
|
||||
"materials_category": {"type": ["string", "null"]},
|
||||
"materials_list": {
|
||||
"type": ["array", "null"],
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"skills_developed": {
|
||||
"type": ["array", "null"],
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"difficulty_level": {
|
||||
"type": ["string", "null"],
|
||||
"enum": ["usor", "mediu", "dificil", null]
|
||||
},
|
||||
"keywords": {
|
||||
"type": ["array", "null"],
|
||||
"items": {"type": "string"}
|
||||
},
|
||||
"tags": {
|
||||
"type": ["array", "null"],
|
||||
"items": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
639
scripts/build_database.py
Normal file
639
scripts/build_database.py
Normal file
@@ -0,0 +1,639 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
build_database.py — build data/activities.db from the subagent extraction JSON.
|
||||
|
||||
Replaces the old import_claude_activities.py. Pipeline (plan §4):
|
||||
|
||||
1. `--rebuild` builds into data/activities.db.tmp; on success the live DB is
|
||||
backed up to data/activities.db.bak and the tmp file is swapped in with an
|
||||
atomic os.replace. A mid-build crash leaves the live DB untouched.
|
||||
2. Every data/extracted/*.json is validated against scripts/activity_schema.json;
|
||||
invalid files are moved to data/extracted/_rejected/ with an error log.
|
||||
2b. Each source_excerpt must appear as a fuzzy substring (rapidfuzz
|
||||
partial_ratio >= 90) of its source chunk — non-matches are hallucinations
|
||||
and the activity is dropped (logged to _rejected/).
|
||||
3. `category` is normalized to a valid taxonomy slug (fallback `altele`).
|
||||
4. Dedup (D5): group by exact normalized_name, never across languages; within a
|
||||
group rapidfuzz on descriptions — >=85 auto-merge, 60-85 borderline (keep
|
||||
both, needs_review), <60 separate variants.
|
||||
5. data/review_decisions.json is applied before insert.
|
||||
6. Bulk insert into the tmp DB, populate the categories table, rebuild FTS.
|
||||
7. A QA report is printed.
|
||||
|
||||
Usage:
|
||||
python scripts/build_database.py --rebuild
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
from app.config_taxonomy import ( # noqa: E402
|
||||
category_display_name,
|
||||
normalize_category,
|
||||
normalize_content_type,
|
||||
)
|
||||
from app.models.activity import Activity # noqa: E402
|
||||
from app.models.database import DatabaseManager # noqa: E402
|
||||
from import_common import ( # noqa: E402
|
||||
DEFAULT_SCHEMA_PATH,
|
||||
content_key,
|
||||
excerpt_matches,
|
||||
find_chunk_text,
|
||||
iter_extraction_files,
|
||||
load_schema,
|
||||
normalize_name,
|
||||
source_path_for,
|
||||
)
|
||||
|
||||
# dedup thresholds (rapidfuzz token_sort_ratio, 0..100 scale)
|
||||
AUTO_MERGE_THRESHOLD = 85.0
|
||||
BORDERLINE_THRESHOLD = 60.0
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# extraction dict -> Activity
|
||||
# --------------------------------------------------------------------------
|
||||
def _csv(value: Any) -> Optional[str]:
|
||||
"""Schema arrays -> comma string for the (TEXT) DB columns."""
|
||||
if value is None:
|
||||
return None
|
||||
if isinstance(value, str):
|
||||
return value.strip() or None
|
||||
if isinstance(value, (list, tuple)):
|
||||
parts = [str(v).strip() for v in value if str(v).strip()]
|
||||
return ", ".join(parts) or None
|
||||
return str(value)
|
||||
|
||||
|
||||
def _split_csv(value: Optional[str]) -> list[str]:
|
||||
if not value:
|
||||
return []
|
||||
return [p.strip() for p in str(value).split(",") if p.strip()]
|
||||
|
||||
|
||||
def dict_to_activity(adict: dict, source_file: str) -> Activity:
|
||||
"""Build an Activity from one extraction-JSON activity object."""
|
||||
tags = adict.get("tags") or []
|
||||
if isinstance(tags, str):
|
||||
tags = _split_csv(tags)
|
||||
|
||||
source_files = adict.get("source_files") or []
|
||||
if isinstance(source_files, str):
|
||||
source_files = _split_csv(source_files)
|
||||
if source_file and source_file not in source_files:
|
||||
source_files = [source_file, *source_files]
|
||||
|
||||
return Activity(
|
||||
name=(adict.get("name") or "").strip(),
|
||||
description=(adict.get("description") or "").strip(),
|
||||
rules=adict.get("rules"),
|
||||
variations=adict.get("variations"),
|
||||
category=normalize_category(adict.get("category", "")),
|
||||
subcategory=adict.get("subcategory"),
|
||||
content_type=normalize_content_type(adict.get("content_type", "")),
|
||||
source_file=source_file,
|
||||
source_files=list(source_files),
|
||||
page_reference=adict.get("page_reference"),
|
||||
source_excerpt=adict.get("source_excerpt"),
|
||||
age_group_min=adict.get("age_group_min"),
|
||||
age_group_max=adict.get("age_group_max"),
|
||||
participants_min=adict.get("participants_min"),
|
||||
participants_max=adict.get("participants_max"),
|
||||
duration_min=adict.get("duration_min"),
|
||||
duration_max=adict.get("duration_max"),
|
||||
materials_category=adict.get("materials_category"),
|
||||
materials_list=_csv(adict.get("materials_list")),
|
||||
skills_developed=_csv(adict.get("skills_developed")),
|
||||
difficulty_level=adict.get("difficulty_level"),
|
||||
keywords=_csv(adict.get("keywords")),
|
||||
tags=list(tags),
|
||||
language=adict.get("language"),
|
||||
extraction_confidence=adict.get("extraction_confidence"),
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 3 — category normalization is done in dict_to_activity; a non-taxonomy
|
||||
# value silently falls back to `altele`. This logs the substitutions.
|
||||
# --------------------------------------------------------------------------
|
||||
def log_category_fallbacks(raw_pairs: list[tuple[str, str]]) -> list[str]:
|
||||
"""raw_pairs = (original, slug); return human-readable fallback messages."""
|
||||
msgs = []
|
||||
for original, slug in raw_pairs:
|
||||
if slug == "altele" and normalize_name(original or "") not in ("", "altele"):
|
||||
msgs.append(f"category '{original}' -> altele (not in taxonomy)")
|
||||
return msgs
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 4 — dedup
|
||||
# --------------------------------------------------------------------------
|
||||
def _longest(*values: Optional[str]) -> Optional[str]:
|
||||
best: Optional[str] = None
|
||||
for v in values:
|
||||
if v and (best is None or len(v) > len(best)):
|
||||
best = v
|
||||
return best
|
||||
|
||||
|
||||
def _union_csv(values: list[Optional[str]]) -> Optional[str]:
|
||||
seen: list[str] = []
|
||||
for value in values:
|
||||
for item in _split_csv(value):
|
||||
if item not in seen:
|
||||
seen.append(item)
|
||||
return ", ".join(seen) or None
|
||||
|
||||
|
||||
def merge_cluster(cluster: list[Activity]) -> Activity:
|
||||
"""Collapse a cluster of duplicate activities into one merged Activity."""
|
||||
if len(cluster) == 1:
|
||||
return cluster[0]
|
||||
|
||||
# representative = the one with the longest description
|
||||
rep = max(cluster, key=lambda a: len(a.description or ""))
|
||||
merged = Activity(
|
||||
name=rep.name,
|
||||
description=_longest(*(a.description for a in cluster)) or rep.description,
|
||||
rules=_longest(*(a.rules for a in cluster)),
|
||||
variations=_longest(*(a.variations for a in cluster)),
|
||||
category=rep.category,
|
||||
subcategory=rep.subcategory,
|
||||
content_type=rep.content_type,
|
||||
source_file=rep.source_file,
|
||||
page_reference=rep.page_reference,
|
||||
source_excerpt=rep.source_excerpt,
|
||||
age_group_min=rep.age_group_min,
|
||||
age_group_max=rep.age_group_max,
|
||||
participants_min=rep.participants_min,
|
||||
participants_max=rep.participants_max,
|
||||
duration_min=rep.duration_min,
|
||||
duration_max=rep.duration_max,
|
||||
materials_category=rep.materials_category,
|
||||
materials_list=_union_csv([a.materials_list for a in cluster]),
|
||||
skills_developed=_union_csv([a.skills_developed for a in cluster]),
|
||||
difficulty_level=rep.difficulty_level,
|
||||
keywords=_union_csv([a.keywords for a in cluster]),
|
||||
language=rep.language,
|
||||
extraction_confidence=rep.extraction_confidence,
|
||||
)
|
||||
# union of tags
|
||||
tags: list[str] = []
|
||||
for a in cluster:
|
||||
for t in a.tags or []:
|
||||
if t not in tags:
|
||||
tags.append(t)
|
||||
merged.tags = tags
|
||||
# accumulate every source the activity was seen in
|
||||
sources: list[str] = []
|
||||
for a in cluster:
|
||||
for s in [a.source_file, *(a.source_files or [])]:
|
||||
if s and s not in sources:
|
||||
sources.append(s)
|
||||
merged.source_files = sources
|
||||
# popularity_score++ per merged duplicate (plan §4)
|
||||
merged.popularity_score = max(a.popularity_score for a in cluster) + (len(cluster) - 1)
|
||||
return merged
|
||||
|
||||
|
||||
def dedup_activities(activities: list[Activity]) -> tuple[list[Activity], dict]:
|
||||
"""
|
||||
Dedup per plan D5.
|
||||
|
||||
Groups by (normalized_name, language) — different languages are NEVER
|
||||
merged. Within a group, descriptions are clustered with rapidfuzz:
|
||||
>= 85 -> same cluster (auto-merge)
|
||||
60-85 -> borderline: kept as separate clusters, both flagged needs_review
|
||||
< 60 -> separate variants
|
||||
"""
|
||||
from rapidfuzz import fuzz
|
||||
|
||||
groups: dict[tuple, list[Activity]] = defaultdict(list)
|
||||
for act in activities:
|
||||
key = (act.normalized_name or normalize_name(act.name), act.language)
|
||||
groups[key].append(act)
|
||||
|
||||
result: list[Activity] = []
|
||||
stats = {"input": len(activities), "auto_merged": 0, "borderline": 0, "output": 0}
|
||||
|
||||
for members in groups.values():
|
||||
clusters: list[list[Activity]] = []
|
||||
borderline_idx: set[int] = set()
|
||||
|
||||
for act in members:
|
||||
best_idx, best_score = -1, -1.0
|
||||
borderline_here: list[int] = []
|
||||
for idx, cluster in enumerate(clusters):
|
||||
score = fuzz.token_sort_ratio(
|
||||
act.description or "", cluster[0].description or ""
|
||||
)
|
||||
if score >= AUTO_MERGE_THRESHOLD:
|
||||
if score > best_score:
|
||||
best_idx, best_score = idx, score
|
||||
elif score >= BORDERLINE_THRESHOLD:
|
||||
borderline_here.append(idx)
|
||||
if best_idx >= 0:
|
||||
clusters[best_idx].append(act)
|
||||
else:
|
||||
clusters.append([act])
|
||||
new_idx = len(clusters) - 1
|
||||
for bidx in borderline_here:
|
||||
borderline_idx.add(bidx)
|
||||
borderline_idx.add(new_idx)
|
||||
|
||||
for idx, cluster in enumerate(clusters):
|
||||
merged = merge_cluster(cluster)
|
||||
if len(cluster) > 1:
|
||||
stats["auto_merged"] += len(cluster) - 1
|
||||
if idx in borderline_idx:
|
||||
merged.needs_review = 1
|
||||
stats["borderline"] += 1
|
||||
result.append(merged)
|
||||
|
||||
stats["output"] = len(result)
|
||||
return result, stats
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# step 5 — review decisions
|
||||
# --------------------------------------------------------------------------
|
||||
def load_review_decisions(path: Path) -> dict:
|
||||
if path and path.is_file():
|
||||
try:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
if isinstance(data, dict):
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def apply_review_decisions(
|
||||
activities: list[Activity], decisions: dict
|
||||
) -> tuple[list[Activity], dict]:
|
||||
"""
|
||||
Apply data/review_decisions.json (plan §5c).
|
||||
|
||||
Keyed by the stable content_key. A decision of `drop` removes the row;
|
||||
`keep-separate` / `merge` clear needs_review (the user has resolved it).
|
||||
Rows with no decision keep needs_review and resurface in the queue.
|
||||
"""
|
||||
kept: list[Activity] = []
|
||||
stats = {"dropped": 0, "resolved": 0}
|
||||
for act in activities:
|
||||
key = content_key(
|
||||
act.normalized_name or normalize_name(act.name),
|
||||
act.language,
|
||||
act.description or "",
|
||||
)
|
||||
entry = decisions.get(key)
|
||||
decision = entry.get("decision") if isinstance(entry, dict) else entry
|
||||
if decision == "drop":
|
||||
stats["dropped"] += 1
|
||||
continue
|
||||
if decision in ("keep-separate", "merge"):
|
||||
act.needs_review = 0
|
||||
stats["resolved"] += 1
|
||||
kept.append(act)
|
||||
return kept, stats
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# golden-set recall (plan §7)
|
||||
# --------------------------------------------------------------------------
|
||||
def _golden_names(data: Any) -> list[str]:
|
||||
items = data.get("activities", data) if isinstance(data, dict) else data
|
||||
names: list[str] = []
|
||||
for item in items or []:
|
||||
if isinstance(item, str):
|
||||
names.append(item)
|
||||
elif isinstance(item, dict) and item.get("name"):
|
||||
names.append(item["name"])
|
||||
return names
|
||||
|
||||
|
||||
def golden_recall(golden_dir: Path, activities: list[Activity]) -> Optional[dict]:
|
||||
if not golden_dir or not golden_dir.is_dir():
|
||||
return None
|
||||
found = {normalize_name(a.name) for a in activities}
|
||||
expected, hits = 0, 0
|
||||
for gf in sorted(golden_dir.glob("*.json")):
|
||||
try:
|
||||
data = json.loads(gf.read_text(encoding="utf-8"))
|
||||
except (json.JSONDecodeError, OSError):
|
||||
continue
|
||||
for name in _golden_names(data):
|
||||
expected += 1
|
||||
if normalize_name(name) in found:
|
||||
hits += 1
|
||||
if expected == 0:
|
||||
return None
|
||||
return {"expected": expected, "found": hits, "recall": round(hits / expected, 3)}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# load + validate + excerpt-check the extraction files
|
||||
# --------------------------------------------------------------------------
|
||||
def collect_activities(
|
||||
extracted_dir: Path,
|
||||
chunks_dir: Path,
|
||||
sources_dir: Path,
|
||||
schema: dict,
|
||||
) -> dict:
|
||||
"""Validate, excerpt-check and convert every extraction file."""
|
||||
rejected_dir = extracted_dir / "_rejected"
|
||||
activities: list[Activity] = []
|
||||
report = {
|
||||
"files_total": 0,
|
||||
"files_valid": 0,
|
||||
"files_rejected_schema": 0,
|
||||
"activities_raw": 0,
|
||||
"activities_hallucinated": 0,
|
||||
"category_fallbacks": [],
|
||||
}
|
||||
raw_categories: list[tuple[str, str]] = []
|
||||
|
||||
from import_common import chunk_key_for # local import to avoid clutter
|
||||
|
||||
for json_path in iter_extraction_files(extracted_dir):
|
||||
report["files_total"] += 1
|
||||
try:
|
||||
data = json.loads(json_path.read_text(encoding="utf-8"))
|
||||
except json.JSONDecodeError as exc:
|
||||
_reject_file(json_path, rejected_dir, [f"invalid JSON: {exc}"])
|
||||
report["files_rejected_schema"] += 1
|
||||
continue
|
||||
|
||||
from import_common import validate_extraction
|
||||
|
||||
errors = validate_extraction(data, schema)
|
||||
if errors:
|
||||
_reject_file(json_path, rejected_dir, errors)
|
||||
report["files_rejected_schema"] += 1
|
||||
continue
|
||||
report["files_valid"] += 1
|
||||
|
||||
header = data.get("header", {})
|
||||
chunk_text = find_chunk_text(json_path, header, chunks_dir)
|
||||
source_id = header.get("source_id") or chunk_key_for(json_path, header).rsplit(
|
||||
".part", 1
|
||||
)[0]
|
||||
fallback_source = (
|
||||
source_path_for(source_id, sources_dir) or source_id or json_path.stem
|
||||
)
|
||||
|
||||
hallucinated: list[dict] = []
|
||||
for adict in data.get("activities", []):
|
||||
report["activities_raw"] += 1
|
||||
excerpt = adict.get("source_excerpt") or ""
|
||||
# if the chunk text is unavailable we cannot verify — keep but the
|
||||
# QA report still counts it under activities_raw.
|
||||
if chunk_text is not None and not excerpt_matches(excerpt, chunk_text):
|
||||
hallucinated.append(adict)
|
||||
report["activities_hallucinated"] += 1
|
||||
continue
|
||||
src = adict.get("source_file") or fallback_source
|
||||
raw_categories.append((adict.get("category", ""), normalize_category(adict.get("category", ""))))
|
||||
activities.append(dict_to_activity(adict, src))
|
||||
|
||||
if hallucinated:
|
||||
_log_hallucinations(json_path, rejected_dir, hallucinated)
|
||||
|
||||
report["category_fallbacks"] = log_category_fallbacks(raw_categories)
|
||||
report["activities"] = activities
|
||||
return report
|
||||
|
||||
|
||||
def _reject_file(json_path: Path, rejected_dir: Path, errors: list[str]) -> None:
|
||||
rejected_dir.mkdir(parents=True, exist_ok=True)
|
||||
dest = rejected_dir / json_path.name
|
||||
shutil.move(str(json_path), str(dest))
|
||||
log = rejected_dir / f"{json_path.stem}.errors.txt"
|
||||
log.write_text(
|
||||
f"REJECTED (schema validation): {json_path.name}\n\n"
|
||||
+ "\n".join(f" - {e}" for e in errors)
|
||||
+ "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
def _log_hallucinations(
|
||||
json_path: Path, rejected_dir: Path, hallucinated: list[dict]
|
||||
) -> None:
|
||||
rejected_dir.mkdir(parents=True, exist_ok=True)
|
||||
log = rejected_dir / f"{json_path.stem}.hallucinations.txt"
|
||||
lines = [f"DROPPED activities (source_excerpt not found in chunk): {json_path.name}", ""]
|
||||
for a in hallucinated:
|
||||
lines.append(f" - {a.get('name')!r}")
|
||||
lines.append(f" excerpt: {a.get('source_excerpt')!r}")
|
||||
log.write_text("\n".join(lines) + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# DB write + atomic swap
|
||||
# --------------------------------------------------------------------------
|
||||
def _enrich_category_display_names(db_path: Path) -> None:
|
||||
"""Give the categories table proper Romanian display names for slugs."""
|
||||
import sqlite3
|
||||
|
||||
conn = sqlite3.connect(db_path)
|
||||
try:
|
||||
rows = conn.execute(
|
||||
"SELECT value FROM categories WHERE type = 'category'"
|
||||
).fetchall()
|
||||
for (slug,) in rows:
|
||||
conn.execute(
|
||||
"UPDATE categories SET display_name = ? WHERE type='category' AND value = ?",
|
||||
(category_display_name(slug), slug),
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def write_database(db_tmp_path: Path, activities: list[Activity]) -> None:
|
||||
"""Create a fresh tmp DB, bulk insert, populate categories, rebuild FTS."""
|
||||
if db_tmp_path.exists():
|
||||
db_tmp_path.unlink()
|
||||
db = DatabaseManager(str(db_tmp_path))
|
||||
db.bulk_insert_activities(activities)
|
||||
_enrich_category_display_names(db_tmp_path)
|
||||
db.rebuild_fts_index()
|
||||
|
||||
|
||||
def atomic_swap(db_tmp_path: Path, db_path: Path) -> Optional[Path]:
|
||||
"""Back up the live DB then atomically swap the tmp file in."""
|
||||
backup: Optional[Path] = None
|
||||
if db_path.exists():
|
||||
backup = db_path.with_suffix(db_path.suffix + ".bak")
|
||||
shutil.copy2(db_path, backup)
|
||||
os.replace(db_tmp_path, db_path)
|
||||
return backup
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# orchestration
|
||||
# --------------------------------------------------------------------------
|
||||
def rebuild(
|
||||
*,
|
||||
extracted_dir: Path,
|
||||
chunks_dir: Path,
|
||||
sources_dir: Path,
|
||||
db_path: Path,
|
||||
decisions_path: Optional[Path] = None,
|
||||
schema_path: Path = DEFAULT_SCHEMA_PATH,
|
||||
golden_dir: Optional[Path] = None,
|
||||
do_swap: bool = True,
|
||||
) -> dict:
|
||||
"""
|
||||
Full rebuild. Everything is built into <db_path>.tmp; the live DB is only
|
||||
touched by the final atomic swap, so a crash anywhere above leaves it intact.
|
||||
"""
|
||||
extracted_dir = Path(extracted_dir)
|
||||
db_path = Path(db_path)
|
||||
db_tmp_path = db_path.with_suffix(db_path.suffix + ".tmp")
|
||||
|
||||
schema = load_schema(schema_path)
|
||||
collected = collect_activities(extracted_dir, Path(chunks_dir), Path(sources_dir), schema)
|
||||
activities: list[Activity] = collected.pop("activities")
|
||||
|
||||
deduped, dedup_stats = dedup_activities(activities)
|
||||
|
||||
decisions = load_review_decisions(Path(decisions_path)) if decisions_path else {}
|
||||
final, decision_stats = apply_review_decisions(deduped, decisions)
|
||||
|
||||
try:
|
||||
write_database(db_tmp_path, final)
|
||||
backup = atomic_swap(db_tmp_path, db_path) if do_swap else None
|
||||
except Exception:
|
||||
if db_tmp_path.exists():
|
||||
db_tmp_path.unlink()
|
||||
raise
|
||||
|
||||
report = {
|
||||
**collected,
|
||||
"dedup": dedup_stats,
|
||||
"decisions": decision_stats,
|
||||
"final_count": len(final),
|
||||
"backup": str(backup) if backup else None,
|
||||
"swapped": do_swap,
|
||||
"qa": _qa_report(final, collected, golden_dir),
|
||||
}
|
||||
return report
|
||||
|
||||
|
||||
def _qa_report(
|
||||
activities: list[Activity], collected: dict, golden_dir: Optional[Path]
|
||||
) -> dict:
|
||||
per_category: dict[str, int] = defaultdict(int)
|
||||
per_content_type: dict[str, int] = defaultdict(int)
|
||||
confidence: dict[str, int] = defaultdict(int)
|
||||
with_rules = 0
|
||||
for a in activities:
|
||||
per_category[a.category] += 1
|
||||
per_content_type[a.content_type or "?"] += 1
|
||||
confidence[a.extraction_confidence or "?"] += 1
|
||||
if a.rules and a.rules.strip():
|
||||
with_rules += 1
|
||||
raw = collected.get("activities_raw", 0)
|
||||
hallucinated = collected.get("activities_hallucinated", 0)
|
||||
return {
|
||||
"total": len(activities),
|
||||
"per_category": dict(per_category),
|
||||
"per_content_type": dict(per_content_type),
|
||||
"extraction_confidence": dict(confidence),
|
||||
"pct_with_rules": round(100 * with_rules / len(activities), 1) if activities else 0.0,
|
||||
"needs_review": sum(1 for a in activities if a.needs_review),
|
||||
"hallucination_rate": round(100 * hallucinated / raw, 2) if raw else 0.0,
|
||||
"golden_recall": golden_recall(Path(golden_dir), activities) if golden_dir else None,
|
||||
}
|
||||
|
||||
|
||||
def print_report(report: dict) -> None:
|
||||
qa = report["qa"]
|
||||
print("=" * 60)
|
||||
print("BUILD DATABASE — QA REPORT")
|
||||
print("=" * 60)
|
||||
print(f"extraction files : {report['files_total']} "
|
||||
f"(valid {report['files_valid']}, schema-rejected {report['files_rejected_schema']})")
|
||||
print(f"activities raw : {report['activities_raw']}")
|
||||
print(f" hallucinated drop : {report['activities_hallucinated']} "
|
||||
f"({qa['hallucination_rate']}%)")
|
||||
d = report["dedup"]
|
||||
print(f"dedup : {d['input']} -> {d['output']} "
|
||||
f"(auto-merged {d['auto_merged']}, borderline {d['borderline']})")
|
||||
print(f"review decisions : dropped {report['decisions']['dropped']}, "
|
||||
f"resolved {report['decisions']['resolved']}")
|
||||
print(f"final inserted : {report['final_count']}")
|
||||
print(f"% with rules : {qa['pct_with_rules']}")
|
||||
print(f"needs_review rows : {qa['needs_review']}")
|
||||
print("per category :")
|
||||
for slug, n in sorted(qa["per_category"].items(), key=lambda kv: -kv[1]):
|
||||
print(f" {slug:<24}: {n}")
|
||||
print("per content_type :")
|
||||
for ct, n in sorted(qa["per_content_type"].items(), key=lambda kv: -kv[1]):
|
||||
print(f" {ct:<24}: {n}")
|
||||
print("extraction_confidence:")
|
||||
for c, n in sorted(qa["extraction_confidence"].items()):
|
||||
print(f" {c:<24}: {n}")
|
||||
if qa["golden_recall"]:
|
||||
g = qa["golden_recall"]
|
||||
print(f"golden recall : {g['found']}/{g['expected']} = {g['recall']}")
|
||||
if report["category_fallbacks"]:
|
||||
print("category fallbacks :")
|
||||
for msg in report["category_fallbacks"]:
|
||||
print(f" {msg}")
|
||||
if report["backup"]:
|
||||
print(f"live DB backed up to : {report['backup']}")
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Build activities.db from extraction JSON.")
|
||||
parser.add_argument("--rebuild", action="store_true",
|
||||
help="rebuild the database from scratch (only mode supported)")
|
||||
parser.add_argument("--extracted", default="data/extracted")
|
||||
parser.add_argument("--chunks", default="data/chunks")
|
||||
parser.add_argument("--sources", default="data/sources")
|
||||
parser.add_argument("--db", default="data/activities.db")
|
||||
parser.add_argument("--decisions", default="data/review_decisions.json")
|
||||
parser.add_argument("--golden", default="data/golden")
|
||||
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if not args.rebuild:
|
||||
parser.error("only --rebuild is supported (full rebuild, no incremental merge)")
|
||||
|
||||
report = rebuild(
|
||||
extracted_dir=Path(args.extracted),
|
||||
chunks_dir=Path(args.chunks),
|
||||
sources_dir=Path(args.sources),
|
||||
db_path=Path(args.db),
|
||||
decisions_path=Path(args.decisions),
|
||||
schema_path=Path(args.schema),
|
||||
golden_dir=Path(args.golden),
|
||||
)
|
||||
print_report(report)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
251
scripts/chunk_sources.py
Normal file
251
scripts/chunk_sources.py
Normal file
@@ -0,0 +1,251 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
chunk_sources.py — split normalized data/sources/*.txt into ~20-page chunks
|
||||
for subagent extraction, and maintain data/chunks/manifest.json.
|
||||
|
||||
Paginated text → ~20-page chunks, ~4-page overlap (plan D8).
|
||||
Unpaginated text → ~10000-word windows, ~2000-word overlap.
|
||||
|
||||
The manifest is a cache derived from the filesystem + per-chunk state. Re-running
|
||||
this script is idempotent: existing chunk states (pending/assigned/done/rejected)
|
||||
survive as long as the source content hash is unchanged.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
if str(SCRIPT_DIR) not in sys.path:
|
||||
sys.path.insert(0, str(SCRIPT_DIR))
|
||||
|
||||
from extract_common import content_hash, split_pages # noqa: E402
|
||||
|
||||
SCHEMA_VERSION = "1.0"
|
||||
PAGES_PER_CHUNK = 20
|
||||
PAGE_OVERLAP = 4
|
||||
WORD_WINDOW = 10_000
|
||||
WORD_OVERLAP = 2_000
|
||||
|
||||
VALID_STATES = {"pending", "assigned", "done", "rejected"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# header parsing
|
||||
# --------------------------------------------------------------------------
|
||||
def parse_source(text: str) -> tuple[dict, str]:
|
||||
"""Split a normalized source file into (header_dict, body)."""
|
||||
lines = text.splitlines()
|
||||
header: dict = {}
|
||||
body_start = 0
|
||||
in_header = True
|
||||
for i, line in enumerate(lines):
|
||||
if line.startswith("--- PAGE "):
|
||||
body_start = i
|
||||
break
|
||||
if not in_header:
|
||||
continue
|
||||
if set(line.strip()) == {"="} and line.strip():
|
||||
body_start = i + 1
|
||||
in_header = False # header ends at the rule line
|
||||
continue
|
||||
if ":" in line:
|
||||
key, _, val = line.partition(":")
|
||||
header[key.strip()] = val.strip()
|
||||
body = "\n".join(lines[body_start:])
|
||||
return header, body
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# chunking — pure functions
|
||||
# --------------------------------------------------------------------------
|
||||
def chunk_pages(
|
||||
pages: list[tuple[int, str]],
|
||||
pages_per_chunk: int = PAGES_PER_CHUNK,
|
||||
overlap: int = PAGE_OVERLAP,
|
||||
) -> list[dict]:
|
||||
"""
|
||||
Split an ordered list of (page_no, text) into overlapping chunks.
|
||||
|
||||
stride = pages_per_chunk - overlap. Because stride < pages_per_chunk - 1, any
|
||||
activity straddling a page boundary appears whole in at least one chunk.
|
||||
"""
|
||||
if not pages:
|
||||
return []
|
||||
stride = max(1, pages_per_chunk - overlap)
|
||||
chunks: list[dict] = []
|
||||
i = 0
|
||||
n = len(pages)
|
||||
while i < n:
|
||||
window = pages[i : i + pages_per_chunk]
|
||||
first, last = window[0][0], window[-1][0]
|
||||
text = "".join(
|
||||
f"\n--- PAGE {num} ---\n{txt}\n" for num, txt in window
|
||||
)
|
||||
chunks.append(
|
||||
{"page_start": first, "page_end": last,
|
||||
"chunk_range": f"pages {first}-{last}", "text": text}
|
||||
)
|
||||
if i + pages_per_chunk >= n:
|
||||
break
|
||||
i += stride
|
||||
return chunks
|
||||
|
||||
|
||||
def chunk_words(
|
||||
text: str, window: int = WORD_WINDOW, overlap: int = WORD_OVERLAP
|
||||
) -> list[dict]:
|
||||
"""Split unpaginated text into overlapping word windows."""
|
||||
words = text.split()
|
||||
if not words:
|
||||
return []
|
||||
stride = max(1, window - overlap)
|
||||
chunks: list[dict] = []
|
||||
i = 0
|
||||
n = len(words)
|
||||
while i < n:
|
||||
seg = words[i : i + window]
|
||||
chunks.append(
|
||||
{"word_start": i, "word_end": i + len(seg),
|
||||
"chunk_range": f"words {i}-{i + len(seg)}", "text": " ".join(seg)}
|
||||
)
|
||||
if i + window >= n:
|
||||
break
|
||||
i += stride
|
||||
return chunks
|
||||
|
||||
|
||||
def make_chunks(source_text: str) -> list[dict]:
|
||||
"""Chunk one normalized source file. Picks page- or word-windowing."""
|
||||
_, body = parse_source(source_text)
|
||||
pages = split_pages(body)
|
||||
if pages:
|
||||
return chunk_pages(pages)
|
||||
return chunk_words(body)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# manifest
|
||||
# --------------------------------------------------------------------------
|
||||
def _empty_manifest() -> dict:
|
||||
return {"schema_version": SCHEMA_VERSION, "chunks": {}}
|
||||
|
||||
|
||||
def load_manifest(manifest_path: Path) -> dict:
|
||||
if manifest_path.exists():
|
||||
try:
|
||||
data = json.loads(manifest_path.read_text(encoding="utf-8"))
|
||||
data.setdefault("schema_version", SCHEMA_VERSION)
|
||||
data.setdefault("chunks", {})
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return _empty_manifest()
|
||||
|
||||
|
||||
def save_manifest(manifest: dict, manifest_path: Path) -> None:
|
||||
manifest_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
manifest_path.write_text(
|
||||
json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
|
||||
)
|
||||
|
||||
|
||||
def chunk_source_file(
|
||||
source_path: Path, chunks_dir: Path, manifest: dict
|
||||
) -> list[str]:
|
||||
"""
|
||||
Chunk one data/sources/<id>.txt → data/chunks/<id>/<id>.partNN.txt and
|
||||
register every chunk in `manifest`. Preserves prior state when the source
|
||||
content hash is unchanged. Returns the list of chunk keys written.
|
||||
"""
|
||||
source_id = source_path.stem
|
||||
text = source_path.read_text(encoding="utf-8", errors="replace")
|
||||
src_hash = content_hash(text)
|
||||
chunks = make_chunks(text)
|
||||
|
||||
out_dir = chunks_dir / source_id
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
written: list[str] = []
|
||||
for idx, chunk in enumerate(chunks, 1):
|
||||
key = f"{source_id}.part{idx:02d}"
|
||||
chunk_file = out_dir / f"{key}.txt"
|
||||
chunk_file.write_text(chunk["text"], encoding="utf-8")
|
||||
|
||||
prior = manifest["chunks"].get(key)
|
||||
# preserve state only if the source content is unchanged
|
||||
if prior and prior.get("source_hash") == src_hash and \
|
||||
prior.get("state") in VALID_STATES:
|
||||
state = prior["state"]
|
||||
else:
|
||||
state = "pending"
|
||||
|
||||
manifest["chunks"][key] = {
|
||||
"source_id": source_id,
|
||||
"source_hash": src_hash,
|
||||
"part": idx,
|
||||
"chunk_range": chunk["chunk_range"],
|
||||
"chunk_file": str(chunk_file.relative_to(chunks_dir.parent)),
|
||||
"expected_json": f"{key}.json",
|
||||
"state": state,
|
||||
}
|
||||
written.append(key)
|
||||
return written
|
||||
|
||||
|
||||
def prune_stale(manifest: dict, live_keys: set[str]) -> list[str]:
|
||||
"""Drop manifest entries whose chunk no longer exists on disk."""
|
||||
stale = [k for k in manifest["chunks"] if k not in live_keys]
|
||||
for k in stale:
|
||||
del manifest["chunks"][k]
|
||||
return stale
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def run(sources_dir: Path, chunks_dir: Path) -> dict:
|
||||
"""Chunk every *.txt in sources_dir. Returns a summary dict."""
|
||||
manifest_path = chunks_dir / "manifest.json"
|
||||
manifest = load_manifest(manifest_path)
|
||||
|
||||
live_keys: set[str] = set()
|
||||
source_files = sorted(sources_dir.glob("*.txt"))
|
||||
for src in source_files:
|
||||
live_keys.update(chunk_source_file(src, chunks_dir, manifest))
|
||||
|
||||
stale = prune_stale(manifest, live_keys)
|
||||
save_manifest(manifest, manifest_path)
|
||||
|
||||
states: dict[str, int] = {}
|
||||
for meta in manifest["chunks"].values():
|
||||
states[meta["state"]] = states.get(meta["state"], 0) + 1
|
||||
return {
|
||||
"sources": len(source_files),
|
||||
"chunks": len(live_keys),
|
||||
"pruned": len(stale),
|
||||
"states": states,
|
||||
}
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Chunk normalized sources.")
|
||||
parser.add_argument("--sources", default="data/sources", help="sources dir")
|
||||
parser.add_argument("--chunks", default="data/chunks", help="chunks output dir")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
summary = run(Path(args.sources), Path(args.chunks))
|
||||
print(f"sources processed : {summary['sources']}")
|
||||
print(f"chunks written : {summary['chunks']}")
|
||||
print(f"stale pruned : {summary['pruned']}")
|
||||
for state, count in sorted(summary["states"].items()):
|
||||
print(f" {state:<10}: {count}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,54 +0,0 @@
|
||||
# TEMPLATE PENTRU EXTRACȚIE ACTIVITĂȚI CU CLAUDE
|
||||
|
||||
## Instrucțiuni pentru Claude Code:
|
||||
|
||||
Pentru fiecare PDF/DOC, folosește următorul format de extracție:
|
||||
|
||||
### 1. Citește fișierul:
|
||||
```
|
||||
Claude, te rog citește fișierul: [CALE_FISIER]
|
||||
```
|
||||
|
||||
### 2. Extrage activitățile folosind acest template JSON:
|
||||
```json
|
||||
{
|
||||
"source_file": "[NUME_FISIER]",
|
||||
"activities": [
|
||||
{
|
||||
"name": "Numele activității",
|
||||
"description": "Descrierea completă a activității",
|
||||
"rules": "Regulile jocului/activității",
|
||||
"variations": "Variante sau adaptări",
|
||||
"category": "[A-H] bazat pe tip",
|
||||
"age_group_min": 6,
|
||||
"age_group_max": 14,
|
||||
"participants_min": 4,
|
||||
"participants_max": 20,
|
||||
"duration_min": 10,
|
||||
"duration_max": 30,
|
||||
"materials_list": "Lista materialelor necesare",
|
||||
"skills_developed": "Competențe dezvoltate",
|
||||
"difficulty_level": "Ușor/Mediu/Dificil",
|
||||
"keywords": "cuvinte cheie separate prin virgulă",
|
||||
"tags": "taguri relevante"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Salvează în fișier:
|
||||
După extracție, salvează JSON-ul în: `/scripts/extracted_activities/[NUME_FISIER].json`
|
||||
|
||||
### 4. Priorități de procesare:
|
||||
|
||||
**TOP PRIORITY (procesează primele):**
|
||||
1. 1000 Fantastic Scout Games.pdf
|
||||
2. Cartea Mare a jocurilor.pdf
|
||||
3. 160-de-activitati-dinamice-jocuri-pentru-team-building.pdf
|
||||
4. 101 Ways to Create an Unforgettable Camp Experience.pdf
|
||||
5. 151 Awesome Summer Camp Nature Activities.pdf
|
||||
|
||||
**Categorii de focus:**
|
||||
- [A] Jocuri Cercetășești
|
||||
- [C] Camping & Activități Exterior
|
||||
- [G] Activități Educaționale
|
||||
@@ -1,164 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
DATABASE SETUP SCRIPT - INDEX-SISTEM-JOCURI
|
||||
|
||||
Script pentru recrearea bazelor de date din .gitignore
|
||||
Folosește clasele DatabaseManager pentru consistență
|
||||
|
||||
Usage:
|
||||
python scripts/create_databases.py
|
||||
python scripts/create_databases.py --clear-existing
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
# Add src to path so we can import our modules
|
||||
sys.path.append(str(Path(__file__).parent.parent / 'src'))
|
||||
|
||||
from database import DatabaseManager
|
||||
from game_library_manager import GameLibraryManager
|
||||
|
||||
def create_main_database(db_path: str = "data/activities.db", clear: bool = False):
|
||||
"""Create the main activities database"""
|
||||
db_file = Path(db_path)
|
||||
|
||||
if clear and db_file.exists():
|
||||
print(f"🗑️ Removing existing database: {db_path}")
|
||||
db_file.unlink()
|
||||
|
||||
print(f"📊 Creating main database: {db_path}")
|
||||
db = DatabaseManager(db_path)
|
||||
|
||||
# Test the database
|
||||
try:
|
||||
stats = db.get_statistics()
|
||||
print(f"✅ Database created successfully: {stats['total_activities']} activities")
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Error creating database: {e}")
|
||||
return False
|
||||
|
||||
def create_game_library_database(db_path: str = "data/game_library.db", clear: bool = False):
|
||||
"""Create the legacy game library database"""
|
||||
db_file = Path(db_path)
|
||||
|
||||
if clear and db_file.exists():
|
||||
print(f"🗑️ Removing existing database: {db_path}")
|
||||
db_file.unlink()
|
||||
|
||||
print(f"📊 Creating game library database: {db_path}")
|
||||
manager = GameLibraryManager(db_path)
|
||||
|
||||
print(f"✅ Game library database created successfully")
|
||||
return True
|
||||
|
||||
def create_test_database(db_path: str = "data/test_activities.db", clear: bool = False):
|
||||
"""Create the test database"""
|
||||
db_file = Path(db_path)
|
||||
|
||||
if clear and db_file.exists():
|
||||
print(f"🗑️ Removing existing database: {db_path}")
|
||||
db_file.unlink()
|
||||
|
||||
print(f"📊 Creating test database: {db_path}")
|
||||
db = DatabaseManager(db_path)
|
||||
|
||||
# Add some test data
|
||||
test_activity = {
|
||||
'title': 'Test Activity - Setup Script',
|
||||
'description': 'This is a test activity created by the setup script',
|
||||
'file_path': 'test/sample.txt',
|
||||
'file_type': 'TXT',
|
||||
'category': 'test',
|
||||
'age_group': '8-12 ani',
|
||||
'participants': '5-10 persoane',
|
||||
'duration': '15-30min',
|
||||
'materials': 'Fără materiale',
|
||||
'tags': '["test", "setup"]',
|
||||
'source_text': 'Sample test content for verification'
|
||||
}
|
||||
|
||||
try:
|
||||
db.insert_activity(test_activity)
|
||||
stats = db.get_statistics()
|
||||
print(f"✅ Test database created with sample data: {stats['total_activities']} activities")
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Error creating test database: {e}")
|
||||
return False
|
||||
|
||||
def ensure_data_directory():
|
||||
"""Ensure the data directory exists"""
|
||||
data_dir = Path("data")
|
||||
if not data_dir.exists():
|
||||
print(f"📁 Creating data directory: {data_dir}")
|
||||
data_dir.mkdir(parents=True)
|
||||
else:
|
||||
print(f"📁 Data directory exists: {data_dir}")
|
||||
|
||||
def main():
|
||||
"""Main setup function"""
|
||||
parser = argparse.ArgumentParser(description='Create databases for INDEX-SISTEM-JOCURI')
|
||||
parser.add_argument('--clear-existing', '-c', action='store_true',
|
||||
help='Remove existing databases before creating new ones')
|
||||
parser.add_argument('--main-only', action='store_true',
|
||||
help='Create only the main activities database')
|
||||
parser.add_argument('--test-only', action='store_true',
|
||||
help='Create only the test database')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("🚀 DATABASE SETUP - INDEX-SISTEM-JOCURI")
|
||||
print("=" * 50)
|
||||
|
||||
# Ensure data directory exists
|
||||
ensure_data_directory()
|
||||
|
||||
success_count = 0
|
||||
total_count = 0
|
||||
|
||||
if args.test_only:
|
||||
total_count = 1
|
||||
if create_test_database(clear=args.clear_existing):
|
||||
success_count += 1
|
||||
elif args.main_only:
|
||||
total_count = 1
|
||||
if create_main_database(clear=args.clear_existing):
|
||||
success_count += 1
|
||||
else:
|
||||
# Create all databases
|
||||
databases = [
|
||||
("Main activities", lambda: create_main_database(clear=args.clear_existing)),
|
||||
("Game library", lambda: create_game_library_database(clear=args.clear_existing)),
|
||||
("Test activities", lambda: create_test_database(clear=args.clear_existing))
|
||||
]
|
||||
|
||||
total_count = len(databases)
|
||||
|
||||
for name, create_func in databases:
|
||||
print(f"\n📂 Creating {name} database...")
|
||||
try:
|
||||
if create_func():
|
||||
success_count += 1
|
||||
except Exception as e:
|
||||
print(f"❌ Failed to create {name} database: {e}")
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print(f"🎯 SUMMARY: {success_count}/{total_count} databases created successfully")
|
||||
|
||||
if success_count == total_count:
|
||||
print("✅ All databases ready!")
|
||||
print("\nNext steps:")
|
||||
print("1. Run indexer: cd src && python indexer.py --clear-db")
|
||||
print("2. Start web app: cd src && python app.py")
|
||||
else:
|
||||
print("⚠️ Some databases failed to create. Check errors above.")
|
||||
return 1
|
||||
|
||||
return 0
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
361
scripts/extract_common.py
Normal file
361
scripts/extract_common.py
Normal file
@@ -0,0 +1,361 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
extract_common.py — single home for per-format text extraction.
|
||||
|
||||
Every extractor returns a plain text *body* with synthetic page markers
|
||||
(`--- PAGE N ---`). The file-level header (`SOURCE:` / `CONVERTED:`) is added
|
||||
by normalize_sources.py, not here.
|
||||
|
||||
Critical fix vs. the old pdf_to_text_converter.py: there is NO `max_pages` cap.
|
||||
Large books are extracted in full.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import importlib
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import subprocess
|
||||
import tempfile
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from typing import Callable
|
||||
|
||||
PAGE_MARKER_RE = re.compile(r"^--- PAGE (\d+) ---\s*$", re.MULTILINE)
|
||||
|
||||
# paragraphs per synthetic page for paginated-by-flow formats (docx)
|
||||
DOCX_PARAS_PER_PAGE = 40
|
||||
|
||||
# formats we deliberately ignore (epub duplicates existing PDFs — plan §1)
|
||||
IGNORED_EXTENSIONS = {".epub"}
|
||||
|
||||
# obvious junk filenames skipped during a walk
|
||||
JUNK_NAMES = {"desktop.ini", "linkuri-jocuri.txt"}
|
||||
JUNK_SUFFIXES = {".bak", ".tmp", ".ini"}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# page assembly helpers
|
||||
# --------------------------------------------------------------------------
|
||||
def join_pages(pages: list[str], start: int = 1) -> str:
|
||||
"""Join a list of page texts into a body string with `--- PAGE N ---`."""
|
||||
out: list[str] = []
|
||||
for i, text in enumerate(pages, start):
|
||||
out.append(f"\n--- PAGE {i} ---\n{(text or '').strip()}\n")
|
||||
return "".join(out)
|
||||
|
||||
|
||||
def split_pages(body: str) -> list[tuple[int, str]]:
|
||||
"""Inverse of join_pages: parse a body into [(page_number, text), ...]."""
|
||||
matches = list(PAGE_MARKER_RE.finditer(body))
|
||||
if not matches:
|
||||
return []
|
||||
pages: list[tuple[int, str]] = []
|
||||
for idx, m in enumerate(matches):
|
||||
num = int(m.group(1))
|
||||
seg_start = m.end()
|
||||
seg_end = matches[idx + 1].start() if idx + 1 < len(matches) else len(body)
|
||||
pages.append((num, body[seg_start:seg_end].strip()))
|
||||
return pages
|
||||
|
||||
|
||||
def count_page_markers(body: str) -> int:
|
||||
return len(PAGE_MARKER_RE.findall(body))
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# format detection
|
||||
# --------------------------------------------------------------------------
|
||||
FORMAT_BY_EXT = {
|
||||
".pdf": "pdf",
|
||||
".docx": "docx",
|
||||
".doc": "doc",
|
||||
".pptx": "pptx",
|
||||
".ppt": "pptx",
|
||||
".htm": "html",
|
||||
".html": "html",
|
||||
".zip": "zip",
|
||||
".epub": "epub",
|
||||
".txt": "txt",
|
||||
}
|
||||
|
||||
|
||||
def detect_format(path: str | os.PathLike) -> str:
|
||||
"""Return a format key for a path based on its extension."""
|
||||
ext = Path(path).suffix.lower()
|
||||
return FORMAT_BY_EXT.get(ext, "unknown")
|
||||
|
||||
|
||||
def is_junk(path: str | os.PathLike) -> bool:
|
||||
p = Path(path)
|
||||
name = p.name.lower()
|
||||
if name in JUNK_NAMES:
|
||||
return True
|
||||
if name.startswith("readme") and p.suffix.lower() == ".md":
|
||||
return True
|
||||
if p.suffix.lower() in JUNK_SUFFIXES:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# content hashing + near-duplicate elimination
|
||||
# --------------------------------------------------------------------------
|
||||
def _normalize_for_hash(text: str) -> str:
|
||||
return re.sub(r"\s+", " ", (text or "")).strip().lower()
|
||||
|
||||
|
||||
def content_hash(text: str) -> str:
|
||||
"""Stable SHA1 of whitespace-normalized text — used for exact-dup detection."""
|
||||
return hashlib.sha1(_normalize_for_hash(text).encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def near_duplicate_ratio(a: str, b: str) -> float:
|
||||
"""Similarity score in [0, 100] between two texts (rapidfuzz token ratio)."""
|
||||
from rapidfuzz import fuzz
|
||||
|
||||
return fuzz.token_sort_ratio(_normalize_for_hash(a), _normalize_for_hash(b))
|
||||
|
||||
|
||||
def dedupe_texts(
|
||||
items: list[tuple[str, str]], threshold: float = 95.0
|
||||
) -> list[tuple[str, str]]:
|
||||
"""
|
||||
Drop exact and near-duplicate texts from a list of (key, text) pairs.
|
||||
|
||||
Used for HTML mirror pages (print copies, repeated index/footer pages).
|
||||
Keeps the first occurrence; O(n) on exact hash, O(n*k) fuzzy only against
|
||||
already-kept items.
|
||||
"""
|
||||
kept: list[tuple[str, str]] = []
|
||||
seen_hashes: set[str] = set()
|
||||
for key, text in items:
|
||||
h = content_hash(text)
|
||||
if h in seen_hashes:
|
||||
continue
|
||||
if any(near_duplicate_ratio(text, kt) >= threshold for _, kt in kept):
|
||||
continue
|
||||
seen_hashes.add(h)
|
||||
kept.append((key, text))
|
||||
return kept
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# preflight dependency check
|
||||
# --------------------------------------------------------------------------
|
||||
REQUIRED_PYTHON_MODULES = {
|
||||
"pdfplumber": "pdfplumber",
|
||||
"PyPDF2": "pypdf2",
|
||||
"docx": "python-docx",
|
||||
"pptx": "python-pptx",
|
||||
"bs4": "beautifulsoup4",
|
||||
"lxml": "lxml",
|
||||
"jsonschema": "jsonschema",
|
||||
"rapidfuzz": "rapidfuzz",
|
||||
"chardet": "chardet",
|
||||
}
|
||||
|
||||
|
||||
def preflight(check_ocr: bool = False) -> dict:
|
||||
"""
|
||||
Check system + Python dependencies before a long normalization run.
|
||||
|
||||
Returns {'ok': bool, 'missing_python': [...], 'missing_system': [...],
|
||||
'warnings': [...]}. libreoffice is a *warning* (only .doc needs it),
|
||||
tesseract only checked when check_ocr=True.
|
||||
"""
|
||||
missing_python: list[str] = []
|
||||
for module, pip_name in REQUIRED_PYTHON_MODULES.items():
|
||||
try:
|
||||
importlib.import_module(module)
|
||||
except ImportError:
|
||||
missing_python.append(pip_name)
|
||||
|
||||
warnings: list[str] = []
|
||||
missing_system: list[str] = []
|
||||
|
||||
if not (shutil.which("libreoffice") or shutil.which("soffice")):
|
||||
warnings.append("libreoffice not found — legacy .doc files cannot be converted")
|
||||
|
||||
if check_ocr and not shutil.which("tesseract"):
|
||||
missing_system.append("tesseract (OCR requested but not installed)")
|
||||
|
||||
return {
|
||||
"ok": not missing_python and not missing_system,
|
||||
"missing_python": missing_python,
|
||||
"missing_system": missing_system,
|
||||
"warnings": warnings,
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# per-format extractors
|
||||
# --------------------------------------------------------------------------
|
||||
def extract_pdf(path: str | os.PathLike) -> str:
|
||||
"""PDF → body. pdfplumber primary, PyPDF2 fallback. No page cap."""
|
||||
path = str(path)
|
||||
try:
|
||||
return _extract_pdf_pdfplumber(path)
|
||||
except Exception:
|
||||
return _extract_pdf_pypdf2(path)
|
||||
|
||||
|
||||
def _extract_pdf_pdfplumber(path: str) -> str:
|
||||
import pdfplumber
|
||||
|
||||
pages: list[str] = []
|
||||
with pdfplumber.open(path) as pdf:
|
||||
for page in pdf.pages: # ALL pages — no max_pages
|
||||
try:
|
||||
pages.append(page.extract_text() or "")
|
||||
except Exception:
|
||||
pages.append("")
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
def _extract_pdf_pypdf2(path: str) -> str:
|
||||
import PyPDF2
|
||||
|
||||
pages: list[str] = []
|
||||
with open(path, "rb") as fh:
|
||||
reader = PyPDF2.PdfReader(fh)
|
||||
for page in reader.pages: # ALL pages — no max_pages
|
||||
try:
|
||||
pages.append(page.extract_text() or "")
|
||||
except Exception:
|
||||
pages.append("")
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
def extract_docx(path: str | os.PathLike) -> str:
|
||||
"""docx → body. Synthetic page marker every DOCX_PARAS_PER_PAGE paragraphs."""
|
||||
import docx
|
||||
|
||||
document = docx.Document(str(path))
|
||||
paragraphs = [p.text for p in document.paragraphs]
|
||||
pages: list[str] = []
|
||||
for i in range(0, max(len(paragraphs), 1), DOCX_PARAS_PER_PAGE):
|
||||
chunk = paragraphs[i : i + DOCX_PARAS_PER_PAGE]
|
||||
pages.append("\n".join(chunk))
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
def extract_doc(path: str | os.PathLike) -> str:
|
||||
"""
|
||||
Legacy .doc → body via `libreoffice --headless --convert-to docx`.
|
||||
|
||||
Raises RuntimeError if libreoffice is unavailable — the caller marks the
|
||||
resulting source `needs_review` regardless (conversion is imperfect).
|
||||
"""
|
||||
soffice = shutil.which("libreoffice") or shutil.which("soffice")
|
||||
if not soffice:
|
||||
raise RuntimeError("libreoffice/soffice not available — cannot convert .doc")
|
||||
|
||||
src = Path(path).resolve()
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
subprocess.run(
|
||||
[soffice, "--headless", "--convert-to", "docx", "--outdir", tmp, str(src)],
|
||||
check=True,
|
||||
capture_output=True,
|
||||
timeout=300,
|
||||
)
|
||||
converted = Path(tmp) / (src.stem + ".docx")
|
||||
if not converted.exists():
|
||||
raise RuntimeError(f"libreoffice produced no output for {src.name}")
|
||||
return extract_docx(converted)
|
||||
|
||||
|
||||
def extract_pptx(path: str | os.PathLike) -> str:
|
||||
"""pptx → body. One page per slide: title + body text + speaker notes."""
|
||||
from pptx import Presentation
|
||||
|
||||
presentation = Presentation(str(path))
|
||||
pages: list[str] = []
|
||||
for slide in presentation.slides:
|
||||
parts: list[str] = []
|
||||
for shape in slide.shapes:
|
||||
if shape.has_text_frame and shape.text_frame.text.strip():
|
||||
parts.append(shape.text_frame.text.strip())
|
||||
if slide.has_notes_slide:
|
||||
notes = slide.notes_slide.notes_text_frame.text.strip()
|
||||
if notes:
|
||||
parts.append(f"[NOTES] {notes}")
|
||||
pages.append("\n".join(parts))
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
def extract_html(path: str | os.PathLike) -> str:
|
||||
"""HTML mirror page → body. Strips nav/script/style/footer/header/aside."""
|
||||
import chardet
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
raw = Path(path).read_bytes()
|
||||
enc = chardet.detect(raw).get("encoding") or "utf-8"
|
||||
soup = BeautifulSoup(raw.decode(enc, errors="replace"), "lxml")
|
||||
|
||||
for tag in soup(["script", "style", "nav", "footer", "header", "aside", "noscript"]):
|
||||
tag.decompose()
|
||||
# also drop common chrome by role/class
|
||||
for tag in soup.find_all(attrs={"role": ["navigation", "banner", "contentinfo"]}):
|
||||
tag.decompose()
|
||||
|
||||
text = soup.get_text(separator="\n")
|
||||
lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
|
||||
return join_pages(["\n".join(lines)])
|
||||
|
||||
|
||||
def extract_zip(path: str | os.PathLike) -> str:
|
||||
"""
|
||||
zip → body. Unzips into a temp dir and recurses on every extractable inner
|
||||
file. Inner files are page-renumbered into one continuous body.
|
||||
"""
|
||||
path = str(path)
|
||||
pages: list[str] = []
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
try:
|
||||
with zipfile.ZipFile(path) as zf:
|
||||
zf.extractall(tmp)
|
||||
except zipfile.BadZipFile:
|
||||
return ""
|
||||
for inner in sorted(Path(tmp).rglob("*")):
|
||||
if not inner.is_file() or is_junk(inner):
|
||||
continue
|
||||
fmt = detect_format(inner)
|
||||
if fmt in ("unknown", "epub", "zip"):
|
||||
# nested zips handled by recursion below
|
||||
if fmt == "zip":
|
||||
body = extract_zip(inner)
|
||||
pages.extend(t for _, t in split_pages(body))
|
||||
continue
|
||||
try:
|
||||
body = extract_file(inner)
|
||||
except Exception:
|
||||
continue
|
||||
pages.extend(t for _, t in split_pages(body))
|
||||
return join_pages(pages)
|
||||
|
||||
|
||||
EXTRACTORS: dict[str, Callable[[str | os.PathLike], str]] = {
|
||||
"pdf": extract_pdf,
|
||||
"docx": extract_docx,
|
||||
"doc": extract_doc,
|
||||
"pptx": extract_pptx,
|
||||
"html": extract_html,
|
||||
"zip": extract_zip,
|
||||
}
|
||||
|
||||
|
||||
def extract_file(path: str | os.PathLike) -> str:
|
||||
"""Dispatch a single file to the right extractor. Returns a page-marked body."""
|
||||
fmt = detect_format(path)
|
||||
if fmt == "txt":
|
||||
body = Path(path).read_text(encoding="utf-8", errors="replace")
|
||||
# already paginated? pass through; else wrap as one page
|
||||
return body if count_page_markers(body) else join_pages([body])
|
||||
extractor = EXTRACTORS.get(fmt)
|
||||
if extractor is None:
|
||||
raise ValueError(f"No extractor for format '{fmt}': {path}")
|
||||
return extractor(path)
|
||||
@@ -1,424 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
HTML Activity Extractor - Proceseaz 1876 fiiere HTML
|
||||
Extrage automat activiti folosind pattern recognition
|
||||
"""
|
||||
|
||||
import os
|
||||
import re
|
||||
import json
|
||||
from pathlib import Path
|
||||
from bs4 import BeautifulSoup
|
||||
import chardet
|
||||
from typing import List, Dict, Optional
|
||||
import sqlite3
|
||||
from datetime import datetime
|
||||
|
||||
class HTMLActivityExtractor:
|
||||
def __init__(self, db_path='data/activities.db'):
|
||||
self.db_path = db_path
|
||||
# Pattern-uri pentru detectare activiti <20>n rom<6F>n
|
||||
self.activity_patterns = {
|
||||
'title_patterns': [
|
||||
r'(?i)(joc|activitate|exerci[t]iu|team[\s-]?building|energizer|ice[\s-]?breaker)[\s:]+([^\.]{5,100})',
|
||||
r'(?i)<h[1-6][^>]*>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</h[1-6]>',
|
||||
r'(?i)<strong>([^<]*(?:joc|activitate|exerci[t]iu)[^<]*)</strong>',
|
||||
r'(?i)^[\d]+\.?\s*([A-Z][^\.]{10,100}(?:joc|activitate|exerci[t]iu)[^\.]{0,50})$',
|
||||
],
|
||||
'description_markers': [
|
||||
'descriere', 'reguli', 'cum se joac[a]', 'instructiuni',
|
||||
'obiectiv', 'desfasurare', 'explicatie', 'mod de joc'
|
||||
],
|
||||
'materials_markers': [
|
||||
'materiale', 'necesare', 'echipament', 'ce avem nevoie',
|
||||
'se folosesc', 'trebuie sa avem', 'dotari'
|
||||
],
|
||||
'age_patterns': [
|
||||
r'(?i)v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
|
||||
r'(?i)(\d+)[\s-]+(\d+)\s*ani',
|
||||
r'(?i)pentru\s+(\d+)[\s-]+(\d+)\s*ani',
|
||||
r'(?i)categoria?\s*(?:de\s*)?v[<5B>a]rst[a][\s:]+(\d+)[\s-]+(\d+)',
|
||||
],
|
||||
'participants_patterns': [
|
||||
r'(?i)(\d+)[\s-]+(\d+)\s*(?:participan[t]i|juc[a]tori|persoane|copii)',
|
||||
r'(?i)num[a]r\s*(?:de\s*)?(?:participan[t]i|juc[a]tori)[\s:]+(\d+)[\s-]+(\d+)',
|
||||
r'(?i)grup\s*de\s*(\d+)[\s-]+(\d+)',
|
||||
],
|
||||
'duration_patterns': [
|
||||
r'(?i)durat[a][\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
|
||||
r'(?i)timp[\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
|
||||
r'(?i)(\d+)[\s-]+(\d+)\s*minute',
|
||||
]
|
||||
}
|
||||
|
||||
# Categorii predefinite bazate pe sistemul existent
|
||||
self.categories = {
|
||||
'[A]': ['joc', 'joaca', 'distractie', 'amuzament'],
|
||||
'[B]': ['aventura', 'explorare', 'descoperire'],
|
||||
'[C]': ['camping', 'tabara', 'excursie', 'drumetie'],
|
||||
'[D]': ['foc', 'flacara', 'lumina'],
|
||||
'[E]': ['noduri', 'fr<EFBFBD>nghii', 'sfori', 'legare'],
|
||||
'[F]': ['bushcraft', 'supravietuire', 'survival'],
|
||||
'[G]': ['educatie', 'educativ', 'invatare', 'scoala'],
|
||||
'[H]': ['orientare', 'busola', 'harta', 'navigare']
|
||||
}
|
||||
|
||||
def detect_encoding(self, file_path):
|
||||
"""Detecteaz encoding-ul fiierului"""
|
||||
with open(file_path, 'rb') as f:
|
||||
result = chardet.detect(f.read())
|
||||
return result['encoding'] or 'utf-8'
|
||||
|
||||
def extract_from_html(self, html_path: str) -> List[Dict]:
|
||||
"""Extrage activiti dintr-un singur fiier HTML"""
|
||||
activities = []
|
||||
|
||||
try:
|
||||
# Detectare encoding i citire
|
||||
encoding = self.detect_encoding(html_path)
|
||||
with open(html_path, 'r', encoding=encoding, errors='ignore') as f:
|
||||
content = f.read()
|
||||
|
||||
soup = BeautifulSoup(content, 'lxml')
|
||||
|
||||
# Metod 1: Caut liste de activiti
|
||||
activities.extend(self._extract_from_lists(soup, html_path))
|
||||
|
||||
# Metod 2: Caut activiti <20>n headings
|
||||
activities.extend(self._extract_from_headings(soup, html_path))
|
||||
|
||||
# Metod 3: Caut pattern-uri <20>n text
|
||||
activities.extend(self._extract_from_patterns(soup, html_path))
|
||||
|
||||
# Metod 4: Caut <20>n tabele
|
||||
activities.extend(self._extract_from_tables(soup, html_path))
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing {html_path}: {e}")
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_lists(self, soup, source_file):
|
||||
"""Extrage activiti din liste HTML (ul, ol)"""
|
||||
activities = []
|
||||
|
||||
for list_elem in soup.find_all(['ul', 'ol']):
|
||||
# Verific dac lista pare s conin activiti
|
||||
list_text = list_elem.get_text().lower()
|
||||
if any(marker in list_text for marker in ['joc', 'activitate', 'exercitiu']):
|
||||
for li in list_elem.find_all('li'):
|
||||
text = li.get_text(strip=True)
|
||||
if len(text) > 20: # Minim 20 caractere pentru o activitate valid
|
||||
activity = self._create_activity_from_text(text, source_file)
|
||||
if activity:
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_headings(self, soup, source_file):
|
||||
"""Extrage activiti bazate pe headings"""
|
||||
activities = []
|
||||
|
||||
for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
|
||||
heading_text = heading.get_text(strip=True)
|
||||
|
||||
# Verific dac heading-ul conine cuvinte cheie
|
||||
if any(keyword in heading_text.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
|
||||
# Caut descrierea <20>n elementele urmtoare
|
||||
description = ""
|
||||
next_elem = heading.find_next_sibling()
|
||||
|
||||
while next_elem and next_elem.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
|
||||
if next_elem.name in ['p', 'div', 'ul']:
|
||||
description += next_elem.get_text(strip=True) + " "
|
||||
if len(description) > 500: # Limit descriere
|
||||
break
|
||||
next_elem = next_elem.find_next_sibling()
|
||||
|
||||
if description:
|
||||
activity = {
|
||||
'name': heading_text[:200],
|
||||
'description': description[:1000],
|
||||
'source_file': str(source_file),
|
||||
'category': self._detect_category(heading_text + " " + description)
|
||||
}
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_patterns(self, soup, source_file):
|
||||
"""Extrage activiti folosind pattern matching"""
|
||||
activities = []
|
||||
text = soup.get_text()
|
||||
|
||||
# Caut pattern-uri de activiti
|
||||
for pattern in self.activity_patterns['title_patterns']:
|
||||
matches = re.finditer(pattern, text, re.MULTILINE)
|
||||
for match in matches:
|
||||
title = match.group(0) if match.lastindex == 0 else match.group(match.lastindex)
|
||||
if len(title) > 10:
|
||||
# Extrage context <20>n jurul match-ului
|
||||
start = max(0, match.start() - 200)
|
||||
end = min(len(text), match.end() + 500)
|
||||
context = text[start:end]
|
||||
|
||||
activity = self._create_activity_from_text(context, source_file, title)
|
||||
if activity:
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_tables(self, soup, source_file):
|
||||
"""Extrage activiti din tabele"""
|
||||
activities = []
|
||||
|
||||
for table in soup.find_all('table'):
|
||||
rows = table.find_all('tr')
|
||||
if len(rows) > 1: # Cel puin header i o linie de date
|
||||
# Detecteaz coloanele relevante
|
||||
headers = [th.get_text(strip=True).lower() for th in rows[0].find_all(['th', 'td'])]
|
||||
|
||||
for row in rows[1:]:
|
||||
cells = row.find_all(['td'])
|
||||
if cells:
|
||||
activity_data = {}
|
||||
for i, cell in enumerate(cells):
|
||||
if i < len(headers):
|
||||
activity_data[headers[i]] = cell.get_text(strip=True)
|
||||
|
||||
# Creeaz activitate din date tabel
|
||||
if any(key in activity_data for key in ['joc', 'activitate', 'nume', 'titlu']):
|
||||
activity = self._create_activity_from_table_data(activity_data, source_file)
|
||||
if activity:
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _create_activity_from_text(self, text, source_file, title=None):
|
||||
"""Creeaz un dicionar de activitate din text"""
|
||||
if not text or len(text) < 30:
|
||||
return None
|
||||
|
||||
activity = {
|
||||
'name': title or text[:100].split('.')[0].strip(),
|
||||
'description': text[:1000],
|
||||
'source_file': str(source_file),
|
||||
'category': self._detect_category(text),
|
||||
'keywords': self._extract_keywords(text),
|
||||
'created_at': datetime.now().isoformat()
|
||||
}
|
||||
|
||||
# Extrage metadata suplimentar
|
||||
activity.update(self._extract_metadata(text))
|
||||
|
||||
return activity
|
||||
|
||||
def _create_activity_from_table_data(self, data, source_file):
|
||||
"""Creeaz activitate din date de tabel"""
|
||||
activity = {
|
||||
'source_file': str(source_file),
|
||||
'created_at': datetime.now().isoformat()
|
||||
}
|
||||
|
||||
# Mapare c<>mpuri tabel la c<>mpuri DB
|
||||
field_mapping = {
|
||||
'nume': 'name', 'titlu': 'name', 'joc': 'name', 'activitate': 'name',
|
||||
'descriere': 'description', 'detalii': 'description', 'explicatie': 'description',
|
||||
'materiale': 'materials_list', 'echipament': 'materials_list',
|
||||
'varsta': 'age_group_min', 'categoria': 'category',
|
||||
'participanti': 'participants_min', 'numar': 'participants_min',
|
||||
'durata': 'duration_min', 'timp': 'duration_min'
|
||||
}
|
||||
|
||||
for table_field, db_field in field_mapping.items():
|
||||
if table_field in data:
|
||||
activity[db_field] = data[table_field]
|
||||
|
||||
# Validare minim
|
||||
if 'name' in activity and len(activity.get('name', '')) > 5:
|
||||
return activity
|
||||
|
||||
return None
|
||||
|
||||
def _extract_metadata(self, text):
|
||||
"""Extrage metadata din text folosind pattern-uri"""
|
||||
metadata = {}
|
||||
|
||||
# Extrage v<>rsta
|
||||
for pattern in self.activity_patterns['age_patterns']:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
metadata['age_group_min'] = int(match.group(1))
|
||||
metadata['age_group_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
||||
break
|
||||
|
||||
# Extrage numr participani
|
||||
for pattern in self.activity_patterns['participants_patterns']:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
metadata['participants_min'] = int(match.group(1))
|
||||
metadata['participants_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
||||
break
|
||||
|
||||
# Extrage durata
|
||||
for pattern in self.activity_patterns['duration_patterns']:
|
||||
match = re.search(pattern, text)
|
||||
if match:
|
||||
metadata['duration_min'] = int(match.group(1))
|
||||
metadata['duration_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
||||
break
|
||||
|
||||
# Extrage materiale
|
||||
materials = []
|
||||
text_lower = text.lower()
|
||||
for marker in self.activity_patterns['materials_markers']:
|
||||
idx = text_lower.find(marker)
|
||||
if idx != -1:
|
||||
# Extrage urmtoarele 200 caractere dup marker
|
||||
materials_text = text[idx:idx+200]
|
||||
# Extrage items din list
|
||||
items = re.findall(r'[-"]\s*([^\n-"]+)', materials_text)
|
||||
if items:
|
||||
materials.extend(items)
|
||||
|
||||
if materials:
|
||||
metadata['materials_list'] = ', '.join(materials[:10]) # Maxim 10 materiale
|
||||
|
||||
return metadata
|
||||
|
||||
def _detect_category(self, text):
|
||||
"""Detecteaz categoria activitii bazat pe cuvinte cheie"""
|
||||
text_lower = text.lower()
|
||||
|
||||
for category, keywords in self.categories.items():
|
||||
if any(keyword in text_lower for keyword in keywords):
|
||||
return category
|
||||
|
||||
return '[A]' # Default categoria jocuri
|
||||
|
||||
def _extract_keywords(self, text):
|
||||
"""Extrage cuvinte cheie din text"""
|
||||
keywords = []
|
||||
text_lower = text.lower()
|
||||
|
||||
# Lista de cuvinte cheie relevante
|
||||
keyword_list = [
|
||||
'cooperare', 'competitie', 'echipa', 'creativitate', 'miscare',
|
||||
'strategie', 'comunicare', 'incredere', 'coordonare', 'atentie',
|
||||
'reflexe', 'logica', 'imaginatie', 'muzica', 'dans', 'sport',
|
||||
'natura', 'mediu', 'stiinta', 'matematica', 'limba', 'cultura'
|
||||
]
|
||||
|
||||
for keyword in keyword_list:
|
||||
if keyword in text_lower:
|
||||
keywords.append(keyword)
|
||||
|
||||
return ', '.join(keywords[:5]) # Maxim 5 keywords
|
||||
|
||||
def save_to_database(self, activities):
|
||||
"""Salveaz activitile <20>n baza de date"""
|
||||
conn = sqlite3.connect(self.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
saved_count = 0
|
||||
duplicate_count = 0
|
||||
|
||||
for activity in activities:
|
||||
try:
|
||||
# Verific duplicate
|
||||
cursor.execute(
|
||||
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
||||
(activity.get('name'), activity.get('source_file'))
|
||||
)
|
||||
|
||||
if cursor.fetchone():
|
||||
duplicate_count += 1
|
||||
continue
|
||||
|
||||
# Pregtete valorile pentru insert
|
||||
columns = []
|
||||
values = []
|
||||
placeholders = []
|
||||
|
||||
for key, value in activity.items():
|
||||
if key != 'created_at': # Skip created_at, it has default
|
||||
columns.append(key)
|
||||
values.append(value)
|
||||
placeholders.append('?')
|
||||
|
||||
# Insert <20>n DB
|
||||
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
||||
cursor.execute(query, values)
|
||||
saved_count += 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error saving activity: {e}")
|
||||
continue
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
return saved_count, duplicate_count
|
||||
|
||||
def process_all_html_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
||||
"""Proceseaz toate fiierele HTML din directorul specificat"""
|
||||
base_path = Path(base_path)
|
||||
html_files = list(base_path.rglob("*.html"))
|
||||
html_files.extend(list(base_path.rglob("*.htm")))
|
||||
|
||||
print(f"Found {len(html_files)} HTML files to process")
|
||||
|
||||
all_activities = []
|
||||
processed = 0
|
||||
errors = 0
|
||||
|
||||
for i, html_file in enumerate(html_files):
|
||||
try:
|
||||
activities = self.extract_from_html(str(html_file))
|
||||
all_activities.extend(activities)
|
||||
processed += 1
|
||||
|
||||
# Progress update
|
||||
if (i + 1) % 100 == 0:
|
||||
print(f"Progress: {i+1}/{len(html_files)} files processed, {len(all_activities)} activities found")
|
||||
# Save batch to DB
|
||||
if all_activities:
|
||||
saved, dupes = self.save_to_database(all_activities)
|
||||
print(f"Batch saved: {saved} new activities, {dupes} duplicates skipped")
|
||||
all_activities = [] # Clear buffer
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing {html_file}: {e}")
|
||||
errors += 1
|
||||
|
||||
# Save remaining activities
|
||||
if all_activities:
|
||||
saved, dupes = self.save_to_database(all_activities)
|
||||
print(f"Final batch saved: {saved} new activities, {dupes} duplicates skipped")
|
||||
|
||||
print(f"\nProcessing complete!")
|
||||
print(f"Files processed: {processed}")
|
||||
print(f"Errors: {errors}")
|
||||
|
||||
return processed, errors
|
||||
|
||||
# Funcie main pentru test
|
||||
if __name__ == "__main__":
|
||||
extractor = HTMLActivityExtractor()
|
||||
|
||||
# Test pe un fiier sample mai <20>nt<6E>i
|
||||
print("Testing on sample file first...")
|
||||
# Gsete un fiier HTML pentru test
|
||||
test_files = list(Path('/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri').rglob("*.html"))[:3]
|
||||
|
||||
for test_file in test_files:
|
||||
print(f"\nTesting: {test_file}")
|
||||
activities = extractor.extract_from_html(str(test_file))
|
||||
print(f"Found {len(activities)} activities")
|
||||
if activities:
|
||||
print(f"Sample activity: {activities[0]['name'][:50]}...")
|
||||
|
||||
# <20>ntreab dac s continue cu procesarea complet
|
||||
response = input("\nContinue with full processing? (y/n): ")
|
||||
if response.lower() == 'y':
|
||||
extractor.process_all_html_files()
|
||||
@@ -1,78 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Import activities extracted by Claude from JSON files
|
||||
"""
|
||||
|
||||
import json
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
class ClaudeActivityImporter:
|
||||
def __init__(self, db_path='data/activities.db'):
|
||||
self.db_path = db_path
|
||||
self.json_dir = Path('scripts/extracted_activities')
|
||||
self.json_dir.mkdir(exist_ok=True)
|
||||
|
||||
def import_json_file(self, json_path):
|
||||
"""Import activities from a single JSON file"""
|
||||
with open(json_path, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
source_file = data.get('source_file', str(json_path))
|
||||
activities = data.get('activities', [])
|
||||
|
||||
conn = sqlite3.connect(self.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
imported = 0
|
||||
for activity in activities:
|
||||
try:
|
||||
# Add source file and timestamp
|
||||
activity['source_file'] = source_file
|
||||
activity['created_at'] = datetime.now().isoformat()
|
||||
|
||||
# Prepare insert
|
||||
columns = list(activity.keys())
|
||||
values = list(activity.values())
|
||||
placeholders = ['?' for _ in values]
|
||||
|
||||
# Check for duplicate
|
||||
cursor.execute(
|
||||
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
||||
(activity.get('name'), source_file)
|
||||
)
|
||||
|
||||
if not cursor.fetchone():
|
||||
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
||||
cursor.execute(query, values)
|
||||
imported += 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error importing activity: {e}")
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
print(f"Imported {imported} activities from {json_path.name}")
|
||||
return imported
|
||||
|
||||
def import_all_json_files(self):
|
||||
"""Import all JSON files from the extracted_activities directory"""
|
||||
json_files = list(self.json_dir.glob("*.json"))
|
||||
|
||||
if not json_files:
|
||||
print("No JSON files found in extracted_activities directory")
|
||||
return 0
|
||||
|
||||
total_imported = 0
|
||||
for json_file in json_files:
|
||||
imported = self.import_json_file(json_file)
|
||||
total_imported += imported
|
||||
|
||||
print(f"\nTotal imported: {total_imported} activities from {len(json_files)} files")
|
||||
return total_imported
|
||||
|
||||
if __name__ == "__main__":
|
||||
importer = ClaudeActivityImporter()
|
||||
importer.import_all_json_files()
|
||||
179
scripts/import_common.py
Normal file
179
scripts/import_common.py
Normal file
@@ -0,0 +1,179 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
import_common.py — shared helpers for the import / validation side of the
|
||||
extraction pipeline (Lane C).
|
||||
|
||||
Used by build_database.py and validate_extractions.py:
|
||||
* JSON-schema validation of subagent extraction files,
|
||||
* the anti-hallucination source_excerpt substring check (E5),
|
||||
* locating the source chunk that an extraction file came from,
|
||||
* the stable content key used by the needs_review queue.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import re
|
||||
import unicodedata
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
|
||||
DEFAULT_SCHEMA_PATH = SCRIPT_DIR / "activity_schema.json"
|
||||
|
||||
# rapidfuzz.partial_ratio is on a 0..100 scale; an excerpt counts as a real
|
||||
# quote from the source when it scores at least this against the chunk text.
|
||||
EXCERPT_MATCH_THRESHOLD = 90.0
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# schema validation
|
||||
# --------------------------------------------------------------------------
|
||||
def load_schema(schema_path: str | Path = DEFAULT_SCHEMA_PATH) -> dict:
|
||||
"""Load the activity JSON schema produced by Lane A."""
|
||||
return json.loads(Path(schema_path).read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def validate_extraction(data: Any, schema: dict) -> list[str]:
|
||||
"""
|
||||
Validate one parsed extraction file against `schema`.
|
||||
|
||||
Returns a list of human-readable error strings; empty list == valid.
|
||||
"""
|
||||
import jsonschema
|
||||
|
||||
validator = jsonschema.Draft7Validator(schema)
|
||||
errors: list[str] = []
|
||||
for err in sorted(validator.iter_errors(data), key=lambda e: list(e.path)):
|
||||
location = "/".join(str(p) for p in err.path) or "<root>"
|
||||
errors.append(f"{location}: {err.message}")
|
||||
return errors
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# excerpt verification (E5 — anti-hallucination)
|
||||
# --------------------------------------------------------------------------
|
||||
def _normalize_text(text: str) -> str:
|
||||
return re.sub(r"\s+", " ", (text or "")).strip().lower()
|
||||
|
||||
|
||||
def excerpt_score(excerpt: str, chunk_text: str) -> float:
|
||||
"""Best fuzzy-substring score (0..100) of `excerpt` inside `chunk_text`."""
|
||||
from rapidfuzz import fuzz
|
||||
|
||||
if not excerpt or not chunk_text:
|
||||
return 0.0
|
||||
return float(fuzz.partial_ratio(_normalize_text(excerpt), _normalize_text(chunk_text)))
|
||||
|
||||
|
||||
def excerpt_matches(
|
||||
excerpt: str, chunk_text: str, threshold: float = EXCERPT_MATCH_THRESHOLD
|
||||
) -> bool:
|
||||
"""True when `excerpt` appears (fuzzily) as a substring of `chunk_text`."""
|
||||
return excerpt_score(excerpt, chunk_text) >= threshold
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# locating the source chunk an extraction file came from
|
||||
# --------------------------------------------------------------------------
|
||||
def chunk_key_for(json_path: Path, header: Optional[dict]) -> str:
|
||||
"""
|
||||
Resolve the chunk key for an extraction file.
|
||||
|
||||
Prefers the explicit `chunk_key` in the header, otherwise falls back to the
|
||||
JSON file stem (extraction files are named `<chunk_key>.json`).
|
||||
"""
|
||||
if header and header.get("chunk_key"):
|
||||
return str(header["chunk_key"])
|
||||
return json_path.stem
|
||||
|
||||
|
||||
def source_id_for(chunk_key: str, header: Optional[dict]) -> str:
|
||||
"""Resolve the source id; `<source_id>.partNN` → `<source_id>`."""
|
||||
if header and header.get("source_id"):
|
||||
return str(header["source_id"])
|
||||
# chunk keys look like "<source_id>.partNN"
|
||||
return chunk_key.rsplit(".part", 1)[0]
|
||||
|
||||
|
||||
def find_chunk_text(
|
||||
json_path: Path, header: Optional[dict], chunks_dir: Path
|
||||
) -> Optional[str]:
|
||||
"""
|
||||
Return the text of the source chunk for an extraction file, or None.
|
||||
|
||||
Looks for data/chunks/<source_id>/<chunk_key>.txt, then falls back to a
|
||||
recursive glob on the chunk key.
|
||||
"""
|
||||
chunk_key = chunk_key_for(json_path, header)
|
||||
source_id = source_id_for(chunk_key, header)
|
||||
|
||||
candidate = chunks_dir / source_id / f"{chunk_key}.txt"
|
||||
if candidate.is_file():
|
||||
return candidate.read_text(encoding="utf-8", errors="replace")
|
||||
|
||||
matches = list(chunks_dir.rglob(f"{chunk_key}.txt"))
|
||||
if matches:
|
||||
return matches[0].read_text(encoding="utf-8", errors="replace")
|
||||
return None
|
||||
|
||||
|
||||
def source_path_for(source_id: str, sources_dir: Path) -> Optional[str]:
|
||||
"""
|
||||
Read the original `SOURCE:` path from a normalized source header.
|
||||
|
||||
data/sources/<source_id>.txt starts with a `SOURCE: <relative path>` line.
|
||||
"""
|
||||
src_file = sources_dir / f"{source_id}.txt"
|
||||
if not src_file.is_file():
|
||||
return None
|
||||
try:
|
||||
with src_file.open(encoding="utf-8", errors="replace") as fh:
|
||||
for line in fh:
|
||||
if line.startswith("SOURCE:"):
|
||||
return line.split(":", 1)[1].strip()
|
||||
if line.startswith("=") or line.startswith("--- PAGE "):
|
||||
break
|
||||
except OSError:
|
||||
return None
|
||||
return None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# stable content key for the needs_review queue (plan §5c)
|
||||
# --------------------------------------------------------------------------
|
||||
def normalize_name(name: str) -> str:
|
||||
"""Diacritic-free, lowercased, whitespace-collapsed name (dedup key)."""
|
||||
if not name:
|
||||
return ""
|
||||
decomposed = unicodedata.normalize("NFKD", name)
|
||||
ascii_str = "".join(c for c in decomposed if not unicodedata.combining(c))
|
||||
return re.sub(r"\s+", " ", ascii_str.lower().strip())
|
||||
|
||||
|
||||
def content_key(normalized_name: str, language: Optional[str], description: str) -> str:
|
||||
"""
|
||||
Stable hash identifying a row for the review queue.
|
||||
|
||||
Only borderline-kept-separate rows and legacy `.doc` rows ever carry
|
||||
needs_review, and neither is auto-merged — so their (normalized_name,
|
||||
language, description) triple is stable across rebuilds.
|
||||
"""
|
||||
payload = f"{normalized_name}\x1f{language or ''}\x1f{_normalize_text(description)}"
|
||||
return hashlib.sha1(payload.encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# iteration
|
||||
# --------------------------------------------------------------------------
|
||||
def iter_extraction_files(extracted_dir: Path):
|
||||
"""Yield every *.json directly under `extracted_dir` (skips _rejected/)."""
|
||||
if not extracted_dir.is_dir():
|
||||
return
|
||||
for path in sorted(extracted_dir.glob("*.json")):
|
||||
if path.is_file():
|
||||
yield path
|
||||
255
scripts/normalize_sources.py
Normal file
255
scripts/normalize_sources.py
Normal file
@@ -0,0 +1,255 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
normalize_sources.py — walk data/carti-camp-jocuri/ and write data/sources/<id>.txt.
|
||||
|
||||
Output files keep the existing header format:
|
||||
|
||||
SOURCE: <original relative path>
|
||||
CONVERTED: <iso date>
|
||||
FORMAT: <pdf|docx|doc|pptx|html-mirror|zip>
|
||||
NEEDS_REVIEW: <reason> (optional — legacy .doc conversions)
|
||||
==================================================
|
||||
|
||||
--- PAGE 1 ---
|
||||
...
|
||||
|
||||
Each source gets a stable id = <8-hex hash of relative path>_<sanitized stem>,
|
||||
so two files with the same name in different folders never collide.
|
||||
|
||||
The pipeline is script-only: this normalizes formats, it does NOT run extraction.
|
||||
Run `--check-deps` before a long job.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import datetime as _dt
|
||||
import hashlib
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
if str(SCRIPT_DIR) not in sys.path:
|
||||
sys.path.insert(0, str(SCRIPT_DIR))
|
||||
|
||||
from extract_common import ( # noqa: E402
|
||||
count_page_markers,
|
||||
dedupe_texts,
|
||||
detect_format,
|
||||
extract_file,
|
||||
extract_html,
|
||||
is_junk,
|
||||
join_pages,
|
||||
preflight,
|
||||
split_pages,
|
||||
)
|
||||
|
||||
HEADER_RULE = "=" * 50
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# stable source id
|
||||
# --------------------------------------------------------------------------
|
||||
def sanitize_stem(stem: str) -> str:
|
||||
s = re.sub(r"[^\w]+", "_", stem, flags=re.UNICODE).strip("_").lower()
|
||||
return s[:60] or "source"
|
||||
|
||||
|
||||
def stable_id(relative_path: str | Path) -> str:
|
||||
"""Collision-proof id derived from the path relative to the corpus root."""
|
||||
rel = str(relative_path).replace("\\", "/")
|
||||
digest = hashlib.sha1(rel.encode("utf-8")).hexdigest()[:8]
|
||||
stem = sanitize_stem(Path(rel).stem)
|
||||
return f"{digest}_{stem}"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# header
|
||||
# --------------------------------------------------------------------------
|
||||
def build_header(
|
||||
source_rel: str, fmt: str, needs_review: str | None = None
|
||||
) -> str:
|
||||
today = _dt.date.today().isoformat()
|
||||
lines = [
|
||||
f"SOURCE: {source_rel}",
|
||||
f"CONVERTED: {today}",
|
||||
f"FORMAT: {fmt}",
|
||||
]
|
||||
if needs_review:
|
||||
lines.append(f"NEEDS_REVIEW: {needs_review}")
|
||||
lines.append(HEADER_RULE)
|
||||
return "\n".join(lines) + "\n\n"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# mirror-site directories
|
||||
# --------------------------------------------------------------------------
|
||||
MIRROR_PAGE_EXTS = {".html", ".htm"}
|
||||
|
||||
|
||||
def is_mirror_dir(path: Path) -> bool:
|
||||
"""A directory counts as a site mirror if it contains HTML pages."""
|
||||
if not path.is_dir():
|
||||
return False
|
||||
if path.name.endswith("_files"):
|
||||
return False
|
||||
return any(
|
||||
p.suffix.lower() in MIRROR_PAGE_EXTS
|
||||
for p in path.rglob("*")
|
||||
if p.is_file()
|
||||
)
|
||||
|
||||
|
||||
def normalize_mirror(mirror_dir: Path) -> str:
|
||||
"""Extract every HTML page in a mirror dir, dedupe near-duplicates, join."""
|
||||
pages: list[tuple[str, str]] = []
|
||||
for html in sorted(mirror_dir.rglob("*")):
|
||||
if not html.is_file() or html.suffix.lower() not in MIRROR_PAGE_EXTS:
|
||||
continue
|
||||
if "_files" in html.parts:
|
||||
continue
|
||||
try:
|
||||
body = extract_html(html)
|
||||
except Exception:
|
||||
continue
|
||||
text = "\n".join(t for _, t in split_pages(body))
|
||||
if text.strip():
|
||||
pages.append((str(html.relative_to(mirror_dir)), text))
|
||||
pages = dedupe_texts(pages)
|
||||
return join_pages([t for _, t in pages])
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# one source
|
||||
# --------------------------------------------------------------------------
|
||||
def normalize_one(
|
||||
path: Path, corpus_root: Path, out_dir: Path
|
||||
) -> dict | None:
|
||||
"""
|
||||
Normalize a single file or mirror directory → data/sources/<id>.txt.
|
||||
|
||||
Returns a result dict, or None if the entry was skipped (junk / ignored).
|
||||
"""
|
||||
rel = path.relative_to(corpus_root)
|
||||
sid = stable_id(rel)
|
||||
|
||||
if path.is_dir():
|
||||
if not is_mirror_dir(path):
|
||||
return None
|
||||
fmt, needs_review = "html-mirror", None
|
||||
body = normalize_mirror(path)
|
||||
else:
|
||||
if is_junk(path):
|
||||
return None
|
||||
fmt = detect_format(path)
|
||||
if fmt in ("unknown", "epub", "txt"):
|
||||
return None # epub duplicates PDFs; txt is not a source format here
|
||||
needs_review = "legacy .doc conversion is imperfect" if fmt == "doc" else None
|
||||
try:
|
||||
body = extract_file(path)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
return {"id": sid, "source": str(rel), "status": "error", "error": str(exc)}
|
||||
|
||||
if not body.strip():
|
||||
return {"id": sid, "source": str(rel), "status": "empty"}
|
||||
|
||||
out_path = out_dir / f"{sid}.txt"
|
||||
out_path.write_text(build_header(str(rel), fmt, needs_review) + body,
|
||||
encoding="utf-8")
|
||||
return {
|
||||
"id": sid,
|
||||
"source": str(rel),
|
||||
"status": "ok",
|
||||
"format": fmt,
|
||||
"pages": count_page_markers(body),
|
||||
"needs_review": bool(needs_review),
|
||||
}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# walk
|
||||
# --------------------------------------------------------------------------
|
||||
def iter_corpus_entries(corpus_root: Path):
|
||||
"""Yield top-level files and mirror directories under the corpus root."""
|
||||
for entry in sorted(corpus_root.iterdir()):
|
||||
if entry.name.startswith("."):
|
||||
continue
|
||||
if entry.is_dir():
|
||||
if is_mirror_dir(entry):
|
||||
yield entry
|
||||
else:
|
||||
yield entry
|
||||
|
||||
|
||||
def run(corpus_root: Path, out_dir: Path) -> dict:
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
results: list[dict] = []
|
||||
for entry in iter_corpus_entries(corpus_root):
|
||||
res = normalize_one(entry, corpus_root, out_dir)
|
||||
if res is not None:
|
||||
results.append(res)
|
||||
summary = {
|
||||
"total": len(results),
|
||||
"ok": sum(1 for r in results if r["status"] == "ok"),
|
||||
"errors": sum(1 for r in results if r["status"] == "error"),
|
||||
"empty": sum(1 for r in results if r["status"] == "empty"),
|
||||
"needs_review": sum(1 for r in results if r.get("needs_review")),
|
||||
"results": results,
|
||||
}
|
||||
return summary
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def print_preflight(report: dict) -> int:
|
||||
print("Dependency preflight")
|
||||
print("--------------------")
|
||||
if report["missing_python"]:
|
||||
print(" MISSING Python packages: " + ", ".join(report["missing_python"]))
|
||||
else:
|
||||
print(" Python packages: OK")
|
||||
if report["missing_system"]:
|
||||
print(" MISSING system tools : " + ", ".join(report["missing_system"]))
|
||||
for w in report["warnings"]:
|
||||
print(f" WARNING: {w}")
|
||||
print(" => " + ("READY" if report["ok"] else "NOT READY — install the above"))
|
||||
return 0 if report["ok"] else 1
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Normalize mixed sources to .txt")
|
||||
parser.add_argument("--corpus", default="data/carti-camp-jocuri",
|
||||
help="corpus root to walk")
|
||||
parser.add_argument("--out", default="data/sources", help="output directory")
|
||||
parser.add_argument("--check-deps", action="store_true",
|
||||
help="run dependency preflight and exit")
|
||||
parser.add_argument("--ocr", action="store_true",
|
||||
help="include OCR (tesseract) in the preflight check")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.check_deps:
|
||||
return print_preflight(preflight(check_ocr=args.ocr))
|
||||
|
||||
report = preflight(check_ocr=args.ocr)
|
||||
if report["missing_python"]:
|
||||
print_preflight(report)
|
||||
return 1
|
||||
for w in report["warnings"]:
|
||||
print(f"WARNING: {w}")
|
||||
|
||||
summary = run(Path(args.corpus), Path(args.out))
|
||||
print(f"normalized : {summary['ok']}/{summary['total']}")
|
||||
print(f"errors : {summary['errors']}")
|
||||
print(f"empty : {summary['empty']}")
|
||||
print(f"needs_review: {summary['needs_review']}")
|
||||
for r in summary["results"]:
|
||||
if r["status"] != "ok":
|
||||
print(f" [{r['status']}] {r['source']}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,143 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
PDF Mass Conversion to Text for Activity Extraction
|
||||
Handles all PDF sizes efficiently with multiple fallback methods
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
from pathlib import Path
|
||||
import PyPDF2
|
||||
import pdfplumber
|
||||
from typing import List, Dict
|
||||
import logging
|
||||
|
||||
class PDFConverter:
|
||||
def __init__(self, max_pages=50):
|
||||
self.max_pages = max_pages
|
||||
self.conversion_stats = {}
|
||||
|
||||
def convert_pdf_to_text(self, pdf_path: str) -> str:
|
||||
"""Convert PDF to text using multiple methods with fallbacks"""
|
||||
try:
|
||||
# Method 1: pdfplumber (best for tables and layout)
|
||||
return self._convert_with_pdfplumber(pdf_path)
|
||||
except Exception as e:
|
||||
print(f"pdfplumber failed for {pdf_path}: {e}")
|
||||
|
||||
try:
|
||||
# Method 2: PyPDF2 (fallback)
|
||||
return self._convert_with_pypdf2(pdf_path)
|
||||
except Exception as e2:
|
||||
print(f"PyPDF2 also failed for {pdf_path}: {e2}")
|
||||
return ""
|
||||
|
||||
def _convert_with_pdfplumber(self, pdf_path: str) -> str:
|
||||
"""Primary conversion method using pdfplumber"""
|
||||
text_content = ""
|
||||
|
||||
with pdfplumber.open(pdf_path) as pdf:
|
||||
total_pages = len(pdf.pages)
|
||||
pages_to_process = min(total_pages, self.max_pages)
|
||||
|
||||
print(f" Converting {pdf_path}: {pages_to_process}/{total_pages} pages")
|
||||
|
||||
for i, page in enumerate(pdf.pages[:pages_to_process]):
|
||||
try:
|
||||
page_text = page.extract_text()
|
||||
if page_text:
|
||||
text_content += f"\n--- PAGE {i+1} ---\n"
|
||||
text_content += page_text
|
||||
text_content += "\n"
|
||||
except Exception as e:
|
||||
print(f" Error on page {i+1}: {e}")
|
||||
continue
|
||||
|
||||
self.conversion_stats[pdf_path] = {
|
||||
'method': 'pdfplumber',
|
||||
'pages_processed': pages_to_process,
|
||||
'total_pages': total_pages,
|
||||
'success': True,
|
||||
'text_length': len(text_content)
|
||||
}
|
||||
|
||||
return text_content
|
||||
|
||||
def _convert_with_pypdf2(self, pdf_path: str) -> str:
|
||||
"""Fallback conversion method using PyPDF2"""
|
||||
text_content = ""
|
||||
|
||||
with open(pdf_path, 'rb') as file:
|
||||
reader = PyPDF2.PdfReader(file)
|
||||
total_pages = len(reader.pages)
|
||||
pages_to_process = min(total_pages, self.max_pages)
|
||||
|
||||
print(f" Converting {pdf_path} (fallback): {pages_to_process}/{total_pages} pages")
|
||||
|
||||
for i in range(pages_to_process):
|
||||
try:
|
||||
page = reader.pages[i]
|
||||
page_text = page.extract_text()
|
||||
if page_text:
|
||||
text_content += f"\n--- PAGE {i+1} ---\n"
|
||||
text_content += page_text
|
||||
text_content += "\n"
|
||||
except Exception as e:
|
||||
print(f" Error on page {i+1}: {e}")
|
||||
continue
|
||||
|
||||
self.conversion_stats[pdf_path] = {
|
||||
'method': 'PyPDF2',
|
||||
'pages_processed': pages_to_process,
|
||||
'total_pages': total_pages,
|
||||
'success': True,
|
||||
'text_length': len(text_content)
|
||||
}
|
||||
|
||||
return text_content
|
||||
|
||||
def convert_all_pdfs(self, pdf_directory: str, output_directory: str):
|
||||
"""Convert all PDFs in directory to text files"""
|
||||
pdf_files = list(Path(pdf_directory).glob("**/*.pdf"))
|
||||
|
||||
print(f"🔄 Converting {len(pdf_files)} PDF files to text...")
|
||||
|
||||
os.makedirs(output_directory, exist_ok=True)
|
||||
|
||||
for i, pdf_path in enumerate(pdf_files):
|
||||
print(f"\n[{i+1}/{len(pdf_files)}] Processing {pdf_path.name}...")
|
||||
|
||||
# Convert to text
|
||||
text_content = self.convert_pdf_to_text(str(pdf_path))
|
||||
|
||||
if text_content.strip():
|
||||
# Save as text file
|
||||
output_file = Path(output_directory) / f"{pdf_path.stem}.txt"
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
f.write(f"SOURCE: {pdf_path}\n")
|
||||
f.write(f"CONVERTED: 2025-01-11\n")
|
||||
f.write("="*50 + "\n\n")
|
||||
f.write(text_content)
|
||||
|
||||
print(f" ✅ Saved: {output_file}")
|
||||
else:
|
||||
print(f" ❌ No text extracted from {pdf_path.name}")
|
||||
|
||||
# Save conversion statistics
|
||||
stats_file = Path(output_directory) / "conversion_stats.json"
|
||||
with open(stats_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(self.conversion_stats, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"\n🎉 PDF conversion complete! Check {output_directory}")
|
||||
return len([f for f in self.conversion_stats.values() if f['success']])
|
||||
|
||||
# Usage
|
||||
if __name__ == "__main__":
|
||||
converter = PDFConverter(max_pages=50)
|
||||
|
||||
# Convert all PDFs
|
||||
pdf_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri"
|
||||
output_dir = "/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri/INDEX-SISTEM-JOCURI/converted_pdfs"
|
||||
|
||||
converted_count = converter.convert_all_pdfs(pdf_dir, output_dir)
|
||||
print(f"Final result: {converted_count} PDFs successfully converted")
|
||||
145
scripts/review_queue.py
Normal file
145
scripts/review_queue.py
Normal file
@@ -0,0 +1,145 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
review_queue.py — CLI for the needs_review lifecycle (plan §5c).
|
||||
|
||||
Rows land in the queue when dedup leaves a borderline pair separate, or when a
|
||||
legacy `.doc` source was converted imperfectly. Each row has a stable content
|
||||
key; a decision written here is stored in data/review_decisions.json (git
|
||||
tracked) and re-applied by build_database.py on every rebuild, so the queue
|
||||
never resurfaces a resolved row.
|
||||
|
||||
Commands:
|
||||
python scripts/review_queue.py list
|
||||
python scripts/review_queue.py resolve <id> <merge|keep-separate|drop>
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sqlite3
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
from import_common import content_key, normalize_name # noqa: E402
|
||||
|
||||
VALID_DECISIONS = ("merge", "keep-separate", "drop")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# review_decisions.json
|
||||
# --------------------------------------------------------------------------
|
||||
def load_decisions(path: Path) -> dict:
|
||||
if path.is_file():
|
||||
try:
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
if isinstance(data, dict):
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def save_decisions(decisions: dict, path: Path) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(
|
||||
json.dumps(decisions, indent=2, ensure_ascii=False, sort_keys=True),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# queue
|
||||
# --------------------------------------------------------------------------
|
||||
def list_queue(db_path: Path) -> list[dict]:
|
||||
"""Return every needs_review row in the current DB, with its content key."""
|
||||
if not db_path.is_file():
|
||||
return []
|
||||
conn = sqlite3.connect(db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
try:
|
||||
rows = conn.execute(
|
||||
"SELECT name, normalized_name, language, description "
|
||||
"FROM activities WHERE needs_review = 1 ORDER BY normalized_name"
|
||||
).fetchall()
|
||||
except sqlite3.OperationalError:
|
||||
return []
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
out = []
|
||||
for row in rows:
|
||||
norm = row["normalized_name"] or normalize_name(row["name"])
|
||||
key = content_key(norm, row["language"], row["description"] or "")
|
||||
out.append({
|
||||
"id": key,
|
||||
"name": row["name"],
|
||||
"language": row["language"],
|
||||
"description": row["description"] or "",
|
||||
})
|
||||
return out
|
||||
|
||||
|
||||
def resolve(decisions_path: Path, content_id: str, decision: str) -> dict:
|
||||
"""Record a decision for a content key in review_decisions.json."""
|
||||
if decision not in VALID_DECISIONS:
|
||||
raise ValueError(
|
||||
f"invalid decision {decision!r}; expected one of {VALID_DECISIONS}"
|
||||
)
|
||||
decisions = load_decisions(decisions_path)
|
||||
decisions[content_id] = {"decision": decision}
|
||||
save_decisions(decisions, decisions_path)
|
||||
return decisions
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="needs_review queue CLI")
|
||||
parser.add_argument("--db", default="data/activities.db")
|
||||
parser.add_argument("--decisions", default="data/review_decisions.json")
|
||||
sub = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
sub.add_parser("list", help="list rows currently flagged needs_review")
|
||||
|
||||
p_resolve = sub.add_parser("resolve", help="record a decision for a row")
|
||||
p_resolve.add_argument("id", help="content id from `list`")
|
||||
p_resolve.add_argument("decision", choices=VALID_DECISIONS)
|
||||
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.command == "list":
|
||||
rows = list_queue(Path(args.db))
|
||||
if not rows:
|
||||
print("review queue is empty.")
|
||||
return 0
|
||||
print(f"{len(rows)} row(s) need review:\n")
|
||||
for r in rows:
|
||||
desc = r["description"][:80].replace("\n", " ")
|
||||
print(f" id : {r['id']}")
|
||||
print(f" name : {r['name']} [{r['language']}]")
|
||||
print(f" desc : {desc}")
|
||||
print(f" -> review_queue.py resolve {r['id']} <merge|keep-separate|drop>")
|
||||
print()
|
||||
return 0
|
||||
|
||||
if args.command == "resolve":
|
||||
resolve(Path(args.decisions), args.id, args.decision)
|
||||
print(f"recorded: {args.id} -> {args.decision}")
|
||||
print(f"written to {args.decisions} (applied on next build_database --rebuild)")
|
||||
return 0
|
||||
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,50 +1,140 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Main extraction orchestrator
|
||||
Ruleaza intregul proces de extractie
|
||||
run_extraction.py — extraction orchestrator (plan §3).
|
||||
|
||||
The pipeline is script-only up to the LLM step: this script normalizes the
|
||||
corpus, chunks the normalized sources, and emits one subagent prompt per
|
||||
`pending` chunk. It does NOT run the extraction itself — that step is the
|
||||
interactive Claude Code orchestrator launching waves of subagents.
|
||||
|
||||
Steps:
|
||||
1. normalize data/carti-camp-jocuri/ -> data/sources/*.txt
|
||||
2. chunk data/sources/*.txt -> data/chunks/<id>/*.txt + manifest.json
|
||||
3. emit one prompt per `pending` chunk -> data/chunks/_prompts/*.md
|
||||
4. report how many chunks remain `pending`
|
||||
|
||||
Usage:
|
||||
python scripts/run_extraction.py
|
||||
python scripts/run_extraction.py --skip-normalize # re-chunk only
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from unified_processor import UnifiedProcessor
|
||||
from import_claude_activities import ClaudeActivityImporter
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
import chunk_sources # noqa: E402
|
||||
import normalize_sources # noqa: E402
|
||||
|
||||
SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
|
||||
|
||||
|
||||
def emit_chunk_prompt(chunk_key: str, meta: dict, prompts_dir: Path) -> Path:
|
||||
"""Write the subagent prompt for one pending chunk."""
|
||||
chunk_file = meta.get("chunk_file", f"data/chunks/<id>/{chunk_key}.txt")
|
||||
expected_json = meta.get("expected_json", f"{chunk_key}.json")
|
||||
text = "\n".join([
|
||||
f"# EXTRACTION — chunk `{chunk_key}`",
|
||||
"",
|
||||
f"Read ONLY this chunk: `{chunk_file}`",
|
||||
f"Chunk range: {meta.get('chunk_range', '?')}",
|
||||
"",
|
||||
f"Follow the rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
|
||||
"Identify every distinct activity, fill the schema "
|
||||
"(`scripts/activity_schema.json`), and write the result to:",
|
||||
"",
|
||||
f" data/extracted/{expected_json}",
|
||||
"",
|
||||
"Header fields to set: "
|
||||
f'source_id="{meta.get("source_id", "")}", chunk_key="{chunk_key}", '
|
||||
f'source_hash="{meta.get("source_hash", "")}".',
|
||||
"",
|
||||
])
|
||||
prompts_dir.mkdir(parents=True, exist_ok=True)
|
||||
out = prompts_dir / f"{chunk_key}.prompt.md"
|
||||
out.write_text(text, encoding="utf-8")
|
||||
return out
|
||||
|
||||
|
||||
def run(
|
||||
*,
|
||||
corpus_root: Path,
|
||||
sources_dir: Path,
|
||||
chunks_dir: Path,
|
||||
skip_normalize: bool = False,
|
||||
) -> dict:
|
||||
summary: dict = {}
|
||||
|
||||
if not skip_normalize:
|
||||
norm = normalize_sources.run(corpus_root, sources_dir)
|
||||
summary["normalized"] = {"ok": norm["ok"], "total": norm["total"],
|
||||
"errors": norm["errors"]}
|
||||
|
||||
chunk_summary = chunk_sources.run(sources_dir, chunks_dir)
|
||||
summary["chunks"] = chunk_summary
|
||||
|
||||
manifest_path = chunks_dir / "manifest.json"
|
||||
manifest = chunk_sources.load_manifest(manifest_path)
|
||||
prompts_dir = chunks_dir / "_prompts"
|
||||
|
||||
pending = {k: m for k, m in manifest["chunks"].items()
|
||||
if m.get("state") == "pending"}
|
||||
for key, meta in sorted(pending.items()):
|
||||
emit_chunk_prompt(key, meta, prompts_dir)
|
||||
|
||||
states: dict[str, int] = {}
|
||||
for m in manifest["chunks"].values():
|
||||
states[m.get("state", "?")] = states.get(m.get("state", "?"), 0) + 1
|
||||
summary["states"] = states
|
||||
summary["pending"] = len(pending)
|
||||
summary["prompts_dir"] = str(prompts_dir)
|
||||
return summary
|
||||
|
||||
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Extraction orchestrator.")
|
||||
parser.add_argument("--corpus", default="data/carti-camp-jocuri")
|
||||
parser.add_argument("--sources", default="data/sources")
|
||||
parser.add_argument("--chunks", default="data/chunks")
|
||||
parser.add_argument("--skip-normalize", action="store_true",
|
||||
help="skip normalization, re-chunk existing sources only")
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
summary = run(
|
||||
corpus_root=Path(args.corpus),
|
||||
sources_dir=Path(args.sources),
|
||||
chunks_dir=Path(args.chunks),
|
||||
skip_normalize=args.skip_normalize,
|
||||
)
|
||||
|
||||
print("=" * 60)
|
||||
print("EXTRACTION ORCHESTRATOR")
|
||||
print("=" * 60)
|
||||
if "normalized" in summary:
|
||||
n = summary["normalized"]
|
||||
print(f"normalized : {n['ok']}/{n['total']} (errors {n['errors']})")
|
||||
print(f"chunks : {summary['chunks']['chunks']}")
|
||||
for state, count in sorted(summary["states"].items()):
|
||||
print(f" {state:<10}: {count}")
|
||||
print(f"\npending chunks remaining : {summary['pending']}")
|
||||
if summary["pending"]:
|
||||
print(f"subagent prompts written : {summary['prompts_dir']}/")
|
||||
print("Launch waves of ~5-10 subagents on those prompts, then run "
|
||||
"validate_extractions.py and build_database.py --rebuild.")
|
||||
else:
|
||||
print("All chunks extracted — run build_database.py --rebuild.")
|
||||
print("=" * 60)
|
||||
return 0
|
||||
|
||||
def main():
|
||||
print("="*60)
|
||||
print("ACTIVITY EXTRACTION SYSTEM")
|
||||
print("Strategy S8: Hybrid Claude + Scripts")
|
||||
print("="*60)
|
||||
|
||||
# Step 1: Run automated extraction
|
||||
print("\nSTEP 1: Automated Extraction")
|
||||
print("-"*40)
|
||||
processor = UnifiedProcessor()
|
||||
processor.process_automated_formats()
|
||||
|
||||
# Step 2: Wait for Claude processing
|
||||
print("\n" + "="*60)
|
||||
print("STEP 2: Manual Claude Processing Required")
|
||||
print("-"*40)
|
||||
print("Please process PDF/DOC files with Claude using the template.")
|
||||
print("Files are listed in: pdf_doc_for_claude.txt")
|
||||
print("Save extracted activities as JSON in: scripts/extracted_activities/")
|
||||
print("="*60)
|
||||
|
||||
response = input("\nHave you completed Claude processing? (y/n): ")
|
||||
|
||||
if response.lower() == 'y':
|
||||
# Step 3: Import Claude-extracted activities
|
||||
print("\nSTEP 3: Importing Claude-extracted activities")
|
||||
print("-"*40)
|
||||
importer = ClaudeActivityImporter()
|
||||
importer.import_all_json_files()
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("EXTRACTION COMPLETE!")
|
||||
print("="*60)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
raise SystemExit(main())
|
||||
|
||||
@@ -1,197 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
Text/Markdown Activity Extractor
|
||||
Proceseaza fisiere TXT si MD pentru extractie activitati
|
||||
"""
|
||||
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import List, Dict
|
||||
import sqlite3
|
||||
from datetime import datetime
|
||||
|
||||
class TextActivityExtractor:
|
||||
def __init__(self, db_path='data/activities.db'):
|
||||
self.db_path = db_path
|
||||
self.activity_patterns = {
|
||||
'section_headers': [
|
||||
r'^#{1,6}\s*(.+)$', # Markdown headers
|
||||
r'^([A-Z][^\.]{10,100})$', # Titluri simple
|
||||
r'^\d+\.\s*(.+)$', # Numbered lists
|
||||
r'^[•\-\*]\s*(.+)$', # Bullet points
|
||||
],
|
||||
'activity_markers': [
|
||||
'joc:', 'activitate:', 'exercitiu:', 'team building:',
|
||||
'nume:', 'titlu:', 'denumire:'
|
||||
]
|
||||
}
|
||||
|
||||
def extract_from_text(self, file_path: str) -> List[Dict]:
|
||||
"""Extrage activitati din fisier text/markdown"""
|
||||
activities = []
|
||||
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
|
||||
content = f.read()
|
||||
|
||||
# Metoda 1: Cauta sectiuni markdown
|
||||
if file_path.endswith('.md'):
|
||||
activities.extend(self._extract_from_markdown(content, file_path))
|
||||
|
||||
# Metoda 2: Cauta pattern-uri generale
|
||||
activities.extend(self._extract_from_patterns(content, file_path))
|
||||
|
||||
# Metoda 3: Cauta blocuri de text structurate
|
||||
activities.extend(self._extract_from_blocks(content, file_path))
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing {file_path}: {e}")
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_markdown(self, content, source_file):
|
||||
"""Extrage activitati din format markdown"""
|
||||
activities = []
|
||||
lines = content.split('\n')
|
||||
|
||||
current_activity = None
|
||||
current_content = []
|
||||
|
||||
for line in lines:
|
||||
# Verifica daca e header de activitate
|
||||
if re.match(r'^#{1,3}\s*(.+)', line):
|
||||
# Salveaza activitatea anterioara daca exista
|
||||
if current_activity and current_content:
|
||||
current_activity['description'] = '\n'.join(current_content[:20]) # Max 20 linii
|
||||
activities.append(current_activity)
|
||||
|
||||
# Verifica daca noul header e o activitate
|
||||
header_text = re.sub(r'^#{1,3}\s*', '', line)
|
||||
if any(marker in header_text.lower() for marker in ['joc', 'activitate', 'exercitiu']):
|
||||
current_activity = {
|
||||
'name': header_text[:200],
|
||||
'source_file': str(source_file),
|
||||
'category': '[A]'
|
||||
}
|
||||
current_content = []
|
||||
else:
|
||||
current_activity = None
|
||||
|
||||
elif current_activity:
|
||||
# Adauga continut la activitatea curenta
|
||||
if line.strip():
|
||||
current_content.append(line)
|
||||
|
||||
# Salveaza ultima activitate
|
||||
if current_activity and current_content:
|
||||
current_activity['description'] = '\n'.join(current_content[:20])
|
||||
activities.append(current_activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_patterns(self, content, source_file):
|
||||
"""Extrage folosind pattern matching"""
|
||||
activities = []
|
||||
|
||||
# Cauta markeri specifici de activitati
|
||||
for marker in self.activity_patterns['activity_markers']:
|
||||
pattern = re.compile(f'{re.escape(marker)}\\s*(.+?)(?=\\n\\n|{re.escape(marker)}|$)',
|
||||
re.IGNORECASE | re.DOTALL)
|
||||
matches = pattern.finditer(content)
|
||||
|
||||
for match in matches:
|
||||
activity_text = match.group(1)
|
||||
if len(activity_text) > 20:
|
||||
activity = {
|
||||
'name': activity_text.split('\n')[0][:200],
|
||||
'description': activity_text[:1000],
|
||||
'source_file': str(source_file),
|
||||
'category': '[A]'
|
||||
}
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def _extract_from_blocks(self, content, source_file):
|
||||
"""Extrage din blocuri de text separate"""
|
||||
activities = []
|
||||
|
||||
# Imparte in blocuri separate de linii goale
|
||||
blocks = re.split(r'\n\s*\n', content)
|
||||
|
||||
for block in blocks:
|
||||
if len(block) > 50: # Minim 50 caractere
|
||||
lines = block.strip().split('\n')
|
||||
first_line = lines[0].strip()
|
||||
|
||||
# Verifica daca blocul pare o activitate
|
||||
if any(keyword in first_line.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
|
||||
activity = {
|
||||
'name': first_line[:200],
|
||||
'description': block[:1000],
|
||||
'source_file': str(source_file),
|
||||
'category': '[A]'
|
||||
}
|
||||
activities.append(activity)
|
||||
|
||||
return activities
|
||||
|
||||
def save_to_database(self, activities):
|
||||
"""Salveaza in baza de date"""
|
||||
conn = sqlite3.connect(self.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
saved_count = 0
|
||||
|
||||
for activity in activities:
|
||||
try:
|
||||
# Check for duplicates
|
||||
cursor.execute(
|
||||
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
||||
(activity.get('name'), activity.get('source_file'))
|
||||
)
|
||||
|
||||
if not cursor.fetchone():
|
||||
columns = list(activity.keys())
|
||||
values = list(activity.values())
|
||||
placeholders = ['?' for _ in values]
|
||||
|
||||
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
||||
cursor.execute(query, values)
|
||||
saved_count += 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error saving: {e}")
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
return saved_count
|
||||
|
||||
def process_all_text_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
||||
"""Proceseaza toate fisierele text si markdown"""
|
||||
base_path = Path(base_path)
|
||||
|
||||
text_files = list(base_path.rglob("*.txt"))
|
||||
md_files = list(base_path.rglob("*.md"))
|
||||
all_files = text_files + md_files
|
||||
|
||||
print(f"Found {len(all_files)} text/markdown files")
|
||||
|
||||
all_activities = []
|
||||
|
||||
for file_path in all_files:
|
||||
activities = self.extract_from_text(str(file_path))
|
||||
all_activities.extend(activities)
|
||||
print(f"Processed {file_path.name}: {len(activities)} activities")
|
||||
|
||||
# Save to database
|
||||
saved = self.save_to_database(all_activities)
|
||||
print(f"\nTotal saved: {saved} activities from {len(all_files)} files")
|
||||
|
||||
return len(all_files), saved
|
||||
|
||||
if __name__ == "__main__":
|
||||
extractor = TextActivityExtractor()
|
||||
extractor.process_all_text_files()
|
||||
@@ -1,151 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Unified Activity Processor
|
||||
Orchestreaz toate extractoarele pentru procesare complet
|
||||
"""
|
||||
|
||||
import time
|
||||
from pathlib import Path
|
||||
from html_extractor import HTMLActivityExtractor
|
||||
from text_extractor import TextActivityExtractor
|
||||
import sqlite3
|
||||
|
||||
class UnifiedProcessor:
|
||||
def __init__(self, db_path='data/activities.db'):
|
||||
self.db_path = db_path
|
||||
self.html_extractor = HTMLActivityExtractor(db_path)
|
||||
self.text_extractor = TextActivityExtractor(db_path)
|
||||
self.stats = {
|
||||
'html_processed': 0,
|
||||
'text_processed': 0,
|
||||
'pdf_to_process': 0,
|
||||
'doc_to_process': 0,
|
||||
'total_activities': 0,
|
||||
'start_time': None,
|
||||
'end_time': None
|
||||
}
|
||||
|
||||
def get_current_activity_count(self):
|
||||
"""Obine numrul curent de activiti din DB"""
|
||||
conn = sqlite3.connect(self.db_path)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT COUNT(*) FROM activities")
|
||||
count = cursor.fetchone()[0]
|
||||
conn.close()
|
||||
return count
|
||||
|
||||
def count_files_to_process(self, base_path):
|
||||
"""Numr fiierele care trebuie procesate"""
|
||||
base_path = Path(base_path)
|
||||
|
||||
counts = {
|
||||
'html': len(list(base_path.rglob("*.html"))) + len(list(base_path.rglob("*.htm"))),
|
||||
'txt': len(list(base_path.rglob("*.txt"))),
|
||||
'md': len(list(base_path.rglob("*.md"))),
|
||||
'pdf': len(list(base_path.rglob("*.pdf"))),
|
||||
'doc': len(list(base_path.rglob("*.doc"))),
|
||||
'docx': len(list(base_path.rglob("*.docx")))
|
||||
}
|
||||
|
||||
return counts
|
||||
|
||||
def process_automated_formats(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
||||
"""Proceseaz toate formatele care pot fi automatizate"""
|
||||
print("="*60)
|
||||
print("UNIFIED ACTIVITY PROCESSOR - AUTOMATED PHASE")
|
||||
print("="*60)
|
||||
|
||||
self.stats['start_time'] = time.time()
|
||||
initial_count = self.get_current_activity_count()
|
||||
|
||||
# Afieaz statistici iniiale
|
||||
file_counts = self.count_files_to_process(base_path)
|
||||
print(f"\nFiles to process:")
|
||||
for format, count in file_counts.items():
|
||||
print(f" {format.upper()}: {count} files")
|
||||
print(f"\nCurrent activities in database: {initial_count}")
|
||||
print("-"*60)
|
||||
|
||||
# FAZA 1: Procesare HTML (prioritate maxim - volum mare)
|
||||
print("\n[1/2] Processing HTML files...")
|
||||
print("-"*40)
|
||||
html_processed, html_errors = self.html_extractor.process_all_html_files(base_path)
|
||||
self.stats['html_processed'] = html_processed
|
||||
|
||||
# FAZA 2: Procesare Text/MD
|
||||
print("\n[2/2] Processing Text/Markdown files...")
|
||||
print("-"*40)
|
||||
text_processed, text_saved = self.text_extractor.process_all_text_files(base_path)
|
||||
self.stats['text_processed'] = text_processed
|
||||
|
||||
# Statistici finale
|
||||
self.stats['end_time'] = time.time()
|
||||
final_count = self.get_current_activity_count()
|
||||
self.stats['total_activities'] = final_count - initial_count
|
||||
|
||||
# Identific fiierele care necesit procesare manual
|
||||
self.stats['pdf_to_process'] = file_counts['pdf']
|
||||
self.stats['doc_to_process'] = file_counts['doc'] + file_counts['docx']
|
||||
|
||||
self.print_summary()
|
||||
self.save_pdf_doc_list(base_path)
|
||||
|
||||
def print_summary(self):
|
||||
"""Afieaz rezumatul procesrii"""
|
||||
print("\n" + "="*60)
|
||||
print("PROCESSING SUMMARY")
|
||||
print("="*60)
|
||||
|
||||
duration = self.stats['end_time'] - self.stats['start_time']
|
||||
|
||||
print(f"\nAutomated Processing Results:")
|
||||
print(f" HTML files processed: {self.stats['html_processed']}")
|
||||
print(f" Text/MD files processed: {self.stats['text_processed']}")
|
||||
print(f" New activities added: {self.stats['total_activities']}")
|
||||
print(f" Processing time: {duration:.1f} seconds")
|
||||
|
||||
print(f"\nFiles requiring Claude processing:")
|
||||
print(f" PDF files: {self.stats['pdf_to_process']}")
|
||||
print(f" DOC/DOCX files: {self.stats['doc_to_process']}")
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("NEXT STEPS:")
|
||||
print("1. Review the file 'pdf_doc_for_claude.txt' for manual processing")
|
||||
print("2. Use Claude to extract activities from PDF/DOC files")
|
||||
print("3. Focus on largest PDF files first (highest activity density)")
|
||||
print("="*60)
|
||||
|
||||
def save_pdf_doc_list(self, base_path):
|
||||
"""Salveaz lista de PDF/DOC pentru procesare cu Claude"""
|
||||
base_path = Path(base_path)
|
||||
|
||||
pdf_files = sorted(base_path.rglob("*.pdf"), key=lambda p: p.stat().st_size, reverse=True)
|
||||
doc_files = list(base_path.rglob("*.doc"))
|
||||
docx_files = list(base_path.rglob("*.docx"))
|
||||
|
||||
with open('pdf_doc_for_claude.txt', 'w', encoding='utf-8') as f:
|
||||
f.write("PDF/DOC FILES FOR CLAUDE PROCESSING\n")
|
||||
f.write("="*60 + "\n")
|
||||
f.write("Files sorted by size (largest first = likely more activities)\n\n")
|
||||
|
||||
f.write("TOP PRIORITY PDF FILES (process these first):\n")
|
||||
f.write("-"*40 + "\n")
|
||||
for i, pdf in enumerate(pdf_files[:20], 1):
|
||||
size_mb = pdf.stat().st_size / (1024*1024)
|
||||
f.write(f"{i}. {pdf.name} ({size_mb:.1f} MB)\n")
|
||||
f.write(f" Path: {pdf}\n\n")
|
||||
|
||||
if len(pdf_files) > 20:
|
||||
f.write(f"\n... and {len(pdf_files)-20} more PDF files\n\n")
|
||||
|
||||
f.write("\nDOC/DOCX FILES:\n")
|
||||
f.write("-"*40 + "\n")
|
||||
for doc in doc_files + docx_files:
|
||||
size_kb = doc.stat().st_size / 1024
|
||||
f.write(f"- {doc.name} ({size_kb:.1f} KB)\n")
|
||||
|
||||
print(f"\nPDF/DOC list saved to: pdf_doc_for_claude.txt")
|
||||
|
||||
if __name__ == "__main__":
|
||||
processor = UnifiedProcessor()
|
||||
processor.process_automated_formats()
|
||||
208
scripts/validate_extractions.py
Normal file
208
scripts/validate_extractions.py
Normal file
@@ -0,0 +1,208 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
validate_extractions.py — validate every data/extracted/*.json (plan §5b).
|
||||
|
||||
For each extraction file it runs two checks:
|
||||
1. JSON-schema validation against scripts/activity_schema.json,
|
||||
2. the source_excerpt anti-hallucination check (each excerpt must be a fuzzy
|
||||
substring of the chunk it came from).
|
||||
|
||||
For every failing chunk it:
|
||||
* writes the exact re-extraction prompt to data/extracted/_reextract/<chunk>.prompt.md,
|
||||
* marks the chunk `rejected` in data/chunks/manifest.json.
|
||||
|
||||
The orchestrator then re-launches subagents only on the `rejected` chunks; the
|
||||
loop repeats until nothing is rejected.
|
||||
|
||||
Usage:
|
||||
python scripts/validate_extractions.py
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
REPO_ROOT = SCRIPT_DIR.parent
|
||||
for _p in (str(SCRIPT_DIR), str(REPO_ROOT)):
|
||||
if _p not in sys.path:
|
||||
sys.path.insert(0, _p)
|
||||
|
||||
from import_common import ( # noqa: E402
|
||||
DEFAULT_SCHEMA_PATH,
|
||||
chunk_key_for,
|
||||
excerpt_matches,
|
||||
excerpt_score,
|
||||
find_chunk_text,
|
||||
iter_extraction_files,
|
||||
load_schema,
|
||||
validate_extraction,
|
||||
)
|
||||
|
||||
SUBAGENT_PROMPT = SCRIPT_DIR / "SUBAGENT_PROMPT.md"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# re-extraction prompt
|
||||
# --------------------------------------------------------------------------
|
||||
def build_reextraction_prompt(
|
||||
chunk_key: str, chunk_file: Optional[str], errors: list[str]
|
||||
) -> str:
|
||||
"""The exact prompt to hand a subagent to re-extract a rejected chunk."""
|
||||
chunk_ref = chunk_file or f"data/chunks/<source_id>/{chunk_key}.txt"
|
||||
lines = [
|
||||
f"# RE-EXTRACTION — chunk `{chunk_key}`",
|
||||
"",
|
||||
"The previous extraction for this chunk was **REJECTED**. Reasons:",
|
||||
"",
|
||||
]
|
||||
lines += [f"- {e}" for e in errors]
|
||||
lines += [
|
||||
"",
|
||||
"## What to do",
|
||||
"",
|
||||
f"1. Read ONLY this chunk: `{chunk_ref}`",
|
||||
f"2. Follow the extraction rules in `{SUBAGENT_PROMPT.relative_to(REPO_ROOT)}`.",
|
||||
"3. Fix every problem listed above. In particular:",
|
||||
" - every `source_excerpt` must be copied **verbatim** from the chunk",
|
||||
" (it is checked as a fuzzy substring — invented quotes are rejected);",
|
||||
" - `source_excerpt` and `page_reference` are mandatory on every activity;",
|
||||
" - the output must validate against `scripts/activity_schema.json`.",
|
||||
f"4. Overwrite the extraction file `data/extracted/{chunk_key}.json`.",
|
||||
"",
|
||||
]
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# manifest
|
||||
# --------------------------------------------------------------------------
|
||||
def load_manifest(manifest_path: Path) -> dict:
|
||||
if manifest_path.is_file():
|
||||
try:
|
||||
data = json.loads(manifest_path.read_text(encoding="utf-8"))
|
||||
data.setdefault("chunks", {})
|
||||
return data
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return {"chunks": {}}
|
||||
|
||||
|
||||
def save_manifest(manifest: dict, manifest_path: Path) -> None:
|
||||
manifest_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
manifest_path.write_text(
|
||||
json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
|
||||
)
|
||||
|
||||
|
||||
def mark_rejected(manifest: dict, chunk_key: str) -> None:
|
||||
"""Flip a chunk to `rejected` in the manifest (creating the entry if new)."""
|
||||
entry = manifest["chunks"].get(chunk_key, {})
|
||||
entry["state"] = "rejected"
|
||||
manifest["chunks"][chunk_key] = entry
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# validation
|
||||
# --------------------------------------------------------------------------
|
||||
def validate_file(json_path: Path, schema: dict, chunks_dir: Path) -> list[str]:
|
||||
"""Return the list of errors for one extraction file (empty == valid)."""
|
||||
try:
|
||||
data = json.loads(json_path.read_text(encoding="utf-8"))
|
||||
except json.JSONDecodeError as exc:
|
||||
return [f"invalid JSON: {exc}"]
|
||||
|
||||
errors = validate_extraction(data, schema)
|
||||
if errors:
|
||||
return errors
|
||||
|
||||
header = data.get("header", {})
|
||||
chunk_text = find_chunk_text(json_path, header, chunks_dir)
|
||||
if chunk_text is None:
|
||||
return [f"source chunk not found for {chunk_key_for(json_path, header)}"]
|
||||
|
||||
for adict in data.get("activities", []):
|
||||
excerpt = adict.get("source_excerpt") or ""
|
||||
if not excerpt_matches(excerpt, chunk_text):
|
||||
score = excerpt_score(excerpt, chunk_text)
|
||||
errors.append(
|
||||
f"activity {adict.get('name')!r}: source_excerpt not found in "
|
||||
f"chunk (best match {score:.0f}/100) — possible hallucination"
|
||||
)
|
||||
return errors
|
||||
|
||||
|
||||
def run(
|
||||
extracted_dir: Path,
|
||||
chunks_dir: Path,
|
||||
manifest_path: Path,
|
||||
schema_path: Path = DEFAULT_SCHEMA_PATH,
|
||||
) -> dict:
|
||||
schema = load_schema(schema_path)
|
||||
manifest = load_manifest(manifest_path)
|
||||
reextract_dir = extracted_dir / "_reextract"
|
||||
|
||||
report = {"total": 0, "valid": 0, "rejected": 0, "rejected_chunks": []}
|
||||
for json_path in iter_extraction_files(extracted_dir):
|
||||
report["total"] += 1
|
||||
errors = validate_file(json_path, schema, chunks_dir)
|
||||
if not errors:
|
||||
report["valid"] += 1
|
||||
continue
|
||||
|
||||
report["rejected"] += 1
|
||||
try:
|
||||
data = json.loads(json_path.read_text(encoding="utf-8"))
|
||||
header = data.get("header", {})
|
||||
except json.JSONDecodeError:
|
||||
header = {}
|
||||
chunk_key = chunk_key_for(json_path, header)
|
||||
chunk_file = None
|
||||
meta = manifest["chunks"].get(chunk_key)
|
||||
if meta:
|
||||
chunk_file = meta.get("chunk_file")
|
||||
|
||||
reextract_dir.mkdir(parents=True, exist_ok=True)
|
||||
prompt = build_reextraction_prompt(chunk_key, chunk_file, errors)
|
||||
(reextract_dir / f"{chunk_key}.prompt.md").write_text(prompt, encoding="utf-8")
|
||||
|
||||
mark_rejected(manifest, chunk_key)
|
||||
report["rejected_chunks"].append({"chunk": chunk_key, "errors": errors})
|
||||
|
||||
save_manifest(manifest, manifest_path)
|
||||
return report
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# CLI
|
||||
# --------------------------------------------------------------------------
|
||||
def main(argv: Optional[list[str]] = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Validate extraction JSON files.")
|
||||
parser.add_argument("--extracted", default="data/extracted")
|
||||
parser.add_argument("--chunks", default="data/chunks")
|
||||
parser.add_argument("--manifest", default="data/chunks/manifest.json")
|
||||
parser.add_argument("--schema", default=str(DEFAULT_SCHEMA_PATH))
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
report = run(
|
||||
Path(args.extracted), Path(args.chunks), Path(args.manifest), Path(args.schema)
|
||||
)
|
||||
print(f"extraction files : {report['total']}")
|
||||
print(f" valid : {report['valid']}")
|
||||
print(f" rejected : {report['rejected']}")
|
||||
for item in report["rejected_chunks"]:
|
||||
print(f" [rejected] {item['chunk']}")
|
||||
for err in item["errors"]:
|
||||
print(f" - {err}")
|
||||
if report["rejected"]:
|
||||
print(f"\nRe-extraction prompts written to {args.extracted}/_reextract/")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Reference in New Issue
Block a user