Files

Claude Agent 66ae831c36 Rebuild extraction pipeline infrastructure (Faza 0 prep)

Implements the approved plan to replace the broken regex/index-master
extraction with an LLM-subagent pipeline. Four parallel lanes:

Lane A — scripts/extract_common.py (PDF/docx/doc/pptx/html/zip, no
  max_pages truncation), normalize_sources.py, chunk_sources.py
  (~20pg chunks + overlap, manifest registry), activity_schema.json.
Lane B — app/config_taxonomy.py (16 fixed category slugs), schema
  rebuilt from scratch in app/models/ with content_type, language,
  source_files, source_excerpt, normalized_name, extraction_confidence,
  needs_review; FTS5 + 3 triggers extended with materials_list and
  skills_developed.
Lane C — build_database.py (--rebuild, atomic swap, schema + fuzzy
  source_excerpt validation, dedup with needs_review band),
  validate_extractions.py, review_queue.py, new run_extraction.py
  orchestrator, SUBAGENT_PROMPT.md.
Lane D — search.py content_type/language filters (default search
  excludes non-game content), E7 schema-compat audit; fixed a NULL
  keywords AttributeError in _boost_search_relevance.

Removes 8 orphaned/dead scripts and app/services/parser.py +
indexer.py. Adds tests/ (70 passing, 1 skipped — libreoffice absent).

Note: Lane D made one additive edit to app/models/database.py
(_update_category_counts) to surface content_type/language in
get_filter_options, outside its nominal lane boundary but after
Lane B completed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 17:43:38 +00:00

3.3 KiB

Raw Blame History

SUBAGENT — Activity extraction

You are a subagent in the game-library extraction pipeline. You extract educational activities (games, team-building, scouting, recipes, songs, ceremonies) from one chunk of a source document into structured JSON.

Your task

Read ONLY the chunk you were assigned. Do not read other chunks, other files, or the original document. The chunk is a .txt file with --- PAGE N --- markers.
Identify every distinct activity in the chunk.
For each activity, fill the schema in scripts/activity_schema.json.
Write the result to data/extracted/<chunk_key>.json.

What counts as "a distinct activity"

A distinct activity is a self-contained game/activity/recipe/song/ceremony with its own name and a real description of how to do it. It is NOT:

a bare mention or a cross-reference with no description — skip it;
a sub-variant of an activity already extracted — fold it into variations;
a heading, a table of contents entry, or running page chrome.

If the same activity is split across a page boundary inside your chunk, treat it as one activity and combine the text.

Output format

The file is one JSON object: a header plus an activities array.

{
  "header": {
    "source_id": "<set from the prompt>",
    "chunk_key": "<set from the prompt>",
    "source_hash": "<set from the prompt>",
    "schema_version": "1.0",
    "prompt_version": "1.0",
    "chunk_range": "pages 1-20"
  },
  "activities": [ ... ]
}

Rules for each activity

name — the activity's real name (≥3 characters).
description — real prose describing the activity. No hard length limit, but it must actually describe what happens.
rules — how it is played / carried out, if the source gives rules.
category — exactly one taxonomy slug (see the enum in the schema): jocuri-cercetasesti, team-building, icebreakers, camp-outdoor, wide-games, orientare, prim-ajutor, escape-room-puzzle, creative-stem, sports-active, cantece-ceremonii, retete, supravietuire, integrare-incluziune, conflict-empatie, altele. When unsure, use altele.
content_type — the FORM of the content, independent of category: joc, activitate, reteta, cantec, or ceremonie.
language — ro or en (the language the activity is written in).
source_excerpt — MANDATORY. A short quote (one or two sentences) copied verbatim from the chunk. This is the anti-hallucination anchor: it is checked as a fuzzy substring of the chunk, and invented quotes are rejected.
page_reference — MANDATORY. The --- PAGE N --- marker(s) the activity came from, e.g. "page 14" or "pages 14-15".
extraction_confidence — high, med, or low. Use low when the source text for the activity is thin or ambiguous.

Never invent data

Do not invent ages, participant counts, or durations. If the source does not state them, leave those fields null.
Do not paraphrase the source_excerpt — copy it character for character.
Better to extract fewer activities accurately than to pad the output.

Before you finish

Every activity has a non-empty source_excerpt and page_reference.
The file validates against scripts/activity_schema.json.
You only used text from your assigned chunk.

3.3 KiB Raw Blame History