Commit Graph

6 Commits

Author SHA1 Message Date
Claude Agent
f7a37f91ec Headless cron enrichment system + progress checkpoint at 32%
OS cron fires enrich_wave.sh twice nightly (post 23:00 UTC reset); each wave
caps at ~700 keys (~75% window) via enrichment_wave.py --prepare. Fully
headless: one claude -p per batch via xargs, flock-guarded, idempotent.
DB updated to 9541 activities; .gitignore covers enrichment intermediates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 21:26:35 +00:00
Claude Agent
bcfb6841eb Faza 1 complete: bilingual+enrichment plumbing, UI/filters, frozen DB
Extraction finished (575/588 chunks; 6 content-filter-blocked, 7 await
re-extraction). DB rebuilt and frozen at 9418 activities — content_keys
are now stable for the enrichment overlay.

Part A (plumbing + UI):
- database.py: name_ro/description_ro/rules_ro/variations_ro, indoor_outdoor,
  space_needed, estimated_fields, source_id/source_ids/chunk_key columns;
  FTS5 indexes the 4 *_ro columns across CREATE + all 3 triggers; new equality
  filters + category counts for both axes.
- activity.py: new fields + bilingual display helpers (get_display_*,
  is_estimated, axis displays).
- config_taxonomy.py: INDOOR_OUTDOOR/SPACE_NEEDED enums + normalizers
  (None on unrecognised, no fabrication).
- search.py / routes.py / config.py / templates / css: new dropdowns,
  RO-primary rendering with "(estimat)" markers and collapsible original
  text, and a /source/<id> download route shipped DARK behind
  SOURCE_DOWNLOAD_ENABLED (copyright opt-in).
- build_database.py: source_id/chunk_key in dict_to_activity; merge_cluster
  unions source_ids without touching enrichment fields.

Part B (enrichment pipeline, built not yet run):
- build_database.py: load_enrichment + apply_enrichment (post-dedup, keyed on
  content_key) + --enrichment CLI + stated-vs-estimated QA.
- run_enrichment.py (resumable, --source/--limit pilot scoping, --collect),
  ENRICHMENT_PROMPT.md.

Repair: scripts/repair_extractions.py fixes the subagents' systematic
unescaped-ASCII-quote bug with a faithful char-scanner (escapes, never
truncates) + schema validation + a strictly-more-text guard. json_repair was
tried first, truncated silently, and is NOT used. build_database has no repair
dependency.

Tests: tests/test_enrichment.py added; 99 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-29 18:10:13 +00:00
Claude Agent
09999ccd40 Faza 0 follow-ups: re-extract 13 chunks, resolve 377 needs_review
- Re-extracted the 13 chunks with paraphrased source_excerpts
  (root cause: original excerpts straddled --- PAGE N --- markers
  which the rapidfuzz partial_ratio scored 75-90/100). Re-extraction
  used verbatim within-page quotes; all now score 100/100.
- Hallucinated drops: 19 -> 0.
- Bulk-resolved all 377 borderline-dedup needs_review pairs as merge
  (cleared the badge; both rows remain). They came from chunk
  overlap re-extracting the same activity with slightly different
  prose.
- Final DB: 1751 activities (was 1732).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:59:36 +00:00
Claude Agent
3d9f266696 Faza 0 pilot: rebuild activities.db from 5-file extraction
61 chunks × LLM subagent extraction yielded 1780 raw activities;
build_database dedup + hallucination check yielded 1732 in DB.

Pilot metrics vs plan acceptance thresholds:
- hallucinated drops      : 19/1780 = 1.07%  (threshold ≤ 2%)
- schema-rejected files   : 0/61              (threshold ≥ 0.9 valid)
- chunks needing re-extract: 13/61 (paraphrased excerpts 75-90/100)
- % with rules            : 99.9%
- extraction_confidence high: 1712/1732 = 98.8%

OCR decision: NOT NEEDED. The Cartea_Mare scanned-PDF candidate
extracted 151 pages / 38k words of real text via pdfplumber alone.

Pilot files:
- 1000 Fantastic Scout Games (EN, 278pg, 18 chunks → 946 activities)
- dragon.sleepdeprived.ca/games mirror (EN, 498pg, 31 chunks → 531)
- Cartea Mare a Jocurilor (RO, 151pg, 10 chunks → 284)
- Activităţi şi jocuri ... .doc (RO, 7pg, 1 chunk → 19, needs_review)
- Amazing Race templates zip (graphics only, 0 activities — expected)

The old activities.db was backed up to .bak before atomic swap.
tests/ still green (71 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 07:43:42 +00:00
a19ddf0b71 Refactor extraction system and reorganize project structure
- Remove obsolete documentation files (DEPLOYMENT.md, PLAN_IMPLEMENTARE_S8_DETALIAT.md, README.md)
- Add comprehensive extraction pipeline with multiple format support (PDF, HTML, text)
- Implement Claude-based activity extraction with structured templates
- Update dependencies and Docker configuration
- Reorganize scripts directory with modular extraction components
- Move example documentation to appropriate location

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-11 23:32:37 +03:00
4f83b8e73c Complete v2.0 transformation: Production-ready Flask application
Major Changes:
- Migrated from prototype to production architecture
- Implemented modular Flask app with models/services/web layers
- Added Docker containerization with docker-compose
- Switched to Pipenv for dependency management
- Built advanced parser extracting 63 real activities from INDEX_MASTER
- Implemented SQLite FTS5 full-text search
- Created minimalist, responsive web interface
- Added comprehensive documentation and deployment guides

Technical Improvements:
- Clean separation of concerns (models, services, web)
- Enhanced database schema with FTS5 indexing
- Dynamic filters populated from real data
- Production-ready configuration management
- Security best practices implementation
- Health monitoring and API endpoints

Removed Legacy Files:
- Old src/ directory structure
- Static requirements.txt (replaced by Pipfile)
- Test and debug files
- Temporary cache files

Current Status:
- 63 activities indexed across 8 categories
- Full-text search operational
- Docker deployment ready
- Production documentation complete

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-11 00:23:47 +03:00