feat(ocr): Add docTR OCR engine with metrics infrastructure

Add docTR as primary OCR engine with 2-tier sequential processing,
OCR metrics tracking, and simplified engine selection.

Features:
- docTR OCR engine with light+medium preprocessing tiers
- doctr_plus mode with early exit optimization (~65% fast path)
- OCR metrics dashboard with per-engine statistics
- User OCR preference persistence
- Parallel worker pool for OCR processing
- Cross-validation for extraction quality

Engine options: tesseract, doctr, doctr_plus (recommended), paddleocr

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-02 05:37:16 +02:00
parent 74f7aefc26
commit 495790411f
75 changed files with 23349 additions and 1311 deletions

View File

@@ -112,6 +112,40 @@ DATA_ENTRY_SQLITE_DATABASE_PATH=data/receipts/receipts.db
DATA_ENTRY_UPLOAD_PATH=data/receipts/uploads
DATA_ENTRY_MAX_UPLOAD_SIZE_MB=10
# ============================================================================
# OCR ENGINE CONFIGURATION
# ============================================================================
# Control which OCR engines are loaded at startup.
# Disabling engines saves memory but limits available OCR modes.
# Enable/disable PaddleOCR (set to 'false' to save ~800MB RAM)
# When disabled: 'paddleocr' engine unavailable
OCR_ENABLE_PADDLEOCR=true
# Enable/disable Tesseract (set to 'false' to save ~50MB RAM)
# When disabled: 'tesseract' engine unavailable
OCR_ENABLE_TESSERACT=true
# Default OCR engine when not specified in request
# Options: tesseract, doctr, doctr_plus, paddleocr
# Recommended: doctr_plus (2-tier sequential with early exit, ~7.5s avg)
OCR_DEFAULT_ENGINE=doctr_plus
# Active OCR engines shown in frontend dropdown (comma-separated)
# Options: tesseract, doctr, doctr_plus, paddleocr
# doctr_plus: 73.3% perfect, 7.5s avg, 65% fast path (recommended)
# doctr: 63.3% perfect, simpler but faster
OCR_ACTIVE_ENGINES=tesseract,doctr,doctr_plus,paddleocr
# OCR Worker Pool Configuration
# Number of parallel OCR workers (each loads ~1GB for docTR)
# Recommended: 2 for 8GB RAM, 3 for 16GB RAM
OCR_WORKERS=2
# Max tasks per worker before restart (0 = no restart, saves 40-60s warmup time)
# Set to 0 for testing, 10-20 for production (prevents memory leaks)
OCR_MAX_TASKS_PER_CHILD=0
# ============================================================================
# TELEGRAM MODULE - BOT CONFIGURATION (REQUIRED for Telegram features)
# ============================================================================