# OCR Data Extraction Validation System - Summary **Spec Location:** `/mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md` **Created:** 2025-12-30 **Complexity:** High (2-3 days) **Priority:** Critical (P0 - Production Bug) --- ## Problem Production OCR extracts wrong values due to Heavy preprocessing causing digit concatenation on clear PDFs: - **Light OCR (98%):** 85.99 LEI ✅ - **Heavy OCR (88%):** 859,762.16 LEI ❌ (10,000x error!) - **Final Result:** 859,762.16 LEI ❌ (wrong source chosen) --- ## Solution ### 4-Layer Validation System 1. **Absolute Sanity Checks** - Amount: 0.01 - 100,000 RON - Date: not future, not older than 10 years - CUI: 6-10 digits + Mod 11 checksum 2. **Cross-Field Validation** - TVA: 5-24% of TOTAL - CARD + NUMERAR = TOTAL (±0.02) - Σ(TVA entries) = TVA TOTAL (±0.02) 3. **Inter-OCR Consistency** - Flag if values differ >10x - Prefer validation-passing values 4. **Auto-Correction** - Use payment sum if TOTAL wrong - Recalculate TOTAL from TVA if needed ### Replace Heavy with Medium OCR - **Remove:** Heavy preprocessing (causes digit concatenation) - **Add:** Medium preprocessing (moderate enhancements, no binarization) - **Keep:** Light (step 1), Tesseract (step 3) ### Enhanced CUI Extraction - Romanian CIF Mod 11 checksum validation - OCR-tolerant patterns (spaces, C1F errors) - Format normalization (always add RO prefix) --- ## Key Requirements ✅ **Non-blocking warnings** - Allow save with warnings ✅ **Manual review flag** - `needs_manual_review=TRUE` when confidence < 85% ✅ **Cross-validation** - Payment sum & TVA sum checks ✅ **Apply to new uploads only** - No reprocessing --- ## Critical Files (10 total) ### Files to CREATE (3) 1. **`backend/modules/data_entry/services/ocr/validation.py`** (~400 lines) - `ValidationRule` base class - `AmountRangeRule`, `TVARatioRule`, `PaymentSumRule`, `CUIChecksumRule` - `OCRValidationEngine` orchestrator 2. **`backend/modules/data_entry/tests/test_ocr_validation.py`** (~300 lines) - Unit tests for validation rules (>90% coverage) - 20+ test cases 3. **`backend/modules/data_entry/tests/test_ocr_validation_integration.py`** (~200 lines) - Integration tests with real receipts - Five-Holding production case test ### Files to MODIFY (6) 1. **`backend/modules/data_entry/services/ocr_service.py`** (~200 lines modified) - Replace `_merge_extractions()` with validation-aware logic - Replace Heavy with Medium OCR (line ~130) - Add validation engine call (line ~204) 2. **`backend/modules/data_entry/services/ocr_extractor.py`** (~80 lines modified) - Add validation fields to `ExtractionResult` dataclass - Fix CLIENT CUI patterns (OCR-tolerant) - Add CUI normalization & Mod 11 checksum validation 3. **`backend/modules/data_entry/services/image_preprocessor.py`** (~80 lines added) - Add `preprocess_medium()` method - Mark `preprocess_heavy()` as deprecated 4. **`backend/modules/data_entry/routers/ocr.py`** (~40 lines modified) - Update response with validation warnings - Add `needs_manual_review` flag 5. **`backend/modules/data_entry/schemas/ocr.py`** (~20 lines added) - Add `ValidationWarning` schema - Add validation fields to `ExtractionData` 6. **`backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py`** (~30 lines) - Add `needs_manual_review` column (nullable BOOLEAN) ### Frontend Files (2 - optional for Phase 1) 1. **`src/modules/data-entry/views/receipts/ReceiptCreateView.vue`** - Display validation warnings section - Show manual review badge 2. **`src/modules/data-entry/components/ocr/OCRPreview.vue`** - Show inter-OCR consistency warning --- ## Acceptance Criteria ### Critical (Must Pass) ✅ **AC-1:** Five-Holding receipt extracts 85.99 (NOT 859,762.16) ✅ **AC-2:** Save button works with warnings (not blocked) ✅ **AC-3:** CARD + NUMERAR = TOTAL validation ✅ **AC-4:** Σ(TVA entries) = TVA TOTAL validation ✅ **AC-5:** CUI Mod 11 checksum validation ### Test Coverage - **Unit tests:** 20+ test cases, >90% coverage - **Integration tests:** 10+ real receipt tests - **Manual testing:** 6 scenarios (Five-Holding, faded receipt, payment methods, etc.) --- ## Implementation Priority ### Day 1: Core Validation 1. Create `ocr/validation.py` module 2. Implement 7 validation rules 3. Write unit tests 4. ✅ Checkpoint: All unit tests pass ### Day 2: OCR Integration 1. Add `preprocess_medium()` method 2. Update `_merge_extractions()` with validation 3. Update API schemas 4. Add database migration 5. ✅ Checkpoint: Five-Holding receipt works ### Day 3: Testing & Polish 1. Write integration tests 2. Update frontend components 3. Manual testing 4. Bug fixes 5. ✅ Checkpoint: Production-ready --- ## Risks & Mitigations | Risk | Mitigation | |------|------------| | Medium OCR still causes errors | Tesseract fallback + validation catches issues | | CUI validation too strict | Warning only (not error), allow override | | Performance impact | Validation <10ms (negligible vs. OCR time) | | Breaking API changes | Add new fields, keep existing unchanged | --- ## Tech Stack Integration ### Backend Patterns (CLAUDE.md compliant) - ✅ SQLModel + Alembic migrations - ✅ Pydantic v2 schemas - ✅ Service layer pattern (logic in services, not routers) - ✅ Type hints + docstrings ### Frontend Patterns (CLAUDE.md compliant) - ✅ Vue 3 Composition API - ✅ PrimeVue components - ✅ Shared CSS patterns (`.roa-card`, `.roa-metric`) - ✅ No `:deep()` selectors ### Testing Patterns - ✅ pytest for backend - ✅ >90% coverage target - ✅ Integration tests with real data --- ## Next Steps 1. **Review specification** → `/mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md` 2. **Create feature branch** → `feature/bon-ocr-validation` 3. **Implement Phase 1** → Validation engine + tests (Day 1) 4. **Implement Phase 2** → OCR integration (Day 2) 5. **Implement Phase 3** → Frontend + testing (Day 3) 6. **Deploy to staging** → Test with production receipts 7. **Monitor for 1 week** → Verify no regressions 8. **Deploy to production** → Roll out gradually --- **Estimated Completion:** 2026-01-02 (3 working days) **Status:** Ready for Implementation