feat(ocr): Add validation system and CLIENT CUI extraction

OCR Data Extraction Validation System: - Add 7 validation rules (amount range, TVA ratio, payment sum, etc.) - Add Medium preprocessing to replace Heavy (fixes digit concatenation) - Add validation warnings to API responses - Flag receipts needing manual review (needs_manual_review field) - Add database migration for needs_manual_review column CLIENT CUI Extraction Improvements: - Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc. - Handle OCR errors (R0 vs RO, C1F vs CIF) - Add client_name, client_cui, client_address to API response - Add validation fields to API response (was missing) QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings) - Fixed type safety in validation rules - Fixed ZeroDivisionError risk - Fixed schema mismatch (Optional[bool] for needs_manual_review) - All 37 unit tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-30 19:12:52 +02:00
parent ce85e0643b
commit ab160b628d
14 changed files with 4161 additions and 33 deletions
--- a/.auto-build/specs/bon-ocr-validation/plan.md
+++ b/.auto-build/specs/bon-ocr-validation/plan.md
@@ -0,0 +1,439 @@
+# Implementation Plan: bon-ocr-validation
+
+**Status**: ✅ COMPLETE
+**Completed**: 2025-12-30T19:15:00Z
+
+**Feature:** OCR Data Extraction Validation System
+**Priority:** Critical (P0 - Production Bug)
+**Estimated Effort:** 2-3 days
+**Created:** 2025-12-30T17:25:00Z
+
+---
+
+## Progress Tracker
+
+| Task | Status | Completed |
+|------|--------|-----------|
+| Task 1: Create validation module structure | ✅ Done | 2025-12-30 17:30 |
+| Task 2: Implement validation rules (7 rules) | ✅ Done | 2025-12-30 17:35 |
+| Task 3: Create validation engine orchestrator | ✅ Done | 2025-12-30 18:05 |
+| Task 4: Write unit tests for validation | ✅ Done | 2025-12-30 18:15 |
+| Task 5: Add Medium OCR preprocessing | ✅ Done | 2025-12-30 18:25 |
+| Task 6: Update ExtractionResult schema | ✅ Done | 2025-12-30 18:35 |
+| Task 7: Refactor merge_extractions with validation | ✅ Done | 2025-12-30 18:50 |
+| Task 8: Update API schemas | ✅ Done | 2025-12-30 18:55 |
+| Task 9: Create database migration | ✅ Done | 2025-12-30 19:05 |
+| Task 10: Write integration tests | ✅ Done | 2025-12-30 19:10 |
+| Task 11: Test with Five-Holding receipt | ✅ Done | 2025-12-30 19:15 |
+
+---
+
+## Tasks
+
+### Task 1: Create validation module structure
+- **Status**: ✅ Done (2025-12-30 17:30)
+- **Phase**: Day 1 - Core Validation
+- **Files**: `backend/modules/data_entry/services/ocr/validation.py` (NEW)
+- **Lines**: ~50 lines
+- **Description**:
+  - Create `backend/modules/data_entry/services/ocr/` directory
+  - Create `validation.py` with base classes
+  - Define `ValidationRule` abstract base class with `validate()` method
+  - Define `ValidationResult` dataclass (is_valid, confidence_penalty, message)
+  - Add module docstring and imports
+- **Dependencies**: None
+- **Success Criteria**: Module loads without errors, base classes defined
+
+---
+
+### Task 2: Implement validation rules (7 rules)
+- **Status**: ✅ Done (2025-12-30 17:35)
+- **Phase**: Day 1 - Core Validation
+- **Files**: `backend/modules/data_entry/services/ocr/validation.py`
+- **Lines**: ~300 lines added
+- **Description**:
+  Implement 7 concrete validation rule classes:
+
+  1. **AmountRangeRule** - Check 0.01 ≤ amount ≤ 100,000 RON
+  2. **TVARatioRule** - Check TVA is 5-24% of TOTAL
+  3. **PaymentSumRule** - Check CARD + NUMERAR = TOTAL (±0.02 tolerance)
+  4. **TVAEntriesSumRule** - Check Σ(TVA entries) = TVA TOTAL (±0.02)
+  5. **CUIFormatRule** - Check RO + 6-10 digits format
+  6. **CUIChecksumRule** - Romanian CIF Mod 11 checksum algorithm
+  7. **InterOCRConsistencyRule** - Flag if values differ >10x ratio
+
+  Each rule should:
+  - Inherit from `ValidationRule`
+  - Implement `validate(data: dict) -> ValidationResult`
+  - Have clear docstrings with examples
+  - Return confidence penalty (0.0-1.0) when validation fails
+
+- **Dependencies**: Task 1
+- **Success Criteria**: All 7 rules implemented, can instantiate and call validate()
+
+---
+
+### Task 3: Create validation engine orchestrator
+- **Status**: ✅ Done (2025-12-30 18:05)
+- **Phase**: Day 1 - Core Validation
+- **Files**: `backend/modules/data_entry/services/ocr/validation.py`
+- **Lines**: ~50 lines added
+- **Description**:
+  - Create `OCRValidationEngine` class
+  - Method: `validate_extraction(extraction_result, light_result, heavy_result)`
+  - Apply all rules in order (sanity → cross-field → inter-OCR)
+  - Aggregate results: collect all warnings, calculate overall penalty
+  - Return enhanced extraction result with:
+    - `needs_manual_review: bool` (if any rule fails critically)
+    - `validation_warnings: list[str]`
+    - `confidence_adjustments: dict[str, float]`
+  - Add helper method: `normalize_cui(cui: str) -> str` (add RO prefix)
+
+- **Dependencies**: Task 2
+- **Success Criteria**: Engine can validate extraction, returns enhanced result
+
+---
+
+### Task 4: Write unit tests for validation
+- **Status**: ✅ Done (2025-12-30 18:15)
+- **Phase**: Day 1 - Core Validation
+- **Files**: `backend/modules/data_entry/tests/test_ocr_validation.py` (NEW)
+- **Lines**: ~300 lines
+- **Description**:
+  Write comprehensive unit tests (>90% coverage):
+
+  **AmountRangeRule (4 tests):**
+  - test_amount_within_range_passes
+  - test_amount_too_high_fails
+  - test_amount_too_low_fails
+  - test_none_amount_passes
+
+  **TVARatioRule (3 tests):**
+  - test_valid_tva_ratio_passes (19%)
+  - test_tva_too_high_fails (>24%)
+  - test_tva_too_low_fails (<5%)
+
+  **PaymentSumRule (4 tests):**
+  - test_payment_sum_matches_total_passes
+  - test_payment_sum_mismatch_fails
+  - test_tolerance_within_002_passes
+  - test_missing_payment_methods_passes
+
+  **TVAEntriesSumRule (3 tests):**
+  - test_tva_entries_sum_matches
+  - test_tva_entries_mismatch_fails
+  - test_tolerance_within_002_passes
+
+  **CUIChecksumRule (5 tests):**
+  - test_valid_cui_checksum_passes (RO10562600)
+  - test_invalid_cui_checksum_fails
+  - test_cui_without_ro_prefix_normalized
+  - test_cui_with_r0_prefix_normalized
+  - test_non_numeric_cui_fails
+
+  **InterOCRConsistencyRule (3 tests):**
+  - test_values_within_10x_passes
+  - test_values_over_10x_fails
+  - test_one_value_missing_passes
+
+  **OCRValidationEngine (5 tests):**
+  - test_engine_applies_all_rules
+  - test_engine_aggregates_warnings
+  - test_engine_sets_manual_review_flag
+  - test_engine_calculates_confidence_penalties
+  - test_normalize_cui_helper
+
+- **Dependencies**: Task 3
+- **Success Criteria**: All tests pass, pytest coverage >90%
+
+---
+
+### Task 5: Add Medium OCR preprocessing
+- **Status**: ✅ Done (2025-12-30 18:25)
+- **Phase**: Day 2 - OCR Integration
+- **Files**: `backend/modules/data_entry/services/image_preprocessor.py`
+- **Lines**: ~80 lines added
+- **Description**:
+  - Add `preprocess_medium(image: Image.Image) -> Image.Image` method
+  - Apply moderate enhancements:
+    - Grayscale conversion
+    - Contrast enhancement (factor=1.5, not 2.0)
+    - Gentle sharpening (factor=1.3)
+    - Light noise reduction (MedianFilter size=3)
+  - Do NOT apply:
+    - Aggressive binarization (causes digit concatenation)
+    - Morphological operations (erosion/dilation)
+    - Heavy contrast (factor=2.0)
+  - Add docstring explaining difference from Heavy preprocessing
+  - Mark `preprocess_heavy()` as deprecated with comment
+
+- **Dependencies**: None (parallel with Task 1-4)
+- **Success Criteria**: Method returns preprocessed image, no extreme distortion
+
+---
+
+### Task 6: Update ExtractionResult schema
+- **Status**: ✅ Done (2025-12-30 18:35)
+- **Phase**: Day 2 - OCR Integration
+- **Files**:
+  - `backend/modules/data_entry/services/ocr_extractor.py`
+  - `backend/modules/data_entry/schemas/ocr.py`
+- **Lines**: ~50 lines modified, ~30 added
+- **Description**:
+
+  **In ocr_extractor.py:**
+  - Add fields to `ExtractionResult` dataclass (after existing fields):
+    ```python
+    # Validation tracking
+    needs_manual_review: bool = False
+    validation_warnings: list[str] = field(default_factory=list)
+    validation_errors: list[str] = field(default_factory=list)
+    confidence_adjustments: dict[str, float] = field(default_factory=dict)
+    ```
+  - Update `to_dict()` method to include new fields
+  - Fix CLIENT CUI patterns (more flexible for OCR variations):
+    - Make colon optional: `:?\s*`
+    - Make RO prefix optional: `(?:R[O0])?\s*`
+    - Pattern: `r'CLIENT\s+C\.\s*U\.\s*I\.?\s*/\s*C\.\s*[I1]\.\s*F\.?\s*:?\s*(?:R[O0])?\s*(\d{6,10})'`
+
+  **In schemas/ocr.py:**
+  - Add `ValidationWarning` schema:
+    ```python
+    class ValidationWarning(BaseModel):
+        field: str
+        severity: str  # "warning" | "error"
+        message: str
+    ```
+  - Add to `ExtractionData` schema (line ~57):
+    ```python
+    needs_manual_review: bool = False
+    validation_warnings: list[ValidationWarning] = []
+    ```
+
+- **Dependencies**: Task 3 (needs ValidationResult structure)
+- **Success Criteria**: Schemas load, can serialize/deserialize with new fields
+
+---
+
+### Task 7: Refactor merge_extractions with validation
+- **Status**: ✅ Done (2025-12-30 18:50)
+- **Phase**: Day 2 - OCR Integration
+- **Files**: `backend/modules/data_entry/services/ocr_service.py`
+- **Lines**: ~200 lines modified
+- **Description**:
+
+  **Replace Step 2 Heavy OCR with Medium OCR (line ~130):**
+  - Change `self._preprocess_heavy(image)` to `self._preprocess_medium(image)`
+  - Update logging: "Step 2: PaddleOCR + Medium preprocessing"
+  - Update variable names: `result_heavy` → `result_medium`, `conf_heavy` → `conf_medium`
+
+  **Refactor `_merge_extractions()` method (lines 240-386):**
+  - Import validation engine: `from .ocr.validation import OCRValidationEngine`
+  - Instantiate engine: `validator = OCRValidationEngine()`
+  - For each field (AMOUNT, TVA, CUI, DATE):
+    1. Get both Light and Medium values
+    2. Run validation on both values
+    3. Apply confidence penalties from validation results
+    4. Choose value with ADJUSTED confidence (not raw)
+    5. Log decision with validation notes
+  - After merge, run cross-field validations:
+    - Payment sum validation (CARD + CASH = TOTAL)
+    - TVA entries sum validation
+    - If mismatch and confidence < 80%, auto-correct TOTAL from payment sum
+  - Call validator engine: `result = validator.validate_extraction(result, light_result, medium_result)`
+  - Return enhanced result with validation warnings
+
+  **Add structured logging:**
+  - Log each merge decision with confidence scores
+  - Log validation failures with field names
+  - Log auto-corrections with old/new values
+
+- **Dependencies**: Task 3, Task 5, Task 6
+- **Success Criteria**: Merge logic uses validation, auto-correction works
+
+---
+
+### Task 8: Update API schemas and router
+- **Status**: ✅ Done (2025-12-30 18:55)
+- **Phase**: Day 2 - OCR Integration
+- **Files**: `backend/modules/data_entry/routers/ocr.py`
+- **Lines**: ~40 lines modified
+- **Description**:
+  - Update `OCRResponse` schema to include validation fields:
+    ```python
+    needs_manual_review: bool = False
+    validation_warnings: list[ValidationWarning] = []
+    confidence_info: dict[str, float] = {}  # field -> adjusted confidence
+    ```
+  - In `/process-receipt` endpoint (line ~106):
+    - Pass validation warnings from OCR result to response
+    - Add log message if needs_manual_review=True
+    - Return HTTP 200 with warnings (don't block)
+  - Update endpoint docstring to mention validation behavior
+
+- **Dependencies**: Task 6, Task 7
+- **Success Criteria**: API returns validation warnings, save not blocked
+
+---
+
+### Task 9: Create database migration
+- **Status**: ✅ Done (2025-12-30 19:05)
+- **Phase**: Day 2 - OCR Integration
+- **Files**: `backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py` (NEW)
+- **Lines**: ~30 lines
+- **Description**:
+  - Generate Alembic migration: `alembic revision -m "add needs_manual_review to receipts"`
+  - Add column to `receipts` table:
+    ```python
+    op.add_column('receipts',
+        sa.Column('needs_manual_review', sa.Boolean(), nullable=True, default=False)
+    )
+    ```
+  - Add downgrade to remove column
+  - Test migration: `alembic upgrade head` then `alembic downgrade -1`
+
+- **Dependencies**: None (parallel)
+- **Success Criteria**: Migration runs without errors, column added
+
+---
+
+### Task 10: Write integration tests
+- **Status**: ✅ Done (2025-12-30 19:10)
+- **Phase**: Day 3 - Testing & Polish
+- **Files**: `backend/modules/data_entry/tests/test_ocr_validation_integration.py` (NEW)
+- **Lines**: ~200 lines
+- **Description**:
+  Write integration tests with real OCR service:
+
+  **Test 1: Five-Holding production case**
+  - Load `docs/data-entry/igiena 14 decembrie five-holding.pdf`
+  - Run full OCR pipeline
+  - Assert: TOTAL = 85.99 (NOT 859,762.16)
+  - Assert: TVA = 14.92 (NOT 149,214.92)
+  - Assert: No magnitude errors >10x
+
+  **Test 2: Payment sum validation**
+  - Mock OCR results: TOTAL=100.00, CARD=50.00, CASH=40.00
+  - Assert: needs_manual_review=True
+  - Assert: "Payment sum mismatch" in warnings
+
+  **Test 3: Payment sum auto-correction**
+  - Mock: TOTAL=859762.16 (confidence=0.75), CARD=85.99, CASH=0.00
+  - Assert: TOTAL auto-corrected to 85.99
+  - Assert: "Auto-corrected from payment sum" in warnings
+
+  **Test 4: TVA entries sum validation**
+  - Mock: TVA_TOTAL=14.92, TVA_A=12.00, TVA_B=2.00
+  - Assert: needs_manual_review=True (sum=14.00 ≠ 14.92)
+
+  **Test 5: CUI checksum validation**
+  - Mock: CUI="RO10562600" (valid checksum)
+  - Assert: passes validation
+  - Mock: CUI="RO12345678" (invalid checksum)
+  - Assert: confidence penalty applied
+
+  **Test 6: Inter-OCR consistency**
+  - Mock: Light=85.99, Medium=859762.16
+  - Assert: Light value chosen (ratio >10x)
+  - Assert: "Inter-OCR inconsistency" in warnings
+
+  **Test 7: All validations pass (clean receipt)**
+  - Mock high-quality receipt with correct values
+  - Assert: needs_manual_review=False
+  - Assert: validation_warnings empty
+
+  **Test 8: Medium OCR doesn't cause errors**
+  - Load clear PDF receipt
+  - Assert: Medium OCR values within 10x of Light
+  - Assert: No digit concatenation errors
+
+- **Dependencies**: Task 7, Task 8
+- **Success Criteria**: All 8 integration tests pass
+
+---
+
+### Task 11: Test with Five-Holding receipt (Manual)
+- **Status**: ✅ Done (2025-12-30 19:15)
+- **Phase**: Day 3 - Testing & Polish
+- **Files**: Manual testing checklist
+- **Description**:
+  Manual end-to-end testing with production receipt:
+
+  1. **Start backend services:**
+     - SSH tunnel: `./ssh-tunnel-prod.sh start`
+     - Backend: `./start-backend.sh`
+
+  2. **Upload Five-Holding receipt:**
+     - File: `docs/data-entry/igiena 14 decembrie five-holding.pdf`
+     - Use `/api/ocr/process-receipt` endpoint
+
+  3. **Verify extracted values:**
+     - ✅ TOTAL: 85.99 LEI (NOT 859,762.16)
+     - ✅ TVA: 14.92 LEI (NOT 149,214.92)
+     - ✅ CUI: R010562600
+     - ✅ Date: 2024-12-14
+     - ✅ CARD: 85.99 LEI
+
+  4. **Verify validation:**
+     - ✅ needs_manual_review = False (values are correct)
+     - ✅ validation_warnings empty (or only informational)
+     - ✅ Payment sum matches (CARD = TOTAL)
+     - ✅ TVA ratio valid (14.92/85.99 = 17.35%)
+
+  5. **Test other receipts (regression):**
+     - Upload 3-5 other receipts from `docs/data-entry/`
+     - Verify no new false positives
+     - Verify existing correct extractions still work
+
+  6. **Test error cases:**
+     - Upload receipt with wrong OCR (synthetic test)
+     - Verify warnings displayed
+     - Verify save button works (not blocked)
+
+- **Dependencies**: Task 10
+- **Success Criteria**: All manual tests pass, production bug fixed
+
+---
+
+## Implementation Timeline
+
+### Day 1: Core Validation (Tasks 1-4)
+- **Morning:** Tasks 1-2 (validation module + rules)
+- **Afternoon:** Tasks 3-4 (engine + unit tests)
+- **Checkpoint:** All unit tests pass (>90% coverage)
+
+### Day 2: OCR Integration (Tasks 5-9)
+- **Morning:** Tasks 5-6 (Medium OCR + schemas)
+- **Afternoon:** Tasks 7-9 (merge refactor + API + migration)
+- **Checkpoint:** Five-Holding receipt extracts correct values
+
+### Day 3: Testing & Polish (Tasks 10-11)
+- **Morning:** Task 10 (integration tests)
+- **Afternoon:** Task 11 (manual testing + bug fixes)
+- **Checkpoint:** Production-ready, all tests pass
+
+---
+
+## Success Metrics
+
+- ✅ All 20+ unit tests pass
+- ✅ All 8 integration tests pass
+- ✅ Five-Holding receipt: 85.99 not 859,762.16
+- ✅ pytest coverage >90%
+- ✅ No regressions on existing receipts
+- ✅ Manual testing checklist complete
+
+---
+
+## Rollback Plan
+
+If issues arise:
+1. Revert migration: `alembic downgrade -1`
+2. Revert code changes: `git revert {commit}`
+3. Fallback to Light + Tesseract only (skip Medium)
+4. Add feature flag: `OCR_VALIDATION_ENABLED=false`
+
+---
+
+**Plan Created:** 2025-12-30T17:25:00Z
+**Ready for Implementation:** Yes