Files
roa2web-service-auto/.auto-build/specs/bon-ocr-validation/plan.md
Marius Mutu ab160b628d feat(ocr): Add validation system and CLIENT CUI extraction
OCR Data Extraction Validation System:
- Add 7 validation rules (amount range, TVA ratio, payment sum, etc.)
- Add Medium preprocessing to replace Heavy (fixes digit concatenation)
- Add validation warnings to API responses
- Flag receipts needing manual review (needs_manual_review field)
- Add database migration for needs_manual_review column

CLIENT CUI Extraction Improvements:
- Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc.
- Handle OCR errors (R0 vs RO, C1F vs CIF)
- Add client_name, client_cui, client_address to API response
- Add validation fields to API response (was missing)

QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings)
- Fixed type safety in validation rules
- Fixed ZeroDivisionError risk
- Fixed schema mismatch (Optional[bool] for needs_manual_review)
- All 37 unit tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-30 19:12:52 +02:00

15 KiB

Implementation Plan: bon-ocr-validation

Status: COMPLETE Completed: 2025-12-30T19:15:00Z

Feature: OCR Data Extraction Validation System Priority: Critical (P0 - Production Bug) Estimated Effort: 2-3 days Created: 2025-12-30T17:25:00Z


Progress Tracker

Task Status Completed
Task 1: Create validation module structure Done 2025-12-30 17:30
Task 2: Implement validation rules (7 rules) Done 2025-12-30 17:35
Task 3: Create validation engine orchestrator Done 2025-12-30 18:05
Task 4: Write unit tests for validation Done 2025-12-30 18:15
Task 5: Add Medium OCR preprocessing Done 2025-12-30 18:25
Task 6: Update ExtractionResult schema Done 2025-12-30 18:35
Task 7: Refactor merge_extractions with validation Done 2025-12-30 18:50
Task 8: Update API schemas Done 2025-12-30 18:55
Task 9: Create database migration Done 2025-12-30 19:05
Task 10: Write integration tests Done 2025-12-30 19:10
Task 11: Test with Five-Holding receipt Done 2025-12-30 19:15

Tasks

Task 1: Create validation module structure

  • Status: Done (2025-12-30 17:30)
  • Phase: Day 1 - Core Validation
  • Files: backend/modules/data_entry/services/ocr/validation.py (NEW)
  • Lines: ~50 lines
  • Description:
    • Create backend/modules/data_entry/services/ocr/ directory
    • Create validation.py with base classes
    • Define ValidationRule abstract base class with validate() method
    • Define ValidationResult dataclass (is_valid, confidence_penalty, message)
    • Add module docstring and imports
  • Dependencies: None
  • Success Criteria: Module loads without errors, base classes defined

Task 2: Implement validation rules (7 rules)

  • Status: Done (2025-12-30 17:35)

  • Phase: Day 1 - Core Validation

  • Files: backend/modules/data_entry/services/ocr/validation.py

  • Lines: ~300 lines added

  • Description: Implement 7 concrete validation rule classes:

    1. AmountRangeRule - Check 0.01 ≤ amount ≤ 100,000 RON
    2. TVARatioRule - Check TVA is 5-24% of TOTAL
    3. PaymentSumRule - Check CARD + NUMERAR = TOTAL (±0.02 tolerance)
    4. TVAEntriesSumRule - Check Σ(TVA entries) = TVA TOTAL (±0.02)
    5. CUIFormatRule - Check RO + 6-10 digits format
    6. CUIChecksumRule - Romanian CIF Mod 11 checksum algorithm
    7. InterOCRConsistencyRule - Flag if values differ >10x ratio

    Each rule should:

    • Inherit from ValidationRule
    • Implement validate(data: dict) -> ValidationResult
    • Have clear docstrings with examples
    • Return confidence penalty (0.0-1.0) when validation fails
  • Dependencies: Task 1

  • Success Criteria: All 7 rules implemented, can instantiate and call validate()


Task 3: Create validation engine orchestrator

  • Status: Done (2025-12-30 18:05)

  • Phase: Day 1 - Core Validation

  • Files: backend/modules/data_entry/services/ocr/validation.py

  • Lines: ~50 lines added

  • Description:

    • Create OCRValidationEngine class
    • Method: validate_extraction(extraction_result, light_result, heavy_result)
    • Apply all rules in order (sanity → cross-field → inter-OCR)
    • Aggregate results: collect all warnings, calculate overall penalty
    • Return enhanced extraction result with:
      • needs_manual_review: bool (if any rule fails critically)
      • validation_warnings: list[str]
      • confidence_adjustments: dict[str, float]
    • Add helper method: normalize_cui(cui: str) -> str (add RO prefix)
  • Dependencies: Task 2

  • Success Criteria: Engine can validate extraction, returns enhanced result


Task 4: Write unit tests for validation

  • Status: Done (2025-12-30 18:15)

  • Phase: Day 1 - Core Validation

  • Files: backend/modules/data_entry/tests/test_ocr_validation.py (NEW)

  • Lines: ~300 lines

  • Description: Write comprehensive unit tests (>90% coverage):

    AmountRangeRule (4 tests):

    • test_amount_within_range_passes
    • test_amount_too_high_fails
    • test_amount_too_low_fails
    • test_none_amount_passes

    TVARatioRule (3 tests):

    • test_valid_tva_ratio_passes (19%)
    • test_tva_too_high_fails (>24%)
    • test_tva_too_low_fails (<5%)

    PaymentSumRule (4 tests):

    • test_payment_sum_matches_total_passes
    • test_payment_sum_mismatch_fails
    • test_tolerance_within_002_passes
    • test_missing_payment_methods_passes

    TVAEntriesSumRule (3 tests):

    • test_tva_entries_sum_matches
    • test_tva_entries_mismatch_fails
    • test_tolerance_within_002_passes

    CUIChecksumRule (5 tests):

    • test_valid_cui_checksum_passes (RO10562600)
    • test_invalid_cui_checksum_fails
    • test_cui_without_ro_prefix_normalized
    • test_cui_with_r0_prefix_normalized
    • test_non_numeric_cui_fails

    InterOCRConsistencyRule (3 tests):

    • test_values_within_10x_passes
    • test_values_over_10x_fails
    • test_one_value_missing_passes

    OCRValidationEngine (5 tests):

    • test_engine_applies_all_rules
    • test_engine_aggregates_warnings
    • test_engine_sets_manual_review_flag
    • test_engine_calculates_confidence_penalties
    • test_normalize_cui_helper
  • Dependencies: Task 3

  • Success Criteria: All tests pass, pytest coverage >90%


Task 5: Add Medium OCR preprocessing

  • Status: Done (2025-12-30 18:25)

  • Phase: Day 2 - OCR Integration

  • Files: backend/modules/data_entry/services/image_preprocessor.py

  • Lines: ~80 lines added

  • Description:

    • Add preprocess_medium(image: Image.Image) -> Image.Image method
    • Apply moderate enhancements:
      • Grayscale conversion
      • Contrast enhancement (factor=1.5, not 2.0)
      • Gentle sharpening (factor=1.3)
      • Light noise reduction (MedianFilter size=3)
    • Do NOT apply:
      • Aggressive binarization (causes digit concatenation)
      • Morphological operations (erosion/dilation)
      • Heavy contrast (factor=2.0)
    • Add docstring explaining difference from Heavy preprocessing
    • Mark preprocess_heavy() as deprecated with comment
  • Dependencies: None (parallel with Task 1-4)

  • Success Criteria: Method returns preprocessed image, no extreme distortion


Task 6: Update ExtractionResult schema

  • Status: Done (2025-12-30 18:35)

  • Phase: Day 2 - OCR Integration

  • Files:

    • backend/modules/data_entry/services/ocr_extractor.py
    • backend/modules/data_entry/schemas/ocr.py
  • Lines: ~50 lines modified, ~30 added

  • Description:

    In ocr_extractor.py:

    • Add fields to ExtractionResult dataclass (after existing fields):
      # Validation tracking
      needs_manual_review: bool = False
      validation_warnings: list[str] = field(default_factory=list)
      validation_errors: list[str] = field(default_factory=list)
      confidence_adjustments: dict[str, float] = field(default_factory=dict)
      
    • Update to_dict() method to include new fields
    • Fix CLIENT CUI patterns (more flexible for OCR variations):
      • Make colon optional: :?\s*
      • Make RO prefix optional: (?:R[O0])?\s*
      • Pattern: r'CLIENT\s+C\.\s*U\.\s*I\.?\s*/\s*C\.\s*[I1]\.\s*F\.?\s*:?\s*(?:R[O0])?\s*(\d{6,10})'

    In schemas/ocr.py:

    • Add ValidationWarning schema:
      class ValidationWarning(BaseModel):
          field: str
          severity: str  # "warning" | "error"
          message: str
      
    • Add to ExtractionData schema (line ~57):
      needs_manual_review: bool = False
      validation_warnings: list[ValidationWarning] = []
      
  • Dependencies: Task 3 (needs ValidationResult structure)

  • Success Criteria: Schemas load, can serialize/deserialize with new fields


Task 7: Refactor merge_extractions with validation

  • Status: Done (2025-12-30 18:50)

  • Phase: Day 2 - OCR Integration

  • Files: backend/modules/data_entry/services/ocr_service.py

  • Lines: ~200 lines modified

  • Description:

    Replace Step 2 Heavy OCR with Medium OCR (line ~130):

    • Change self._preprocess_heavy(image) to self._preprocess_medium(image)
    • Update logging: "Step 2: PaddleOCR + Medium preprocessing"
    • Update variable names: result_heavyresult_medium, conf_heavyconf_medium

    Refactor _merge_extractions() method (lines 240-386):

    • Import validation engine: from .ocr.validation import OCRValidationEngine
    • Instantiate engine: validator = OCRValidationEngine()
    • For each field (AMOUNT, TVA, CUI, DATE):
      1. Get both Light and Medium values
      2. Run validation on both values
      3. Apply confidence penalties from validation results
      4. Choose value with ADJUSTED confidence (not raw)
      5. Log decision with validation notes
    • After merge, run cross-field validations:
      • Payment sum validation (CARD + CASH = TOTAL)
      • TVA entries sum validation
      • If mismatch and confidence < 80%, auto-correct TOTAL from payment sum
    • Call validator engine: result = validator.validate_extraction(result, light_result, medium_result)
    • Return enhanced result with validation warnings

    Add structured logging:

    • Log each merge decision with confidence scores
    • Log validation failures with field names
    • Log auto-corrections with old/new values
  • Dependencies: Task 3, Task 5, Task 6

  • Success Criteria: Merge logic uses validation, auto-correction works


Task 8: Update API schemas and router

  • Status: Done (2025-12-30 18:55)

  • Phase: Day 2 - OCR Integration

  • Files: backend/modules/data_entry/routers/ocr.py

  • Lines: ~40 lines modified

  • Description:

    • Update OCRResponse schema to include validation fields:
      needs_manual_review: bool = False
      validation_warnings: list[ValidationWarning] = []
      confidence_info: dict[str, float] = {}  # field -> adjusted confidence
      
    • In /process-receipt endpoint (line ~106):
      • Pass validation warnings from OCR result to response
      • Add log message if needs_manual_review=True
      • Return HTTP 200 with warnings (don't block)
    • Update endpoint docstring to mention validation behavior
  • Dependencies: Task 6, Task 7

  • Success Criteria: API returns validation warnings, save not blocked


Task 9: Create database migration

  • Status: Done (2025-12-30 19:05)

  • Phase: Day 2 - OCR Integration

  • Files: backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py (NEW)

  • Lines: ~30 lines

  • Description:

    • Generate Alembic migration: alembic revision -m "add needs_manual_review to receipts"
    • Add column to receipts table:
      op.add_column('receipts',
          sa.Column('needs_manual_review', sa.Boolean(), nullable=True, default=False)
      )
      
    • Add downgrade to remove column
    • Test migration: alembic upgrade head then alembic downgrade -1
  • Dependencies: None (parallel)

  • Success Criteria: Migration runs without errors, column added


Task 10: Write integration tests

  • Status: Done (2025-12-30 19:10)

  • Phase: Day 3 - Testing & Polish

  • Files: backend/modules/data_entry/tests/test_ocr_validation_integration.py (NEW)

  • Lines: ~200 lines

  • Description: Write integration tests with real OCR service:

    Test 1: Five-Holding production case

    • Load docs/data-entry/igiena 14 decembrie five-holding.pdf
    • Run full OCR pipeline
    • Assert: TOTAL = 85.99 (NOT 859,762.16)
    • Assert: TVA = 14.92 (NOT 149,214.92)
    • Assert: No magnitude errors >10x

    Test 2: Payment sum validation

    • Mock OCR results: TOTAL=100.00, CARD=50.00, CASH=40.00
    • Assert: needs_manual_review=True
    • Assert: "Payment sum mismatch" in warnings

    Test 3: Payment sum auto-correction

    • Mock: TOTAL=859762.16 (confidence=0.75), CARD=85.99, CASH=0.00
    • Assert: TOTAL auto-corrected to 85.99
    • Assert: "Auto-corrected from payment sum" in warnings

    Test 4: TVA entries sum validation

    • Mock: TVA_TOTAL=14.92, TVA_A=12.00, TVA_B=2.00
    • Assert: needs_manual_review=True (sum=14.00 ≠ 14.92)

    Test 5: CUI checksum validation

    • Mock: CUI="RO10562600" (valid checksum)
    • Assert: passes validation
    • Mock: CUI="RO12345678" (invalid checksum)
    • Assert: confidence penalty applied

    Test 6: Inter-OCR consistency

    • Mock: Light=85.99, Medium=859762.16
    • Assert: Light value chosen (ratio >10x)
    • Assert: "Inter-OCR inconsistency" in warnings

    Test 7: All validations pass (clean receipt)

    • Mock high-quality receipt with correct values
    • Assert: needs_manual_review=False
    • Assert: validation_warnings empty

    Test 8: Medium OCR doesn't cause errors

    • Load clear PDF receipt
    • Assert: Medium OCR values within 10x of Light
    • Assert: No digit concatenation errors
  • Dependencies: Task 7, Task 8

  • Success Criteria: All 8 integration tests pass


Task 11: Test with Five-Holding receipt (Manual)

  • Status: Done (2025-12-30 19:15)

  • Phase: Day 3 - Testing & Polish

  • Files: Manual testing checklist

  • Description: Manual end-to-end testing with production receipt:

    1. Start backend services:

      • SSH tunnel: ./ssh-tunnel-prod.sh start
      • Backend: ./start-backend.sh
    2. Upload Five-Holding receipt:

      • File: docs/data-entry/igiena 14 decembrie five-holding.pdf
      • Use /api/ocr/process-receipt endpoint
    3. Verify extracted values:

      • TOTAL: 85.99 LEI (NOT 859,762.16)
      • TVA: 14.92 LEI (NOT 149,214.92)
      • CUI: R010562600
      • Date: 2024-12-14
      • CARD: 85.99 LEI
    4. Verify validation:

      • needs_manual_review = False (values are correct)
      • validation_warnings empty (or only informational)
      • Payment sum matches (CARD = TOTAL)
      • TVA ratio valid (14.92/85.99 = 17.35%)
    5. Test other receipts (regression):

      • Upload 3-5 other receipts from docs/data-entry/
      • Verify no new false positives
      • Verify existing correct extractions still work
    6. Test error cases:

      • Upload receipt with wrong OCR (synthetic test)
      • Verify warnings displayed
      • Verify save button works (not blocked)
  • Dependencies: Task 10

  • Success Criteria: All manual tests pass, production bug fixed


Implementation Timeline

Day 1: Core Validation (Tasks 1-4)

  • Morning: Tasks 1-2 (validation module + rules)
  • Afternoon: Tasks 3-4 (engine + unit tests)
  • Checkpoint: All unit tests pass (>90% coverage)

Day 2: OCR Integration (Tasks 5-9)

  • Morning: Tasks 5-6 (Medium OCR + schemas)
  • Afternoon: Tasks 7-9 (merge refactor + API + migration)
  • Checkpoint: Five-Holding receipt extracts correct values

Day 3: Testing & Polish (Tasks 10-11)

  • Morning: Task 10 (integration tests)
  • Afternoon: Task 11 (manual testing + bug fixes)
  • Checkpoint: Production-ready, all tests pass

Success Metrics

  • All 20+ unit tests pass
  • All 8 integration tests pass
  • Five-Holding receipt: 85.99 not 859,762.16
  • pytest coverage >90%
  • No regressions on existing receipts
  • Manual testing checklist complete

Rollback Plan

If issues arise:

  1. Revert migration: alembic downgrade -1
  2. Revert code changes: git revert {commit}
  3. Fallback to Light + Tesseract only (skip Medium)
  4. Add feature flag: OCR_VALIDATION_ENABLED=false

Plan Created: 2025-12-30T17:25:00Z Ready for Implementation: Yes