Files
roa2web-service-auto/.auto-build/specs/bon-ocr-validation/SUMMARY.md
Marius Mutu ab160b628d feat(ocr): Add validation system and CLIENT CUI extraction
OCR Data Extraction Validation System:
- Add 7 validation rules (amount range, TVA ratio, payment sum, etc.)
- Add Medium preprocessing to replace Heavy (fixes digit concatenation)
- Add validation warnings to API responses
- Flag receipts needing manual review (needs_manual_review field)
- Add database migration for needs_manual_review column

CLIENT CUI Extraction Improvements:
- Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc.
- Handle OCR errors (R0 vs RO, C1F vs CIF)
- Add client_name, client_cui, client_address to API response
- Add validation fields to API response (was missing)

QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings)
- Fixed type safety in validation rules
- Fixed ZeroDivisionError risk
- Fixed schema mismatch (Optional[bool] for needs_manual_review)
- All 37 unit tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-30 19:12:52 +02:00

6.2 KiB

OCR Data Extraction Validation System - Summary

Spec Location: /mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md Created: 2025-12-30 Complexity: High (2-3 days) Priority: Critical (P0 - Production Bug)


Problem

Production OCR extracts wrong values due to Heavy preprocessing causing digit concatenation on clear PDFs:

  • Light OCR (98%): 85.99 LEI
  • Heavy OCR (88%): 859,762.16 LEI (10,000x error!)
  • Final Result: 859,762.16 LEI (wrong source chosen)

Solution

4-Layer Validation System

  1. Absolute Sanity Checks

    • Amount: 0.01 - 100,000 RON
    • Date: not future, not older than 10 years
    • CUI: 6-10 digits + Mod 11 checksum
  2. Cross-Field Validation

    • TVA: 5-24% of TOTAL
    • CARD + NUMERAR = TOTAL (±0.02)
    • Σ(TVA entries) = TVA TOTAL (±0.02)
  3. Inter-OCR Consistency

    • Flag if values differ >10x
    • Prefer validation-passing values
  4. Auto-Correction

    • Use payment sum if TOTAL wrong
    • Recalculate TOTAL from TVA if needed

Replace Heavy with Medium OCR

  • Remove: Heavy preprocessing (causes digit concatenation)
  • Add: Medium preprocessing (moderate enhancements, no binarization)
  • Keep: Light (step 1), Tesseract (step 3)

Enhanced CUI Extraction

  • Romanian CIF Mod 11 checksum validation
  • OCR-tolerant patterns (spaces, C1F errors)
  • Format normalization (always add RO prefix)

Key Requirements

Non-blocking warnings - Allow save with warnings Manual review flag - needs_manual_review=TRUE when confidence < 85% Cross-validation - Payment sum & TVA sum checks Apply to new uploads only - No reprocessing


Critical Files (10 total)

Files to CREATE (3)

  1. backend/modules/data_entry/services/ocr/validation.py (~400 lines)

    • ValidationRule base class
    • AmountRangeRule, TVARatioRule, PaymentSumRule, CUIChecksumRule
    • OCRValidationEngine orchestrator
  2. backend/modules/data_entry/tests/test_ocr_validation.py (~300 lines)

    • Unit tests for validation rules (>90% coverage)
    • 20+ test cases
  3. backend/modules/data_entry/tests/test_ocr_validation_integration.py (~200 lines)

    • Integration tests with real receipts
    • Five-Holding production case test

Files to MODIFY (6)

  1. backend/modules/data_entry/services/ocr_service.py (~200 lines modified)

    • Replace _merge_extractions() with validation-aware logic
    • Replace Heavy with Medium OCR (line ~130)
    • Add validation engine call (line ~204)
  2. backend/modules/data_entry/services/ocr_extractor.py (~80 lines modified)

    • Add validation fields to ExtractionResult dataclass
    • Fix CLIENT CUI patterns (OCR-tolerant)
    • Add CUI normalization & Mod 11 checksum validation
  3. backend/modules/data_entry/services/image_preprocessor.py (~80 lines added)

    • Add preprocess_medium() method
    • Mark preprocess_heavy() as deprecated
  4. backend/modules/data_entry/routers/ocr.py (~40 lines modified)

    • Update response with validation warnings
    • Add needs_manual_review flag
  5. backend/modules/data_entry/schemas/ocr.py (~20 lines added)

    • Add ValidationWarning schema
    • Add validation fields to ExtractionData
  6. backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py (~30 lines)

    • Add needs_manual_review column (nullable BOOLEAN)

Frontend Files (2 - optional for Phase 1)

  1. src/modules/data-entry/views/receipts/ReceiptCreateView.vue

    • Display validation warnings section
    • Show manual review badge
  2. src/modules/data-entry/components/ocr/OCRPreview.vue

    • Show inter-OCR consistency warning

Acceptance Criteria

Critical (Must Pass)

AC-1: Five-Holding receipt extracts 85.99 (NOT 859,762.16) AC-2: Save button works with warnings (not blocked) AC-3: CARD + NUMERAR = TOTAL validation AC-4: Σ(TVA entries) = TVA TOTAL validation AC-5: CUI Mod 11 checksum validation

Test Coverage

  • Unit tests: 20+ test cases, >90% coverage
  • Integration tests: 10+ real receipt tests
  • Manual testing: 6 scenarios (Five-Holding, faded receipt, payment methods, etc.)

Implementation Priority

Day 1: Core Validation

  1. Create ocr/validation.py module
  2. Implement 7 validation rules
  3. Write unit tests
  4. Checkpoint: All unit tests pass

Day 2: OCR Integration

  1. Add preprocess_medium() method
  2. Update _merge_extractions() with validation
  3. Update API schemas
  4. Add database migration
  5. Checkpoint: Five-Holding receipt works

Day 3: Testing & Polish

  1. Write integration tests
  2. Update frontend components
  3. Manual testing
  4. Bug fixes
  5. Checkpoint: Production-ready

Risks & Mitigations

Risk Mitigation
Medium OCR still causes errors Tesseract fallback + validation catches issues
CUI validation too strict Warning only (not error), allow override
Performance impact Validation <10ms (negligible vs. OCR time)
Breaking API changes Add new fields, keep existing unchanged

Tech Stack Integration

Backend Patterns (CLAUDE.md compliant)

  • SQLModel + Alembic migrations
  • Pydantic v2 schemas
  • Service layer pattern (logic in services, not routers)
  • Type hints + docstrings

Frontend Patterns (CLAUDE.md compliant)

  • Vue 3 Composition API
  • PrimeVue components
  • Shared CSS patterns (.roa-card, .roa-metric)
  • No :deep() selectors

Testing Patterns

  • pytest for backend
  • >90% coverage target
  • Integration tests with real data

Next Steps

  1. Review specification/mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md
  2. Create feature branchfeature/bon-ocr-validation
  3. Implement Phase 1 → Validation engine + tests (Day 1)
  4. Implement Phase 2 → OCR integration (Day 2)
  5. Implement Phase 3 → Frontend + testing (Day 3)
  6. Deploy to staging → Test with production receipts
  7. Monitor for 1 week → Verify no regressions
  8. Deploy to production → Roll out gradually

Estimated Completion: 2026-01-02 (3 working days) Status: Ready for Implementation