OCR Data Extraction Validation System: - Add 7 validation rules (amount range, TVA ratio, payment sum, etc.) - Add Medium preprocessing to replace Heavy (fixes digit concatenation) - Add validation warnings to API responses - Flag receipts needing manual review (needs_manual_review field) - Add database migration for needs_manual_review column CLIENT CUI Extraction Improvements: - Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc. - Handle OCR errors (R0 vs RO, C1F vs CIF) - Add client_name, client_cui, client_address to API response - Add validation fields to API response (was missing) QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings) - Fixed type safety in validation rules - Fixed ZeroDivisionError risk - Fixed schema mismatch (Optional[bool] for needs_manual_review) - All 37 unit tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.2 KiB
OCR Data Extraction Validation System - Summary
Spec Location: /mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md
Created: 2025-12-30
Complexity: High (2-3 days)
Priority: Critical (P0 - Production Bug)
Problem
Production OCR extracts wrong values due to Heavy preprocessing causing digit concatenation on clear PDFs:
- Light OCR (98%): 85.99 LEI ✅
- Heavy OCR (88%): 859,762.16 LEI ❌ (10,000x error!)
- Final Result: 859,762.16 LEI ❌ (wrong source chosen)
Solution
4-Layer Validation System
-
Absolute Sanity Checks
- Amount: 0.01 - 100,000 RON
- Date: not future, not older than 10 years
- CUI: 6-10 digits + Mod 11 checksum
-
Cross-Field Validation
- TVA: 5-24% of TOTAL
- CARD + NUMERAR = TOTAL (±0.02)
- Σ(TVA entries) = TVA TOTAL (±0.02)
-
Inter-OCR Consistency
- Flag if values differ >10x
- Prefer validation-passing values
-
Auto-Correction
- Use payment sum if TOTAL wrong
- Recalculate TOTAL from TVA if needed
Replace Heavy with Medium OCR
- Remove: Heavy preprocessing (causes digit concatenation)
- Add: Medium preprocessing (moderate enhancements, no binarization)
- Keep: Light (step 1), Tesseract (step 3)
Enhanced CUI Extraction
- Romanian CIF Mod 11 checksum validation
- OCR-tolerant patterns (spaces, C1F errors)
- Format normalization (always add RO prefix)
Key Requirements
✅ Non-blocking warnings - Allow save with warnings
✅ Manual review flag - needs_manual_review=TRUE when confidence < 85%
✅ Cross-validation - Payment sum & TVA sum checks
✅ Apply to new uploads only - No reprocessing
Critical Files (10 total)
Files to CREATE (3)
-
backend/modules/data_entry/services/ocr/validation.py(~400 lines)ValidationRulebase classAmountRangeRule,TVARatioRule,PaymentSumRule,CUIChecksumRuleOCRValidationEngineorchestrator
-
backend/modules/data_entry/tests/test_ocr_validation.py(~300 lines)- Unit tests for validation rules (>90% coverage)
- 20+ test cases
-
backend/modules/data_entry/tests/test_ocr_validation_integration.py(~200 lines)- Integration tests with real receipts
- Five-Holding production case test
Files to MODIFY (6)
-
backend/modules/data_entry/services/ocr_service.py(~200 lines modified)- Replace
_merge_extractions()with validation-aware logic - Replace Heavy with Medium OCR (line ~130)
- Add validation engine call (line ~204)
- Replace
-
backend/modules/data_entry/services/ocr_extractor.py(~80 lines modified)- Add validation fields to
ExtractionResultdataclass - Fix CLIENT CUI patterns (OCR-tolerant)
- Add CUI normalization & Mod 11 checksum validation
- Add validation fields to
-
backend/modules/data_entry/services/image_preprocessor.py(~80 lines added)- Add
preprocess_medium()method - Mark
preprocess_heavy()as deprecated
- Add
-
backend/modules/data_entry/routers/ocr.py(~40 lines modified)- Update response with validation warnings
- Add
needs_manual_reviewflag
-
backend/modules/data_entry/schemas/ocr.py(~20 lines added)- Add
ValidationWarningschema - Add validation fields to
ExtractionData
- Add
-
backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py(~30 lines)- Add
needs_manual_reviewcolumn (nullable BOOLEAN)
- Add
Frontend Files (2 - optional for Phase 1)
-
src/modules/data-entry/views/receipts/ReceiptCreateView.vue- Display validation warnings section
- Show manual review badge
-
src/modules/data-entry/components/ocr/OCRPreview.vue- Show inter-OCR consistency warning
Acceptance Criteria
Critical (Must Pass)
✅ AC-1: Five-Holding receipt extracts 85.99 (NOT 859,762.16) ✅ AC-2: Save button works with warnings (not blocked) ✅ AC-3: CARD + NUMERAR = TOTAL validation ✅ AC-4: Σ(TVA entries) = TVA TOTAL validation ✅ AC-5: CUI Mod 11 checksum validation
Test Coverage
- Unit tests: 20+ test cases, >90% coverage
- Integration tests: 10+ real receipt tests
- Manual testing: 6 scenarios (Five-Holding, faded receipt, payment methods, etc.)
Implementation Priority
Day 1: Core Validation
- Create
ocr/validation.pymodule - Implement 7 validation rules
- Write unit tests
- ✅ Checkpoint: All unit tests pass
Day 2: OCR Integration
- Add
preprocess_medium()method - Update
_merge_extractions()with validation - Update API schemas
- Add database migration
- ✅ Checkpoint: Five-Holding receipt works
Day 3: Testing & Polish
- Write integration tests
- Update frontend components
- Manual testing
- Bug fixes
- ✅ Checkpoint: Production-ready
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Medium OCR still causes errors | Tesseract fallback + validation catches issues |
| CUI validation too strict | Warning only (not error), allow override |
| Performance impact | Validation <10ms (negligible vs. OCR time) |
| Breaking API changes | Add new fields, keep existing unchanged |
Tech Stack Integration
Backend Patterns (CLAUDE.md compliant)
- ✅ SQLModel + Alembic migrations
- ✅ Pydantic v2 schemas
- ✅ Service layer pattern (logic in services, not routers)
- ✅ Type hints + docstrings
Frontend Patterns (CLAUDE.md compliant)
- ✅ Vue 3 Composition API
- ✅ PrimeVue components
- ✅ Shared CSS patterns (
.roa-card,.roa-metric) - ✅ No
:deep()selectors
Testing Patterns
- ✅ pytest for backend
- ✅ >90% coverage target
- ✅ Integration tests with real data
Next Steps
- Review specification →
/mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md - Create feature branch →
feature/bon-ocr-validation - Implement Phase 1 → Validation engine + tests (Day 1)
- Implement Phase 2 → OCR integration (Day 2)
- Implement Phase 3 → Frontend + testing (Day 3)
- Deploy to staging → Test with production receipts
- Monitor for 1 week → Verify no regressions
- Deploy to production → Roll out gradually
Estimated Completion: 2026-01-02 (3 working days) Status: Ready for Implementation