Files
roa2web-service-auto/.auto-build/specs/bon-ocr-validation/SUMMARY.md
Marius Mutu ab160b628d feat(ocr): Add validation system and CLIENT CUI extraction
OCR Data Extraction Validation System:
- Add 7 validation rules (amount range, TVA ratio, payment sum, etc.)
- Add Medium preprocessing to replace Heavy (fixes digit concatenation)
- Add validation warnings to API responses
- Flag receipts needing manual review (needs_manual_review field)
- Add database migration for needs_manual_review column

CLIENT CUI Extraction Improvements:
- Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc.
- Handle OCR errors (R0 vs RO, C1F vs CIF)
- Add client_name, client_cui, client_address to API response
- Add validation fields to API response (was missing)

QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings)
- Fixed type safety in validation rules
- Fixed ZeroDivisionError risk
- Fixed schema mismatch (Optional[bool] for needs_manual_review)
- All 37 unit tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-30 19:12:52 +02:00

208 lines
6.2 KiB
Markdown

# OCR Data Extraction Validation System - Summary
**Spec Location:** `/mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md`
**Created:** 2025-12-30
**Complexity:** High (2-3 days)
**Priority:** Critical (P0 - Production Bug)
---
## Problem
Production OCR extracts wrong values due to Heavy preprocessing causing digit concatenation on clear PDFs:
- **Light OCR (98%):** 85.99 LEI ✅
- **Heavy OCR (88%):** 859,762.16 LEI ❌ (10,000x error!)
- **Final Result:** 859,762.16 LEI ❌ (wrong source chosen)
---
## Solution
### 4-Layer Validation System
1. **Absolute Sanity Checks**
- Amount: 0.01 - 100,000 RON
- Date: not future, not older than 10 years
- CUI: 6-10 digits + Mod 11 checksum
2. **Cross-Field Validation**
- TVA: 5-24% of TOTAL
- CARD + NUMERAR = TOTAL (±0.02)
- Σ(TVA entries) = TVA TOTAL (±0.02)
3. **Inter-OCR Consistency**
- Flag if values differ >10x
- Prefer validation-passing values
4. **Auto-Correction**
- Use payment sum if TOTAL wrong
- Recalculate TOTAL from TVA if needed
### Replace Heavy with Medium OCR
- **Remove:** Heavy preprocessing (causes digit concatenation)
- **Add:** Medium preprocessing (moderate enhancements, no binarization)
- **Keep:** Light (step 1), Tesseract (step 3)
### Enhanced CUI Extraction
- Romanian CIF Mod 11 checksum validation
- OCR-tolerant patterns (spaces, C1F errors)
- Format normalization (always add RO prefix)
---
## Key Requirements
**Non-blocking warnings** - Allow save with warnings
**Manual review flag** - `needs_manual_review=TRUE` when confidence < 85%
**Cross-validation** - Payment sum & TVA sum checks
**Apply to new uploads only** - No reprocessing
---
## Critical Files (10 total)
### Files to CREATE (3)
1. **`backend/modules/data_entry/services/ocr/validation.py`** (~400 lines)
- `ValidationRule` base class
- `AmountRangeRule`, `TVARatioRule`, `PaymentSumRule`, `CUIChecksumRule`
- `OCRValidationEngine` orchestrator
2. **`backend/modules/data_entry/tests/test_ocr_validation.py`** (~300 lines)
- Unit tests for validation rules (>90% coverage)
- 20+ test cases
3. **`backend/modules/data_entry/tests/test_ocr_validation_integration.py`** (~200 lines)
- Integration tests with real receipts
- Five-Holding production case test
### Files to MODIFY (6)
1. **`backend/modules/data_entry/services/ocr_service.py`** (~200 lines modified)
- Replace `_merge_extractions()` with validation-aware logic
- Replace Heavy with Medium OCR (line ~130)
- Add validation engine call (line ~204)
2. **`backend/modules/data_entry/services/ocr_extractor.py`** (~80 lines modified)
- Add validation fields to `ExtractionResult` dataclass
- Fix CLIENT CUI patterns (OCR-tolerant)
- Add CUI normalization & Mod 11 checksum validation
3. **`backend/modules/data_entry/services/image_preprocessor.py`** (~80 lines added)
- Add `preprocess_medium()` method
- Mark `preprocess_heavy()` as deprecated
4. **`backend/modules/data_entry/routers/ocr.py`** (~40 lines modified)
- Update response with validation warnings
- Add `needs_manual_review` flag
5. **`backend/modules/data_entry/schemas/ocr.py`** (~20 lines added)
- Add `ValidationWarning` schema
- Add validation fields to `ExtractionData`
6. **`backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py`** (~30 lines)
- Add `needs_manual_review` column (nullable BOOLEAN)
### Frontend Files (2 - optional for Phase 1)
1. **`src/modules/data-entry/views/receipts/ReceiptCreateView.vue`**
- Display validation warnings section
- Show manual review badge
2. **`src/modules/data-entry/components/ocr/OCRPreview.vue`**
- Show inter-OCR consistency warning
---
## Acceptance Criteria
### Critical (Must Pass)
**AC-1:** Five-Holding receipt extracts 85.99 (NOT 859,762.16)
**AC-2:** Save button works with warnings (not blocked)
**AC-3:** CARD + NUMERAR = TOTAL validation
**AC-4:** Σ(TVA entries) = TVA TOTAL validation
**AC-5:** CUI Mod 11 checksum validation
### Test Coverage
- **Unit tests:** 20+ test cases, >90% coverage
- **Integration tests:** 10+ real receipt tests
- **Manual testing:** 6 scenarios (Five-Holding, faded receipt, payment methods, etc.)
---
## Implementation Priority
### Day 1: Core Validation
1. Create `ocr/validation.py` module
2. Implement 7 validation rules
3. Write unit tests
4. ✅ Checkpoint: All unit tests pass
### Day 2: OCR Integration
1. Add `preprocess_medium()` method
2. Update `_merge_extractions()` with validation
3. Update API schemas
4. Add database migration
5. ✅ Checkpoint: Five-Holding receipt works
### Day 3: Testing & Polish
1. Write integration tests
2. Update frontend components
3. Manual testing
4. Bug fixes
5. ✅ Checkpoint: Production-ready
---
## Risks & Mitigations
| Risk | Mitigation |
|------|------------|
| Medium OCR still causes errors | Tesseract fallback + validation catches issues |
| CUI validation too strict | Warning only (not error), allow override |
| Performance impact | Validation <10ms (negligible vs. OCR time) |
| Breaking API changes | Add new fields, keep existing unchanged |
---
## Tech Stack Integration
### Backend Patterns (CLAUDE.md compliant)
- SQLModel + Alembic migrations
- Pydantic v2 schemas
- Service layer pattern (logic in services, not routers)
- Type hints + docstrings
### Frontend Patterns (CLAUDE.md compliant)
- Vue 3 Composition API
- PrimeVue components
- Shared CSS patterns (`.roa-card`, `.roa-metric`)
- No `:deep()` selectors
### Testing Patterns
- pytest for backend
- >90% coverage target
- ✅ Integration tests with real data
---
## Next Steps
1. **Review specification**`/mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md`
2. **Create feature branch**`feature/bon-ocr-validation`
3. **Implement Phase 1** → Validation engine + tests (Day 1)
4. **Implement Phase 2** → OCR integration (Day 2)
5. **Implement Phase 3** → Frontend + testing (Day 3)
6. **Deploy to staging** → Test with production receipts
7. **Monitor for 1 week** → Verify no regressions
8. **Deploy to production** → Roll out gradually
---
**Estimated Completion:** 2026-01-02 (3 working days)
**Status:** Ready for Implementation