OCR Data Extraction Validation System: - Add 7 validation rules (amount range, TVA ratio, payment sum, etc.) - Add Medium preprocessing to replace Heavy (fixes digit concatenation) - Add validation warnings to API responses - Flag receipts needing manual review (needs_manual_review field) - Add database migration for needs_manual_review column CLIENT CUI Extraction Improvements: - Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc. - Handle OCR errors (R0 vs RO, C1F vs CIF) - Add client_name, client_cui, client_address to API response - Add validation fields to API response (was missing) QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings) - Fixed type safety in validation rules - Fixed ZeroDivisionError risk - Fixed schema mismatch (Optional[bool] for needs_manual_review) - All 37 unit tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
208 lines
6.2 KiB
Markdown
208 lines
6.2 KiB
Markdown
# OCR Data Extraction Validation System - Summary
|
|
|
|
**Spec Location:** `/mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md`
|
|
**Created:** 2025-12-30
|
|
**Complexity:** High (2-3 days)
|
|
**Priority:** Critical (P0 - Production Bug)
|
|
|
|
---
|
|
|
|
## Problem
|
|
|
|
Production OCR extracts wrong values due to Heavy preprocessing causing digit concatenation on clear PDFs:
|
|
- **Light OCR (98%):** 85.99 LEI ✅
|
|
- **Heavy OCR (88%):** 859,762.16 LEI ❌ (10,000x error!)
|
|
- **Final Result:** 859,762.16 LEI ❌ (wrong source chosen)
|
|
|
|
---
|
|
|
|
## Solution
|
|
|
|
### 4-Layer Validation System
|
|
|
|
1. **Absolute Sanity Checks**
|
|
- Amount: 0.01 - 100,000 RON
|
|
- Date: not future, not older than 10 years
|
|
- CUI: 6-10 digits + Mod 11 checksum
|
|
|
|
2. **Cross-Field Validation**
|
|
- TVA: 5-24% of TOTAL
|
|
- CARD + NUMERAR = TOTAL (±0.02)
|
|
- Σ(TVA entries) = TVA TOTAL (±0.02)
|
|
|
|
3. **Inter-OCR Consistency**
|
|
- Flag if values differ >10x
|
|
- Prefer validation-passing values
|
|
|
|
4. **Auto-Correction**
|
|
- Use payment sum if TOTAL wrong
|
|
- Recalculate TOTAL from TVA if needed
|
|
|
|
### Replace Heavy with Medium OCR
|
|
|
|
- **Remove:** Heavy preprocessing (causes digit concatenation)
|
|
- **Add:** Medium preprocessing (moderate enhancements, no binarization)
|
|
- **Keep:** Light (step 1), Tesseract (step 3)
|
|
|
|
### Enhanced CUI Extraction
|
|
|
|
- Romanian CIF Mod 11 checksum validation
|
|
- OCR-tolerant patterns (spaces, C1F errors)
|
|
- Format normalization (always add RO prefix)
|
|
|
|
---
|
|
|
|
## Key Requirements
|
|
|
|
✅ **Non-blocking warnings** - Allow save with warnings
|
|
✅ **Manual review flag** - `needs_manual_review=TRUE` when confidence < 85%
|
|
✅ **Cross-validation** - Payment sum & TVA sum checks
|
|
✅ **Apply to new uploads only** - No reprocessing
|
|
|
|
---
|
|
|
|
## Critical Files (10 total)
|
|
|
|
### Files to CREATE (3)
|
|
|
|
1. **`backend/modules/data_entry/services/ocr/validation.py`** (~400 lines)
|
|
- `ValidationRule` base class
|
|
- `AmountRangeRule`, `TVARatioRule`, `PaymentSumRule`, `CUIChecksumRule`
|
|
- `OCRValidationEngine` orchestrator
|
|
|
|
2. **`backend/modules/data_entry/tests/test_ocr_validation.py`** (~300 lines)
|
|
- Unit tests for validation rules (>90% coverage)
|
|
- 20+ test cases
|
|
|
|
3. **`backend/modules/data_entry/tests/test_ocr_validation_integration.py`** (~200 lines)
|
|
- Integration tests with real receipts
|
|
- Five-Holding production case test
|
|
|
|
### Files to MODIFY (6)
|
|
|
|
1. **`backend/modules/data_entry/services/ocr_service.py`** (~200 lines modified)
|
|
- Replace `_merge_extractions()` with validation-aware logic
|
|
- Replace Heavy with Medium OCR (line ~130)
|
|
- Add validation engine call (line ~204)
|
|
|
|
2. **`backend/modules/data_entry/services/ocr_extractor.py`** (~80 lines modified)
|
|
- Add validation fields to `ExtractionResult` dataclass
|
|
- Fix CLIENT CUI patterns (OCR-tolerant)
|
|
- Add CUI normalization & Mod 11 checksum validation
|
|
|
|
3. **`backend/modules/data_entry/services/image_preprocessor.py`** (~80 lines added)
|
|
- Add `preprocess_medium()` method
|
|
- Mark `preprocess_heavy()` as deprecated
|
|
|
|
4. **`backend/modules/data_entry/routers/ocr.py`** (~40 lines modified)
|
|
- Update response with validation warnings
|
|
- Add `needs_manual_review` flag
|
|
|
|
5. **`backend/modules/data_entry/schemas/ocr.py`** (~20 lines added)
|
|
- Add `ValidationWarning` schema
|
|
- Add validation fields to `ExtractionData`
|
|
|
|
6. **`backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py`** (~30 lines)
|
|
- Add `needs_manual_review` column (nullable BOOLEAN)
|
|
|
|
### Frontend Files (2 - optional for Phase 1)
|
|
|
|
1. **`src/modules/data-entry/views/receipts/ReceiptCreateView.vue`**
|
|
- Display validation warnings section
|
|
- Show manual review badge
|
|
|
|
2. **`src/modules/data-entry/components/ocr/OCRPreview.vue`**
|
|
- Show inter-OCR consistency warning
|
|
|
|
---
|
|
|
|
## Acceptance Criteria
|
|
|
|
### Critical (Must Pass)
|
|
|
|
✅ **AC-1:** Five-Holding receipt extracts 85.99 (NOT 859,762.16)
|
|
✅ **AC-2:** Save button works with warnings (not blocked)
|
|
✅ **AC-3:** CARD + NUMERAR = TOTAL validation
|
|
✅ **AC-4:** Σ(TVA entries) = TVA TOTAL validation
|
|
✅ **AC-5:** CUI Mod 11 checksum validation
|
|
|
|
### Test Coverage
|
|
|
|
- **Unit tests:** 20+ test cases, >90% coverage
|
|
- **Integration tests:** 10+ real receipt tests
|
|
- **Manual testing:** 6 scenarios (Five-Holding, faded receipt, payment methods, etc.)
|
|
|
|
---
|
|
|
|
## Implementation Priority
|
|
|
|
### Day 1: Core Validation
|
|
1. Create `ocr/validation.py` module
|
|
2. Implement 7 validation rules
|
|
3. Write unit tests
|
|
4. ✅ Checkpoint: All unit tests pass
|
|
|
|
### Day 2: OCR Integration
|
|
1. Add `preprocess_medium()` method
|
|
2. Update `_merge_extractions()` with validation
|
|
3. Update API schemas
|
|
4. Add database migration
|
|
5. ✅ Checkpoint: Five-Holding receipt works
|
|
|
|
### Day 3: Testing & Polish
|
|
1. Write integration tests
|
|
2. Update frontend components
|
|
3. Manual testing
|
|
4. Bug fixes
|
|
5. ✅ Checkpoint: Production-ready
|
|
|
|
---
|
|
|
|
## Risks & Mitigations
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Medium OCR still causes errors | Tesseract fallback + validation catches issues |
|
|
| CUI validation too strict | Warning only (not error), allow override |
|
|
| Performance impact | Validation <10ms (negligible vs. OCR time) |
|
|
| Breaking API changes | Add new fields, keep existing unchanged |
|
|
|
|
---
|
|
|
|
## Tech Stack Integration
|
|
|
|
### Backend Patterns (CLAUDE.md compliant)
|
|
- ✅ SQLModel + Alembic migrations
|
|
- ✅ Pydantic v2 schemas
|
|
- ✅ Service layer pattern (logic in services, not routers)
|
|
- ✅ Type hints + docstrings
|
|
|
|
### Frontend Patterns (CLAUDE.md compliant)
|
|
- ✅ Vue 3 Composition API
|
|
- ✅ PrimeVue components
|
|
- ✅ Shared CSS patterns (`.roa-card`, `.roa-metric`)
|
|
- ✅ No `:deep()` selectors
|
|
|
|
### Testing Patterns
|
|
- ✅ pytest for backend
|
|
- ✅ >90% coverage target
|
|
- ✅ Integration tests with real data
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Review specification** → `/mnt/e/proiecte/roa2web/.auto-build/specs/bon-ocr-validation/spec.md`
|
|
2. **Create feature branch** → `feature/bon-ocr-validation`
|
|
3. **Implement Phase 1** → Validation engine + tests (Day 1)
|
|
4. **Implement Phase 2** → OCR integration (Day 2)
|
|
5. **Implement Phase 3** → Frontend + testing (Day 3)
|
|
6. **Deploy to staging** → Test with production receipts
|
|
7. **Monitor for 1 week** → Verify no regressions
|
|
8. **Deploy to production** → Roll out gradually
|
|
|
|
---
|
|
|
|
**Estimated Completion:** 2026-01-02 (3 working days)
|
|
**Status:** Ready for Implementation
|