feat(ocr): Add validation system and CLIENT CUI extraction
OCR Data Extraction Validation System: - Add 7 validation rules (amount range, TVA ratio, payment sum, etc.) - Add Medium preprocessing to replace Heavy (fixes digit concatenation) - Add validation warnings to API responses - Flag receipts needing manual review (needs_manual_review field) - Add database migration for needs_manual_review column CLIENT CUI Extraction Improvements: - Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc. - Handle OCR errors (R0 vs RO, C1F vs CIF) - Add client_name, client_cui, client_address to API response - Add validation fields to API response (was missing) QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings) - Fixed type safety in validation rules - Fixed ZeroDivisionError risk - Fixed schema mismatch (Optional[bool] for needs_manual_review) - All 37 unit tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
439
.auto-build/specs/bon-ocr-validation/plan.md
Normal file
439
.auto-build/specs/bon-ocr-validation/plan.md
Normal file
@@ -0,0 +1,439 @@
|
||||
# Implementation Plan: bon-ocr-validation
|
||||
|
||||
**Status**: ✅ COMPLETE
|
||||
**Completed**: 2025-12-30T19:15:00Z
|
||||
|
||||
**Feature:** OCR Data Extraction Validation System
|
||||
**Priority:** Critical (P0 - Production Bug)
|
||||
**Estimated Effort:** 2-3 days
|
||||
**Created:** 2025-12-30T17:25:00Z
|
||||
|
||||
---
|
||||
|
||||
## Progress Tracker
|
||||
|
||||
| Task | Status | Completed |
|
||||
|------|--------|-----------|
|
||||
| Task 1: Create validation module structure | ✅ Done | 2025-12-30 17:30 |
|
||||
| Task 2: Implement validation rules (7 rules) | ✅ Done | 2025-12-30 17:35 |
|
||||
| Task 3: Create validation engine orchestrator | ✅ Done | 2025-12-30 18:05 |
|
||||
| Task 4: Write unit tests for validation | ✅ Done | 2025-12-30 18:15 |
|
||||
| Task 5: Add Medium OCR preprocessing | ✅ Done | 2025-12-30 18:25 |
|
||||
| Task 6: Update ExtractionResult schema | ✅ Done | 2025-12-30 18:35 |
|
||||
| Task 7: Refactor merge_extractions with validation | ✅ Done | 2025-12-30 18:50 |
|
||||
| Task 8: Update API schemas | ✅ Done | 2025-12-30 18:55 |
|
||||
| Task 9: Create database migration | ✅ Done | 2025-12-30 19:05 |
|
||||
| Task 10: Write integration tests | ✅ Done | 2025-12-30 19:10 |
|
||||
| Task 11: Test with Five-Holding receipt | ✅ Done | 2025-12-30 19:15 |
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### Task 1: Create validation module structure
|
||||
- **Status**: ✅ Done (2025-12-30 17:30)
|
||||
- **Phase**: Day 1 - Core Validation
|
||||
- **Files**: `backend/modules/data_entry/services/ocr/validation.py` (NEW)
|
||||
- **Lines**: ~50 lines
|
||||
- **Description**:
|
||||
- Create `backend/modules/data_entry/services/ocr/` directory
|
||||
- Create `validation.py` with base classes
|
||||
- Define `ValidationRule` abstract base class with `validate()` method
|
||||
- Define `ValidationResult` dataclass (is_valid, confidence_penalty, message)
|
||||
- Add module docstring and imports
|
||||
- **Dependencies**: None
|
||||
- **Success Criteria**: Module loads without errors, base classes defined
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Implement validation rules (7 rules)
|
||||
- **Status**: ✅ Done (2025-12-30 17:35)
|
||||
- **Phase**: Day 1 - Core Validation
|
||||
- **Files**: `backend/modules/data_entry/services/ocr/validation.py`
|
||||
- **Lines**: ~300 lines added
|
||||
- **Description**:
|
||||
Implement 7 concrete validation rule classes:
|
||||
|
||||
1. **AmountRangeRule** - Check 0.01 ≤ amount ≤ 100,000 RON
|
||||
2. **TVARatioRule** - Check TVA is 5-24% of TOTAL
|
||||
3. **PaymentSumRule** - Check CARD + NUMERAR = TOTAL (±0.02 tolerance)
|
||||
4. **TVAEntriesSumRule** - Check Σ(TVA entries) = TVA TOTAL (±0.02)
|
||||
5. **CUIFormatRule** - Check RO + 6-10 digits format
|
||||
6. **CUIChecksumRule** - Romanian CIF Mod 11 checksum algorithm
|
||||
7. **InterOCRConsistencyRule** - Flag if values differ >10x ratio
|
||||
|
||||
Each rule should:
|
||||
- Inherit from `ValidationRule`
|
||||
- Implement `validate(data: dict) -> ValidationResult`
|
||||
- Have clear docstrings with examples
|
||||
- Return confidence penalty (0.0-1.0) when validation fails
|
||||
|
||||
- **Dependencies**: Task 1
|
||||
- **Success Criteria**: All 7 rules implemented, can instantiate and call validate()
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Create validation engine orchestrator
|
||||
- **Status**: ✅ Done (2025-12-30 18:05)
|
||||
- **Phase**: Day 1 - Core Validation
|
||||
- **Files**: `backend/modules/data_entry/services/ocr/validation.py`
|
||||
- **Lines**: ~50 lines added
|
||||
- **Description**:
|
||||
- Create `OCRValidationEngine` class
|
||||
- Method: `validate_extraction(extraction_result, light_result, heavy_result)`
|
||||
- Apply all rules in order (sanity → cross-field → inter-OCR)
|
||||
- Aggregate results: collect all warnings, calculate overall penalty
|
||||
- Return enhanced extraction result with:
|
||||
- `needs_manual_review: bool` (if any rule fails critically)
|
||||
- `validation_warnings: list[str]`
|
||||
- `confidence_adjustments: dict[str, float]`
|
||||
- Add helper method: `normalize_cui(cui: str) -> str` (add RO prefix)
|
||||
|
||||
- **Dependencies**: Task 2
|
||||
- **Success Criteria**: Engine can validate extraction, returns enhanced result
|
||||
|
||||
---
|
||||
|
||||
### Task 4: Write unit tests for validation
|
||||
- **Status**: ✅ Done (2025-12-30 18:15)
|
||||
- **Phase**: Day 1 - Core Validation
|
||||
- **Files**: `backend/modules/data_entry/tests/test_ocr_validation.py` (NEW)
|
||||
- **Lines**: ~300 lines
|
||||
- **Description**:
|
||||
Write comprehensive unit tests (>90% coverage):
|
||||
|
||||
**AmountRangeRule (4 tests):**
|
||||
- test_amount_within_range_passes
|
||||
- test_amount_too_high_fails
|
||||
- test_amount_too_low_fails
|
||||
- test_none_amount_passes
|
||||
|
||||
**TVARatioRule (3 tests):**
|
||||
- test_valid_tva_ratio_passes (19%)
|
||||
- test_tva_too_high_fails (>24%)
|
||||
- test_tva_too_low_fails (<5%)
|
||||
|
||||
**PaymentSumRule (4 tests):**
|
||||
- test_payment_sum_matches_total_passes
|
||||
- test_payment_sum_mismatch_fails
|
||||
- test_tolerance_within_002_passes
|
||||
- test_missing_payment_methods_passes
|
||||
|
||||
**TVAEntriesSumRule (3 tests):**
|
||||
- test_tva_entries_sum_matches
|
||||
- test_tva_entries_mismatch_fails
|
||||
- test_tolerance_within_002_passes
|
||||
|
||||
**CUIChecksumRule (5 tests):**
|
||||
- test_valid_cui_checksum_passes (RO10562600)
|
||||
- test_invalid_cui_checksum_fails
|
||||
- test_cui_without_ro_prefix_normalized
|
||||
- test_cui_with_r0_prefix_normalized
|
||||
- test_non_numeric_cui_fails
|
||||
|
||||
**InterOCRConsistencyRule (3 tests):**
|
||||
- test_values_within_10x_passes
|
||||
- test_values_over_10x_fails
|
||||
- test_one_value_missing_passes
|
||||
|
||||
**OCRValidationEngine (5 tests):**
|
||||
- test_engine_applies_all_rules
|
||||
- test_engine_aggregates_warnings
|
||||
- test_engine_sets_manual_review_flag
|
||||
- test_engine_calculates_confidence_penalties
|
||||
- test_normalize_cui_helper
|
||||
|
||||
- **Dependencies**: Task 3
|
||||
- **Success Criteria**: All tests pass, pytest coverage >90%
|
||||
|
||||
---
|
||||
|
||||
### Task 5: Add Medium OCR preprocessing
|
||||
- **Status**: ✅ Done (2025-12-30 18:25)
|
||||
- **Phase**: Day 2 - OCR Integration
|
||||
- **Files**: `backend/modules/data_entry/services/image_preprocessor.py`
|
||||
- **Lines**: ~80 lines added
|
||||
- **Description**:
|
||||
- Add `preprocess_medium(image: Image.Image) -> Image.Image` method
|
||||
- Apply moderate enhancements:
|
||||
- Grayscale conversion
|
||||
- Contrast enhancement (factor=1.5, not 2.0)
|
||||
- Gentle sharpening (factor=1.3)
|
||||
- Light noise reduction (MedianFilter size=3)
|
||||
- Do NOT apply:
|
||||
- Aggressive binarization (causes digit concatenation)
|
||||
- Morphological operations (erosion/dilation)
|
||||
- Heavy contrast (factor=2.0)
|
||||
- Add docstring explaining difference from Heavy preprocessing
|
||||
- Mark `preprocess_heavy()` as deprecated with comment
|
||||
|
||||
- **Dependencies**: None (parallel with Task 1-4)
|
||||
- **Success Criteria**: Method returns preprocessed image, no extreme distortion
|
||||
|
||||
---
|
||||
|
||||
### Task 6: Update ExtractionResult schema
|
||||
- **Status**: ✅ Done (2025-12-30 18:35)
|
||||
- **Phase**: Day 2 - OCR Integration
|
||||
- **Files**:
|
||||
- `backend/modules/data_entry/services/ocr_extractor.py`
|
||||
- `backend/modules/data_entry/schemas/ocr.py`
|
||||
- **Lines**: ~50 lines modified, ~30 added
|
||||
- **Description**:
|
||||
|
||||
**In ocr_extractor.py:**
|
||||
- Add fields to `ExtractionResult` dataclass (after existing fields):
|
||||
```python
|
||||
# Validation tracking
|
||||
needs_manual_review: bool = False
|
||||
validation_warnings: list[str] = field(default_factory=list)
|
||||
validation_errors: list[str] = field(default_factory=list)
|
||||
confidence_adjustments: dict[str, float] = field(default_factory=dict)
|
||||
```
|
||||
- Update `to_dict()` method to include new fields
|
||||
- Fix CLIENT CUI patterns (more flexible for OCR variations):
|
||||
- Make colon optional: `:?\s*`
|
||||
- Make RO prefix optional: `(?:R[O0])?\s*`
|
||||
- Pattern: `r'CLIENT\s+C\.\s*U\.\s*I\.?\s*/\s*C\.\s*[I1]\.\s*F\.?\s*:?\s*(?:R[O0])?\s*(\d{6,10})'`
|
||||
|
||||
**In schemas/ocr.py:**
|
||||
- Add `ValidationWarning` schema:
|
||||
```python
|
||||
class ValidationWarning(BaseModel):
|
||||
field: str
|
||||
severity: str # "warning" | "error"
|
||||
message: str
|
||||
```
|
||||
- Add to `ExtractionData` schema (line ~57):
|
||||
```python
|
||||
needs_manual_review: bool = False
|
||||
validation_warnings: list[ValidationWarning] = []
|
||||
```
|
||||
|
||||
- **Dependencies**: Task 3 (needs ValidationResult structure)
|
||||
- **Success Criteria**: Schemas load, can serialize/deserialize with new fields
|
||||
|
||||
---
|
||||
|
||||
### Task 7: Refactor merge_extractions with validation
|
||||
- **Status**: ✅ Done (2025-12-30 18:50)
|
||||
- **Phase**: Day 2 - OCR Integration
|
||||
- **Files**: `backend/modules/data_entry/services/ocr_service.py`
|
||||
- **Lines**: ~200 lines modified
|
||||
- **Description**:
|
||||
|
||||
**Replace Step 2 Heavy OCR with Medium OCR (line ~130):**
|
||||
- Change `self._preprocess_heavy(image)` to `self._preprocess_medium(image)`
|
||||
- Update logging: "Step 2: PaddleOCR + Medium preprocessing"
|
||||
- Update variable names: `result_heavy` → `result_medium`, `conf_heavy` → `conf_medium`
|
||||
|
||||
**Refactor `_merge_extractions()` method (lines 240-386):**
|
||||
- Import validation engine: `from .ocr.validation import OCRValidationEngine`
|
||||
- Instantiate engine: `validator = OCRValidationEngine()`
|
||||
- For each field (AMOUNT, TVA, CUI, DATE):
|
||||
1. Get both Light and Medium values
|
||||
2. Run validation on both values
|
||||
3. Apply confidence penalties from validation results
|
||||
4. Choose value with ADJUSTED confidence (not raw)
|
||||
5. Log decision with validation notes
|
||||
- After merge, run cross-field validations:
|
||||
- Payment sum validation (CARD + CASH = TOTAL)
|
||||
- TVA entries sum validation
|
||||
- If mismatch and confidence < 80%, auto-correct TOTAL from payment sum
|
||||
- Call validator engine: `result = validator.validate_extraction(result, light_result, medium_result)`
|
||||
- Return enhanced result with validation warnings
|
||||
|
||||
**Add structured logging:**
|
||||
- Log each merge decision with confidence scores
|
||||
- Log validation failures with field names
|
||||
- Log auto-corrections with old/new values
|
||||
|
||||
- **Dependencies**: Task 3, Task 5, Task 6
|
||||
- **Success Criteria**: Merge logic uses validation, auto-correction works
|
||||
|
||||
---
|
||||
|
||||
### Task 8: Update API schemas and router
|
||||
- **Status**: ✅ Done (2025-12-30 18:55)
|
||||
- **Phase**: Day 2 - OCR Integration
|
||||
- **Files**: `backend/modules/data_entry/routers/ocr.py`
|
||||
- **Lines**: ~40 lines modified
|
||||
- **Description**:
|
||||
- Update `OCRResponse` schema to include validation fields:
|
||||
```python
|
||||
needs_manual_review: bool = False
|
||||
validation_warnings: list[ValidationWarning] = []
|
||||
confidence_info: dict[str, float] = {} # field -> adjusted confidence
|
||||
```
|
||||
- In `/process-receipt` endpoint (line ~106):
|
||||
- Pass validation warnings from OCR result to response
|
||||
- Add log message if needs_manual_review=True
|
||||
- Return HTTP 200 with warnings (don't block)
|
||||
- Update endpoint docstring to mention validation behavior
|
||||
|
||||
- **Dependencies**: Task 6, Task 7
|
||||
- **Success Criteria**: API returns validation warnings, save not blocked
|
||||
|
||||
---
|
||||
|
||||
### Task 9: Create database migration
|
||||
- **Status**: ✅ Done (2025-12-30 19:05)
|
||||
- **Phase**: Day 2 - OCR Integration
|
||||
- **Files**: `backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py` (NEW)
|
||||
- **Lines**: ~30 lines
|
||||
- **Description**:
|
||||
- Generate Alembic migration: `alembic revision -m "add needs_manual_review to receipts"`
|
||||
- Add column to `receipts` table:
|
||||
```python
|
||||
op.add_column('receipts',
|
||||
sa.Column('needs_manual_review', sa.Boolean(), nullable=True, default=False)
|
||||
)
|
||||
```
|
||||
- Add downgrade to remove column
|
||||
- Test migration: `alembic upgrade head` then `alembic downgrade -1`
|
||||
|
||||
- **Dependencies**: None (parallel)
|
||||
- **Success Criteria**: Migration runs without errors, column added
|
||||
|
||||
---
|
||||
|
||||
### Task 10: Write integration tests
|
||||
- **Status**: ✅ Done (2025-12-30 19:10)
|
||||
- **Phase**: Day 3 - Testing & Polish
|
||||
- **Files**: `backend/modules/data_entry/tests/test_ocr_validation_integration.py` (NEW)
|
||||
- **Lines**: ~200 lines
|
||||
- **Description**:
|
||||
Write integration tests with real OCR service:
|
||||
|
||||
**Test 1: Five-Holding production case**
|
||||
- Load `docs/data-entry/igiena 14 decembrie five-holding.pdf`
|
||||
- Run full OCR pipeline
|
||||
- Assert: TOTAL = 85.99 (NOT 859,762.16)
|
||||
- Assert: TVA = 14.92 (NOT 149,214.92)
|
||||
- Assert: No magnitude errors >10x
|
||||
|
||||
**Test 2: Payment sum validation**
|
||||
- Mock OCR results: TOTAL=100.00, CARD=50.00, CASH=40.00
|
||||
- Assert: needs_manual_review=True
|
||||
- Assert: "Payment sum mismatch" in warnings
|
||||
|
||||
**Test 3: Payment sum auto-correction**
|
||||
- Mock: TOTAL=859762.16 (confidence=0.75), CARD=85.99, CASH=0.00
|
||||
- Assert: TOTAL auto-corrected to 85.99
|
||||
- Assert: "Auto-corrected from payment sum" in warnings
|
||||
|
||||
**Test 4: TVA entries sum validation**
|
||||
- Mock: TVA_TOTAL=14.92, TVA_A=12.00, TVA_B=2.00
|
||||
- Assert: needs_manual_review=True (sum=14.00 ≠ 14.92)
|
||||
|
||||
**Test 5: CUI checksum validation**
|
||||
- Mock: CUI="RO10562600" (valid checksum)
|
||||
- Assert: passes validation
|
||||
- Mock: CUI="RO12345678" (invalid checksum)
|
||||
- Assert: confidence penalty applied
|
||||
|
||||
**Test 6: Inter-OCR consistency**
|
||||
- Mock: Light=85.99, Medium=859762.16
|
||||
- Assert: Light value chosen (ratio >10x)
|
||||
- Assert: "Inter-OCR inconsistency" in warnings
|
||||
|
||||
**Test 7: All validations pass (clean receipt)**
|
||||
- Mock high-quality receipt with correct values
|
||||
- Assert: needs_manual_review=False
|
||||
- Assert: validation_warnings empty
|
||||
|
||||
**Test 8: Medium OCR doesn't cause errors**
|
||||
- Load clear PDF receipt
|
||||
- Assert: Medium OCR values within 10x of Light
|
||||
- Assert: No digit concatenation errors
|
||||
|
||||
- **Dependencies**: Task 7, Task 8
|
||||
- **Success Criteria**: All 8 integration tests pass
|
||||
|
||||
---
|
||||
|
||||
### Task 11: Test with Five-Holding receipt (Manual)
|
||||
- **Status**: ✅ Done (2025-12-30 19:15)
|
||||
- **Phase**: Day 3 - Testing & Polish
|
||||
- **Files**: Manual testing checklist
|
||||
- **Description**:
|
||||
Manual end-to-end testing with production receipt:
|
||||
|
||||
1. **Start backend services:**
|
||||
- SSH tunnel: `./ssh-tunnel-prod.sh start`
|
||||
- Backend: `./start-backend.sh`
|
||||
|
||||
2. **Upload Five-Holding receipt:**
|
||||
- File: `docs/data-entry/igiena 14 decembrie five-holding.pdf`
|
||||
- Use `/api/ocr/process-receipt` endpoint
|
||||
|
||||
3. **Verify extracted values:**
|
||||
- ✅ TOTAL: 85.99 LEI (NOT 859,762.16)
|
||||
- ✅ TVA: 14.92 LEI (NOT 149,214.92)
|
||||
- ✅ CUI: R010562600
|
||||
- ✅ Date: 2024-12-14
|
||||
- ✅ CARD: 85.99 LEI
|
||||
|
||||
4. **Verify validation:**
|
||||
- ✅ needs_manual_review = False (values are correct)
|
||||
- ✅ validation_warnings empty (or only informational)
|
||||
- ✅ Payment sum matches (CARD = TOTAL)
|
||||
- ✅ TVA ratio valid (14.92/85.99 = 17.35%)
|
||||
|
||||
5. **Test other receipts (regression):**
|
||||
- Upload 3-5 other receipts from `docs/data-entry/`
|
||||
- Verify no new false positives
|
||||
- Verify existing correct extractions still work
|
||||
|
||||
6. **Test error cases:**
|
||||
- Upload receipt with wrong OCR (synthetic test)
|
||||
- Verify warnings displayed
|
||||
- Verify save button works (not blocked)
|
||||
|
||||
- **Dependencies**: Task 10
|
||||
- **Success Criteria**: All manual tests pass, production bug fixed
|
||||
|
||||
---
|
||||
|
||||
## Implementation Timeline
|
||||
|
||||
### Day 1: Core Validation (Tasks 1-4)
|
||||
- **Morning:** Tasks 1-2 (validation module + rules)
|
||||
- **Afternoon:** Tasks 3-4 (engine + unit tests)
|
||||
- **Checkpoint:** All unit tests pass (>90% coverage)
|
||||
|
||||
### Day 2: OCR Integration (Tasks 5-9)
|
||||
- **Morning:** Tasks 5-6 (Medium OCR + schemas)
|
||||
- **Afternoon:** Tasks 7-9 (merge refactor + API + migration)
|
||||
- **Checkpoint:** Five-Holding receipt extracts correct values
|
||||
|
||||
### Day 3: Testing & Polish (Tasks 10-11)
|
||||
- **Morning:** Task 10 (integration tests)
|
||||
- **Afternoon:** Task 11 (manual testing + bug fixes)
|
||||
- **Checkpoint:** Production-ready, all tests pass
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
- ✅ All 20+ unit tests pass
|
||||
- ✅ All 8 integration tests pass
|
||||
- ✅ Five-Holding receipt: 85.99 not 859,762.16
|
||||
- ✅ pytest coverage >90%
|
||||
- ✅ No regressions on existing receipts
|
||||
- ✅ Manual testing checklist complete
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If issues arise:
|
||||
1. Revert migration: `alembic downgrade -1`
|
||||
2. Revert code changes: `git revert {commit}`
|
||||
3. Fallback to Light + Tesseract only (skip Medium)
|
||||
4. Add feature flag: `OCR_VALIDATION_ENABLED=false`
|
||||
|
||||
---
|
||||
|
||||
**Plan Created:** 2025-12-30T17:25:00Z
|
||||
**Ready for Implementation:** Yes
|
||||
Reference in New Issue
Block a user