OCR Data Extraction Validation System: - Add 7 validation rules (amount range, TVA ratio, payment sum, etc.) - Add Medium preprocessing to replace Heavy (fixes digit concatenation) - Add validation warnings to API responses - Flag receipts needing manual review (needs_manual_review field) - Add database migration for needs_manual_review column CLIENT CUI Extraction Improvements: - Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc. - Handle OCR errors (R0 vs RO, C1F vs CIF) - Add client_name, client_cui, client_address to API response - Add validation fields to API response (was missing) QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings) - Fixed type safety in validation rules - Fixed ZeroDivisionError risk - Fixed schema mismatch (Optional[bool] for needs_manual_review) - All 37 unit tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1534 lines
52 KiB
Markdown
1534 lines
52 KiB
Markdown
# Feature Specification: OCR Data Extraction Validation System
|
||
|
||
**Feature ID:** bon-ocr-validation
|
||
**Priority:** Critical (P0 - Production Bug)
|
||
**Complexity:** High
|
||
**Estimated Effort:** 2-3 days
|
||
**Created:** 2025-12-30
|
||
**Module:** Data Entry (backend/modules/data_entry/)
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
Fix critical OCR data extraction issue where PaddleOCR Heavy preprocessing (88% confidence) overwrites correct Light OCR (98% confidence) data with garbage values, causing 10,000x magnitude errors in production receipts.
|
||
|
||
**Value Proposition:** Prevent incorrect financial data from entering the system, reduce manual corrections, improve user trust in OCR accuracy.
|
||
|
||
---
|
||
|
||
## Problem Statement
|
||
|
||
### Current Behavior (BROKEN)
|
||
|
||
The OCR processing pipeline (`backend/modules/data_entry/services/ocr_service.py`) uses a 3-step adaptive approach:
|
||
1. **Step 1:** PaddleOCR + Light preprocessing (fast, high confidence)
|
||
2. **Step 2:** PaddleOCR + Heavy preprocessing (faded receipts)
|
||
3. **Step 3:** Tesseract (complement missing fields only)
|
||
|
||
**Critical Bug:** The `_merge_extractions()` method (lines 240-386) blindly prefers higher OCR confidence scores WITHOUT validating actual extracted values.
|
||
|
||
### Real Production Example (Five-Holding Receipt)
|
||
|
||
| Field | Light OCR (98%) ✅ | Heavy OCR (88%) ❌ | Final Result ❌ |
|
||
|-------|-------------------|-------------------|-----------------|
|
||
| **TOTAL** | 85.99 LEI | 859,762.16 LEI | **859,762.16** (10,000x error!) |
|
||
| **TVA** | 14.92 LEI | 149,214.92 LEI | **149,214.92** (10,000x error!) |
|
||
| **CUI** | R010562600 | (not found) | R010562600 |
|
||
| **Date** | 2025-10-11 | 2025-10-11 | 2025-10-11 |
|
||
| **Confidence** | 98% | 88% | 88% (wrong source!) |
|
||
|
||
**Root Cause:** Heavy preprocessing causes digit concatenation on high-quality PDFs. The binarization and morphological operations (lines 153-164 in `image_preprocessor.py`) merge adjacent numbers, creating garbage values.
|
||
|
||
### Impact
|
||
|
||
- **Data Integrity:** Incorrect amounts enter accounting system
|
||
- **User Trust:** Users lose confidence in OCR accuracy
|
||
- **Manual Work:** Requires manual verification of ALL OCR extractions
|
||
- **Financial Risk:** Wrong amounts could be approved without review
|
||
|
||
---
|
||
|
||
## User Stories
|
||
|
||
### 1. As a user uploading a clear PDF receipt
|
||
**I want** OCR to extract correct values from the first pass
|
||
**So that** I don't have to manually correct obvious errors
|
||
|
||
**Acceptance Criteria:**
|
||
- Light OCR correctly extracts 85.99 LEI (not 859,762.16)
|
||
- Heavy OCR is skipped when Light OCR confidence >= 90%
|
||
- No 10,000x magnitude errors
|
||
|
||
### 2. As a user submitting a receipt with warnings
|
||
**I want** to be able to save receipts with validation warnings
|
||
**So that** I can submit for review even if OCR isn't perfect
|
||
|
||
**Acceptance Criteria:**
|
||
- Save button works with warnings (not blocked)
|
||
- Receipt marked with `needs_manual_review=True`
|
||
- Warnings displayed clearly in UI
|
||
|
||
### 3. As a supervisor reviewing receipts
|
||
**I want** to see which receipts need manual review
|
||
**So that** I can prioritize validation efforts
|
||
|
||
**Acceptance Criteria:**
|
||
- Filter by "Needs Review" flag
|
||
- Validation warnings shown in detail view
|
||
- Clear indication of which fields are suspicious
|
||
|
||
### 4. As a system validating cross-field data
|
||
**I want** to validate CARD + NUMERAR = TOTAL
|
||
**So that** payment methods match the total amount
|
||
|
||
**Acceptance Criteria:**
|
||
- Cross-validation: sum of payment methods = TOTAL (±0.02 RON tolerance)
|
||
- If mismatch, flag for review
|
||
- Auto-correct TOTAL from payment sum if confidence < 80%
|
||
|
||
### 5. As a system validating TVA entries
|
||
**I want** to validate Σ(TVA entries) = TVA TOTAL
|
||
**So that** individual TVA lines match the total TVA
|
||
|
||
**Acceptance Criteria:**
|
||
- Cross-validation: sum of TVA entries = TVA TOTAL (±0.02 RON tolerance)
|
||
- TVA rate validation (5-24% of TOTAL)
|
||
- If mismatch, flag for review
|
||
|
||
---
|
||
|
||
## Functional Requirements
|
||
|
||
### Core Requirements (Must-Have)
|
||
|
||
#### 1. Multi-Layer Validation Pipeline
|
||
|
||
**FR-1.1:** Absolute value sanity checks
|
||
- Amount range: 0.01 - 100,000 RON
|
||
- Max 2 decimal places
|
||
- Date: not in future, not older than 10 years (2015+)
|
||
- CUI: 6-10 digits, valid Mod 11 checksum
|
||
|
||
**FR-1.2:** Cross-field correlation validation
|
||
- TVA: 5-24% of TOTAL amount (Romanian rates: 5%, 9%, 11%, 19%, 21%)
|
||
- Payment methods: CARD + NUMERAR = TOTAL (±0.02 RON tolerance)
|
||
- Inter-OCR consistency: flag if values differ >10x between engines
|
||
|
||
**FR-1.3:** Auto-correction logic
|
||
- If TOTAL is obviously wrong (>10x payment sum), use payment sum
|
||
- If TVA > TOTAL, recalculate TOTAL from TVA using reverse formula
|
||
- Preserve high-confidence values from Light OCR over low-confidence Heavy OCR
|
||
|
||
**FR-1.4:** Validation result structure
|
||
```python
|
||
@dataclass
|
||
class ValidationResult:
|
||
is_valid: bool
|
||
warnings: List[ValidationWarning] # Non-blocking issues
|
||
errors: List[ValidationError] # Blocking issues (none for now)
|
||
corrected_fields: Dict[str, Any] # Auto-corrected values
|
||
needs_manual_review: bool # Flag for supervisor
|
||
```
|
||
|
||
#### 2. Replace Heavy with Medium OCR
|
||
|
||
**FR-2.1:** Remove `preprocess_heavy()` method
|
||
- Current Heavy: aggressive binarization causes digit concatenation
|
||
- Reason: Destroys high-quality PDFs while trying to recover faded receipts
|
||
|
||
**FR-2.2:** Add `preprocess_medium()` method
|
||
- Moderate contrast enhancement (CLAHE clipLimit=2.0)
|
||
- Light denoising (fastNlMeansDenoising h=6)
|
||
- NO binarization, NO morphological operations
|
||
- Preserve text boundaries on clear images
|
||
|
||
**FR-2.3:** Update OCR pipeline
|
||
- Step 1: Light preprocessing (unchanged)
|
||
- Step 2: **Medium** preprocessing (replaces Heavy)
|
||
- Step 3: Tesseract (unchanged)
|
||
|
||
#### 3. Enhanced CUI Extraction
|
||
|
||
**FR-3.1:** Romanian CIF validation algorithm
|
||
- Implement Mod 11 checksum validation
|
||
- Control digit formula: `sum(digit[i] * weight[i]) % 11`
|
||
- Weights: `[7, 5, 3, 2, 1, 7, 5, 3, 2]` (right-to-left)
|
||
|
||
**FR-3.2:** CUI format normalization
|
||
- Always add "RO" prefix if missing
|
||
- Remove spaces, dashes, dots
|
||
- Validate length: 6-10 digits
|
||
|
||
**FR-3.3:** Improved regex patterns
|
||
```python
|
||
# Add OCR-tolerant patterns (current patterns are too strict)
|
||
CUI_OCR_TOLERANT_PATTERNS = [
|
||
r'CIF[:\s]*R[O0]?\s*(\d[\d\s]{5,9})', # Spaces in CUI
|
||
r'C[I1]F[:\s]*(\d[\d\s]{6,10})', # C1F (I→1 OCR error)
|
||
r'C\.?\s*[I1]\.?\s*F\.?[:\s]*(\d+)', # C. I. F. (spaced)
|
||
]
|
||
```
|
||
|
||
#### 4. User Requirements Integration
|
||
|
||
**FR-4.1:** Non-blocking validation warnings
|
||
- Save button enabled even with warnings
|
||
- User can override and submit
|
||
- Warnings displayed clearly in UI
|
||
|
||
**FR-4.2:** Manual review flag
|
||
- Database field: `receipts.needs_manual_review` (BOOLEAN)
|
||
- Set to `TRUE` if:
|
||
- Any validation warning present
|
||
- Overall confidence < 85%
|
||
- Cross-validation fails
|
||
|
||
**FR-4.3:** Apply to new uploads only
|
||
- No reprocessing of existing receipts
|
||
- Validation runs on OCR extraction (POST /api/ocr/extract)
|
||
- Migration: add column with default NULL (not FALSE)
|
||
|
||
### Secondary Requirements (Nice-to-Have)
|
||
|
||
**FR-S1:** Validation confidence scoring
|
||
- Each validation rule contributes to score
|
||
- Overall validation confidence: weighted average
|
||
- Display in UI alongside OCR confidence
|
||
|
||
**FR-S2:** Validation rule configurability
|
||
- Move hardcoded thresholds to config
|
||
- Allow per-company customization
|
||
- Admin UI to adjust rules
|
||
|
||
---
|
||
|
||
## Technical Requirements
|
||
|
||
### Files to Create
|
||
|
||
#### 1. `backend/modules/data_entry/services/ocr/validation.py`
|
||
**Purpose:** Validation utilities and rule engine
|
||
**Size:** ~400 lines
|
||
**Key Classes:**
|
||
- `ValidationRule` (base class)
|
||
- `AmountRangeRule`, `TVARatioRule`, `PaymentSumRule`, `CUIChecksumRule`
|
||
- `OCRValidationEngine` (orchestrator)
|
||
|
||
**Example:**
|
||
```python
|
||
@dataclass
|
||
class ValidationWarning:
|
||
"""Non-blocking validation warning."""
|
||
field: str
|
||
rule: str
|
||
message: str
|
||
severity: str # 'low', 'medium', 'high'
|
||
suggested_value: Optional[Any] = None
|
||
|
||
class ValidationRule(ABC):
|
||
"""Base validation rule."""
|
||
@abstractmethod
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
pass
|
||
|
||
class AmountRangeRule(ValidationRule):
|
||
"""Validate amount is in reasonable range (0.01 - 100,000 RON)."""
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
warnings = []
|
||
if extraction.amount:
|
||
if extraction.amount < Decimal('0.01'):
|
||
warnings.append(ValidationWarning(
|
||
field='amount',
|
||
rule='amount_range',
|
||
message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
|
||
severity='high'
|
||
))
|
||
elif extraction.amount > Decimal('100000'):
|
||
warnings.append(ValidationWarning(
|
||
field='amount',
|
||
rule='amount_range',
|
||
message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
|
||
severity='high'
|
||
))
|
||
return warnings
|
||
|
||
class OCRValidationEngine:
|
||
"""Orchestrate all validation rules."""
|
||
def __init__(self):
|
||
self.rules = [
|
||
AmountRangeRule(),
|
||
TVARatioRule(),
|
||
PaymentSumRule(),
|
||
InterOCRConsistencyRule(),
|
||
CUIChecksumRule(),
|
||
DateValidityRule(),
|
||
]
|
||
|
||
def validate(self, extraction: ExtractionResult) -> ValidationResult:
|
||
"""Run all validation rules and return result."""
|
||
all_warnings = []
|
||
corrected_fields = {}
|
||
|
||
for rule in self.rules:
|
||
warnings = rule.validate(extraction)
|
||
all_warnings.extend(warnings)
|
||
|
||
# Apply auto-corrections
|
||
corrections = rule.auto_correct(extraction)
|
||
corrected_fields.update(corrections)
|
||
|
||
needs_review = (
|
||
len(all_warnings) > 0 or
|
||
extraction.overall_confidence < 0.85
|
||
)
|
||
|
||
return ValidationResult(
|
||
is_valid=True, # Never block (warnings only)
|
||
warnings=all_warnings,
|
||
errors=[],
|
||
corrected_fields=corrected_fields,
|
||
needs_manual_review=needs_review
|
||
)
|
||
```
|
||
|
||
#### 2. `backend/modules/data_entry/tests/test_ocr_validation.py`
|
||
**Purpose:** Unit tests for validation rules
|
||
**Size:** ~300 lines
|
||
**Coverage Target:** >90%
|
||
|
||
**Test Cases:**
|
||
- `test_amount_range_valid()` - 85.99 RON passes
|
||
- `test_amount_range_too_high()` - 859,762.16 fails
|
||
- `test_tva_ratio_valid()` - 14.92/85.99 = 17.3% passes
|
||
- `test_tva_ratio_too_high()` - 149,214.92/859,762.16 = 17.3% but amounts wrong
|
||
- `test_payment_sum_matches()` - CARD 50 + NUMERAR 35.99 = TOTAL 85.99
|
||
- `test_cui_checksum_valid()` - R010562600 passes Mod 11
|
||
- `test_cui_checksum_invalid()` - R010562601 fails Mod 11
|
||
- `test_inter_ocr_consistency()` - 85.99 vs 859,762.16 = 10,000x flag
|
||
|
||
#### 3. `backend/modules/data_entry/tests/test_ocr_validation_integration.py`
|
||
**Purpose:** Integration tests with full OCR pipeline
|
||
**Size:** ~200 lines
|
||
|
||
**Test Cases:**
|
||
- `test_five_holding_receipt()` - Real production case (85.99 not 859,762.16)
|
||
- `test_clear_pdf_uses_light_ocr()` - High-quality PDF skips Heavy
|
||
- `test_faded_receipt_uses_medium_ocr()` - Thermal receipt uses Medium
|
||
- `test_validation_warnings_in_response()` - API returns warnings
|
||
- `test_manual_review_flag_set()` - Flag set when confidence < 85%
|
||
|
||
### Files to Modify
|
||
|
||
#### 1. `backend/modules/data_entry/services/ocr_service.py`
|
||
**Changes:** ~200 lines modified, ~100 lines added
|
||
|
||
**Key Modifications:**
|
||
|
||
**A. Replace `_merge_extractions()` (lines 240-386) with validation-aware version:**
|
||
```python
|
||
def _merge_extractions(
|
||
self,
|
||
light: Optional[ExtractionResult],
|
||
medium: Optional[ExtractionResult] # Renamed from 'tesseract'
|
||
) -> ExtractionResult:
|
||
"""
|
||
Merge extractions with VALIDATION-AWARE logic.
|
||
|
||
NEW Strategy:
|
||
1. Run validation on both extractions
|
||
2. Prefer extraction with FEWER warnings (not just higher confidence)
|
||
3. For each field, pick value that passes validation
|
||
4. Flag inter-OCR inconsistencies (>10x difference)
|
||
"""
|
||
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
|
||
|
||
validator = OCRValidationEngine()
|
||
|
||
# Validate both extractions
|
||
light_validation = validator.validate(light) if light else None
|
||
medium_validation = validator.validate(medium) if medium else None
|
||
|
||
result = ExtractionResult()
|
||
|
||
# === AMOUNT (with validation check) ===
|
||
if light.amount and medium.amount:
|
||
# Check for 10x inconsistency
|
||
ratio = max(light.amount, medium.amount) / min(light.amount, medium.amount)
|
||
if ratio > 10:
|
||
print(f"[Merge] WARNING: Inter-OCR inconsistency: {light.amount} vs {medium.amount} ({ratio:.0f}x)", flush=True)
|
||
# Prefer value that passes validation
|
||
light_warnings = [w for w in light_validation.warnings if w.field == 'amount']
|
||
medium_warnings = [w for w in medium_validation.warnings if w.field == 'amount']
|
||
|
||
if len(light_warnings) < len(medium_warnings):
|
||
result.amount = light.amount
|
||
result.confidence_amount = light.confidence_amount
|
||
print(f"[Merge] Using Light OCR amount: {light.amount} (fewer warnings)", flush=True)
|
||
else:
|
||
result.amount = medium.amount
|
||
result.confidence_amount = medium.confidence_amount
|
||
print(f"[Merge] Using Medium OCR amount: {medium.amount} (fewer warnings)", flush=True)
|
||
else:
|
||
# Normal merge: prefer higher confidence
|
||
if light.confidence_amount >= medium.confidence_amount:
|
||
result.amount = light.amount
|
||
result.confidence_amount = light.confidence_amount
|
||
else:
|
||
result.amount = medium.amount
|
||
result.confidence_amount = medium.confidence_amount
|
||
elif light.amount:
|
||
result.amount = light.amount
|
||
result.confidence_amount = light.confidence_amount
|
||
elif medium.amount:
|
||
result.amount = medium.amount
|
||
result.confidence_amount = medium.confidence_amount
|
||
|
||
# ... (similar logic for other fields)
|
||
|
||
return result
|
||
```
|
||
|
||
**B. Add `preprocess_medium()` call (replace Heavy):**
|
||
```python
|
||
# Line ~130: Replace preprocess_heavy with preprocess_medium
|
||
print("=" * 60, flush=True)
|
||
print("[OCR] STEP 2: PaddleOCR + Medium preprocessing", flush=True)
|
||
print("=" * 60, flush=True)
|
||
medium_img = self.preprocessor.preprocess_medium(image) # NEW
|
||
|
||
try:
|
||
paddle_medium = self.ocr_engine._paddle_recognize(medium_img)
|
||
# ... rest of processing
|
||
```
|
||
|
||
**C. Add validation to final result:**
|
||
```python
|
||
# Line ~204: Add validation before returning
|
||
if extraction:
|
||
extraction = self._final_validation(extraction)
|
||
|
||
# NEW: Run validation engine
|
||
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
|
||
validator = OCRValidationEngine()
|
||
validation_result = validator.validate(extraction)
|
||
|
||
# Apply auto-corrections
|
||
for field, value in validation_result.corrected_fields.items():
|
||
setattr(extraction, field, value)
|
||
|
||
# Store validation warnings (add to ExtractionResult)
|
||
extraction.validation_warnings = validation_result.warnings
|
||
extraction.needs_manual_review = validation_result.needs_manual_review
|
||
```
|
||
|
||
#### 2. `backend/modules/data_entry/services/ocr_extractor.py`
|
||
**Changes:** ~50 lines modified, ~30 lines added
|
||
|
||
**Key Modifications:**
|
||
|
||
**A. Add validation fields to `ExtractionResult` (lines 10-50):**
|
||
```python
|
||
@dataclass
|
||
class ExtractionResult:
|
||
"""Structured extraction result from receipt."""
|
||
# ... existing fields ...
|
||
|
||
# NEW: Validation results
|
||
validation_warnings: List[dict] = field(default_factory=list) # List of warnings
|
||
needs_manual_review: bool = False # Flag for supervisor review
|
||
|
||
# NEW: Inter-OCR comparison data
|
||
inter_ocr_ratio: Optional[float] = None # Ratio between Light/Heavy values
|
||
inter_ocr_source_used: Optional[str] = None # 'light' or 'medium'
|
||
```
|
||
|
||
**B. Fix CLIENT CUI patterns (lines 253-272):**
|
||
```python
|
||
# Current patterns are too strict - add OCR-tolerant versions
|
||
CLIENT_CUI_PATTERNS = [
|
||
# ... existing patterns ...
|
||
|
||
# NEW: OCR-tolerant patterns
|
||
(r'CLIENT\s+C[I1UO]F\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96), # Spaces in CUI
|
||
(r'C[I1]F\s+CLIENT\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96), # Reversed format
|
||
(r'CLIENT.*?(?:R[O0])?(\d{6,10})\s*\n', 0.90), # CUI on next line
|
||
]
|
||
```
|
||
|
||
**C. Add CUI normalization and validation:**
|
||
```python
|
||
def _normalize_cui(self, cui: Optional[str]) -> Optional[str]:
|
||
"""Normalize CUI format and validate checksum."""
|
||
if not cui:
|
||
return None
|
||
|
||
# Remove non-digits
|
||
digits = re.sub(r'\D', '', cui)
|
||
|
||
# Validate length
|
||
if not (6 <= len(digits) <= 10):
|
||
return None
|
||
|
||
# Validate Mod 11 checksum (Romanian CIF algorithm)
|
||
if not self._validate_cui_checksum(digits):
|
||
print(f"[CUI Validation] Invalid checksum: {digits}", flush=True)
|
||
return None
|
||
|
||
# Add RO prefix
|
||
return f"RO{digits}"
|
||
|
||
def _validate_cui_checksum(self, digits: str) -> bool:
|
||
"""Validate Romanian CIF Mod 11 checksum."""
|
||
if len(digits) < 2:
|
||
return False
|
||
|
||
# Weights: 7, 5, 3, 2, 1, 7, 5, 3, 2 (right-to-left)
|
||
weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
|
||
|
||
# Get control digit (last digit)
|
||
control = int(digits[-1])
|
||
|
||
# Calculate checksum (all digits except last)
|
||
digits_to_check = digits[:-1].zfill(9) # Pad with zeros if needed
|
||
checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
|
||
|
||
# Mod 11
|
||
remainder = checksum % 11
|
||
expected_control = 0 if remainder == 10 else remainder
|
||
|
||
return control == expected_control
|
||
```
|
||
|
||
#### 3. `backend/modules/data_entry/services/image_preprocessor.py`
|
||
**Changes:** ~80 lines added
|
||
|
||
**Key Modifications:**
|
||
|
||
**A. Add `preprocess_medium()` method (after line 166):**
|
||
```python
|
||
def preprocess_medium(self, image: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Medium preprocessing for MIXED-QUALITY images.
|
||
Balance between Light (too gentle) and Heavy (too aggressive).
|
||
|
||
Use cases:
|
||
- Moderately faded receipts
|
||
- Photos with uneven lighting
|
||
- Scans with slight blur
|
||
|
||
Preprocessing steps:
|
||
- Moderate contrast enhancement (CLAHE clipLimit=2.0)
|
||
- Light denoising (fastNlMeansDenoising h=6)
|
||
- Gentle sharpening
|
||
- NO binarization (preserves text boundaries)
|
||
- NO morphological operations (avoids digit concatenation)
|
||
"""
|
||
# 0. Add safety padding
|
||
image = self._add_safety_padding(image)
|
||
|
||
# 1. Grayscale
|
||
if len(image.shape) == 3:
|
||
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
|
||
else:
|
||
gray = image.copy()
|
||
|
||
# 2. Scale (same as Light)
|
||
height, width = gray.shape
|
||
max_side = max(height, width)
|
||
if max_side > 4000:
|
||
scale = 4000 / max_side
|
||
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
|
||
height, width = gray.shape
|
||
|
||
if width < 1500:
|
||
scale = 1500 / width
|
||
new_width = int(width * scale)
|
||
new_height = int(height * scale)
|
||
if max(new_width, new_height) > 4000:
|
||
scale = 4000 / max(new_width, new_height)
|
||
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)
|
||
|
||
# 3. Deskew
|
||
gray = self._deskew(gray)
|
||
|
||
# 4. Moderate contrast enhancement
|
||
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
||
enhanced = clahe.apply(gray)
|
||
|
||
# 5. Light denoising (less aggressive than Heavy)
|
||
denoised = cv2.fastNlMeansDenoising(enhanced, h=6, templateWindowSize=7, searchWindowSize=15)
|
||
|
||
# 6. Gentle sharpening
|
||
gaussian = cv2.GaussianBlur(denoised, (0, 0), 1.0)
|
||
sharpened = cv2.addWeighted(denoised, 1.3, gaussian, -0.3, 0)
|
||
|
||
# NO binarization, NO morphological operations
|
||
# This preserves text boundaries and avoids digit concatenation
|
||
return sharpened
|
||
```
|
||
|
||
**B. Mark `preprocess_heavy()` as deprecated:**
|
||
```python
|
||
def preprocess_heavy(self, image: np.ndarray) -> np.ndarray:
|
||
"""
|
||
Heavy preprocessing for FADED thermal receipts.
|
||
|
||
⚠️ DEPRECATED: Use preprocess_medium() instead.
|
||
Heavy preprocessing causes digit concatenation on clear PDFs.
|
||
Kept for backward compatibility only.
|
||
"""
|
||
# ... existing code (unchanged)
|
||
```
|
||
|
||
#### 4. `backend/modules/data_entry/routers/ocr.py`
|
||
**Changes:** ~40 lines modified
|
||
|
||
**Key Modifications:**
|
||
|
||
**A. Update `ExtractionData` schema instantiation (lines 106-128):**
|
||
```python
|
||
# Add validation warnings to response
|
||
validation_warnings_list = [
|
||
{
|
||
'field': w.field,
|
||
'rule': w.rule,
|
||
'message': w.message,
|
||
'severity': w.severity,
|
||
'suggested_value': w.suggested_value
|
||
}
|
||
for w in result.validation_warnings
|
||
] if hasattr(result, 'validation_warnings') else []
|
||
|
||
data = ExtractionData(
|
||
# ... existing fields ...
|
||
|
||
# NEW: Validation fields
|
||
validation_warnings=validation_warnings_list,
|
||
needs_manual_review=getattr(result, 'needs_manual_review', False),
|
||
inter_ocr_ratio=getattr(result, 'inter_ocr_ratio', None),
|
||
inter_ocr_source_used=getattr(result, 'inter_ocr_source_used', None),
|
||
)
|
||
```
|
||
|
||
#### 5. `backend/modules/data_entry/schemas/ocr.py`
|
||
**Changes:** ~20 lines added
|
||
|
||
**Key Modifications:**
|
||
|
||
**A. Add validation fields to `ExtractionData` (after line 57):**
|
||
```python
|
||
class ValidationWarning(BaseModel):
|
||
"""Validation warning from OCR extraction."""
|
||
field: str = Field(description="Field name (e.g., 'amount', 'tva_total')")
|
||
rule: str = Field(description="Rule name (e.g., 'amount_range', 'tva_ratio')")
|
||
message: str = Field(description="Human-readable warning message")
|
||
severity: str = Field(description="Severity: 'low', 'medium', 'high'")
|
||
suggested_value: Optional[Any] = Field(default=None, description="Suggested corrected value")
|
||
|
||
class ExtractionData(BaseModel):
|
||
"""Extracted receipt data from OCR."""
|
||
# ... existing fields ...
|
||
|
||
# NEW: Validation results
|
||
validation_warnings: List[ValidationWarning] = Field(default=[], description="Validation warnings")
|
||
needs_manual_review: bool = Field(default=False, description="Flag for supervisor review")
|
||
inter_ocr_ratio: Optional[float] = Field(default=None, description="Ratio between OCR engines (>10 = inconsistent)")
|
||
inter_ocr_source_used: Optional[str] = Field(default=None, description="OCR engine used: 'light' or 'medium'")
|
||
```
|
||
|
||
#### 6. Database Migration: `backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py`
|
||
**Purpose:** Add `needs_manual_review` column to `receipts` table
|
||
**Size:** ~30 lines (Alembic migration)
|
||
|
||
```python
|
||
"""Add needs_manual_review flag to receipts
|
||
|
||
Revision ID: XXX
|
||
Create Date: 2025-12-30
|
||
"""
|
||
from alembic import op
|
||
import sqlalchemy as sa
|
||
|
||
revision = 'XXX'
|
||
down_revision = 'YYY' # Previous migration
|
||
branch_labels = None
|
||
depends_on = None
|
||
|
||
def upgrade():
|
||
# Add column with default NULL (not FALSE)
|
||
# NULL = not validated yet (old receipts)
|
||
# FALSE = validated, no review needed
|
||
# TRUE = validated, needs review
|
||
op.add_column('receipts', sa.Column('needs_manual_review', sa.Boolean(), nullable=True))
|
||
|
||
def downgrade():
|
||
op.drop_column('receipts', 'needs_manual_review')
|
||
```
|
||
|
||
### Frontend Integration Points
|
||
|
||
#### 1. `src/modules/data-entry/views/receipts/ReceiptCreateView.vue`
|
||
**Changes:** Display validation warnings below OCR results
|
||
|
||
**Example:**
|
||
```vue
|
||
<template>
|
||
<div class="ocr-results">
|
||
<!-- Existing OCR fields -->
|
||
|
||
<!-- NEW: Validation warnings section -->
|
||
<div v-if="ocrData.validation_warnings?.length > 0" class="validation-warnings">
|
||
<h4>
|
||
<i class="pi pi-exclamation-triangle" />
|
||
Avertismente Validare ({{ ocrData.validation_warnings.length }})
|
||
</h4>
|
||
<ul>
|
||
<li
|
||
v-for="(warning, idx) in ocrData.validation_warnings"
|
||
:key="idx"
|
||
:class="`severity-${warning.severity}`"
|
||
>
|
||
<strong>{{ warning.field }}:</strong> {{ warning.message }}
|
||
<span v-if="warning.suggested_value" class="suggestion">
|
||
(sugestie: {{ warning.suggested_value }})
|
||
</span>
|
||
</li>
|
||
</ul>
|
||
</div>
|
||
|
||
<!-- NEW: Manual review badge -->
|
||
<div v-if="ocrData.needs_manual_review" class="manual-review-badge">
|
||
<i class="pi pi-flag" />
|
||
Necesită verificare manuală
|
||
</div>
|
||
</div>
|
||
</template>
|
||
|
||
<style scoped>
|
||
.validation-warnings {
|
||
margin-top: 1rem;
|
||
padding: 1rem;
|
||
background: #fff3cd;
|
||
border-left: 4px solid #ffc107;
|
||
}
|
||
|
||
.validation-warnings li.severity-low {
|
||
color: #666;
|
||
}
|
||
|
||
.validation-warnings li.severity-medium {
|
||
color: #f57c00;
|
||
}
|
||
|
||
.validation-warnings li.severity-high {
|
||
color: #d32f2f;
|
||
font-weight: bold;
|
||
}
|
||
|
||
.manual-review-badge {
|
||
margin-top: 0.5rem;
|
||
padding: 0.5rem 1rem;
|
||
background: #fff3cd;
|
||
border-radius: 4px;
|
||
display: inline-flex;
|
||
align-items: center;
|
||
gap: 0.5rem;
|
||
}
|
||
</style>
|
||
```
|
||
|
||
#### 2. `src/modules/data-entry/components/ocr/OCRPreview.vue`
|
||
**Changes:** Add inter-OCR consistency indicator
|
||
|
||
**Example:**
|
||
```vue
|
||
<template>
|
||
<div class="ocr-preview">
|
||
<!-- Existing fields -->
|
||
|
||
<!-- NEW: Inter-OCR consistency warning -->
|
||
<div v-if="ocrData.inter_ocr_ratio && ocrData.inter_ocr_ratio > 10" class="ocr-consistency-warning">
|
||
<i class="pi pi-exclamation-circle" />
|
||
Inconsistență detectată între motoarele OCR ({{ Math.round(ocrData.inter_ocr_ratio) }}x diferență).
|
||
<br />
|
||
<small>Valorile folosite provin din: {{ ocrData.inter_ocr_source_used }}</small>
|
||
</div>
|
||
</div>
|
||
</template>
|
||
```
|
||
|
||
---
|
||
|
||
## Design Decisions
|
||
|
||
### 1. Why Validation Warnings Instead of Errors?
|
||
|
||
**Decision:** Use non-blocking warnings instead of blocking errors.
|
||
|
||
**Rationale:**
|
||
- User requirement: "Allow save with warnings"
|
||
- OCR will never be 100% perfect
|
||
- Users can override incorrect extractions
|
||
- Supervisor review catches issues before approval
|
||
|
||
**Trade-off:** Risk of bad data entering system vs. user frustration with blocked submissions.
|
||
|
||
**Mitigation:** Manual review flag ensures supervisor catches issues.
|
||
|
||
### 2. Why Replace Heavy with Medium OCR?
|
||
|
||
**Decision:** Remove Heavy preprocessing, add Medium preprocessing.
|
||
|
||
**Rationale:**
|
||
- **Heavy causes digit concatenation** on clear PDFs (production evidence)
|
||
- Binarization destroys text boundaries on high-quality images
|
||
- Morphological operations merge adjacent numbers (85.99 → 859,762.16)
|
||
|
||
**Analysis of Heavy Preprocessing (lines 153-164 in `image_preprocessor.py`):**
|
||
```python
|
||
# 7. Adaptive thresholding (binarization) - PROBLEM!
|
||
binary = cv2.adaptiveThreshold(
|
||
sharpened, 255,
|
||
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
||
cv2.THRESH_BINARY,
|
||
blockSize=11, C=5 # Block size can merge nearby digits
|
||
)
|
||
|
||
# 8. Morphological operations - COMPOUNDS THE PROBLEM!
|
||
kernel_close = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
|
||
result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel_close)
|
||
# MORPH_CLOSE fills small gaps → merges adjacent numbers
|
||
```
|
||
|
||
**Alternative Considered:** Keep Heavy but add safeguards. **Rejected:** Too risky, no benefit for clear PDFs.
|
||
|
||
### 3. Why Romanian CIF Mod 11 Validation?
|
||
|
||
**Decision:** Implement CIF checksum validation algorithm.
|
||
|
||
**Rationale:**
|
||
- Romanian CIFs have built-in checksum (last digit)
|
||
- Validates extracted CUI is mathematically correct
|
||
- Catches OCR digit errors (10562600 vs 10562601)
|
||
|
||
**Algorithm:** Mod 11 checksum
|
||
- Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left)
|
||
- Formula: `sum(digit[i] * weight[i]) % 11`
|
||
- Control digit: remainder (0 if remainder=10)
|
||
|
||
**Example:** RO10562600
|
||
- Digits: 1,0,5,6,2,6,0,0,[0]
|
||
- Checksum: 1×7 + 0×5 + 5×3 + 6×2 + 2×1 + 6×7 + 0×5 + 0×3 = 7+0+15+12+2+42+0+0 = 78
|
||
- 78 % 11 = 1 ≠ 0 → **INVALID!** (This CUI fails validation)
|
||
|
||
**Note:** Some older CIFs may not have checksums (pre-2000). Validation is permissive (warning, not error).
|
||
|
||
### 4. Why Apply to New Uploads Only?
|
||
|
||
**Decision:** Don't reprocess existing receipts.
|
||
|
||
**Rationale:**
|
||
- Migration impact: ~500 existing receipts in DB
|
||
- Reprocessing cost: OCR is slow (~2-5s per receipt)
|
||
- Risk: May change existing approved data
|
||
- Benefit: Minimal (old receipts already reviewed)
|
||
|
||
**Implementation:** Migration adds column with default NULL (not FALSE).
|
||
|
||
---
|
||
|
||
## Validation Rules Specification
|
||
|
||
### 1. Amount Range Validation
|
||
|
||
**Rule:** Amount must be between 0.01 and 100,000 RON.
|
||
|
||
**Implementation:**
|
||
```python
|
||
class AmountRangeRule(ValidationRule):
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
warnings = []
|
||
if extraction.amount:
|
||
if extraction.amount < Decimal('0.01'):
|
||
warnings.append(ValidationWarning(
|
||
field='amount',
|
||
rule='amount_range',
|
||
message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
|
||
severity='high'
|
||
))
|
||
elif extraction.amount > Decimal('100000'):
|
||
warnings.append(ValidationWarning(
|
||
field='amount',
|
||
rule='amount_range',
|
||
message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
|
||
severity='high'
|
||
))
|
||
|
||
# Check decimal places
|
||
decimal_places = abs(extraction.amount.as_tuple().exponent)
|
||
if decimal_places > 2:
|
||
warnings.append(ValidationWarning(
|
||
field='amount',
|
||
rule='decimal_places',
|
||
message=f'Amount has {decimal_places} decimal places (max 2)',
|
||
severity='medium',
|
||
suggested_value=extraction.amount.quantize(Decimal('0.01'))
|
||
))
|
||
return warnings
|
||
```
|
||
|
||
**Test Cases:**
|
||
- 0.00 RON → Warning (too small)
|
||
- 0.01 RON → Valid
|
||
- 85.99 RON → Valid
|
||
- 100,000 RON → Valid
|
||
- 100,001 RON → Warning (too large)
|
||
- 859,762.16 RON → Warning (too large)
|
||
- 85.999 RON → Warning (too many decimals)
|
||
|
||
### 2. TVA Ratio Validation
|
||
|
||
**Rule:** TVA must be 5-24% of TOTAL amount.
|
||
|
||
**Implementation:**
|
||
```python
|
||
class TVARatioRule(ValidationRule):
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
warnings = []
|
||
if extraction.tva_total and extraction.amount:
|
||
# TVA cannot be greater than TOTAL
|
||
if extraction.tva_total > extraction.amount:
|
||
warnings.append(ValidationWarning(
|
||
field='tva_total',
|
||
rule='tva_greater_than_total',
|
||
message=f'TVA ({extraction.tva_total}) cannot be greater than TOTAL ({extraction.amount})',
|
||
severity='high',
|
||
suggested_value=None # Will be auto-corrected by service
|
||
))
|
||
else:
|
||
# Check ratio
|
||
ratio = extraction.tva_total / extraction.amount * Decimal('100')
|
||
if ratio < Decimal('5'):
|
||
warnings.append(ValidationWarning(
|
||
field='tva_total',
|
||
rule='tva_ratio_low',
|
||
message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
|
||
severity='medium'
|
||
))
|
||
elif ratio > Decimal('24'):
|
||
warnings.append(ValidationWarning(
|
||
field='tva_total',
|
||
rule='tva_ratio_high',
|
||
message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
|
||
severity='high'
|
||
))
|
||
return warnings
|
||
```
|
||
|
||
**Test Cases:**
|
||
- TVA=14.92, TOTAL=85.99 → 17.3% → Valid
|
||
- TVA=149,214.92, TOTAL=859,762.16 → 17.3% → Both values wrong (caught by amount_range)
|
||
- TVA=4.00, TOTAL=100.00 → 4% → Warning (too low)
|
||
- TVA=100.00, TOTAL=85.99 → 116% → Warning (impossible!)
|
||
|
||
### 3. Payment Sum Validation
|
||
|
||
**Rule:** CARD + NUMERAR must equal TOTAL (±0.02 RON tolerance).
|
||
|
||
**Implementation:**
|
||
```python
|
||
class PaymentSumRule(ValidationRule):
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
warnings = []
|
||
if extraction.payment_methods and extraction.amount:
|
||
payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
|
||
difference = abs(payment_sum - extraction.amount)
|
||
|
||
if difference > Decimal('0.02'):
|
||
warnings.append(ValidationWarning(
|
||
field='amount',
|
||
rule='payment_sum_mismatch',
|
||
message=f'Payment methods sum ({payment_sum}) ≠ TOTAL ({extraction.amount}), diff={difference}',
|
||
severity='high',
|
||
suggested_value=payment_sum
|
||
))
|
||
return warnings
|
||
|
||
def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
|
||
"""Auto-correct TOTAL from payment sum if confidence < 80%."""
|
||
corrections = {}
|
||
if extraction.payment_methods and extraction.amount:
|
||
payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
|
||
difference = abs(payment_sum - extraction.amount)
|
||
|
||
if difference > Decimal('0.02') and extraction.confidence_amount < 0.80:
|
||
corrections['amount'] = payment_sum
|
||
print(f"[Auto-Correct] TOTAL corrected: {extraction.amount} → {payment_sum} (from payment methods)", flush=True)
|
||
return corrections
|
||
```
|
||
|
||
**Test Cases:**
|
||
- CARD=50, NUMERAR=35.99, TOTAL=85.99 → Valid
|
||
- CARD=50, NUMERAR=35.97, TOTAL=85.99 → Diff=0.02 → Valid (tolerance)
|
||
- CARD=50, NUMERAR=35.00, TOTAL=85.99 → Diff=0.99 → Warning
|
||
|
||
### 4. TVA Entries Sum Validation
|
||
|
||
**Rule:** Σ(TVA entries) must equal TVA TOTAL (±0.02 RON tolerance).
|
||
|
||
**Implementation:**
|
||
```python
|
||
class TVAEntriesSumRule(ValidationRule):
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
warnings = []
|
||
if extraction.tva_entries and extraction.tva_total:
|
||
entries_sum = sum(e['amount'] for e in extraction.tva_entries)
|
||
difference = abs(entries_sum - extraction.tva_total)
|
||
|
||
if difference > Decimal('0.02'):
|
||
warnings.append(ValidationWarning(
|
||
field='tva_total',
|
||
rule='tva_entries_sum_mismatch',
|
||
message=f'TVA entries sum ({entries_sum}) ≠ TVA TOTAL ({extraction.tva_total}), diff={difference}',
|
||
severity='medium',
|
||
suggested_value=entries_sum
|
||
))
|
||
return warnings
|
||
|
||
def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
|
||
"""Use entries sum as TVA TOTAL if mismatch."""
|
||
corrections = {}
|
||
if extraction.tva_entries and extraction.tva_total:
|
||
entries_sum = sum(e['amount'] for e in extraction.tva_entries)
|
||
difference = abs(entries_sum - extraction.tva_total)
|
||
|
||
if difference > Decimal('0.02'):
|
||
corrections['tva_total'] = entries_sum
|
||
print(f"[Auto-Correct] TVA TOTAL corrected: {extraction.tva_total} → {entries_sum} (from entries)", flush=True)
|
||
return corrections
|
||
```
|
||
|
||
**Test Cases:**
|
||
- Entries=[A:19%:14.92], TOTAL=14.92 → Valid
|
||
- Entries=[A:19%:10.00, B:9%:4.92], TOTAL=14.92 → Valid
|
||
- Entries=[A:19%:14.92], TOTAL=14.94 → Diff=0.02 → Valid (tolerance)
|
||
- Entries=[A:19%:14.92], TOTAL=15.00 → Diff=0.08 → Warning
|
||
|
||
### 5. Inter-OCR Consistency Validation
|
||
|
||
**Rule:** Flag if values differ >10x between OCR engines.
|
||
|
||
**Implementation:**
|
||
```python
|
||
class InterOCRConsistencyRule(ValidationRule):
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
"""This rule is applied during merge, stores ratio in extraction."""
|
||
warnings = []
|
||
if hasattr(extraction, 'inter_ocr_ratio') and extraction.inter_ocr_ratio:
|
||
if extraction.inter_ocr_ratio > 10:
|
||
warnings.append(ValidationWarning(
|
||
field='amount',
|
||
rule='inter_ocr_inconsistency',
|
||
message=f'Large inconsistency between OCR engines ({extraction.inter_ocr_ratio:.0f}x difference)',
|
||
severity='high'
|
||
))
|
||
return warnings
|
||
```
|
||
|
||
**Test Cases:**
|
||
- Light=85.99, Medium=86.00 → Ratio=1.00 → Valid
|
||
- Light=85.99, Medium=90.00 → Ratio=1.05 → Valid
|
||
- Light=85.99, Medium=859.76 → Ratio=10.00 → Valid (edge case)
|
||
- Light=85.99, Medium=859,762.16 → Ratio=10,000 → Warning!
|
||
|
||
### 6. CUI Checksum Validation
|
||
|
||
**Rule:** Validate Romanian CIF Mod 11 checksum.
|
||
|
||
**Implementation:**
|
||
```python
|
||
class CUIChecksumRule(ValidationRule):
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
warnings = []
|
||
if extraction.cui:
|
||
# Normalize CUI
|
||
digits = re.sub(r'\D', '', extraction.cui)
|
||
|
||
# Validate length
|
||
if not (6 <= len(digits) <= 10):
|
||
warnings.append(ValidationWarning(
|
||
field='cui',
|
||
rule='cui_length',
|
||
message=f'CUI length invalid: {len(digits)} digits (expected 6-10)',
|
||
severity='medium'
|
||
))
|
||
return warnings
|
||
|
||
# Validate Mod 11 checksum
|
||
if not self._validate_checksum(digits):
|
||
warnings.append(ValidationWarning(
|
||
field='cui',
|
||
rule='cui_checksum',
|
||
message=f'CUI checksum invalid: {extraction.cui} (failed Mod 11 validation)',
|
||
severity='medium' # Medium: some old CIFs don't have checksums
|
||
))
|
||
return warnings
|
||
|
||
def _validate_checksum(self, digits: str) -> bool:
|
||
"""Romanian CIF Mod 11 checksum validation."""
|
||
if len(digits) < 2:
|
||
return False
|
||
|
||
weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
|
||
control = int(digits[-1])
|
||
digits_to_check = digits[:-1].zfill(9)
|
||
|
||
checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
|
||
remainder = checksum % 11
|
||
expected = 0 if remainder == 10 else remainder
|
||
|
||
return control == expected
|
||
```
|
||
|
||
**Test Cases:**
|
||
- R010562600 → Checksum validation
|
||
- R011201891 → Checksum validation
|
||
- R012345678 → Warning (invalid checksum)
|
||
- R01234 → Warning (too short)
|
||
|
||
### 7. Date Validity Validation
|
||
|
||
**Rule:** Date must not be in future, not older than 10 years.
|
||
|
||
**Implementation:**
|
||
```python
|
||
class DateValidityRule(ValidationRule):
|
||
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
|
||
warnings = []
|
||
if extraction.receipt_date:
|
||
today = date.today()
|
||
|
||
# Check future date
|
||
if extraction.receipt_date > today:
|
||
warnings.append(ValidationWarning(
|
||
field='receipt_date',
|
||
rule='date_future',
|
||
message=f'Date is in the future: {extraction.receipt_date}',
|
||
severity='high'
|
||
))
|
||
|
||
# Check too old (10 years)
|
||
cutoff_date = today.replace(year=today.year - 10)
|
||
if extraction.receipt_date < cutoff_date:
|
||
warnings.append(ValidationWarning(
|
||
field='receipt_date',
|
||
rule='date_too_old',
|
||
message=f'Date is older than 10 years: {extraction.receipt_date}',
|
||
severity='medium'
|
||
))
|
||
return warnings
|
||
```
|
||
|
||
**Test Cases:**
|
||
- 2025-12-30 (today) → Valid
|
||
- 2025-10-11 → Valid
|
||
- 2026-01-01 → Warning (future)
|
||
- 2015-12-31 → Valid (exactly 10 years)
|
||
- 2014-12-31 → Warning (too old)
|
||
|
||
---
|
||
|
||
## Acceptance Criteria
|
||
|
||
### Critical Success Criteria (Must Pass)
|
||
|
||
✅ **AC-1:** Five-Holding receipt extracts correct values
|
||
- **Given:** Production PDF receipt (Five-Holding, 85.99 LEI)
|
||
- **When:** OCR processes with new validation
|
||
- **Then:**
|
||
- TOTAL = 85.99 LEI (NOT 859,762.16)
|
||
- TVA = 14.92 LEI (NOT 149,214.92)
|
||
- CUI = R010562600
|
||
- Overall confidence >= 90%
|
||
|
||
✅ **AC-2:** Save works with validation warnings
|
||
- **Given:** Receipt with low confidence (75%)
|
||
- **When:** User clicks Save
|
||
- **Then:**
|
||
- Warnings displayed in UI
|
||
- Save button enabled
|
||
- Receipt saved with `needs_manual_review=TRUE`
|
||
|
||
✅ **AC-3:** Cross-validation: CARD + NUMERAR = TOTAL
|
||
- **Given:** Receipt with CARD=50, NUMERAR=35.99
|
||
- **When:** OCR extracts TOTAL=85.98 (off by 0.01)
|
||
- **Then:**
|
||
- Warning displayed: "Payment sum (85.99) ≠ TOTAL (85.98)"
|
||
- Suggested value: 85.99
|
||
- Auto-corrected if confidence < 80%
|
||
|
||
✅ **AC-4:** Cross-validation: Σ(TVA entries) = TVA TOTAL
|
||
- **Given:** Receipt with TVA A=10.00, TVA B=4.92
|
||
- **When:** OCR extracts TVA TOTAL=14.90 (off by 0.02)
|
||
- **Then:**
|
||
- Warning displayed: "TVA entries sum (14.92) ≠ TVA TOTAL (14.90)"
|
||
- Auto-corrected to 14.92
|
||
|
||
✅ **AC-5:** CUI Mod 11 validation works
|
||
- **Given:** Receipt with CUI R010562600
|
||
- **When:** OCR processes
|
||
- **Then:**
|
||
- CUI validated against Mod 11 checksum
|
||
- If invalid, warning displayed
|
||
- Format normalized to "RO" prefix
|
||
|
||
### Secondary Criteria (Nice-to-Have)
|
||
|
||
🔲 **AC-S1:** Medium OCR performs better than Heavy
|
||
- **Given:** 10 clear PDF receipts
|
||
- **When:** Processed with Light → Medium → Tesseract
|
||
- **Then:**
|
||
- No 10x magnitude errors
|
||
- Average confidence >= 90%
|
||
- Processing time < 5s
|
||
|
||
🔲 **AC-S2:** Validation warnings show in UI
|
||
- **Given:** Receipt with 3 validation warnings
|
||
- **When:** OCR completes
|
||
- **Then:**
|
||
- Warning section displayed
|
||
- Each warning shows: field, message, severity
|
||
- Suggested values displayed if available
|
||
|
||
---
|
||
|
||
## Testing Strategy
|
||
|
||
### Unit Tests (~300 lines)
|
||
|
||
**File:** `backend/modules/data_entry/tests/test_ocr_validation.py`
|
||
|
||
**Test Coverage:**
|
||
```python
|
||
# Amount validation
|
||
test_amount_range_valid()
|
||
test_amount_range_too_small()
|
||
test_amount_range_too_large()
|
||
test_amount_decimal_places()
|
||
|
||
# TVA validation
|
||
test_tva_ratio_valid()
|
||
test_tva_ratio_too_low()
|
||
test_tva_ratio_too_high()
|
||
test_tva_greater_than_total()
|
||
test_tva_entries_sum_matches()
|
||
test_tva_entries_sum_mismatch()
|
||
|
||
# Payment validation
|
||
test_payment_sum_matches()
|
||
test_payment_sum_mismatch_within_tolerance()
|
||
test_payment_sum_mismatch_auto_corrected()
|
||
|
||
# CUI validation
|
||
test_cui_checksum_valid()
|
||
test_cui_checksum_invalid()
|
||
test_cui_length_invalid()
|
||
test_cui_normalization()
|
||
|
||
# Date validation
|
||
test_date_valid()
|
||
test_date_future()
|
||
test_date_too_old()
|
||
|
||
# Inter-OCR consistency
|
||
test_inter_ocr_consistency_valid()
|
||
test_inter_ocr_consistency_10x_difference()
|
||
|
||
# Validation engine
|
||
test_validation_engine_no_warnings()
|
||
test_validation_engine_multiple_warnings()
|
||
test_validation_engine_auto_corrections()
|
||
test_needs_manual_review_flag()
|
||
```
|
||
|
||
### Integration Tests (~200 lines)
|
||
|
||
**File:** `backend/modules/data_entry/tests/test_ocr_validation_integration.py`
|
||
|
||
**Test Coverage:**
|
||
```python
|
||
# Real receipts
|
||
test_five_holding_receipt() # Production case (85.99 not 859,762.16)
|
||
test_omv_receipt() # Clear PDF, Light OCR only
|
||
test_kaufland_receipt() # Faded thermal, Medium OCR
|
||
test_mega_image_receipt() # Multiple TVA entries
|
||
|
||
# OCR pipeline
|
||
test_light_ocr_high_confidence_skips_medium()
|
||
test_light_ocr_low_confidence_runs_medium()
|
||
test_medium_ocr_replaces_heavy()
|
||
test_validation_runs_after_merge()
|
||
|
||
# API responses
|
||
test_api_returns_validation_warnings()
|
||
test_api_returns_needs_manual_review_flag()
|
||
test_api_returns_inter_ocr_ratio()
|
||
test_api_auto_corrects_amount_from_payments()
|
||
|
||
# Edge cases
|
||
test_no_ocr_engines_available()
|
||
test_pdf_with_multiple_pages()
|
||
test_receipt_with_no_tva()
|
||
test_receipt_with_no_payment_methods()
|
||
```
|
||
|
||
### Manual Testing Checklist
|
||
|
||
1. **Upload Five-Holding receipt PDF** (production case)
|
||
- [ ] Verify TOTAL = 85.99 (not 859,762.16)
|
||
- [ ] Verify TVA = 14.92 (not 149,214.92)
|
||
- [ ] Verify no validation warnings
|
||
- [ ] Verify overall confidence >= 90%
|
||
|
||
2. **Upload faded thermal receipt photo**
|
||
- [ ] Verify Medium OCR used (not Heavy)
|
||
- [ ] Verify readable text extracted
|
||
- [ ] Verify no digit concatenation
|
||
|
||
3. **Upload receipt with payment methods**
|
||
- [ ] Verify CARD + NUMERAR displayed
|
||
- [ ] Verify sum matches TOTAL
|
||
- [ ] If mismatch, verify warning displayed
|
||
|
||
4. **Upload receipt with multiple TVA entries**
|
||
- [ ] Verify all TVA entries extracted
|
||
- [ ] Verify sum matches TVA TOTAL
|
||
- [ ] If mismatch, verify warning displayed
|
||
|
||
5. **Submit receipt with warnings**
|
||
- [ ] Verify Save button enabled
|
||
- [ ] Verify warnings displayed in UI
|
||
- [ ] Verify `needs_manual_review` flag set
|
||
|
||
6. **Filter receipts by "Needs Review"**
|
||
- [ ] Verify filter shows flagged receipts
|
||
- [ ] Verify supervisor can review
|
||
|
||
---
|
||
|
||
## Risks and Mitigations
|
||
|
||
| Risk | Likelihood | Impact | Mitigation |
|
||
|------|------------|--------|------------|
|
||
| **Medium OCR still causes errors** | Medium | High | Keep Tesseract as Step 3 fallback; validation catches issues |
|
||
| **CUI Mod 11 validation too strict** | Medium | Low | Use warning (not error); allow override; some old CIFs don't have checksums |
|
||
| **Validation rules too permissive** | Low | Medium | Start conservative, tune based on production data |
|
||
| **Validation rules too strict** | Medium | Low | Non-blocking warnings allow user override |
|
||
| **Performance impact** | Low | Low | Validation is fast (<10ms); OCR dominates processing time |
|
||
| **Breaking changes to API** | Low | High | Add new fields, keep existing fields unchanged; frontend optional |
|
||
| **Database migration issues** | Low | Medium | Use NULL default (not FALSE); test on staging first |
|
||
|
||
---
|
||
|
||
## Out of Scope
|
||
|
||
**Explicitly NOT included in this feature:**
|
||
|
||
1. ❌ **Reprocessing existing receipts** - Only new uploads validated
|
||
2. ❌ **Machine learning OCR improvements** - Use existing PaddleOCR/Tesseract
|
||
3. ❌ **Custom OCR training** - Generic models only
|
||
4. ❌ **Approval workflow changes** - Validation is separate from approval
|
||
5. ❌ **Automatic approval** - Always requires supervisor review
|
||
6. ❌ **Advanced validation rules** - Only basic sanity checks
|
||
7. ❌ **Multi-currency support** - RON only for now
|
||
8. ❌ **Historical receipt validation** - Phase 2 feature
|
||
9. ❌ **OCR confidence tuning** - Accept engine defaults
|
||
10. ❌ **Frontend validation logic** - Backend only (frontend displays)
|
||
|
||
---
|
||
|
||
## Open Questions
|
||
|
||
### Q1: Should we keep Heavy preprocessing as fallback?
|
||
|
||
**Answer:** No. Remove completely. Evidence shows it causes more harm than good on clear PDFs. Medium preprocessing handles mixed-quality images better.
|
||
|
||
### Q2: What tolerance for payment sum validation?
|
||
|
||
**Answer:** ±0.02 RON (2 cents). Romanian receipts use 2 decimal places. This handles rounding errors.
|
||
|
||
### Q3: Should CUI validation be blocking or warning?
|
||
|
||
**Answer:** Warning only. Some old Romanian CIFs (pre-2000) don't have Mod 11 checksums. Also, OCR may extract digits incorrectly.
|
||
|
||
### Q4: What if Light OCR has high confidence but wrong values?
|
||
|
||
**Answer:** Validation catches this. If Light OCR extracts 859,762.16 with 98% confidence, amount_range rule flags it (>100,000 limit). User sees warning.
|
||
|
||
### Q5: Should we reprocess existing receipts with new validation?
|
||
|
||
**Answer:** No. Too risky and time-consuming. Apply to new uploads only. If user wants to re-validate old receipt, they can re-upload.
|
||
|
||
### Q6: What about receipts with no payment methods?
|
||
|
||
**Answer:** No validation warning. Not all receipts show CARD/NUMERAR breakdown (especially older thermal receipts). Only validate if payment methods are extracted.
|
||
|
||
### Q7: Should validation auto-correct or just warn?
|
||
|
||
**Answer:** Both. Auto-correct obvious errors (TOTAL from payment sum if confidence < 80%). Warn for ambiguous cases. Never silently change high-confidence values.
|
||
|
||
### Q8: How to handle receipts from future (clock skew)?
|
||
|
||
**Answer:** Warning only (not error). Allow up to 1 day in future (±24h tolerance) for clock skew. Beyond that, warn user.
|
||
|
||
---
|
||
|
||
## Estimated Complexity
|
||
|
||
**Overall:** High
|
||
**Justification:**
|
||
|
||
- **File Count:** 6 modified, 3 created, 1 migration = 10 files
|
||
- **Line Changes:** ~1,135 lines (400 new validation, 300 tests, 200 integration tests, 235 modifications)
|
||
- **Risk Level:** Medium (core OCR pipeline changes, but validation is additive)
|
||
- **Testing:** 15-20 new test cases, manual testing required
|
||
- **Dependencies:** None (uses existing OCR engines)
|
||
- **Complexity Factors:**
|
||
- Multi-layer validation logic
|
||
- Romanian CIF checksum algorithm
|
||
- Cross-field validation dependencies
|
||
- Inter-OCR comparison logic
|
||
- Auto-correction logic
|
||
- Frontend integration
|
||
- Database migration
|
||
|
||
**Estimated Effort:** 2-3 days
|
||
- Day 1: Validation engine + unit tests
|
||
- Day 2: OCR pipeline integration + medium preprocessing
|
||
- Day 3: Frontend integration + manual testing + bug fixes
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
### External Libraries
|
||
- ✅ `cv2` (OpenCV) - Already installed
|
||
- ✅ `numpy` - Already installed
|
||
- ✅ `paddleocr` - Already installed
|
||
- ✅ `tesseract` - Already installed
|
||
- ✅ `pydantic` - Already installed
|
||
- ✅ `sqlalchemy` - Already installed
|
||
|
||
### Internal Modules
|
||
- ✅ `backend/modules/data_entry/services/ocr_service.py`
|
||
- ✅ `backend/modules/data_entry/services/ocr_extractor.py`
|
||
- ✅ `backend/modules/data_entry/services/image_preprocessor.py`
|
||
- ✅ `backend/modules/data_entry/routers/ocr.py`
|
||
- ✅ `backend/modules/data_entry/schemas/ocr.py`
|
||
- ✅ `backend/modules/data_entry/db/models/receipt.py`
|
||
|
||
### Database Schema Changes
|
||
- ✅ Add `needs_manual_review` column to `receipts` table (nullable BOOLEAN)
|
||
- ✅ Alembic migration required
|
||
|
||
---
|
||
|
||
## Implementation Notes
|
||
|
||
### Priority Order (Recommended)
|
||
|
||
1. **Phase 1: Core Validation (Day 1)**
|
||
- Create `ocr/validation.py` module
|
||
- Implement validation rules (amount, TVA, payment, CUI, date)
|
||
- Write unit tests
|
||
- **Checkpoint:** All tests pass
|
||
|
||
2. **Phase 2: OCR Integration (Day 2 Morning)**
|
||
- Add `preprocess_medium()` to image_preprocessor
|
||
- Update `_merge_extractions()` with validation-aware logic
|
||
- Remove/deprecate `preprocess_heavy()`
|
||
- **Checkpoint:** Five-Holding receipt extracts correctly
|
||
|
||
3. **Phase 3: API Updates (Day 2 Afternoon)**
|
||
- Update `ExtractionResult` dataclass with validation fields
|
||
- Update API schemas (ocr.py, routers/ocr.py)
|
||
- Add database migration
|
||
- **Checkpoint:** API returns validation warnings
|
||
|
||
4. **Phase 4: Integration Testing (Day 3 Morning)**
|
||
- Write integration tests
|
||
- Test with real receipts (Five-Holding, OMV, Kaufland)
|
||
- **Checkpoint:** All integration tests pass
|
||
|
||
5. **Phase 5: Frontend & Polish (Day 3 Afternoon)**
|
||
- Update Vue components to display warnings
|
||
- Add "Needs Review" filter
|
||
- Manual testing
|
||
- Bug fixes
|
||
- **Checkpoint:** Production-ready
|
||
|
||
### Code Quality Standards
|
||
|
||
- ✅ Type hints for all functions
|
||
- ✅ Docstrings for all public methods
|
||
- ✅ Unit test coverage >90%
|
||
- ✅ Integration tests for critical paths
|
||
- ✅ Print statements for debugging (will be converted to logging later)
|
||
- ✅ Follow existing code patterns (SQLModel, Pydantic v2, FastAPI)
|
||
|
||
### Performance Considerations
|
||
|
||
- **Validation overhead:** <10ms per receipt (negligible vs. OCR time)
|
||
- **Medium preprocessing:** Similar speed to Heavy (~500ms)
|
||
- **Database migration:** Non-blocking (adds NULL column)
|
||
- **Frontend impact:** Minimal (only displays warnings)
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
### Project Context
|
||
- **CLAUDE.md:** Data Entry module instructions
|
||
- **docs/data-entry/DATA-ENTRY-MODULE.md:** Module architecture
|
||
- **docs/ARCHITECTURE-DECISIONS.md:** Ultrathin monolith rationale
|
||
|
||
### Technical References
|
||
- **Romanian CIF validation:** https://ro.wikipedia.org/wiki/Cod_de_identificare_fiscal%C4%83
|
||
- **OpenCV preprocessing:** https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html
|
||
- **PaddleOCR docs:** https://github.com/PaddlePaddle/PaddleOCR
|
||
|
||
### Similar Features
|
||
- **Payment methods extraction:** Already implemented in `ocr_extractor.py:1361`
|
||
- **TVA entries extraction:** Already implemented in `ocr_extractor.py:820`
|
||
- **Cross-validation logic:** Pattern from `_cross_validate_and_calculate_amount` (lines 468-557)
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
This specification provides a comprehensive solution to fix critical OCR data extraction issues in the Data Entry module. The multi-layer validation system ensures data integrity while maintaining user flexibility through non-blocking warnings.
|
||
|
||
**Key Benefits:**
|
||
- ✅ Prevents 10,000x magnitude errors (85.99 vs 859,762.16)
|
||
- ✅ Validates cross-field dependencies (payment sum, TVA sum)
|
||
- ✅ Improves CUI extraction with Mod 11 checksum
|
||
- ✅ Replaces problematic Heavy OCR with Medium preprocessing
|
||
- ✅ Non-blocking warnings preserve user workflow
|
||
- ✅ Manual review flag helps supervisors prioritize
|
||
|
||
**Next Steps:**
|
||
1. Review and approve specification
|
||
2. Create feature branch: `feature/bon-ocr-validation`
|
||
3. Implement Phase 1 (validation engine)
|
||
4. Continue with Phases 2-5
|
||
5. Deploy to staging for testing
|
||
6. Monitor production for 1 week before full rollout
|
||
|
||
---
|
||
|
||
**Document Version:** 1.0
|
||
**Last Updated:** 2025-12-30
|
||
**Status:** Ready for Implementation
|
||
**Estimated Completion:** 2026-01-02 (3 working days)
|