Files
roa2web-service-auto/.auto-build/specs/bon-ocr-validation/spec.md
Marius Mutu ab160b628d feat(ocr): Add validation system and CLIENT CUI extraction
OCR Data Extraction Validation System:
- Add 7 validation rules (amount range, TVA ratio, payment sum, etc.)
- Add Medium preprocessing to replace Heavy (fixes digit concatenation)
- Add validation warnings to API responses
- Flag receipts needing manual review (needs_manual_review field)
- Add database migration for needs_manual_review column

CLIENT CUI Extraction Improvements:
- Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc.
- Handle OCR errors (R0 vs RO, C1F vs CIF)
- Add client_name, client_cui, client_address to API response
- Add validation fields to API response (was missing)

QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings)
- Fixed type safety in validation rules
- Fixed ZeroDivisionError risk
- Fixed schema mismatch (Optional[bool] for needs_manual_review)
- All 37 unit tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-30 19:12:52 +02:00

1534 lines
52 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Feature Specification: OCR Data Extraction Validation System
**Feature ID:** bon-ocr-validation
**Priority:** Critical (P0 - Production Bug)
**Complexity:** High
**Estimated Effort:** 2-3 days
**Created:** 2025-12-30
**Module:** Data Entry (backend/modules/data_entry/)
---
## Overview
Fix critical OCR data extraction issue where PaddleOCR Heavy preprocessing (88% confidence) overwrites correct Light OCR (98% confidence) data with garbage values, causing 10,000x magnitude errors in production receipts.
**Value Proposition:** Prevent incorrect financial data from entering the system, reduce manual corrections, improve user trust in OCR accuracy.
---
## Problem Statement
### Current Behavior (BROKEN)
The OCR processing pipeline (`backend/modules/data_entry/services/ocr_service.py`) uses a 3-step adaptive approach:
1. **Step 1:** PaddleOCR + Light preprocessing (fast, high confidence)
2. **Step 2:** PaddleOCR + Heavy preprocessing (faded receipts)
3. **Step 3:** Tesseract (complement missing fields only)
**Critical Bug:** The `_merge_extractions()` method (lines 240-386) blindly prefers higher OCR confidence scores WITHOUT validating actual extracted values.
### Real Production Example (Five-Holding Receipt)
| Field | Light OCR (98%) ✅ | Heavy OCR (88%) ❌ | Final Result ❌ |
|-------|-------------------|-------------------|-----------------|
| **TOTAL** | 85.99 LEI | 859,762.16 LEI | **859,762.16** (10,000x error!) |
| **TVA** | 14.92 LEI | 149,214.92 LEI | **149,214.92** (10,000x error!) |
| **CUI** | R010562600 | (not found) | R010562600 |
| **Date** | 2025-10-11 | 2025-10-11 | 2025-10-11 |
| **Confidence** | 98% | 88% | 88% (wrong source!) |
**Root Cause:** Heavy preprocessing causes digit concatenation on high-quality PDFs. The binarization and morphological operations (lines 153-164 in `image_preprocessor.py`) merge adjacent numbers, creating garbage values.
### Impact
- **Data Integrity:** Incorrect amounts enter accounting system
- **User Trust:** Users lose confidence in OCR accuracy
- **Manual Work:** Requires manual verification of ALL OCR extractions
- **Financial Risk:** Wrong amounts could be approved without review
---
## User Stories
### 1. As a user uploading a clear PDF receipt
**I want** OCR to extract correct values from the first pass
**So that** I don't have to manually correct obvious errors
**Acceptance Criteria:**
- Light OCR correctly extracts 85.99 LEI (not 859,762.16)
- Heavy OCR is skipped when Light OCR confidence >= 90%
- No 10,000x magnitude errors
### 2. As a user submitting a receipt with warnings
**I want** to be able to save receipts with validation warnings
**So that** I can submit for review even if OCR isn't perfect
**Acceptance Criteria:**
- Save button works with warnings (not blocked)
- Receipt marked with `needs_manual_review=True`
- Warnings displayed clearly in UI
### 3. As a supervisor reviewing receipts
**I want** to see which receipts need manual review
**So that** I can prioritize validation efforts
**Acceptance Criteria:**
- Filter by "Needs Review" flag
- Validation warnings shown in detail view
- Clear indication of which fields are suspicious
### 4. As a system validating cross-field data
**I want** to validate CARD + NUMERAR = TOTAL
**So that** payment methods match the total amount
**Acceptance Criteria:**
- Cross-validation: sum of payment methods = TOTAL (±0.02 RON tolerance)
- If mismatch, flag for review
- Auto-correct TOTAL from payment sum if confidence < 80%
### 5. As a system validating TVA entries
**I want** to validate Σ(TVA entries) = TVA TOTAL
**So that** individual TVA lines match the total TVA
**Acceptance Criteria:**
- Cross-validation: sum of TVA entries = TVA TOTAL 0.02 RON tolerance)
- TVA rate validation (5-24% of TOTAL)
- If mismatch, flag for review
---
## Functional Requirements
### Core Requirements (Must-Have)
#### 1. Multi-Layer Validation Pipeline
**FR-1.1:** Absolute value sanity checks
- Amount range: 0.01 - 100,000 RON
- Max 2 decimal places
- Date: not in future, not older than 10 years (2015+)
- CUI: 6-10 digits, valid Mod 11 checksum
**FR-1.2:** Cross-field correlation validation
- TVA: 5-24% of TOTAL amount (Romanian rates: 5%, 9%, 11%, 19%, 21%)
- Payment methods: CARD + NUMERAR = TOTAL 0.02 RON tolerance)
- Inter-OCR consistency: flag if values differ >10x between engines
**FR-1.3:** Auto-correction logic
- If TOTAL is obviously wrong (>10x payment sum), use payment sum
- If TVA > TOTAL, recalculate TOTAL from TVA using reverse formula
- Preserve high-confidence values from Light OCR over low-confidence Heavy OCR
**FR-1.4:** Validation result structure
```python
@dataclass
class ValidationResult:
is_valid: bool
warnings: List[ValidationWarning] # Non-blocking issues
errors: List[ValidationError] # Blocking issues (none for now)
corrected_fields: Dict[str, Any] # Auto-corrected values
needs_manual_review: bool # Flag for supervisor
```
#### 2. Replace Heavy with Medium OCR
**FR-2.1:** Remove `preprocess_heavy()` method
- Current Heavy: aggressive binarization causes digit concatenation
- Reason: Destroys high-quality PDFs while trying to recover faded receipts
**FR-2.2:** Add `preprocess_medium()` method
- Moderate contrast enhancement (CLAHE clipLimit=2.0)
- Light denoising (fastNlMeansDenoising h=6)
- NO binarization, NO morphological operations
- Preserve text boundaries on clear images
**FR-2.3:** Update OCR pipeline
- Step 1: Light preprocessing (unchanged)
- Step 2: **Medium** preprocessing (replaces Heavy)
- Step 3: Tesseract (unchanged)
#### 3. Enhanced CUI Extraction
**FR-3.1:** Romanian CIF validation algorithm
- Implement Mod 11 checksum validation
- Control digit formula: `sum(digit[i] * weight[i]) % 11`
- Weights: `[7, 5, 3, 2, 1, 7, 5, 3, 2]` (right-to-left)
**FR-3.2:** CUI format normalization
- Always add "RO" prefix if missing
- Remove spaces, dashes, dots
- Validate length: 6-10 digits
**FR-3.3:** Improved regex patterns
```python
# Add OCR-tolerant patterns (current patterns are too strict)
CUI_OCR_TOLERANT_PATTERNS = [
r'CIF[:\s]*R[O0]?\s*(\d[\d\s]{5,9})', # Spaces in CUI
r'C[I1]F[:\s]*(\d[\d\s]{6,10})', # C1F (I→1 OCR error)
r'C\.?\s*[I1]\.?\s*F\.?[:\s]*(\d+)', # C. I. F. (spaced)
]
```
#### 4. User Requirements Integration
**FR-4.1:** Non-blocking validation warnings
- Save button enabled even with warnings
- User can override and submit
- Warnings displayed clearly in UI
**FR-4.2:** Manual review flag
- Database field: `receipts.needs_manual_review` (BOOLEAN)
- Set to `TRUE` if:
- Any validation warning present
- Overall confidence < 85%
- Cross-validation fails
**FR-4.3:** Apply to new uploads only
- No reprocessing of existing receipts
- Validation runs on OCR extraction (POST /api/ocr/extract)
- Migration: add column with default NULL (not FALSE)
### Secondary Requirements (Nice-to-Have)
**FR-S1:** Validation confidence scoring
- Each validation rule contributes to score
- Overall validation confidence: weighted average
- Display in UI alongside OCR confidence
**FR-S2:** Validation rule configurability
- Move hardcoded thresholds to config
- Allow per-company customization
- Admin UI to adjust rules
---
## Technical Requirements
### Files to Create
#### 1. `backend/modules/data_entry/services/ocr/validation.py`
**Purpose:** Validation utilities and rule engine
**Size:** ~400 lines
**Key Classes:**
- `ValidationRule` (base class)
- `AmountRangeRule`, `TVARatioRule`, `PaymentSumRule`, `CUIChecksumRule`
- `OCRValidationEngine` (orchestrator)
**Example:**
```python
@dataclass
class ValidationWarning:
"""Non-blocking validation warning."""
field: str
rule: str
message: str
severity: str # 'low', 'medium', 'high'
suggested_value: Optional[Any] = None
class ValidationRule(ABC):
"""Base validation rule."""
@abstractmethod
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
pass
class AmountRangeRule(ValidationRule):
"""Validate amount is in reasonable range (0.01 - 100,000 RON)."""
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.amount:
if extraction.amount < Decimal('0.01'):
warnings.append(ValidationWarning(
field='amount',
rule='amount_range',
message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
severity='high'
))
elif extraction.amount > Decimal('100000'):
warnings.append(ValidationWarning(
field='amount',
rule='amount_range',
message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
severity='high'
))
return warnings
class OCRValidationEngine:
"""Orchestrate all validation rules."""
def __init__(self):
self.rules = [
AmountRangeRule(),
TVARatioRule(),
PaymentSumRule(),
InterOCRConsistencyRule(),
CUIChecksumRule(),
DateValidityRule(),
]
def validate(self, extraction: ExtractionResult) -> ValidationResult:
"""Run all validation rules and return result."""
all_warnings = []
corrected_fields = {}
for rule in self.rules:
warnings = rule.validate(extraction)
all_warnings.extend(warnings)
# Apply auto-corrections
corrections = rule.auto_correct(extraction)
corrected_fields.update(corrections)
needs_review = (
len(all_warnings) > 0 or
extraction.overall_confidence < 0.85
)
return ValidationResult(
is_valid=True, # Never block (warnings only)
warnings=all_warnings,
errors=[],
corrected_fields=corrected_fields,
needs_manual_review=needs_review
)
```
#### 2. `backend/modules/data_entry/tests/test_ocr_validation.py`
**Purpose:** Unit tests for validation rules
**Size:** ~300 lines
**Coverage Target:** >90%
**Test Cases:**
- `test_amount_range_valid()` - 85.99 RON passes
- `test_amount_range_too_high()` - 859,762.16 fails
- `test_tva_ratio_valid()` - 14.92/85.99 = 17.3% passes
- `test_tva_ratio_too_high()` - 149,214.92/859,762.16 = 17.3% but amounts wrong
- `test_payment_sum_matches()` - CARD 50 + NUMERAR 35.99 = TOTAL 85.99
- `test_cui_checksum_valid()` - R010562600 passes Mod 11
- `test_cui_checksum_invalid()` - R010562601 fails Mod 11
- `test_inter_ocr_consistency()` - 85.99 vs 859,762.16 = 10,000x flag
#### 3. `backend/modules/data_entry/tests/test_ocr_validation_integration.py`
**Purpose:** Integration tests with full OCR pipeline
**Size:** ~200 lines
**Test Cases:**
- `test_five_holding_receipt()` - Real production case (85.99 not 859,762.16)
- `test_clear_pdf_uses_light_ocr()` - High-quality PDF skips Heavy
- `test_faded_receipt_uses_medium_ocr()` - Thermal receipt uses Medium
- `test_validation_warnings_in_response()` - API returns warnings
- `test_manual_review_flag_set()` - Flag set when confidence < 85%
### Files to Modify
#### 1. `backend/modules/data_entry/services/ocr_service.py`
**Changes:** ~200 lines modified, ~100 lines added
**Key Modifications:**
**A. Replace `_merge_extractions()` (lines 240-386) with validation-aware version:**
```python
def _merge_extractions(
self,
light: Optional[ExtractionResult],
medium: Optional[ExtractionResult] # Renamed from 'tesseract'
) -> ExtractionResult:
"""
Merge extractions with VALIDATION-AWARE logic.
NEW Strategy:
1. Run validation on both extractions
2. Prefer extraction with FEWER warnings (not just higher confidence)
3. For each field, pick value that passes validation
4. Flag inter-OCR inconsistencies (>10x difference)
"""
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
validator = OCRValidationEngine()
# Validate both extractions
light_validation = validator.validate(light) if light else None
medium_validation = validator.validate(medium) if medium else None
result = ExtractionResult()
# === AMOUNT (with validation check) ===
if light.amount and medium.amount:
# Check for 10x inconsistency
ratio = max(light.amount, medium.amount) / min(light.amount, medium.amount)
if ratio > 10:
print(f"[Merge] WARNING: Inter-OCR inconsistency: {light.amount} vs {medium.amount} ({ratio:.0f}x)", flush=True)
# Prefer value that passes validation
light_warnings = [w for w in light_validation.warnings if w.field == 'amount']
medium_warnings = [w for w in medium_validation.warnings if w.field == 'amount']
if len(light_warnings) < len(medium_warnings):
result.amount = light.amount
result.confidence_amount = light.confidence_amount
print(f"[Merge] Using Light OCR amount: {light.amount} (fewer warnings)", flush=True)
else:
result.amount = medium.amount
result.confidence_amount = medium.confidence_amount
print(f"[Merge] Using Medium OCR amount: {medium.amount} (fewer warnings)", flush=True)
else:
# Normal merge: prefer higher confidence
if light.confidence_amount >= medium.confidence_amount:
result.amount = light.amount
result.confidence_amount = light.confidence_amount
else:
result.amount = medium.amount
result.confidence_amount = medium.confidence_amount
elif light.amount:
result.amount = light.amount
result.confidence_amount = light.confidence_amount
elif medium.amount:
result.amount = medium.amount
result.confidence_amount = medium.confidence_amount
# ... (similar logic for other fields)
return result
```
**B. Add `preprocess_medium()` call (replace Heavy):**
```python
# Line ~130: Replace preprocess_heavy with preprocess_medium
print("=" * 60, flush=True)
print("[OCR] STEP 2: PaddleOCR + Medium preprocessing", flush=True)
print("=" * 60, flush=True)
medium_img = self.preprocessor.preprocess_medium(image) # NEW
try:
paddle_medium = self.ocr_engine._paddle_recognize(medium_img)
# ... rest of processing
```
**C. Add validation to final result:**
```python
# Line ~204: Add validation before returning
if extraction:
extraction = self._final_validation(extraction)
# NEW: Run validation engine
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
validator = OCRValidationEngine()
validation_result = validator.validate(extraction)
# Apply auto-corrections
for field, value in validation_result.corrected_fields.items():
setattr(extraction, field, value)
# Store validation warnings (add to ExtractionResult)
extraction.validation_warnings = validation_result.warnings
extraction.needs_manual_review = validation_result.needs_manual_review
```
#### 2. `backend/modules/data_entry/services/ocr_extractor.py`
**Changes:** ~50 lines modified, ~30 lines added
**Key Modifications:**
**A. Add validation fields to `ExtractionResult` (lines 10-50):**
```python
@dataclass
class ExtractionResult:
"""Structured extraction result from receipt."""
# ... existing fields ...
# NEW: Validation results
validation_warnings: List[dict] = field(default_factory=list) # List of warnings
needs_manual_review: bool = False # Flag for supervisor review
# NEW: Inter-OCR comparison data
inter_ocr_ratio: Optional[float] = None # Ratio between Light/Heavy values
inter_ocr_source_used: Optional[str] = None # 'light' or 'medium'
```
**B. Fix CLIENT CUI patterns (lines 253-272):**
```python
# Current patterns are too strict - add OCR-tolerant versions
CLIENT_CUI_PATTERNS = [
# ... existing patterns ...
# NEW: OCR-tolerant patterns
(r'CLIENT\s+C[I1UO]F\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96), # Spaces in CUI
(r'C[I1]F\s+CLIENT\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96), # Reversed format
(r'CLIENT.*?(?:R[O0])?(\d{6,10})\s*\n', 0.90), # CUI on next line
]
```
**C. Add CUI normalization and validation:**
```python
def _normalize_cui(self, cui: Optional[str]) -> Optional[str]:
"""Normalize CUI format and validate checksum."""
if not cui:
return None
# Remove non-digits
digits = re.sub(r'\D', '', cui)
# Validate length
if not (6 <= len(digits) <= 10):
return None
# Validate Mod 11 checksum (Romanian CIF algorithm)
if not self._validate_cui_checksum(digits):
print(f"[CUI Validation] Invalid checksum: {digits}", flush=True)
return None
# Add RO prefix
return f"RO{digits}"
def _validate_cui_checksum(self, digits: str) -> bool:
"""Validate Romanian CIF Mod 11 checksum."""
if len(digits) < 2:
return False
# Weights: 7, 5, 3, 2, 1, 7, 5, 3, 2 (right-to-left)
weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
# Get control digit (last digit)
control = int(digits[-1])
# Calculate checksum (all digits except last)
digits_to_check = digits[:-1].zfill(9) # Pad with zeros if needed
checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
# Mod 11
remainder = checksum % 11
expected_control = 0 if remainder == 10 else remainder
return control == expected_control
```
#### 3. `backend/modules/data_entry/services/image_preprocessor.py`
**Changes:** ~80 lines added
**Key Modifications:**
**A. Add `preprocess_medium()` method (after line 166):**
```python
def preprocess_medium(self, image: np.ndarray) -> np.ndarray:
"""
Medium preprocessing for MIXED-QUALITY images.
Balance between Light (too gentle) and Heavy (too aggressive).
Use cases:
- Moderately faded receipts
- Photos with uneven lighting
- Scans with slight blur
Preprocessing steps:
- Moderate contrast enhancement (CLAHE clipLimit=2.0)
- Light denoising (fastNlMeansDenoising h=6)
- Gentle sharpening
- NO binarization (preserves text boundaries)
- NO morphological operations (avoids digit concatenation)
"""
# 0. Add safety padding
image = self._add_safety_padding(image)
# 1. Grayscale
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
else:
gray = image.copy()
# 2. Scale (same as Light)
height, width = gray.shape
max_side = max(height, width)
if max_side > 4000:
scale = 4000 / max_side
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
height, width = gray.shape
if width < 1500:
scale = 1500 / width
new_width = int(width * scale)
new_height = int(height * scale)
if max(new_width, new_height) > 4000:
scale = 4000 / max(new_width, new_height)
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)
# 3. Deskew
gray = self._deskew(gray)
# 4. Moderate contrast enhancement
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
enhanced = clahe.apply(gray)
# 5. Light denoising (less aggressive than Heavy)
denoised = cv2.fastNlMeansDenoising(enhanced, h=6, templateWindowSize=7, searchWindowSize=15)
# 6. Gentle sharpening
gaussian = cv2.GaussianBlur(denoised, (0, 0), 1.0)
sharpened = cv2.addWeighted(denoised, 1.3, gaussian, -0.3, 0)
# NO binarization, NO morphological operations
# This preserves text boundaries and avoids digit concatenation
return sharpened
```
**B. Mark `preprocess_heavy()` as deprecated:**
```python
def preprocess_heavy(self, image: np.ndarray) -> np.ndarray:
"""
Heavy preprocessing for FADED thermal receipts.
⚠️ DEPRECATED: Use preprocess_medium() instead.
Heavy preprocessing causes digit concatenation on clear PDFs.
Kept for backward compatibility only.
"""
# ... existing code (unchanged)
```
#### 4. `backend/modules/data_entry/routers/ocr.py`
**Changes:** ~40 lines modified
**Key Modifications:**
**A. Update `ExtractionData` schema instantiation (lines 106-128):**
```python
# Add validation warnings to response
validation_warnings_list = [
{
'field': w.field,
'rule': w.rule,
'message': w.message,
'severity': w.severity,
'suggested_value': w.suggested_value
}
for w in result.validation_warnings
] if hasattr(result, 'validation_warnings') else []
data = ExtractionData(
# ... existing fields ...
# NEW: Validation fields
validation_warnings=validation_warnings_list,
needs_manual_review=getattr(result, 'needs_manual_review', False),
inter_ocr_ratio=getattr(result, 'inter_ocr_ratio', None),
inter_ocr_source_used=getattr(result, 'inter_ocr_source_used', None),
)
```
#### 5. `backend/modules/data_entry/schemas/ocr.py`
**Changes:** ~20 lines added
**Key Modifications:**
**A. Add validation fields to `ExtractionData` (after line 57):**
```python
class ValidationWarning(BaseModel):
"""Validation warning from OCR extraction."""
field: str = Field(description="Field name (e.g., 'amount', 'tva_total')")
rule: str = Field(description="Rule name (e.g., 'amount_range', 'tva_ratio')")
message: str = Field(description="Human-readable warning message")
severity: str = Field(description="Severity: 'low', 'medium', 'high'")
suggested_value: Optional[Any] = Field(default=None, description="Suggested corrected value")
class ExtractionData(BaseModel):
"""Extracted receipt data from OCR."""
# ... existing fields ...
# NEW: Validation results
validation_warnings: List[ValidationWarning] = Field(default=[], description="Validation warnings")
needs_manual_review: bool = Field(default=False, description="Flag for supervisor review")
inter_ocr_ratio: Optional[float] = Field(default=None, description="Ratio between OCR engines (>10 = inconsistent)")
inter_ocr_source_used: Optional[str] = Field(default=None, description="OCR engine used: 'light' or 'medium'")
```
#### 6. Database Migration: `backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py`
**Purpose:** Add `needs_manual_review` column to `receipts` table
**Size:** ~30 lines (Alembic migration)
```python
"""Add needs_manual_review flag to receipts
Revision ID: XXX
Create Date: 2025-12-30
"""
from alembic import op
import sqlalchemy as sa
revision = 'XXX'
down_revision = 'YYY' # Previous migration
branch_labels = None
depends_on = None
def upgrade():
# Add column with default NULL (not FALSE)
# NULL = not validated yet (old receipts)
# FALSE = validated, no review needed
# TRUE = validated, needs review
op.add_column('receipts', sa.Column('needs_manual_review', sa.Boolean(), nullable=True))
def downgrade():
op.drop_column('receipts', 'needs_manual_review')
```
### Frontend Integration Points
#### 1. `src/modules/data-entry/views/receipts/ReceiptCreateView.vue`
**Changes:** Display validation warnings below OCR results
**Example:**
```vue
<template>
<div class="ocr-results">
<!-- Existing OCR fields -->
<!-- NEW: Validation warnings section -->
<div v-if="ocrData.validation_warnings?.length > 0" class="validation-warnings">
<h4>
<i class="pi pi-exclamation-triangle" />
Avertismente Validare ({{ ocrData.validation_warnings.length }})
</h4>
<ul>
<li
v-for="(warning, idx) in ocrData.validation_warnings"
:key="idx"
:class="`severity-${warning.severity}`"
>
<strong>{{ warning.field }}:</strong> {{ warning.message }}
<span v-if="warning.suggested_value" class="suggestion">
(sugestie: {{ warning.suggested_value }})
</span>
</li>
</ul>
</div>
<!-- NEW: Manual review badge -->
<div v-if="ocrData.needs_manual_review" class="manual-review-badge">
<i class="pi pi-flag" />
Necesită verificare manuală
</div>
</div>
</template>
<style scoped>
.validation-warnings {
margin-top: 1rem;
padding: 1rem;
background: #fff3cd;
border-left: 4px solid #ffc107;
}
.validation-warnings li.severity-low {
color: #666;
}
.validation-warnings li.severity-medium {
color: #f57c00;
}
.validation-warnings li.severity-high {
color: #d32f2f;
font-weight: bold;
}
.manual-review-badge {
margin-top: 0.5rem;
padding: 0.5rem 1rem;
background: #fff3cd;
border-radius: 4px;
display: inline-flex;
align-items: center;
gap: 0.5rem;
}
</style>
```
#### 2. `src/modules/data-entry/components/ocr/OCRPreview.vue`
**Changes:** Add inter-OCR consistency indicator
**Example:**
```vue
<template>
<div class="ocr-preview">
<!-- Existing fields -->
<!-- NEW: Inter-OCR consistency warning -->
<div v-if="ocrData.inter_ocr_ratio && ocrData.inter_ocr_ratio > 10" class="ocr-consistency-warning">
<i class="pi pi-exclamation-circle" />
Inconsistență detectată între motoarele OCR ({{ Math.round(ocrData.inter_ocr_ratio) }}x diferență).
<br />
<small>Valorile folosite provin din: {{ ocrData.inter_ocr_source_used }}</small>
</div>
</div>
</template>
```
---
## Design Decisions
### 1. Why Validation Warnings Instead of Errors?
**Decision:** Use non-blocking warnings instead of blocking errors.
**Rationale:**
- User requirement: "Allow save with warnings"
- OCR will never be 100% perfect
- Users can override incorrect extractions
- Supervisor review catches issues before approval
**Trade-off:** Risk of bad data entering system vs. user frustration with blocked submissions.
**Mitigation:** Manual review flag ensures supervisor catches issues.
### 2. Why Replace Heavy with Medium OCR?
**Decision:** Remove Heavy preprocessing, add Medium preprocessing.
**Rationale:**
- **Heavy causes digit concatenation** on clear PDFs (production evidence)
- Binarization destroys text boundaries on high-quality images
- Morphological operations merge adjacent numbers (85.99 859,762.16)
**Analysis of Heavy Preprocessing (lines 153-164 in `image_preprocessor.py`):**
```python
# 7. Adaptive thresholding (binarization) - PROBLEM!
binary = cv2.adaptiveThreshold(
sharpened, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=11, C=5 # Block size can merge nearby digits
)
# 8. Morphological operations - COMPOUNDS THE PROBLEM!
kernel_close = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel_close)
# MORPH_CLOSE fills small gaps → merges adjacent numbers
```
**Alternative Considered:** Keep Heavy but add safeguards. **Rejected:** Too risky, no benefit for clear PDFs.
### 3. Why Romanian CIF Mod 11 Validation?
**Decision:** Implement CIF checksum validation algorithm.
**Rationale:**
- Romanian CIFs have built-in checksum (last digit)
- Validates extracted CUI is mathematically correct
- Catches OCR digit errors (10562600 vs 10562601)
**Algorithm:** Mod 11 checksum
- Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left)
- Formula: `sum(digit[i] * weight[i]) % 11`
- Control digit: remainder (0 if remainder=10)
**Example:** RO10562600
- Digits: 1,0,5,6,2,6,0,0,[0]
- Checksum: 1×7 + 0×5 + 5×3 + 6×2 + 2×1 + 6×7 + 0×5 + 0×3 = 7+0+15+12+2+42+0+0 = 78
- 78 % 11 = 1 0 **INVALID!** (This CUI fails validation)
**Note:** Some older CIFs may not have checksums (pre-2000). Validation is permissive (warning, not error).
### 4. Why Apply to New Uploads Only?
**Decision:** Don't reprocess existing receipts.
**Rationale:**
- Migration impact: ~500 existing receipts in DB
- Reprocessing cost: OCR is slow (~2-5s per receipt)
- Risk: May change existing approved data
- Benefit: Minimal (old receipts already reviewed)
**Implementation:** Migration adds column with default NULL (not FALSE).
---
## Validation Rules Specification
### 1. Amount Range Validation
**Rule:** Amount must be between 0.01 and 100,000 RON.
**Implementation:**
```python
class AmountRangeRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.amount:
if extraction.amount < Decimal('0.01'):
warnings.append(ValidationWarning(
field='amount',
rule='amount_range',
message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
severity='high'
))
elif extraction.amount > Decimal('100000'):
warnings.append(ValidationWarning(
field='amount',
rule='amount_range',
message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
severity='high'
))
# Check decimal places
decimal_places = abs(extraction.amount.as_tuple().exponent)
if decimal_places > 2:
warnings.append(ValidationWarning(
field='amount',
rule='decimal_places',
message=f'Amount has {decimal_places} decimal places (max 2)',
severity='medium',
suggested_value=extraction.amount.quantize(Decimal('0.01'))
))
return warnings
```
**Test Cases:**
- 0.00 RON Warning (too small)
- 0.01 RON Valid
- 85.99 RON Valid
- 100,000 RON Valid
- 100,001 RON Warning (too large)
- 859,762.16 RON Warning (too large)
- 85.999 RON Warning (too many decimals)
### 2. TVA Ratio Validation
**Rule:** TVA must be 5-24% of TOTAL amount.
**Implementation:**
```python
class TVARatioRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.tva_total and extraction.amount:
# TVA cannot be greater than TOTAL
if extraction.tva_total > extraction.amount:
warnings.append(ValidationWarning(
field='tva_total',
rule='tva_greater_than_total',
message=f'TVA ({extraction.tva_total}) cannot be greater than TOTAL ({extraction.amount})',
severity='high',
suggested_value=None # Will be auto-corrected by service
))
else:
# Check ratio
ratio = extraction.tva_total / extraction.amount * Decimal('100')
if ratio < Decimal('5'):
warnings.append(ValidationWarning(
field='tva_total',
rule='tva_ratio_low',
message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
severity='medium'
))
elif ratio > Decimal('24'):
warnings.append(ValidationWarning(
field='tva_total',
rule='tva_ratio_high',
message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
severity='high'
))
return warnings
```
**Test Cases:**
- TVA=14.92, TOTAL=85.99 17.3% Valid
- TVA=149,214.92, TOTAL=859,762.16 17.3% Both values wrong (caught by amount_range)
- TVA=4.00, TOTAL=100.00 4% Warning (too low)
- TVA=100.00, TOTAL=85.99 116% Warning (impossible!)
### 3. Payment Sum Validation
**Rule:** CARD + NUMERAR must equal TOTAL 0.02 RON tolerance).
**Implementation:**
```python
class PaymentSumRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.payment_methods and extraction.amount:
payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
difference = abs(payment_sum - extraction.amount)
if difference > Decimal('0.02'):
warnings.append(ValidationWarning(
field='amount',
rule='payment_sum_mismatch',
message=f'Payment methods sum ({payment_sum}) ≠ TOTAL ({extraction.amount}), diff={difference}',
severity='high',
suggested_value=payment_sum
))
return warnings
def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
"""Auto-correct TOTAL from payment sum if confidence < 80%."""
corrections = {}
if extraction.payment_methods and extraction.amount:
payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
difference = abs(payment_sum - extraction.amount)
if difference > Decimal('0.02') and extraction.confidence_amount < 0.80:
corrections['amount'] = payment_sum
print(f"[Auto-Correct] TOTAL corrected: {extraction.amount}{payment_sum} (from payment methods)", flush=True)
return corrections
```
**Test Cases:**
- CARD=50, NUMERAR=35.99, TOTAL=85.99 Valid
- CARD=50, NUMERAR=35.97, TOTAL=85.99 Diff=0.02 Valid (tolerance)
- CARD=50, NUMERAR=35.00, TOTAL=85.99 Diff=0.99 Warning
### 4. TVA Entries Sum Validation
**Rule:** Σ(TVA entries) must equal TVA TOTAL 0.02 RON tolerance).
**Implementation:**
```python
class TVAEntriesSumRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.tva_entries and extraction.tva_total:
entries_sum = sum(e['amount'] for e in extraction.tva_entries)
difference = abs(entries_sum - extraction.tva_total)
if difference > Decimal('0.02'):
warnings.append(ValidationWarning(
field='tva_total',
rule='tva_entries_sum_mismatch',
message=f'TVA entries sum ({entries_sum}) ≠ TVA TOTAL ({extraction.tva_total}), diff={difference}',
severity='medium',
suggested_value=entries_sum
))
return warnings
def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
"""Use entries sum as TVA TOTAL if mismatch."""
corrections = {}
if extraction.tva_entries and extraction.tva_total:
entries_sum = sum(e['amount'] for e in extraction.tva_entries)
difference = abs(entries_sum - extraction.tva_total)
if difference > Decimal('0.02'):
corrections['tva_total'] = entries_sum
print(f"[Auto-Correct] TVA TOTAL corrected: {extraction.tva_total}{entries_sum} (from entries)", flush=True)
return corrections
```
**Test Cases:**
- Entries=[A:19%:14.92], TOTAL=14.92 Valid
- Entries=[A:19%:10.00, B:9%:4.92], TOTAL=14.92 Valid
- Entries=[A:19%:14.92], TOTAL=14.94 Diff=0.02 Valid (tolerance)
- Entries=[A:19%:14.92], TOTAL=15.00 Diff=0.08 Warning
### 5. Inter-OCR Consistency Validation
**Rule:** Flag if values differ >10x between OCR engines.
**Implementation:**
```python
class InterOCRConsistencyRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
"""This rule is applied during merge, stores ratio in extraction."""
warnings = []
if hasattr(extraction, 'inter_ocr_ratio') and extraction.inter_ocr_ratio:
if extraction.inter_ocr_ratio > 10:
warnings.append(ValidationWarning(
field='amount',
rule='inter_ocr_inconsistency',
message=f'Large inconsistency between OCR engines ({extraction.inter_ocr_ratio:.0f}x difference)',
severity='high'
))
return warnings
```
**Test Cases:**
- Light=85.99, Medium=86.00 → Ratio=1.00 → Valid
- Light=85.99, Medium=90.00 → Ratio=1.05 → Valid
- Light=85.99, Medium=859.76 → Ratio=10.00 → Valid (edge case)
- Light=85.99, Medium=859,762.16 → Ratio=10,000 → Warning!
### 6. CUI Checksum Validation
**Rule:** Validate Romanian CIF Mod 11 checksum.
**Implementation:**
```python
class CUIChecksumRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.cui:
# Normalize CUI
digits = re.sub(r'\D', '', extraction.cui)
# Validate length
if not (6 <= len(digits) <= 10):
warnings.append(ValidationWarning(
field='cui',
rule='cui_length',
message=f'CUI length invalid: {len(digits)} digits (expected 6-10)',
severity='medium'
))
return warnings
# Validate Mod 11 checksum
if not self._validate_checksum(digits):
warnings.append(ValidationWarning(
field='cui',
rule='cui_checksum',
message=f'CUI checksum invalid: {extraction.cui} (failed Mod 11 validation)',
severity='medium' # Medium: some old CIFs don't have checksums
))
return warnings
def _validate_checksum(self, digits: str) -> bool:
"""Romanian CIF Mod 11 checksum validation."""
if len(digits) < 2:
return False
weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
control = int(digits[-1])
digits_to_check = digits[:-1].zfill(9)
checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
remainder = checksum % 11
expected = 0 if remainder == 10 else remainder
return control == expected
```
**Test Cases:**
- R010562600 → Checksum validation
- R011201891 → Checksum validation
- R012345678 → Warning (invalid checksum)
- R01234 → Warning (too short)
### 7. Date Validity Validation
**Rule:** Date must not be in future, not older than 10 years.
**Implementation:**
```python
class DateValidityRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.receipt_date:
today = date.today()
# Check future date
if extraction.receipt_date > today:
warnings.append(ValidationWarning(
field='receipt_date',
rule='date_future',
message=f'Date is in the future: {extraction.receipt_date}',
severity='high'
))
# Check too old (10 years)
cutoff_date = today.replace(year=today.year - 10)
if extraction.receipt_date < cutoff_date:
warnings.append(ValidationWarning(
field='receipt_date',
rule='date_too_old',
message=f'Date is older than 10 years: {extraction.receipt_date}',
severity='medium'
))
return warnings
```
**Test Cases:**
- 2025-12-30 (today) → Valid
- 2025-10-11 → Valid
- 2026-01-01 → Warning (future)
- 2015-12-31 → Valid (exactly 10 years)
- 2014-12-31 → Warning (too old)
---
## Acceptance Criteria
### Critical Success Criteria (Must Pass)
**AC-1:** Five-Holding receipt extracts correct values
- **Given:** Production PDF receipt (Five-Holding, 85.99 LEI)
- **When:** OCR processes with new validation
- **Then:**
- TOTAL = 85.99 LEI (NOT 859,762.16)
- TVA = 14.92 LEI (NOT 149,214.92)
- CUI = R010562600
- Overall confidence >= 90%
**AC-2:** Save works with validation warnings
- **Given:** Receipt with low confidence (75%)
- **When:** User clicks Save
- **Then:**
- Warnings displayed in UI
- Save button enabled
- Receipt saved with `needs_manual_review=TRUE`
**AC-3:** Cross-validation: CARD + NUMERAR = TOTAL
- **Given:** Receipt with CARD=50, NUMERAR=35.99
- **When:** OCR extracts TOTAL=85.98 (off by 0.01)
- **Then:**
- Warning displayed: "Payment sum (85.99) ≠ TOTAL (85.98)"
- Suggested value: 85.99
- Auto-corrected if confidence < 80%
**AC-4:** Cross-validation: Σ(TVA entries) = TVA TOTAL
- **Given:** Receipt with TVA A=10.00, TVA B=4.92
- **When:** OCR extracts TVA TOTAL=14.90 (off by 0.02)
- **Then:**
- Warning displayed: "TVA entries sum (14.92) TVA TOTAL (14.90)"
- Auto-corrected to 14.92
**AC-5:** CUI Mod 11 validation works
- **Given:** Receipt with CUI R010562600
- **When:** OCR processes
- **Then:**
- CUI validated against Mod 11 checksum
- If invalid, warning displayed
- Format normalized to "RO" prefix
### Secondary Criteria (Nice-to-Have)
🔲 **AC-S1:** Medium OCR performs better than Heavy
- **Given:** 10 clear PDF receipts
- **When:** Processed with Light Medium Tesseract
- **Then:**
- No 10x magnitude errors
- Average confidence >= 90%
- Processing time < 5s
🔲 **AC-S2:** Validation warnings show in UI
- **Given:** Receipt with 3 validation warnings
- **When:** OCR completes
- **Then:**
- Warning section displayed
- Each warning shows: field, message, severity
- Suggested values displayed if available
---
## Testing Strategy
### Unit Tests (~300 lines)
**File:** `backend/modules/data_entry/tests/test_ocr_validation.py`
**Test Coverage:**
```python
# Amount validation
test_amount_range_valid()
test_amount_range_too_small()
test_amount_range_too_large()
test_amount_decimal_places()
# TVA validation
test_tva_ratio_valid()
test_tva_ratio_too_low()
test_tva_ratio_too_high()
test_tva_greater_than_total()
test_tva_entries_sum_matches()
test_tva_entries_sum_mismatch()
# Payment validation
test_payment_sum_matches()
test_payment_sum_mismatch_within_tolerance()
test_payment_sum_mismatch_auto_corrected()
# CUI validation
test_cui_checksum_valid()
test_cui_checksum_invalid()
test_cui_length_invalid()
test_cui_normalization()
# Date validation
test_date_valid()
test_date_future()
test_date_too_old()
# Inter-OCR consistency
test_inter_ocr_consistency_valid()
test_inter_ocr_consistency_10x_difference()
# Validation engine
test_validation_engine_no_warnings()
test_validation_engine_multiple_warnings()
test_validation_engine_auto_corrections()
test_needs_manual_review_flag()
```
### Integration Tests (~200 lines)
**File:** `backend/modules/data_entry/tests/test_ocr_validation_integration.py`
**Test Coverage:**
```python
# Real receipts
test_five_holding_receipt() # Production case (85.99 not 859,762.16)
test_omv_receipt() # Clear PDF, Light OCR only
test_kaufland_receipt() # Faded thermal, Medium OCR
test_mega_image_receipt() # Multiple TVA entries
# OCR pipeline
test_light_ocr_high_confidence_skips_medium()
test_light_ocr_low_confidence_runs_medium()
test_medium_ocr_replaces_heavy()
test_validation_runs_after_merge()
# API responses
test_api_returns_validation_warnings()
test_api_returns_needs_manual_review_flag()
test_api_returns_inter_ocr_ratio()
test_api_auto_corrects_amount_from_payments()
# Edge cases
test_no_ocr_engines_available()
test_pdf_with_multiple_pages()
test_receipt_with_no_tva()
test_receipt_with_no_payment_methods()
```
### Manual Testing Checklist
1. **Upload Five-Holding receipt PDF** (production case)
- [ ] Verify TOTAL = 85.99 (not 859,762.16)
- [ ] Verify TVA = 14.92 (not 149,214.92)
- [ ] Verify no validation warnings
- [ ] Verify overall confidence >= 90%
2. **Upload faded thermal receipt photo**
- [ ] Verify Medium OCR used (not Heavy)
- [ ] Verify readable text extracted
- [ ] Verify no digit concatenation
3. **Upload receipt with payment methods**
- [ ] Verify CARD + NUMERAR displayed
- [ ] Verify sum matches TOTAL
- [ ] If mismatch, verify warning displayed
4. **Upload receipt with multiple TVA entries**
- [ ] Verify all TVA entries extracted
- [ ] Verify sum matches TVA TOTAL
- [ ] If mismatch, verify warning displayed
5. **Submit receipt with warnings**
- [ ] Verify Save button enabled
- [ ] Verify warnings displayed in UI
- [ ] Verify `needs_manual_review` flag set
6. **Filter receipts by "Needs Review"**
- [ ] Verify filter shows flagged receipts
- [ ] Verify supervisor can review
---
## Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| **Medium OCR still causes errors** | Medium | High | Keep Tesseract as Step 3 fallback; validation catches issues |
| **CUI Mod 11 validation too strict** | Medium | Low | Use warning (not error); allow override; some old CIFs don't have checksums |
| **Validation rules too permissive** | Low | Medium | Start conservative, tune based on production data |
| **Validation rules too strict** | Medium | Low | Non-blocking warnings allow user override |
| **Performance impact** | Low | Low | Validation is fast (<10ms); OCR dominates processing time |
| **Breaking changes to API** | Low | High | Add new fields, keep existing fields unchanged; frontend optional |
| **Database migration issues** | Low | Medium | Use NULL default (not FALSE); test on staging first |
---
## Out of Scope
**Explicitly NOT included in this feature:**
1. **Reprocessing existing receipts** - Only new uploads validated
2. **Machine learning OCR improvements** - Use existing PaddleOCR/Tesseract
3. **Custom OCR training** - Generic models only
4. **Approval workflow changes** - Validation is separate from approval
5. **Automatic approval** - Always requires supervisor review
6. **Advanced validation rules** - Only basic sanity checks
7. **Multi-currency support** - RON only for now
8. **Historical receipt validation** - Phase 2 feature
9. **OCR confidence tuning** - Accept engine defaults
10. **Frontend validation logic** - Backend only (frontend displays)
---
## Open Questions
### Q1: Should we keep Heavy preprocessing as fallback?
**Answer:** No. Remove completely. Evidence shows it causes more harm than good on clear PDFs. Medium preprocessing handles mixed-quality images better.
### Q2: What tolerance for payment sum validation?
**Answer:** ±0.02 RON (2 cents). Romanian receipts use 2 decimal places. This handles rounding errors.
### Q3: Should CUI validation be blocking or warning?
**Answer:** Warning only. Some old Romanian CIFs (pre-2000) don't have Mod 11 checksums. Also, OCR may extract digits incorrectly.
### Q4: What if Light OCR has high confidence but wrong values?
**Answer:** Validation catches this. If Light OCR extracts 859,762.16 with 98% confidence, amount_range rule flags it (>100,000 limit). User sees warning.
### Q5: Should we reprocess existing receipts with new validation?
**Answer:** No. Too risky and time-consuming. Apply to new uploads only. If user wants to re-validate old receipt, they can re-upload.
### Q6: What about receipts with no payment methods?
**Answer:** No validation warning. Not all receipts show CARD/NUMERAR breakdown (especially older thermal receipts). Only validate if payment methods are extracted.
### Q7: Should validation auto-correct or just warn?
**Answer:** Both. Auto-correct obvious errors (TOTAL from payment sum if confidence < 80%). Warn for ambiguous cases. Never silently change high-confidence values.
### Q8: How to handle receipts from future (clock skew)?
**Answer:** Warning only (not error). Allow up to 1 day in future 24h tolerance) for clock skew. Beyond that, warn user.
---
## Estimated Complexity
**Overall:** High
**Justification:**
- **File Count:** 6 modified, 3 created, 1 migration = 10 files
- **Line Changes:** ~1,135 lines (400 new validation, 300 tests, 200 integration tests, 235 modifications)
- **Risk Level:** Medium (core OCR pipeline changes, but validation is additive)
- **Testing:** 15-20 new test cases, manual testing required
- **Dependencies:** None (uses existing OCR engines)
- **Complexity Factors:**
- Multi-layer validation logic
- Romanian CIF checksum algorithm
- Cross-field validation dependencies
- Inter-OCR comparison logic
- Auto-correction logic
- Frontend integration
- Database migration
**Estimated Effort:** 2-3 days
- Day 1: Validation engine + unit tests
- Day 2: OCR pipeline integration + medium preprocessing
- Day 3: Frontend integration + manual testing + bug fixes
---
## Dependencies
### External Libraries
- `cv2` (OpenCV) - Already installed
- `numpy` - Already installed
- `paddleocr` - Already installed
- `tesseract` - Already installed
- `pydantic` - Already installed
- `sqlalchemy` - Already installed
### Internal Modules
- `backend/modules/data_entry/services/ocr_service.py`
- `backend/modules/data_entry/services/ocr_extractor.py`
- `backend/modules/data_entry/services/image_preprocessor.py`
- `backend/modules/data_entry/routers/ocr.py`
- `backend/modules/data_entry/schemas/ocr.py`
- `backend/modules/data_entry/db/models/receipt.py`
### Database Schema Changes
- Add `needs_manual_review` column to `receipts` table (nullable BOOLEAN)
- Alembic migration required
---
## Implementation Notes
### Priority Order (Recommended)
1. **Phase 1: Core Validation (Day 1)**
- Create `ocr/validation.py` module
- Implement validation rules (amount, TVA, payment, CUI, date)
- Write unit tests
- **Checkpoint:** All tests pass
2. **Phase 2: OCR Integration (Day 2 Morning)**
- Add `preprocess_medium()` to image_preprocessor
- Update `_merge_extractions()` with validation-aware logic
- Remove/deprecate `preprocess_heavy()`
- **Checkpoint:** Five-Holding receipt extracts correctly
3. **Phase 3: API Updates (Day 2 Afternoon)**
- Update `ExtractionResult` dataclass with validation fields
- Update API schemas (ocr.py, routers/ocr.py)
- Add database migration
- **Checkpoint:** API returns validation warnings
4. **Phase 4: Integration Testing (Day 3 Morning)**
- Write integration tests
- Test with real receipts (Five-Holding, OMV, Kaufland)
- **Checkpoint:** All integration tests pass
5. **Phase 5: Frontend & Polish (Day 3 Afternoon)**
- Update Vue components to display warnings
- Add "Needs Review" filter
- Manual testing
- Bug fixes
- **Checkpoint:** Production-ready
### Code Quality Standards
- Type hints for all functions
- Docstrings for all public methods
- Unit test coverage >90%
- ✅ Integration tests for critical paths
- ✅ Print statements for debugging (will be converted to logging later)
- ✅ Follow existing code patterns (SQLModel, Pydantic v2, FastAPI)
### Performance Considerations
- **Validation overhead:** <10ms per receipt (negligible vs. OCR time)
- **Medium preprocessing:** Similar speed to Heavy (~500ms)
- **Database migration:** Non-blocking (adds NULL column)
- **Frontend impact:** Minimal (only displays warnings)
---
## Related Documentation
### Project Context
- **CLAUDE.md:** Data Entry module instructions
- **docs/data-entry/DATA-ENTRY-MODULE.md:** Module architecture
- **docs/ARCHITECTURE-DECISIONS.md:** Ultrathin monolith rationale
### Technical References
- **Romanian CIF validation:** https://ro.wikipedia.org/wiki/Cod_de_identificare_fiscal%C4%83
- **OpenCV preprocessing:** https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html
- **PaddleOCR docs:** https://github.com/PaddlePaddle/PaddleOCR
### Similar Features
- **Payment methods extraction:** Already implemented in `ocr_extractor.py:1361`
- **TVA entries extraction:** Already implemented in `ocr_extractor.py:820`
- **Cross-validation logic:** Pattern from `_cross_validate_and_calculate_amount` (lines 468-557)
---
## Summary
This specification provides a comprehensive solution to fix critical OCR data extraction issues in the Data Entry module. The multi-layer validation system ensures data integrity while maintaining user flexibility through non-blocking warnings.
**Key Benefits:**
- Prevents 10,000x magnitude errors (85.99 vs 859,762.16)
- Validates cross-field dependencies (payment sum, TVA sum)
- Improves CUI extraction with Mod 11 checksum
- Replaces problematic Heavy OCR with Medium preprocessing
- Non-blocking warnings preserve user workflow
- Manual review flag helps supervisors prioritize
**Next Steps:**
1. Review and approve specification
2. Create feature branch: `feature/bon-ocr-validation`
3. Implement Phase 1 (validation engine)
4. Continue with Phases 2-5
5. Deploy to staging for testing
6. Monitor production for 1 week before full rollout
---
**Document Version:** 1.0
**Last Updated:** 2025-12-30
**Status:** Ready for Implementation
**Estimated Completion:** 2026-01-02 (3 working days)