OCR Data Extraction Validation System: - Add 7 validation rules (amount range, TVA ratio, payment sum, etc.) - Add Medium preprocessing to replace Heavy (fixes digit concatenation) - Add validation warnings to API responses - Flag receipts needing manual review (needs_manual_review field) - Add database migration for needs_manual_review column CLIENT CUI Extraction Improvements: - Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc. - Handle OCR errors (R0 vs RO, C1F vs CIF) - Add client_name, client_cui, client_address to API response - Add validation fields to API response (was missing) QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings) - Fixed type safety in validation rules - Fixed ZeroDivisionError risk - Fixed schema mismatch (Optional[bool] for needs_manual_review) - All 37 unit tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
52 KiB
Feature Specification: OCR Data Extraction Validation System
Feature ID: bon-ocr-validation Priority: Critical (P0 - Production Bug) Complexity: High Estimated Effort: 2-3 days Created: 2025-12-30 Module: Data Entry (backend/modules/data_entry/)
Overview
Fix critical OCR data extraction issue where PaddleOCR Heavy preprocessing (88% confidence) overwrites correct Light OCR (98% confidence) data with garbage values, causing 10,000x magnitude errors in production receipts.
Value Proposition: Prevent incorrect financial data from entering the system, reduce manual corrections, improve user trust in OCR accuracy.
Problem Statement
Current Behavior (BROKEN)
The OCR processing pipeline (backend/modules/data_entry/services/ocr_service.py) uses a 3-step adaptive approach:
- Step 1: PaddleOCR + Light preprocessing (fast, high confidence)
- Step 2: PaddleOCR + Heavy preprocessing (faded receipts)
- Step 3: Tesseract (complement missing fields only)
Critical Bug: The _merge_extractions() method (lines 240-386) blindly prefers higher OCR confidence scores WITHOUT validating actual extracted values.
Real Production Example (Five-Holding Receipt)
| Field | Light OCR (98%) ✅ | Heavy OCR (88%) ❌ | Final Result ❌ |
|---|---|---|---|
| TOTAL | 85.99 LEI | 859,762.16 LEI | 859,762.16 (10,000x error!) |
| TVA | 14.92 LEI | 149,214.92 LEI | 149,214.92 (10,000x error!) |
| CUI | R010562600 | (not found) | R010562600 |
| Date | 2025-10-11 | 2025-10-11 | 2025-10-11 |
| Confidence | 98% | 88% | 88% (wrong source!) |
Root Cause: Heavy preprocessing causes digit concatenation on high-quality PDFs. The binarization and morphological operations (lines 153-164 in image_preprocessor.py) merge adjacent numbers, creating garbage values.
Impact
- Data Integrity: Incorrect amounts enter accounting system
- User Trust: Users lose confidence in OCR accuracy
- Manual Work: Requires manual verification of ALL OCR extractions
- Financial Risk: Wrong amounts could be approved without review
User Stories
1. As a user uploading a clear PDF receipt
I want OCR to extract correct values from the first pass So that I don't have to manually correct obvious errors
Acceptance Criteria:
- Light OCR correctly extracts 85.99 LEI (not 859,762.16)
- Heavy OCR is skipped when Light OCR confidence >= 90%
- No 10,000x magnitude errors
2. As a user submitting a receipt with warnings
I want to be able to save receipts with validation warnings So that I can submit for review even if OCR isn't perfect
Acceptance Criteria:
- Save button works with warnings (not blocked)
- Receipt marked with
needs_manual_review=True - Warnings displayed clearly in UI
3. As a supervisor reviewing receipts
I want to see which receipts need manual review So that I can prioritize validation efforts
Acceptance Criteria:
- Filter by "Needs Review" flag
- Validation warnings shown in detail view
- Clear indication of which fields are suspicious
4. As a system validating cross-field data
I want to validate CARD + NUMERAR = TOTAL So that payment methods match the total amount
Acceptance Criteria:
- Cross-validation: sum of payment methods = TOTAL (±0.02 RON tolerance)
- If mismatch, flag for review
- Auto-correct TOTAL from payment sum if confidence < 80%
5. As a system validating TVA entries
I want to validate Σ(TVA entries) = TVA TOTAL So that individual TVA lines match the total TVA
Acceptance Criteria:
- Cross-validation: sum of TVA entries = TVA TOTAL (±0.02 RON tolerance)
- TVA rate validation (5-24% of TOTAL)
- If mismatch, flag for review
Functional Requirements
Core Requirements (Must-Have)
1. Multi-Layer Validation Pipeline
FR-1.1: Absolute value sanity checks
- Amount range: 0.01 - 100,000 RON
- Max 2 decimal places
- Date: not in future, not older than 10 years (2015+)
- CUI: 6-10 digits, valid Mod 11 checksum
FR-1.2: Cross-field correlation validation
- TVA: 5-24% of TOTAL amount (Romanian rates: 5%, 9%, 11%, 19%, 21%)
- Payment methods: CARD + NUMERAR = TOTAL (±0.02 RON tolerance)
- Inter-OCR consistency: flag if values differ >10x between engines
FR-1.3: Auto-correction logic
- If TOTAL is obviously wrong (>10x payment sum), use payment sum
- If TVA > TOTAL, recalculate TOTAL from TVA using reverse formula
- Preserve high-confidence values from Light OCR over low-confidence Heavy OCR
FR-1.4: Validation result structure
@dataclass
class ValidationResult:
is_valid: bool
warnings: List[ValidationWarning] # Non-blocking issues
errors: List[ValidationError] # Blocking issues (none for now)
corrected_fields: Dict[str, Any] # Auto-corrected values
needs_manual_review: bool # Flag for supervisor
2. Replace Heavy with Medium OCR
FR-2.1: Remove preprocess_heavy() method
- Current Heavy: aggressive binarization causes digit concatenation
- Reason: Destroys high-quality PDFs while trying to recover faded receipts
FR-2.2: Add preprocess_medium() method
- Moderate contrast enhancement (CLAHE clipLimit=2.0)
- Light denoising (fastNlMeansDenoising h=6)
- NO binarization, NO morphological operations
- Preserve text boundaries on clear images
FR-2.3: Update OCR pipeline
- Step 1: Light preprocessing (unchanged)
- Step 2: Medium preprocessing (replaces Heavy)
- Step 3: Tesseract (unchanged)
3. Enhanced CUI Extraction
FR-3.1: Romanian CIF validation algorithm
- Implement Mod 11 checksum validation
- Control digit formula:
sum(digit[i] * weight[i]) % 11 - Weights:
[7, 5, 3, 2, 1, 7, 5, 3, 2](right-to-left)
FR-3.2: CUI format normalization
- Always add "RO" prefix if missing
- Remove spaces, dashes, dots
- Validate length: 6-10 digits
FR-3.3: Improved regex patterns
# Add OCR-tolerant patterns (current patterns are too strict)
CUI_OCR_TOLERANT_PATTERNS = [
r'CIF[:\s]*R[O0]?\s*(\d[\d\s]{5,9})', # Spaces in CUI
r'C[I1]F[:\s]*(\d[\d\s]{6,10})', # C1F (I→1 OCR error)
r'C\.?\s*[I1]\.?\s*F\.?[:\s]*(\d+)', # C. I. F. (spaced)
]
4. User Requirements Integration
FR-4.1: Non-blocking validation warnings
- Save button enabled even with warnings
- User can override and submit
- Warnings displayed clearly in UI
FR-4.2: Manual review flag
- Database field:
receipts.needs_manual_review(BOOLEAN) - Set to
TRUEif:- Any validation warning present
- Overall confidence < 85%
- Cross-validation fails
FR-4.3: Apply to new uploads only
- No reprocessing of existing receipts
- Validation runs on OCR extraction (POST /api/ocr/extract)
- Migration: add column with default NULL (not FALSE)
Secondary Requirements (Nice-to-Have)
FR-S1: Validation confidence scoring
- Each validation rule contributes to score
- Overall validation confidence: weighted average
- Display in UI alongside OCR confidence
FR-S2: Validation rule configurability
- Move hardcoded thresholds to config
- Allow per-company customization
- Admin UI to adjust rules
Technical Requirements
Files to Create
1. backend/modules/data_entry/services/ocr/validation.py
Purpose: Validation utilities and rule engine Size: ~400 lines Key Classes:
ValidationRule(base class)AmountRangeRule,TVARatioRule,PaymentSumRule,CUIChecksumRuleOCRValidationEngine(orchestrator)
Example:
@dataclass
class ValidationWarning:
"""Non-blocking validation warning."""
field: str
rule: str
message: str
severity: str # 'low', 'medium', 'high'
suggested_value: Optional[Any] = None
class ValidationRule(ABC):
"""Base validation rule."""
@abstractmethod
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
pass
class AmountRangeRule(ValidationRule):
"""Validate amount is in reasonable range (0.01 - 100,000 RON)."""
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.amount:
if extraction.amount < Decimal('0.01'):
warnings.append(ValidationWarning(
field='amount',
rule='amount_range',
message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
severity='high'
))
elif extraction.amount > Decimal('100000'):
warnings.append(ValidationWarning(
field='amount',
rule='amount_range',
message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
severity='high'
))
return warnings
class OCRValidationEngine:
"""Orchestrate all validation rules."""
def __init__(self):
self.rules = [
AmountRangeRule(),
TVARatioRule(),
PaymentSumRule(),
InterOCRConsistencyRule(),
CUIChecksumRule(),
DateValidityRule(),
]
def validate(self, extraction: ExtractionResult) -> ValidationResult:
"""Run all validation rules and return result."""
all_warnings = []
corrected_fields = {}
for rule in self.rules:
warnings = rule.validate(extraction)
all_warnings.extend(warnings)
# Apply auto-corrections
corrections = rule.auto_correct(extraction)
corrected_fields.update(corrections)
needs_review = (
len(all_warnings) > 0 or
extraction.overall_confidence < 0.85
)
return ValidationResult(
is_valid=True, # Never block (warnings only)
warnings=all_warnings,
errors=[],
corrected_fields=corrected_fields,
needs_manual_review=needs_review
)
2. backend/modules/data_entry/tests/test_ocr_validation.py
Purpose: Unit tests for validation rules Size: ~300 lines Coverage Target: >90%
Test Cases:
test_amount_range_valid()- 85.99 RON passestest_amount_range_too_high()- 859,762.16 failstest_tva_ratio_valid()- 14.92/85.99 = 17.3% passestest_tva_ratio_too_high()- 149,214.92/859,762.16 = 17.3% but amounts wrongtest_payment_sum_matches()- CARD 50 + NUMERAR 35.99 = TOTAL 85.99test_cui_checksum_valid()- R010562600 passes Mod 11test_cui_checksum_invalid()- R010562601 fails Mod 11test_inter_ocr_consistency()- 85.99 vs 859,762.16 = 10,000x flag
3. backend/modules/data_entry/tests/test_ocr_validation_integration.py
Purpose: Integration tests with full OCR pipeline Size: ~200 lines
Test Cases:
test_five_holding_receipt()- Real production case (85.99 not 859,762.16)test_clear_pdf_uses_light_ocr()- High-quality PDF skips Heavytest_faded_receipt_uses_medium_ocr()- Thermal receipt uses Mediumtest_validation_warnings_in_response()- API returns warningstest_manual_review_flag_set()- Flag set when confidence < 85%
Files to Modify
1. backend/modules/data_entry/services/ocr_service.py
Changes: ~200 lines modified, ~100 lines added
Key Modifications:
A. Replace _merge_extractions() (lines 240-386) with validation-aware version:
def _merge_extractions(
self,
light: Optional[ExtractionResult],
medium: Optional[ExtractionResult] # Renamed from 'tesseract'
) -> ExtractionResult:
"""
Merge extractions with VALIDATION-AWARE logic.
NEW Strategy:
1. Run validation on both extractions
2. Prefer extraction with FEWER warnings (not just higher confidence)
3. For each field, pick value that passes validation
4. Flag inter-OCR inconsistencies (>10x difference)
"""
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
validator = OCRValidationEngine()
# Validate both extractions
light_validation = validator.validate(light) if light else None
medium_validation = validator.validate(medium) if medium else None
result = ExtractionResult()
# === AMOUNT (with validation check) ===
if light.amount and medium.amount:
# Check for 10x inconsistency
ratio = max(light.amount, medium.amount) / min(light.amount, medium.amount)
if ratio > 10:
print(f"[Merge] WARNING: Inter-OCR inconsistency: {light.amount} vs {medium.amount} ({ratio:.0f}x)", flush=True)
# Prefer value that passes validation
light_warnings = [w for w in light_validation.warnings if w.field == 'amount']
medium_warnings = [w for w in medium_validation.warnings if w.field == 'amount']
if len(light_warnings) < len(medium_warnings):
result.amount = light.amount
result.confidence_amount = light.confidence_amount
print(f"[Merge] Using Light OCR amount: {light.amount} (fewer warnings)", flush=True)
else:
result.amount = medium.amount
result.confidence_amount = medium.confidence_amount
print(f"[Merge] Using Medium OCR amount: {medium.amount} (fewer warnings)", flush=True)
else:
# Normal merge: prefer higher confidence
if light.confidence_amount >= medium.confidence_amount:
result.amount = light.amount
result.confidence_amount = light.confidence_amount
else:
result.amount = medium.amount
result.confidence_amount = medium.confidence_amount
elif light.amount:
result.amount = light.amount
result.confidence_amount = light.confidence_amount
elif medium.amount:
result.amount = medium.amount
result.confidence_amount = medium.confidence_amount
# ... (similar logic for other fields)
return result
B. Add preprocess_medium() call (replace Heavy):
# Line ~130: Replace preprocess_heavy with preprocess_medium
print("=" * 60, flush=True)
print("[OCR] STEP 2: PaddleOCR + Medium preprocessing", flush=True)
print("=" * 60, flush=True)
medium_img = self.preprocessor.preprocess_medium(image) # NEW
try:
paddle_medium = self.ocr_engine._paddle_recognize(medium_img)
# ... rest of processing
C. Add validation to final result:
# Line ~204: Add validation before returning
if extraction:
extraction = self._final_validation(extraction)
# NEW: Run validation engine
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
validator = OCRValidationEngine()
validation_result = validator.validate(extraction)
# Apply auto-corrections
for field, value in validation_result.corrected_fields.items():
setattr(extraction, field, value)
# Store validation warnings (add to ExtractionResult)
extraction.validation_warnings = validation_result.warnings
extraction.needs_manual_review = validation_result.needs_manual_review
2. backend/modules/data_entry/services/ocr_extractor.py
Changes: ~50 lines modified, ~30 lines added
Key Modifications:
A. Add validation fields to ExtractionResult (lines 10-50):
@dataclass
class ExtractionResult:
"""Structured extraction result from receipt."""
# ... existing fields ...
# NEW: Validation results
validation_warnings: List[dict] = field(default_factory=list) # List of warnings
needs_manual_review: bool = False # Flag for supervisor review
# NEW: Inter-OCR comparison data
inter_ocr_ratio: Optional[float] = None # Ratio between Light/Heavy values
inter_ocr_source_used: Optional[str] = None # 'light' or 'medium'
B. Fix CLIENT CUI patterns (lines 253-272):
# Current patterns are too strict - add OCR-tolerant versions
CLIENT_CUI_PATTERNS = [
# ... existing patterns ...
# NEW: OCR-tolerant patterns
(r'CLIENT\s+C[I1UO]F\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96), # Spaces in CUI
(r'C[I1]F\s+CLIENT\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96), # Reversed format
(r'CLIENT.*?(?:R[O0])?(\d{6,10})\s*\n', 0.90), # CUI on next line
]
C. Add CUI normalization and validation:
def _normalize_cui(self, cui: Optional[str]) -> Optional[str]:
"""Normalize CUI format and validate checksum."""
if not cui:
return None
# Remove non-digits
digits = re.sub(r'\D', '', cui)
# Validate length
if not (6 <= len(digits) <= 10):
return None
# Validate Mod 11 checksum (Romanian CIF algorithm)
if not self._validate_cui_checksum(digits):
print(f"[CUI Validation] Invalid checksum: {digits}", flush=True)
return None
# Add RO prefix
return f"RO{digits}"
def _validate_cui_checksum(self, digits: str) -> bool:
"""Validate Romanian CIF Mod 11 checksum."""
if len(digits) < 2:
return False
# Weights: 7, 5, 3, 2, 1, 7, 5, 3, 2 (right-to-left)
weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
# Get control digit (last digit)
control = int(digits[-1])
# Calculate checksum (all digits except last)
digits_to_check = digits[:-1].zfill(9) # Pad with zeros if needed
checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
# Mod 11
remainder = checksum % 11
expected_control = 0 if remainder == 10 else remainder
return control == expected_control
3. backend/modules/data_entry/services/image_preprocessor.py
Changes: ~80 lines added
Key Modifications:
A. Add preprocess_medium() method (after line 166):
def preprocess_medium(self, image: np.ndarray) -> np.ndarray:
"""
Medium preprocessing for MIXED-QUALITY images.
Balance between Light (too gentle) and Heavy (too aggressive).
Use cases:
- Moderately faded receipts
- Photos with uneven lighting
- Scans with slight blur
Preprocessing steps:
- Moderate contrast enhancement (CLAHE clipLimit=2.0)
- Light denoising (fastNlMeansDenoising h=6)
- Gentle sharpening
- NO binarization (preserves text boundaries)
- NO morphological operations (avoids digit concatenation)
"""
# 0. Add safety padding
image = self._add_safety_padding(image)
# 1. Grayscale
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
else:
gray = image.copy()
# 2. Scale (same as Light)
height, width = gray.shape
max_side = max(height, width)
if max_side > 4000:
scale = 4000 / max_side
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
height, width = gray.shape
if width < 1500:
scale = 1500 / width
new_width = int(width * scale)
new_height = int(height * scale)
if max(new_width, new_height) > 4000:
scale = 4000 / max(new_width, new_height)
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)
# 3. Deskew
gray = self._deskew(gray)
# 4. Moderate contrast enhancement
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
enhanced = clahe.apply(gray)
# 5. Light denoising (less aggressive than Heavy)
denoised = cv2.fastNlMeansDenoising(enhanced, h=6, templateWindowSize=7, searchWindowSize=15)
# 6. Gentle sharpening
gaussian = cv2.GaussianBlur(denoised, (0, 0), 1.0)
sharpened = cv2.addWeighted(denoised, 1.3, gaussian, -0.3, 0)
# NO binarization, NO morphological operations
# This preserves text boundaries and avoids digit concatenation
return sharpened
B. Mark preprocess_heavy() as deprecated:
def preprocess_heavy(self, image: np.ndarray) -> np.ndarray:
"""
Heavy preprocessing for FADED thermal receipts.
⚠️ DEPRECATED: Use preprocess_medium() instead.
Heavy preprocessing causes digit concatenation on clear PDFs.
Kept for backward compatibility only.
"""
# ... existing code (unchanged)
4. backend/modules/data_entry/routers/ocr.py
Changes: ~40 lines modified
Key Modifications:
A. Update ExtractionData schema instantiation (lines 106-128):
# Add validation warnings to response
validation_warnings_list = [
{
'field': w.field,
'rule': w.rule,
'message': w.message,
'severity': w.severity,
'suggested_value': w.suggested_value
}
for w in result.validation_warnings
] if hasattr(result, 'validation_warnings') else []
data = ExtractionData(
# ... existing fields ...
# NEW: Validation fields
validation_warnings=validation_warnings_list,
needs_manual_review=getattr(result, 'needs_manual_review', False),
inter_ocr_ratio=getattr(result, 'inter_ocr_ratio', None),
inter_ocr_source_used=getattr(result, 'inter_ocr_source_used', None),
)
5. backend/modules/data_entry/schemas/ocr.py
Changes: ~20 lines added
Key Modifications:
A. Add validation fields to ExtractionData (after line 57):
class ValidationWarning(BaseModel):
"""Validation warning from OCR extraction."""
field: str = Field(description="Field name (e.g., 'amount', 'tva_total')")
rule: str = Field(description="Rule name (e.g., 'amount_range', 'tva_ratio')")
message: str = Field(description="Human-readable warning message")
severity: str = Field(description="Severity: 'low', 'medium', 'high'")
suggested_value: Optional[Any] = Field(default=None, description="Suggested corrected value")
class ExtractionData(BaseModel):
"""Extracted receipt data from OCR."""
# ... existing fields ...
# NEW: Validation results
validation_warnings: List[ValidationWarning] = Field(default=[], description="Validation warnings")
needs_manual_review: bool = Field(default=False, description="Flag for supervisor review")
inter_ocr_ratio: Optional[float] = Field(default=None, description="Ratio between OCR engines (>10 = inconsistent)")
inter_ocr_source_used: Optional[str] = Field(default=None, description="OCR engine used: 'light' or 'medium'")
6. Database Migration: backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py
Purpose: Add needs_manual_review column to receipts table
Size: ~30 lines (Alembic migration)
"""Add needs_manual_review flag to receipts
Revision ID: XXX
Create Date: 2025-12-30
"""
from alembic import op
import sqlalchemy as sa
revision = 'XXX'
down_revision = 'YYY' # Previous migration
branch_labels = None
depends_on = None
def upgrade():
# Add column with default NULL (not FALSE)
# NULL = not validated yet (old receipts)
# FALSE = validated, no review needed
# TRUE = validated, needs review
op.add_column('receipts', sa.Column('needs_manual_review', sa.Boolean(), nullable=True))
def downgrade():
op.drop_column('receipts', 'needs_manual_review')
Frontend Integration Points
1. src/modules/data-entry/views/receipts/ReceiptCreateView.vue
Changes: Display validation warnings below OCR results
Example:
<template>
<div class="ocr-results">
<!-- Existing OCR fields -->
<!-- NEW: Validation warnings section -->
<div v-if="ocrData.validation_warnings?.length > 0" class="validation-warnings">
<h4>
<i class="pi pi-exclamation-triangle" />
Avertismente Validare ({{ ocrData.validation_warnings.length }})
</h4>
<ul>
<li
v-for="(warning, idx) in ocrData.validation_warnings"
:key="idx"
:class="`severity-${warning.severity}`"
>
<strong>{{ warning.field }}:</strong> {{ warning.message }}
<span v-if="warning.suggested_value" class="suggestion">
(sugestie: {{ warning.suggested_value }})
</span>
</li>
</ul>
</div>
<!-- NEW: Manual review badge -->
<div v-if="ocrData.needs_manual_review" class="manual-review-badge">
<i class="pi pi-flag" />
Necesită verificare manuală
</div>
</div>
</template>
<style scoped>
.validation-warnings {
margin-top: 1rem;
padding: 1rem;
background: #fff3cd;
border-left: 4px solid #ffc107;
}
.validation-warnings li.severity-low {
color: #666;
}
.validation-warnings li.severity-medium {
color: #f57c00;
}
.validation-warnings li.severity-high {
color: #d32f2f;
font-weight: bold;
}
.manual-review-badge {
margin-top: 0.5rem;
padding: 0.5rem 1rem;
background: #fff3cd;
border-radius: 4px;
display: inline-flex;
align-items: center;
gap: 0.5rem;
}
</style>
2. src/modules/data-entry/components/ocr/OCRPreview.vue
Changes: Add inter-OCR consistency indicator
Example:
<template>
<div class="ocr-preview">
<!-- Existing fields -->
<!-- NEW: Inter-OCR consistency warning -->
<div v-if="ocrData.inter_ocr_ratio && ocrData.inter_ocr_ratio > 10" class="ocr-consistency-warning">
<i class="pi pi-exclamation-circle" />
Inconsistență detectată între motoarele OCR ({{ Math.round(ocrData.inter_ocr_ratio) }}x diferență).
<br />
<small>Valorile folosite provin din: {{ ocrData.inter_ocr_source_used }}</small>
</div>
</div>
</template>
Design Decisions
1. Why Validation Warnings Instead of Errors?
Decision: Use non-blocking warnings instead of blocking errors.
Rationale:
- User requirement: "Allow save with warnings"
- OCR will never be 100% perfect
- Users can override incorrect extractions
- Supervisor review catches issues before approval
Trade-off: Risk of bad data entering system vs. user frustration with blocked submissions.
Mitigation: Manual review flag ensures supervisor catches issues.
2. Why Replace Heavy with Medium OCR?
Decision: Remove Heavy preprocessing, add Medium preprocessing.
Rationale:
- Heavy causes digit concatenation on clear PDFs (production evidence)
- Binarization destroys text boundaries on high-quality images
- Morphological operations merge adjacent numbers (85.99 → 859,762.16)
Analysis of Heavy Preprocessing (lines 153-164 in image_preprocessor.py):
# 7. Adaptive thresholding (binarization) - PROBLEM!
binary = cv2.adaptiveThreshold(
sharpened, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=11, C=5 # Block size can merge nearby digits
)
# 8. Morphological operations - COMPOUNDS THE PROBLEM!
kernel_close = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel_close)
# MORPH_CLOSE fills small gaps → merges adjacent numbers
Alternative Considered: Keep Heavy but add safeguards. Rejected: Too risky, no benefit for clear PDFs.
3. Why Romanian CIF Mod 11 Validation?
Decision: Implement CIF checksum validation algorithm.
Rationale:
- Romanian CIFs have built-in checksum (last digit)
- Validates extracted CUI is mathematically correct
- Catches OCR digit errors (10562600 vs 10562601)
Algorithm: Mod 11 checksum
- Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left)
- Formula:
sum(digit[i] * weight[i]) % 11 - Control digit: remainder (0 if remainder=10)
Example: RO10562600
- Digits: 1,0,5,6,2,6,0,0,[0]
- Checksum: 1×7 + 0×5 + 5×3 + 6×2 + 2×1 + 6×7 + 0×5 + 0×3 = 7+0+15+12+2+42+0+0 = 78
- 78 % 11 = 1 ≠ 0 → INVALID! (This CUI fails validation)
Note: Some older CIFs may not have checksums (pre-2000). Validation is permissive (warning, not error).
4. Why Apply to New Uploads Only?
Decision: Don't reprocess existing receipts.
Rationale:
- Migration impact: ~500 existing receipts in DB
- Reprocessing cost: OCR is slow (~2-5s per receipt)
- Risk: May change existing approved data
- Benefit: Minimal (old receipts already reviewed)
Implementation: Migration adds column with default NULL (not FALSE).
Validation Rules Specification
1. Amount Range Validation
Rule: Amount must be between 0.01 and 100,000 RON.
Implementation:
class AmountRangeRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.amount:
if extraction.amount < Decimal('0.01'):
warnings.append(ValidationWarning(
field='amount',
rule='amount_range',
message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
severity='high'
))
elif extraction.amount > Decimal('100000'):
warnings.append(ValidationWarning(
field='amount',
rule='amount_range',
message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
severity='high'
))
# Check decimal places
decimal_places = abs(extraction.amount.as_tuple().exponent)
if decimal_places > 2:
warnings.append(ValidationWarning(
field='amount',
rule='decimal_places',
message=f'Amount has {decimal_places} decimal places (max 2)',
severity='medium',
suggested_value=extraction.amount.quantize(Decimal('0.01'))
))
return warnings
Test Cases:
- 0.00 RON → Warning (too small)
- 0.01 RON → Valid
- 85.99 RON → Valid
- 100,000 RON → Valid
- 100,001 RON → Warning (too large)
- 859,762.16 RON → Warning (too large)
- 85.999 RON → Warning (too many decimals)
2. TVA Ratio Validation
Rule: TVA must be 5-24% of TOTAL amount.
Implementation:
class TVARatioRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.tva_total and extraction.amount:
# TVA cannot be greater than TOTAL
if extraction.tva_total > extraction.amount:
warnings.append(ValidationWarning(
field='tva_total',
rule='tva_greater_than_total',
message=f'TVA ({extraction.tva_total}) cannot be greater than TOTAL ({extraction.amount})',
severity='high',
suggested_value=None # Will be auto-corrected by service
))
else:
# Check ratio
ratio = extraction.tva_total / extraction.amount * Decimal('100')
if ratio < Decimal('5'):
warnings.append(ValidationWarning(
field='tva_total',
rule='tva_ratio_low',
message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
severity='medium'
))
elif ratio > Decimal('24'):
warnings.append(ValidationWarning(
field='tva_total',
rule='tva_ratio_high',
message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
severity='high'
))
return warnings
Test Cases:
- TVA=14.92, TOTAL=85.99 → 17.3% → Valid
- TVA=149,214.92, TOTAL=859,762.16 → 17.3% → Both values wrong (caught by amount_range)
- TVA=4.00, TOTAL=100.00 → 4% → Warning (too low)
- TVA=100.00, TOTAL=85.99 → 116% → Warning (impossible!)
3. Payment Sum Validation
Rule: CARD + NUMERAR must equal TOTAL (±0.02 RON tolerance).
Implementation:
class PaymentSumRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.payment_methods and extraction.amount:
payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
difference = abs(payment_sum - extraction.amount)
if difference > Decimal('0.02'):
warnings.append(ValidationWarning(
field='amount',
rule='payment_sum_mismatch',
message=f'Payment methods sum ({payment_sum}) ≠ TOTAL ({extraction.amount}), diff={difference}',
severity='high',
suggested_value=payment_sum
))
return warnings
def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
"""Auto-correct TOTAL from payment sum if confidence < 80%."""
corrections = {}
if extraction.payment_methods and extraction.amount:
payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
difference = abs(payment_sum - extraction.amount)
if difference > Decimal('0.02') and extraction.confidence_amount < 0.80:
corrections['amount'] = payment_sum
print(f"[Auto-Correct] TOTAL corrected: {extraction.amount} → {payment_sum} (from payment methods)", flush=True)
return corrections
Test Cases:
- CARD=50, NUMERAR=35.99, TOTAL=85.99 → Valid
- CARD=50, NUMERAR=35.97, TOTAL=85.99 → Diff=0.02 → Valid (tolerance)
- CARD=50, NUMERAR=35.00, TOTAL=85.99 → Diff=0.99 → Warning
4. TVA Entries Sum Validation
Rule: Σ(TVA entries) must equal TVA TOTAL (±0.02 RON tolerance).
Implementation:
class TVAEntriesSumRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.tva_entries and extraction.tva_total:
entries_sum = sum(e['amount'] for e in extraction.tva_entries)
difference = abs(entries_sum - extraction.tva_total)
if difference > Decimal('0.02'):
warnings.append(ValidationWarning(
field='tva_total',
rule='tva_entries_sum_mismatch',
message=f'TVA entries sum ({entries_sum}) ≠ TVA TOTAL ({extraction.tva_total}), diff={difference}',
severity='medium',
suggested_value=entries_sum
))
return warnings
def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
"""Use entries sum as TVA TOTAL if mismatch."""
corrections = {}
if extraction.tva_entries and extraction.tva_total:
entries_sum = sum(e['amount'] for e in extraction.tva_entries)
difference = abs(entries_sum - extraction.tva_total)
if difference > Decimal('0.02'):
corrections['tva_total'] = entries_sum
print(f"[Auto-Correct] TVA TOTAL corrected: {extraction.tva_total} → {entries_sum} (from entries)", flush=True)
return corrections
Test Cases:
- Entries=[A:19%:14.92], TOTAL=14.92 → Valid
- Entries=[A:19%:10.00, B:9%:4.92], TOTAL=14.92 → Valid
- Entries=[A:19%:14.92], TOTAL=14.94 → Diff=0.02 → Valid (tolerance)
- Entries=[A:19%:14.92], TOTAL=15.00 → Diff=0.08 → Warning
5. Inter-OCR Consistency Validation
Rule: Flag if values differ >10x between OCR engines.
Implementation:
class InterOCRConsistencyRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
"""This rule is applied during merge, stores ratio in extraction."""
warnings = []
if hasattr(extraction, 'inter_ocr_ratio') and extraction.inter_ocr_ratio:
if extraction.inter_ocr_ratio > 10:
warnings.append(ValidationWarning(
field='amount',
rule='inter_ocr_inconsistency',
message=f'Large inconsistency between OCR engines ({extraction.inter_ocr_ratio:.0f}x difference)',
severity='high'
))
return warnings
Test Cases:
- Light=85.99, Medium=86.00 → Ratio=1.00 → Valid
- Light=85.99, Medium=90.00 → Ratio=1.05 → Valid
- Light=85.99, Medium=859.76 → Ratio=10.00 → Valid (edge case)
- Light=85.99, Medium=859,762.16 → Ratio=10,000 → Warning!
6. CUI Checksum Validation
Rule: Validate Romanian CIF Mod 11 checksum.
Implementation:
class CUIChecksumRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.cui:
# Normalize CUI
digits = re.sub(r'\D', '', extraction.cui)
# Validate length
if not (6 <= len(digits) <= 10):
warnings.append(ValidationWarning(
field='cui',
rule='cui_length',
message=f'CUI length invalid: {len(digits)} digits (expected 6-10)',
severity='medium'
))
return warnings
# Validate Mod 11 checksum
if not self._validate_checksum(digits):
warnings.append(ValidationWarning(
field='cui',
rule='cui_checksum',
message=f'CUI checksum invalid: {extraction.cui} (failed Mod 11 validation)',
severity='medium' # Medium: some old CIFs don't have checksums
))
return warnings
def _validate_checksum(self, digits: str) -> bool:
"""Romanian CIF Mod 11 checksum validation."""
if len(digits) < 2:
return False
weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
control = int(digits[-1])
digits_to_check = digits[:-1].zfill(9)
checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
remainder = checksum % 11
expected = 0 if remainder == 10 else remainder
return control == expected
Test Cases:
- R010562600 → Checksum validation
- R011201891 → Checksum validation
- R012345678 → Warning (invalid checksum)
- R01234 → Warning (too short)
7. Date Validity Validation
Rule: Date must not be in future, not older than 10 years.
Implementation:
class DateValidityRule(ValidationRule):
def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
warnings = []
if extraction.receipt_date:
today = date.today()
# Check future date
if extraction.receipt_date > today:
warnings.append(ValidationWarning(
field='receipt_date',
rule='date_future',
message=f'Date is in the future: {extraction.receipt_date}',
severity='high'
))
# Check too old (10 years)
cutoff_date = today.replace(year=today.year - 10)
if extraction.receipt_date < cutoff_date:
warnings.append(ValidationWarning(
field='receipt_date',
rule='date_too_old',
message=f'Date is older than 10 years: {extraction.receipt_date}',
severity='medium'
))
return warnings
Test Cases:
- 2025-12-30 (today) → Valid
- 2025-10-11 → Valid
- 2026-01-01 → Warning (future)
- 2015-12-31 → Valid (exactly 10 years)
- 2014-12-31 → Warning (too old)
Acceptance Criteria
Critical Success Criteria (Must Pass)
✅ AC-1: Five-Holding receipt extracts correct values
- Given: Production PDF receipt (Five-Holding, 85.99 LEI)
- When: OCR processes with new validation
- Then:
- TOTAL = 85.99 LEI (NOT 859,762.16)
- TVA = 14.92 LEI (NOT 149,214.92)
- CUI = R010562600
- Overall confidence >= 90%
✅ AC-2: Save works with validation warnings
- Given: Receipt with low confidence (75%)
- When: User clicks Save
- Then:
- Warnings displayed in UI
- Save button enabled
- Receipt saved with
needs_manual_review=TRUE
✅ AC-3: Cross-validation: CARD + NUMERAR = TOTAL
- Given: Receipt with CARD=50, NUMERAR=35.99
- When: OCR extracts TOTAL=85.98 (off by 0.01)
- Then:
- Warning displayed: "Payment sum (85.99) ≠ TOTAL (85.98)"
- Suggested value: 85.99
- Auto-corrected if confidence < 80%
✅ AC-4: Cross-validation: Σ(TVA entries) = TVA TOTAL
- Given: Receipt with TVA A=10.00, TVA B=4.92
- When: OCR extracts TVA TOTAL=14.90 (off by 0.02)
- Then:
- Warning displayed: "TVA entries sum (14.92) ≠ TVA TOTAL (14.90)"
- Auto-corrected to 14.92
✅ AC-5: CUI Mod 11 validation works
- Given: Receipt with CUI R010562600
- When: OCR processes
- Then:
- CUI validated against Mod 11 checksum
- If invalid, warning displayed
- Format normalized to "RO" prefix
Secondary Criteria (Nice-to-Have)
🔲 AC-S1: Medium OCR performs better than Heavy
- Given: 10 clear PDF receipts
- When: Processed with Light → Medium → Tesseract
- Then:
- No 10x magnitude errors
- Average confidence >= 90%
- Processing time < 5s
🔲 AC-S2: Validation warnings show in UI
- Given: Receipt with 3 validation warnings
- When: OCR completes
- Then:
- Warning section displayed
- Each warning shows: field, message, severity
- Suggested values displayed if available
Testing Strategy
Unit Tests (~300 lines)
File: backend/modules/data_entry/tests/test_ocr_validation.py
Test Coverage:
# Amount validation
test_amount_range_valid()
test_amount_range_too_small()
test_amount_range_too_large()
test_amount_decimal_places()
# TVA validation
test_tva_ratio_valid()
test_tva_ratio_too_low()
test_tva_ratio_too_high()
test_tva_greater_than_total()
test_tva_entries_sum_matches()
test_tva_entries_sum_mismatch()
# Payment validation
test_payment_sum_matches()
test_payment_sum_mismatch_within_tolerance()
test_payment_sum_mismatch_auto_corrected()
# CUI validation
test_cui_checksum_valid()
test_cui_checksum_invalid()
test_cui_length_invalid()
test_cui_normalization()
# Date validation
test_date_valid()
test_date_future()
test_date_too_old()
# Inter-OCR consistency
test_inter_ocr_consistency_valid()
test_inter_ocr_consistency_10x_difference()
# Validation engine
test_validation_engine_no_warnings()
test_validation_engine_multiple_warnings()
test_validation_engine_auto_corrections()
test_needs_manual_review_flag()
Integration Tests (~200 lines)
File: backend/modules/data_entry/tests/test_ocr_validation_integration.py
Test Coverage:
# Real receipts
test_five_holding_receipt() # Production case (85.99 not 859,762.16)
test_omv_receipt() # Clear PDF, Light OCR only
test_kaufland_receipt() # Faded thermal, Medium OCR
test_mega_image_receipt() # Multiple TVA entries
# OCR pipeline
test_light_ocr_high_confidence_skips_medium()
test_light_ocr_low_confidence_runs_medium()
test_medium_ocr_replaces_heavy()
test_validation_runs_after_merge()
# API responses
test_api_returns_validation_warnings()
test_api_returns_needs_manual_review_flag()
test_api_returns_inter_ocr_ratio()
test_api_auto_corrects_amount_from_payments()
# Edge cases
test_no_ocr_engines_available()
test_pdf_with_multiple_pages()
test_receipt_with_no_tva()
test_receipt_with_no_payment_methods()
Manual Testing Checklist
-
Upload Five-Holding receipt PDF (production case)
- Verify TOTAL = 85.99 (not 859,762.16)
- Verify TVA = 14.92 (not 149,214.92)
- Verify no validation warnings
- Verify overall confidence >= 90%
-
Upload faded thermal receipt photo
- Verify Medium OCR used (not Heavy)
- Verify readable text extracted
- Verify no digit concatenation
-
Upload receipt with payment methods
- Verify CARD + NUMERAR displayed
- Verify sum matches TOTAL
- If mismatch, verify warning displayed
-
Upload receipt with multiple TVA entries
- Verify all TVA entries extracted
- Verify sum matches TVA TOTAL
- If mismatch, verify warning displayed
-
Submit receipt with warnings
- Verify Save button enabled
- Verify warnings displayed in UI
- Verify
needs_manual_reviewflag set
-
Filter receipts by "Needs Review"
- Verify filter shows flagged receipts
- Verify supervisor can review
Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Medium OCR still causes errors | Medium | High | Keep Tesseract as Step 3 fallback; validation catches issues |
| CUI Mod 11 validation too strict | Medium | Low | Use warning (not error); allow override; some old CIFs don't have checksums |
| Validation rules too permissive | Low | Medium | Start conservative, tune based on production data |
| Validation rules too strict | Medium | Low | Non-blocking warnings allow user override |
| Performance impact | Low | Low | Validation is fast (<10ms); OCR dominates processing time |
| Breaking changes to API | Low | High | Add new fields, keep existing fields unchanged; frontend optional |
| Database migration issues | Low | Medium | Use NULL default (not FALSE); test on staging first |
Out of Scope
Explicitly NOT included in this feature:
- ❌ Reprocessing existing receipts - Only new uploads validated
- ❌ Machine learning OCR improvements - Use existing PaddleOCR/Tesseract
- ❌ Custom OCR training - Generic models only
- ❌ Approval workflow changes - Validation is separate from approval
- ❌ Automatic approval - Always requires supervisor review
- ❌ Advanced validation rules - Only basic sanity checks
- ❌ Multi-currency support - RON only for now
- ❌ Historical receipt validation - Phase 2 feature
- ❌ OCR confidence tuning - Accept engine defaults
- ❌ Frontend validation logic - Backend only (frontend displays)
Open Questions
Q1: Should we keep Heavy preprocessing as fallback?
Answer: No. Remove completely. Evidence shows it causes more harm than good on clear PDFs. Medium preprocessing handles mixed-quality images better.
Q2: What tolerance for payment sum validation?
Answer: ±0.02 RON (2 cents). Romanian receipts use 2 decimal places. This handles rounding errors.
Q3: Should CUI validation be blocking or warning?
Answer: Warning only. Some old Romanian CIFs (pre-2000) don't have Mod 11 checksums. Also, OCR may extract digits incorrectly.
Q4: What if Light OCR has high confidence but wrong values?
Answer: Validation catches this. If Light OCR extracts 859,762.16 with 98% confidence, amount_range rule flags it (>100,000 limit). User sees warning.
Q5: Should we reprocess existing receipts with new validation?
Answer: No. Too risky and time-consuming. Apply to new uploads only. If user wants to re-validate old receipt, they can re-upload.
Q6: What about receipts with no payment methods?
Answer: No validation warning. Not all receipts show CARD/NUMERAR breakdown (especially older thermal receipts). Only validate if payment methods are extracted.
Q7: Should validation auto-correct or just warn?
Answer: Both. Auto-correct obvious errors (TOTAL from payment sum if confidence < 80%). Warn for ambiguous cases. Never silently change high-confidence values.
Q8: How to handle receipts from future (clock skew)?
Answer: Warning only (not error). Allow up to 1 day in future (±24h tolerance) for clock skew. Beyond that, warn user.
Estimated Complexity
Overall: High Justification:
- File Count: 6 modified, 3 created, 1 migration = 10 files
- Line Changes: ~1,135 lines (400 new validation, 300 tests, 200 integration tests, 235 modifications)
- Risk Level: Medium (core OCR pipeline changes, but validation is additive)
- Testing: 15-20 new test cases, manual testing required
- Dependencies: None (uses existing OCR engines)
- Complexity Factors:
- Multi-layer validation logic
- Romanian CIF checksum algorithm
- Cross-field validation dependencies
- Inter-OCR comparison logic
- Auto-correction logic
- Frontend integration
- Database migration
Estimated Effort: 2-3 days
- Day 1: Validation engine + unit tests
- Day 2: OCR pipeline integration + medium preprocessing
- Day 3: Frontend integration + manual testing + bug fixes
Dependencies
External Libraries
- ✅
cv2(OpenCV) - Already installed - ✅
numpy- Already installed - ✅
paddleocr- Already installed - ✅
tesseract- Already installed - ✅
pydantic- Already installed - ✅
sqlalchemy- Already installed
Internal Modules
- ✅
backend/modules/data_entry/services/ocr_service.py - ✅
backend/modules/data_entry/services/ocr_extractor.py - ✅
backend/modules/data_entry/services/image_preprocessor.py - ✅
backend/modules/data_entry/routers/ocr.py - ✅
backend/modules/data_entry/schemas/ocr.py - ✅
backend/modules/data_entry/db/models/receipt.py
Database Schema Changes
- ✅ Add
needs_manual_reviewcolumn toreceiptstable (nullable BOOLEAN) - ✅ Alembic migration required
Implementation Notes
Priority Order (Recommended)
-
Phase 1: Core Validation (Day 1)
- Create
ocr/validation.pymodule - Implement validation rules (amount, TVA, payment, CUI, date)
- Write unit tests
- Checkpoint: All tests pass
- Create
-
Phase 2: OCR Integration (Day 2 Morning)
- Add
preprocess_medium()to image_preprocessor - Update
_merge_extractions()with validation-aware logic - Remove/deprecate
preprocess_heavy() - Checkpoint: Five-Holding receipt extracts correctly
- Add
-
Phase 3: API Updates (Day 2 Afternoon)
- Update
ExtractionResultdataclass with validation fields - Update API schemas (ocr.py, routers/ocr.py)
- Add database migration
- Checkpoint: API returns validation warnings
- Update
-
Phase 4: Integration Testing (Day 3 Morning)
- Write integration tests
- Test with real receipts (Five-Holding, OMV, Kaufland)
- Checkpoint: All integration tests pass
-
Phase 5: Frontend & Polish (Day 3 Afternoon)
- Update Vue components to display warnings
- Add "Needs Review" filter
- Manual testing
- Bug fixes
- Checkpoint: Production-ready
Code Quality Standards
- ✅ Type hints for all functions
- ✅ Docstrings for all public methods
- ✅ Unit test coverage >90%
- ✅ Integration tests for critical paths
- ✅ Print statements for debugging (will be converted to logging later)
- ✅ Follow existing code patterns (SQLModel, Pydantic v2, FastAPI)
Performance Considerations
- Validation overhead: <10ms per receipt (negligible vs. OCR time)
- Medium preprocessing: Similar speed to Heavy (~500ms)
- Database migration: Non-blocking (adds NULL column)
- Frontend impact: Minimal (only displays warnings)
Related Documentation
Project Context
- CLAUDE.md: Data Entry module instructions
- docs/data-entry/DATA-ENTRY-MODULE.md: Module architecture
- docs/ARCHITECTURE-DECISIONS.md: Ultrathin monolith rationale
Technical References
- Romanian CIF validation: https://ro.wikipedia.org/wiki/Cod_de_identificare_fiscal%C4%83
- OpenCV preprocessing: https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html
- PaddleOCR docs: https://github.com/PaddlePaddle/PaddleOCR
Similar Features
- Payment methods extraction: Already implemented in
ocr_extractor.py:1361 - TVA entries extraction: Already implemented in
ocr_extractor.py:820 - Cross-validation logic: Pattern from
_cross_validate_and_calculate_amount(lines 468-557)
Summary
This specification provides a comprehensive solution to fix critical OCR data extraction issues in the Data Entry module. The multi-layer validation system ensures data integrity while maintaining user flexibility through non-blocking warnings.
Key Benefits:
- ✅ Prevents 10,000x magnitude errors (85.99 vs 859,762.16)
- ✅ Validates cross-field dependencies (payment sum, TVA sum)
- ✅ Improves CUI extraction with Mod 11 checksum
- ✅ Replaces problematic Heavy OCR with Medium preprocessing
- ✅ Non-blocking warnings preserve user workflow
- ✅ Manual review flag helps supervisors prioritize
Next Steps:
- Review and approve specification
- Create feature branch:
feature/bon-ocr-validation - Implement Phase 1 (validation engine)
- Continue with Phases 2-5
- Deploy to staging for testing
- Monitor production for 1 week before full rollout
Document Version: 1.0 Last Updated: 2025-12-30 Status: Ready for Implementation Estimated Completion: 2026-01-02 (3 working days)