Files
roa2web-service-auto/.auto-build/specs/bon-ocr-validation/spec.md
Marius Mutu ab160b628d feat(ocr): Add validation system and CLIENT CUI extraction
OCR Data Extraction Validation System:
- Add 7 validation rules (amount range, TVA ratio, payment sum, etc.)
- Add Medium preprocessing to replace Heavy (fixes digit concatenation)
- Add validation warnings to API responses
- Flag receipts needing manual review (needs_manual_review field)
- Add database migration for needs_manual_review column

CLIENT CUI Extraction Improvements:
- Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc.
- Handle OCR errors (R0 vs RO, C1F vs CIF)
- Add client_name, client_cui, client_address to API response
- Add validation fields to API response (was missing)

QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings)
- Fixed type safety in validation rules
- Fixed ZeroDivisionError risk
- Fixed schema mismatch (Optional[bool] for needs_manual_review)
- All 37 unit tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-30 19:12:52 +02:00

52 KiB
Raw Blame History

Feature Specification: OCR Data Extraction Validation System

Feature ID: bon-ocr-validation Priority: Critical (P0 - Production Bug) Complexity: High Estimated Effort: 2-3 days Created: 2025-12-30 Module: Data Entry (backend/modules/data_entry/)


Overview

Fix critical OCR data extraction issue where PaddleOCR Heavy preprocessing (88% confidence) overwrites correct Light OCR (98% confidence) data with garbage values, causing 10,000x magnitude errors in production receipts.

Value Proposition: Prevent incorrect financial data from entering the system, reduce manual corrections, improve user trust in OCR accuracy.


Problem Statement

Current Behavior (BROKEN)

The OCR processing pipeline (backend/modules/data_entry/services/ocr_service.py) uses a 3-step adaptive approach:

  1. Step 1: PaddleOCR + Light preprocessing (fast, high confidence)
  2. Step 2: PaddleOCR + Heavy preprocessing (faded receipts)
  3. Step 3: Tesseract (complement missing fields only)

Critical Bug: The _merge_extractions() method (lines 240-386) blindly prefers higher OCR confidence scores WITHOUT validating actual extracted values.

Real Production Example (Five-Holding Receipt)

Field Light OCR (98%) Heavy OCR (88%) Final Result
TOTAL 85.99 LEI 859,762.16 LEI 859,762.16 (10,000x error!)
TVA 14.92 LEI 149,214.92 LEI 149,214.92 (10,000x error!)
CUI R010562600 (not found) R010562600
Date 2025-10-11 2025-10-11 2025-10-11
Confidence 98% 88% 88% (wrong source!)

Root Cause: Heavy preprocessing causes digit concatenation on high-quality PDFs. The binarization and morphological operations (lines 153-164 in image_preprocessor.py) merge adjacent numbers, creating garbage values.

Impact

  • Data Integrity: Incorrect amounts enter accounting system
  • User Trust: Users lose confidence in OCR accuracy
  • Manual Work: Requires manual verification of ALL OCR extractions
  • Financial Risk: Wrong amounts could be approved without review

User Stories

1. As a user uploading a clear PDF receipt

I want OCR to extract correct values from the first pass So that I don't have to manually correct obvious errors

Acceptance Criteria:

  • Light OCR correctly extracts 85.99 LEI (not 859,762.16)
  • Heavy OCR is skipped when Light OCR confidence >= 90%
  • No 10,000x magnitude errors

2. As a user submitting a receipt with warnings

I want to be able to save receipts with validation warnings So that I can submit for review even if OCR isn't perfect

Acceptance Criteria:

  • Save button works with warnings (not blocked)
  • Receipt marked with needs_manual_review=True
  • Warnings displayed clearly in UI

3. As a supervisor reviewing receipts

I want to see which receipts need manual review So that I can prioritize validation efforts

Acceptance Criteria:

  • Filter by "Needs Review" flag
  • Validation warnings shown in detail view
  • Clear indication of which fields are suspicious

4. As a system validating cross-field data

I want to validate CARD + NUMERAR = TOTAL So that payment methods match the total amount

Acceptance Criteria:

  • Cross-validation: sum of payment methods = TOTAL (±0.02 RON tolerance)
  • If mismatch, flag for review
  • Auto-correct TOTAL from payment sum if confidence < 80%

5. As a system validating TVA entries

I want to validate Σ(TVA entries) = TVA TOTAL So that individual TVA lines match the total TVA

Acceptance Criteria:

  • Cross-validation: sum of TVA entries = TVA TOTAL (±0.02 RON tolerance)
  • TVA rate validation (5-24% of TOTAL)
  • If mismatch, flag for review

Functional Requirements

Core Requirements (Must-Have)

1. Multi-Layer Validation Pipeline

FR-1.1: Absolute value sanity checks

  • Amount range: 0.01 - 100,000 RON
  • Max 2 decimal places
  • Date: not in future, not older than 10 years (2015+)
  • CUI: 6-10 digits, valid Mod 11 checksum

FR-1.2: Cross-field correlation validation

  • TVA: 5-24% of TOTAL amount (Romanian rates: 5%, 9%, 11%, 19%, 21%)
  • Payment methods: CARD + NUMERAR = TOTAL (±0.02 RON tolerance)
  • Inter-OCR consistency: flag if values differ >10x between engines

FR-1.3: Auto-correction logic

  • If TOTAL is obviously wrong (>10x payment sum), use payment sum
  • If TVA > TOTAL, recalculate TOTAL from TVA using reverse formula
  • Preserve high-confidence values from Light OCR over low-confidence Heavy OCR

FR-1.4: Validation result structure

@dataclass
class ValidationResult:
    is_valid: bool
    warnings: List[ValidationWarning]  # Non-blocking issues
    errors: List[ValidationError]      # Blocking issues (none for now)
    corrected_fields: Dict[str, Any]   # Auto-corrected values
    needs_manual_review: bool          # Flag for supervisor

2. Replace Heavy with Medium OCR

FR-2.1: Remove preprocess_heavy() method

  • Current Heavy: aggressive binarization causes digit concatenation
  • Reason: Destroys high-quality PDFs while trying to recover faded receipts

FR-2.2: Add preprocess_medium() method

  • Moderate contrast enhancement (CLAHE clipLimit=2.0)
  • Light denoising (fastNlMeansDenoising h=6)
  • NO binarization, NO morphological operations
  • Preserve text boundaries on clear images

FR-2.3: Update OCR pipeline

  • Step 1: Light preprocessing (unchanged)
  • Step 2: Medium preprocessing (replaces Heavy)
  • Step 3: Tesseract (unchanged)

3. Enhanced CUI Extraction

FR-3.1: Romanian CIF validation algorithm

  • Implement Mod 11 checksum validation
  • Control digit formula: sum(digit[i] * weight[i]) % 11
  • Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left)

FR-3.2: CUI format normalization

  • Always add "RO" prefix if missing
  • Remove spaces, dashes, dots
  • Validate length: 6-10 digits

FR-3.3: Improved regex patterns

# Add OCR-tolerant patterns (current patterns are too strict)
CUI_OCR_TOLERANT_PATTERNS = [
    r'CIF[:\s]*R[O0]?\s*(\d[\d\s]{5,9})',  # Spaces in CUI
    r'C[I1]F[:\s]*(\d[\d\s]{6,10})',        # C1F (I→1 OCR error)
    r'C\.?\s*[I1]\.?\s*F\.?[:\s]*(\d+)',   # C. I. F. (spaced)
]

4. User Requirements Integration

FR-4.1: Non-blocking validation warnings

  • Save button enabled even with warnings
  • User can override and submit
  • Warnings displayed clearly in UI

FR-4.2: Manual review flag

  • Database field: receipts.needs_manual_review (BOOLEAN)
  • Set to TRUE if:
    • Any validation warning present
    • Overall confidence < 85%
    • Cross-validation fails

FR-4.3: Apply to new uploads only

  • No reprocessing of existing receipts
  • Validation runs on OCR extraction (POST /api/ocr/extract)
  • Migration: add column with default NULL (not FALSE)

Secondary Requirements (Nice-to-Have)

FR-S1: Validation confidence scoring

  • Each validation rule contributes to score
  • Overall validation confidence: weighted average
  • Display in UI alongside OCR confidence

FR-S2: Validation rule configurability

  • Move hardcoded thresholds to config
  • Allow per-company customization
  • Admin UI to adjust rules

Technical Requirements

Files to Create

1. backend/modules/data_entry/services/ocr/validation.py

Purpose: Validation utilities and rule engine Size: ~400 lines Key Classes:

  • ValidationRule (base class)
  • AmountRangeRule, TVARatioRule, PaymentSumRule, CUIChecksumRule
  • OCRValidationEngine (orchestrator)

Example:

@dataclass
class ValidationWarning:
    """Non-blocking validation warning."""
    field: str
    rule: str
    message: str
    severity: str  # 'low', 'medium', 'high'
    suggested_value: Optional[Any] = None

class ValidationRule(ABC):
    """Base validation rule."""
    @abstractmethod
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        pass

class AmountRangeRule(ValidationRule):
    """Validate amount is in reasonable range (0.01 - 100,000 RON)."""
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.amount:
            if extraction.amount < Decimal('0.01'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
                    severity='high'
                ))
            elif extraction.amount > Decimal('100000'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
                    severity='high'
                ))
        return warnings

class OCRValidationEngine:
    """Orchestrate all validation rules."""
    def __init__(self):
        self.rules = [
            AmountRangeRule(),
            TVARatioRule(),
            PaymentSumRule(),
            InterOCRConsistencyRule(),
            CUIChecksumRule(),
            DateValidityRule(),
        ]

    def validate(self, extraction: ExtractionResult) -> ValidationResult:
        """Run all validation rules and return result."""
        all_warnings = []
        corrected_fields = {}

        for rule in self.rules:
            warnings = rule.validate(extraction)
            all_warnings.extend(warnings)

            # Apply auto-corrections
            corrections = rule.auto_correct(extraction)
            corrected_fields.update(corrections)

        needs_review = (
            len(all_warnings) > 0 or
            extraction.overall_confidence < 0.85
        )

        return ValidationResult(
            is_valid=True,  # Never block (warnings only)
            warnings=all_warnings,
            errors=[],
            corrected_fields=corrected_fields,
            needs_manual_review=needs_review
        )

2. backend/modules/data_entry/tests/test_ocr_validation.py

Purpose: Unit tests for validation rules Size: ~300 lines Coverage Target: >90%

Test Cases:

  • test_amount_range_valid() - 85.99 RON passes
  • test_amount_range_too_high() - 859,762.16 fails
  • test_tva_ratio_valid() - 14.92/85.99 = 17.3% passes
  • test_tva_ratio_too_high() - 149,214.92/859,762.16 = 17.3% but amounts wrong
  • test_payment_sum_matches() - CARD 50 + NUMERAR 35.99 = TOTAL 85.99
  • test_cui_checksum_valid() - R010562600 passes Mod 11
  • test_cui_checksum_invalid() - R010562601 fails Mod 11
  • test_inter_ocr_consistency() - 85.99 vs 859,762.16 = 10,000x flag

3. backend/modules/data_entry/tests/test_ocr_validation_integration.py

Purpose: Integration tests with full OCR pipeline Size: ~200 lines

Test Cases:

  • test_five_holding_receipt() - Real production case (85.99 not 859,762.16)
  • test_clear_pdf_uses_light_ocr() - High-quality PDF skips Heavy
  • test_faded_receipt_uses_medium_ocr() - Thermal receipt uses Medium
  • test_validation_warnings_in_response() - API returns warnings
  • test_manual_review_flag_set() - Flag set when confidence < 85%

Files to Modify

1. backend/modules/data_entry/services/ocr_service.py

Changes: ~200 lines modified, ~100 lines added

Key Modifications:

A. Replace _merge_extractions() (lines 240-386) with validation-aware version:

def _merge_extractions(
    self,
    light: Optional[ExtractionResult],
    medium: Optional[ExtractionResult]  # Renamed from 'tesseract'
) -> ExtractionResult:
    """
    Merge extractions with VALIDATION-AWARE logic.

    NEW Strategy:
    1. Run validation on both extractions
    2. Prefer extraction with FEWER warnings (not just higher confidence)
    3. For each field, pick value that passes validation
    4. Flag inter-OCR inconsistencies (>10x difference)
    """
    from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine

    validator = OCRValidationEngine()

    # Validate both extractions
    light_validation = validator.validate(light) if light else None
    medium_validation = validator.validate(medium) if medium else None

    result = ExtractionResult()

    # === AMOUNT (with validation check) ===
    if light.amount and medium.amount:
        # Check for 10x inconsistency
        ratio = max(light.amount, medium.amount) / min(light.amount, medium.amount)
        if ratio > 10:
            print(f"[Merge] WARNING: Inter-OCR inconsistency: {light.amount} vs {medium.amount} ({ratio:.0f}x)", flush=True)
            # Prefer value that passes validation
            light_warnings = [w for w in light_validation.warnings if w.field == 'amount']
            medium_warnings = [w for w in medium_validation.warnings if w.field == 'amount']

            if len(light_warnings) < len(medium_warnings):
                result.amount = light.amount
                result.confidence_amount = light.confidence_amount
                print(f"[Merge] Using Light OCR amount: {light.amount} (fewer warnings)", flush=True)
            else:
                result.amount = medium.amount
                result.confidence_amount = medium.confidence_amount
                print(f"[Merge] Using Medium OCR amount: {medium.amount} (fewer warnings)", flush=True)
        else:
            # Normal merge: prefer higher confidence
            if light.confidence_amount >= medium.confidence_amount:
                result.amount = light.amount
                result.confidence_amount = light.confidence_amount
            else:
                result.amount = medium.amount
                result.confidence_amount = medium.confidence_amount
    elif light.amount:
        result.amount = light.amount
        result.confidence_amount = light.confidence_amount
    elif medium.amount:
        result.amount = medium.amount
        result.confidence_amount = medium.confidence_amount

    # ... (similar logic for other fields)

    return result

B. Add preprocess_medium() call (replace Heavy):

# Line ~130: Replace preprocess_heavy with preprocess_medium
print("=" * 60, flush=True)
print("[OCR] STEP 2: PaddleOCR + Medium preprocessing", flush=True)
print("=" * 60, flush=True)
medium_img = self.preprocessor.preprocess_medium(image)  # NEW

try:
    paddle_medium = self.ocr_engine._paddle_recognize(medium_img)
    # ... rest of processing

C. Add validation to final result:

# Line ~204: Add validation before returning
if extraction:
    extraction = self._final_validation(extraction)

    # NEW: Run validation engine
    from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
    validator = OCRValidationEngine()
    validation_result = validator.validate(extraction)

    # Apply auto-corrections
    for field, value in validation_result.corrected_fields.items():
        setattr(extraction, field, value)

    # Store validation warnings (add to ExtractionResult)
    extraction.validation_warnings = validation_result.warnings
    extraction.needs_manual_review = validation_result.needs_manual_review

2. backend/modules/data_entry/services/ocr_extractor.py

Changes: ~50 lines modified, ~30 lines added

Key Modifications:

A. Add validation fields to ExtractionResult (lines 10-50):

@dataclass
class ExtractionResult:
    """Structured extraction result from receipt."""
    # ... existing fields ...

    # NEW: Validation results
    validation_warnings: List[dict] = field(default_factory=list)  # List of warnings
    needs_manual_review: bool = False  # Flag for supervisor review

    # NEW: Inter-OCR comparison data
    inter_ocr_ratio: Optional[float] = None  # Ratio between Light/Heavy values
    inter_ocr_source_used: Optional[str] = None  # 'light' or 'medium'

B. Fix CLIENT CUI patterns (lines 253-272):

# Current patterns are too strict - add OCR-tolerant versions
CLIENT_CUI_PATTERNS = [
    # ... existing patterns ...

    # NEW: OCR-tolerant patterns
    (r'CLIENT\s+C[I1UO]F\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96),  # Spaces in CUI
    (r'C[I1]F\s+CLIENT\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96),    # Reversed format
    (r'CLIENT.*?(?:R[O0])?(\d{6,10})\s*\n', 0.90),                      # CUI on next line
]

C. Add CUI normalization and validation:

def _normalize_cui(self, cui: Optional[str]) -> Optional[str]:
    """Normalize CUI format and validate checksum."""
    if not cui:
        return None

    # Remove non-digits
    digits = re.sub(r'\D', '', cui)

    # Validate length
    if not (6 <= len(digits) <= 10):
        return None

    # Validate Mod 11 checksum (Romanian CIF algorithm)
    if not self._validate_cui_checksum(digits):
        print(f"[CUI Validation] Invalid checksum: {digits}", flush=True)
        return None

    # Add RO prefix
    return f"RO{digits}"

def _validate_cui_checksum(self, digits: str) -> bool:
    """Validate Romanian CIF Mod 11 checksum."""
    if len(digits) < 2:
        return False

    # Weights: 7, 5, 3, 2, 1, 7, 5, 3, 2 (right-to-left)
    weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]

    # Get control digit (last digit)
    control = int(digits[-1])

    # Calculate checksum (all digits except last)
    digits_to_check = digits[:-1].zfill(9)  # Pad with zeros if needed
    checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))

    # Mod 11
    remainder = checksum % 11
    expected_control = 0 if remainder == 10 else remainder

    return control == expected_control

3. backend/modules/data_entry/services/image_preprocessor.py

Changes: ~80 lines added

Key Modifications:

A. Add preprocess_medium() method (after line 166):

def preprocess_medium(self, image: np.ndarray) -> np.ndarray:
    """
    Medium preprocessing for MIXED-QUALITY images.
    Balance between Light (too gentle) and Heavy (too aggressive).

    Use cases:
    - Moderately faded receipts
    - Photos with uneven lighting
    - Scans with slight blur

    Preprocessing steps:
    - Moderate contrast enhancement (CLAHE clipLimit=2.0)
    - Light denoising (fastNlMeansDenoising h=6)
    - Gentle sharpening
    - NO binarization (preserves text boundaries)
    - NO morphological operations (avoids digit concatenation)
    """
    # 0. Add safety padding
    image = self._add_safety_padding(image)

    # 1. Grayscale
    if len(image.shape) == 3:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    else:
        gray = image.copy()

    # 2. Scale (same as Light)
    height, width = gray.shape
    max_side = max(height, width)
    if max_side > 4000:
        scale = 4000 / max_side
        gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
        height, width = gray.shape

    if width < 1500:
        scale = 1500 / width
        new_width = int(width * scale)
        new_height = int(height * scale)
        if max(new_width, new_height) > 4000:
            scale = 4000 / max(new_width, new_height)
        gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)

    # 3. Deskew
    gray = self._deskew(gray)

    # 4. Moderate contrast enhancement
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    enhanced = clahe.apply(gray)

    # 5. Light denoising (less aggressive than Heavy)
    denoised = cv2.fastNlMeansDenoising(enhanced, h=6, templateWindowSize=7, searchWindowSize=15)

    # 6. Gentle sharpening
    gaussian = cv2.GaussianBlur(denoised, (0, 0), 1.0)
    sharpened = cv2.addWeighted(denoised, 1.3, gaussian, -0.3, 0)

    # NO binarization, NO morphological operations
    # This preserves text boundaries and avoids digit concatenation
    return sharpened

B. Mark preprocess_heavy() as deprecated:

def preprocess_heavy(self, image: np.ndarray) -> np.ndarray:
    """
    Heavy preprocessing for FADED thermal receipts.

    ⚠️ DEPRECATED: Use preprocess_medium() instead.
    Heavy preprocessing causes digit concatenation on clear PDFs.
    Kept for backward compatibility only.
    """
    # ... existing code (unchanged)

4. backend/modules/data_entry/routers/ocr.py

Changes: ~40 lines modified

Key Modifications:

A. Update ExtractionData schema instantiation (lines 106-128):

# Add validation warnings to response
validation_warnings_list = [
    {
        'field': w.field,
        'rule': w.rule,
        'message': w.message,
        'severity': w.severity,
        'suggested_value': w.suggested_value
    }
    for w in result.validation_warnings
] if hasattr(result, 'validation_warnings') else []

data = ExtractionData(
    # ... existing fields ...

    # NEW: Validation fields
    validation_warnings=validation_warnings_list,
    needs_manual_review=getattr(result, 'needs_manual_review', False),
    inter_ocr_ratio=getattr(result, 'inter_ocr_ratio', None),
    inter_ocr_source_used=getattr(result, 'inter_ocr_source_used', None),
)

5. backend/modules/data_entry/schemas/ocr.py

Changes: ~20 lines added

Key Modifications:

A. Add validation fields to ExtractionData (after line 57):

class ValidationWarning(BaseModel):
    """Validation warning from OCR extraction."""
    field: str = Field(description="Field name (e.g., 'amount', 'tva_total')")
    rule: str = Field(description="Rule name (e.g., 'amount_range', 'tva_ratio')")
    message: str = Field(description="Human-readable warning message")
    severity: str = Field(description="Severity: 'low', 'medium', 'high'")
    suggested_value: Optional[Any] = Field(default=None, description="Suggested corrected value")

class ExtractionData(BaseModel):
    """Extracted receipt data from OCR."""
    # ... existing fields ...

    # NEW: Validation results
    validation_warnings: List[ValidationWarning] = Field(default=[], description="Validation warnings")
    needs_manual_review: bool = Field(default=False, description="Flag for supervisor review")
    inter_ocr_ratio: Optional[float] = Field(default=None, description="Ratio between OCR engines (>10 = inconsistent)")
    inter_ocr_source_used: Optional[str] = Field(default=None, description="OCR engine used: 'light' or 'medium'")

6. Database Migration: backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py

Purpose: Add needs_manual_review column to receipts table Size: ~30 lines (Alembic migration)

"""Add needs_manual_review flag to receipts

Revision ID: XXX
Create Date: 2025-12-30
"""
from alembic import op
import sqlalchemy as sa

revision = 'XXX'
down_revision = 'YYY'  # Previous migration
branch_labels = None
depends_on = None

def upgrade():
    # Add column with default NULL (not FALSE)
    # NULL = not validated yet (old receipts)
    # FALSE = validated, no review needed
    # TRUE = validated, needs review
    op.add_column('receipts', sa.Column('needs_manual_review', sa.Boolean(), nullable=True))

def downgrade():
    op.drop_column('receipts', 'needs_manual_review')

Frontend Integration Points

1. src/modules/data-entry/views/receipts/ReceiptCreateView.vue

Changes: Display validation warnings below OCR results

Example:

<template>
  <div class="ocr-results">
    <!-- Existing OCR fields -->

    <!-- NEW: Validation warnings section -->
    <div v-if="ocrData.validation_warnings?.length > 0" class="validation-warnings">
      <h4>
        <i class="pi pi-exclamation-triangle" />
        Avertismente Validare ({{ ocrData.validation_warnings.length }})
      </h4>
      <ul>
        <li
          v-for="(warning, idx) in ocrData.validation_warnings"
          :key="idx"
          :class="`severity-${warning.severity}`"
        >
          <strong>{{ warning.field }}:</strong> {{ warning.message }}
          <span v-if="warning.suggested_value" class="suggestion">
            (sugestie: {{ warning.suggested_value }})
          </span>
        </li>
      </ul>
    </div>

    <!-- NEW: Manual review badge -->
    <div v-if="ocrData.needs_manual_review" class="manual-review-badge">
      <i class="pi pi-flag" />
      Necesită verificare manuală
    </div>
  </div>
</template>

<style scoped>
.validation-warnings {
  margin-top: 1rem;
  padding: 1rem;
  background: #fff3cd;
  border-left: 4px solid #ffc107;
}

.validation-warnings li.severity-low {
  color: #666;
}

.validation-warnings li.severity-medium {
  color: #f57c00;
}

.validation-warnings li.severity-high {
  color: #d32f2f;
  font-weight: bold;
}

.manual-review-badge {
  margin-top: 0.5rem;
  padding: 0.5rem 1rem;
  background: #fff3cd;
  border-radius: 4px;
  display: inline-flex;
  align-items: center;
  gap: 0.5rem;
}
</style>

2. src/modules/data-entry/components/ocr/OCRPreview.vue

Changes: Add inter-OCR consistency indicator

Example:

<template>
  <div class="ocr-preview">
    <!-- Existing fields -->

    <!-- NEW: Inter-OCR consistency warning -->
    <div v-if="ocrData.inter_ocr_ratio && ocrData.inter_ocr_ratio > 10" class="ocr-consistency-warning">
      <i class="pi pi-exclamation-circle" />
      Inconsistență detectată între motoarele OCR ({{ Math.round(ocrData.inter_ocr_ratio) }}x diferență).
      <br />
      <small>Valorile folosite provin din: {{ ocrData.inter_ocr_source_used }}</small>
    </div>
  </div>
</template>

Design Decisions

1. Why Validation Warnings Instead of Errors?

Decision: Use non-blocking warnings instead of blocking errors.

Rationale:

  • User requirement: "Allow save with warnings"
  • OCR will never be 100% perfect
  • Users can override incorrect extractions
  • Supervisor review catches issues before approval

Trade-off: Risk of bad data entering system vs. user frustration with blocked submissions.

Mitigation: Manual review flag ensures supervisor catches issues.

2. Why Replace Heavy with Medium OCR?

Decision: Remove Heavy preprocessing, add Medium preprocessing.

Rationale:

  • Heavy causes digit concatenation on clear PDFs (production evidence)
  • Binarization destroys text boundaries on high-quality images
  • Morphological operations merge adjacent numbers (85.99 → 859,762.16)

Analysis of Heavy Preprocessing (lines 153-164 in image_preprocessor.py):

# 7. Adaptive thresholding (binarization) - PROBLEM!
binary = cv2.adaptiveThreshold(
    sharpened, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    blockSize=11, C=5  # Block size can merge nearby digits
)

# 8. Morphological operations - COMPOUNDS THE PROBLEM!
kernel_close = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel_close)
# MORPH_CLOSE fills small gaps → merges adjacent numbers

Alternative Considered: Keep Heavy but add safeguards. Rejected: Too risky, no benefit for clear PDFs.

3. Why Romanian CIF Mod 11 Validation?

Decision: Implement CIF checksum validation algorithm.

Rationale:

  • Romanian CIFs have built-in checksum (last digit)
  • Validates extracted CUI is mathematically correct
  • Catches OCR digit errors (10562600 vs 10562601)

Algorithm: Mod 11 checksum

  • Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left)
  • Formula: sum(digit[i] * weight[i]) % 11
  • Control digit: remainder (0 if remainder=10)

Example: RO10562600

  • Digits: 1,0,5,6,2,6,0,0,[0]
  • Checksum: 1×7 + 0×5 + 5×3 + 6×2 + 2×1 + 6×7 + 0×5 + 0×3 = 7+0+15+12+2+42+0+0 = 78
  • 78 % 11 = 1 ≠ 0 → INVALID! (This CUI fails validation)

Note: Some older CIFs may not have checksums (pre-2000). Validation is permissive (warning, not error).

4. Why Apply to New Uploads Only?

Decision: Don't reprocess existing receipts.

Rationale:

  • Migration impact: ~500 existing receipts in DB
  • Reprocessing cost: OCR is slow (~2-5s per receipt)
  • Risk: May change existing approved data
  • Benefit: Minimal (old receipts already reviewed)

Implementation: Migration adds column with default NULL (not FALSE).


Validation Rules Specification

1. Amount Range Validation

Rule: Amount must be between 0.01 and 100,000 RON.

Implementation:

class AmountRangeRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.amount:
            if extraction.amount < Decimal('0.01'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
                    severity='high'
                ))
            elif extraction.amount > Decimal('100000'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
                    severity='high'
                ))

            # Check decimal places
            decimal_places = abs(extraction.amount.as_tuple().exponent)
            if decimal_places > 2:
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='decimal_places',
                    message=f'Amount has {decimal_places} decimal places (max 2)',
                    severity='medium',
                    suggested_value=extraction.amount.quantize(Decimal('0.01'))
                ))
        return warnings

Test Cases:

  • 0.00 RON → Warning (too small)
  • 0.01 RON → Valid
  • 85.99 RON → Valid
  • 100,000 RON → Valid
  • 100,001 RON → Warning (too large)
  • 859,762.16 RON → Warning (too large)
  • 85.999 RON → Warning (too many decimals)

2. TVA Ratio Validation

Rule: TVA must be 5-24% of TOTAL amount.

Implementation:

class TVARatioRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.tva_total and extraction.amount:
            # TVA cannot be greater than TOTAL
            if extraction.tva_total > extraction.amount:
                warnings.append(ValidationWarning(
                    field='tva_total',
                    rule='tva_greater_than_total',
                    message=f'TVA ({extraction.tva_total}) cannot be greater than TOTAL ({extraction.amount})',
                    severity='high',
                    suggested_value=None  # Will be auto-corrected by service
                ))
            else:
                # Check ratio
                ratio = extraction.tva_total / extraction.amount * Decimal('100')
                if ratio < Decimal('5'):
                    warnings.append(ValidationWarning(
                        field='tva_total',
                        rule='tva_ratio_low',
                        message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
                        severity='medium'
                    ))
                elif ratio > Decimal('24'):
                    warnings.append(ValidationWarning(
                        field='tva_total',
                        rule='tva_ratio_high',
                        message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
                        severity='high'
                    ))
        return warnings

Test Cases:

  • TVA=14.92, TOTAL=85.99 → 17.3% → Valid
  • TVA=149,214.92, TOTAL=859,762.16 → 17.3% → Both values wrong (caught by amount_range)
  • TVA=4.00, TOTAL=100.00 → 4% → Warning (too low)
  • TVA=100.00, TOTAL=85.99 → 116% → Warning (impossible!)

3. Payment Sum Validation

Rule: CARD + NUMERAR must equal TOTAL (±0.02 RON tolerance).

Implementation:

class PaymentSumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.payment_methods and extraction.amount:
            payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
            difference = abs(payment_sum - extraction.amount)

            if difference > Decimal('0.02'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='payment_sum_mismatch',
                    message=f'Payment methods sum ({payment_sum}) ≠ TOTAL ({extraction.amount}), diff={difference}',
                    severity='high',
                    suggested_value=payment_sum
                ))
        return warnings

    def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
        """Auto-correct TOTAL from payment sum if confidence < 80%."""
        corrections = {}
        if extraction.payment_methods and extraction.amount:
            payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
            difference = abs(payment_sum - extraction.amount)

            if difference > Decimal('0.02') and extraction.confidence_amount < 0.80:
                corrections['amount'] = payment_sum
                print(f"[Auto-Correct] TOTAL corrected: {extraction.amount}{payment_sum} (from payment methods)", flush=True)
        return corrections

Test Cases:

  • CARD=50, NUMERAR=35.99, TOTAL=85.99 → Valid
  • CARD=50, NUMERAR=35.97, TOTAL=85.99 → Diff=0.02 → Valid (tolerance)
  • CARD=50, NUMERAR=35.00, TOTAL=85.99 → Diff=0.99 → Warning

4. TVA Entries Sum Validation

Rule: Σ(TVA entries) must equal TVA TOTAL (±0.02 RON tolerance).

Implementation:

class TVAEntriesSumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.tva_entries and extraction.tva_total:
            entries_sum = sum(e['amount'] for e in extraction.tva_entries)
            difference = abs(entries_sum - extraction.tva_total)

            if difference > Decimal('0.02'):
                warnings.append(ValidationWarning(
                    field='tva_total',
                    rule='tva_entries_sum_mismatch',
                    message=f'TVA entries sum ({entries_sum}) ≠ TVA TOTAL ({extraction.tva_total}), diff={difference}',
                    severity='medium',
                    suggested_value=entries_sum
                ))
        return warnings

    def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
        """Use entries sum as TVA TOTAL if mismatch."""
        corrections = {}
        if extraction.tva_entries and extraction.tva_total:
            entries_sum = sum(e['amount'] for e in extraction.tva_entries)
            difference = abs(entries_sum - extraction.tva_total)

            if difference > Decimal('0.02'):
                corrections['tva_total'] = entries_sum
                print(f"[Auto-Correct] TVA TOTAL corrected: {extraction.tva_total}{entries_sum} (from entries)", flush=True)
        return corrections

Test Cases:

  • Entries=[A:19%:14.92], TOTAL=14.92 → Valid
  • Entries=[A:19%:10.00, B:9%:4.92], TOTAL=14.92 → Valid
  • Entries=[A:19%:14.92], TOTAL=14.94 → Diff=0.02 → Valid (tolerance)
  • Entries=[A:19%:14.92], TOTAL=15.00 → Diff=0.08 → Warning

5. Inter-OCR Consistency Validation

Rule: Flag if values differ >10x between OCR engines.

Implementation:

class InterOCRConsistencyRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        """This rule is applied during merge, stores ratio in extraction."""
        warnings = []
        if hasattr(extraction, 'inter_ocr_ratio') and extraction.inter_ocr_ratio:
            if extraction.inter_ocr_ratio > 10:
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='inter_ocr_inconsistency',
                    message=f'Large inconsistency between OCR engines ({extraction.inter_ocr_ratio:.0f}x difference)',
                    severity='high'
                ))
        return warnings

Test Cases:

  • Light=85.99, Medium=86.00 → Ratio=1.00 → Valid
  • Light=85.99, Medium=90.00 → Ratio=1.05 → Valid
  • Light=85.99, Medium=859.76 → Ratio=10.00 → Valid (edge case)
  • Light=85.99, Medium=859,762.16 → Ratio=10,000 → Warning!

6. CUI Checksum Validation

Rule: Validate Romanian CIF Mod 11 checksum.

Implementation:

class CUIChecksumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.cui:
            # Normalize CUI
            digits = re.sub(r'\D', '', extraction.cui)

            # Validate length
            if not (6 <= len(digits) <= 10):
                warnings.append(ValidationWarning(
                    field='cui',
                    rule='cui_length',
                    message=f'CUI length invalid: {len(digits)} digits (expected 6-10)',
                    severity='medium'
                ))
                return warnings

            # Validate Mod 11 checksum
            if not self._validate_checksum(digits):
                warnings.append(ValidationWarning(
                    field='cui',
                    rule='cui_checksum',
                    message=f'CUI checksum invalid: {extraction.cui} (failed Mod 11 validation)',
                    severity='medium'  # Medium: some old CIFs don't have checksums
                ))
        return warnings

    def _validate_checksum(self, digits: str) -> bool:
        """Romanian CIF Mod 11 checksum validation."""
        if len(digits) < 2:
            return False

        weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
        control = int(digits[-1])
        digits_to_check = digits[:-1].zfill(9)

        checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
        remainder = checksum % 11
        expected = 0 if remainder == 10 else remainder

        return control == expected

Test Cases:

  • R010562600 → Checksum validation
  • R011201891 → Checksum validation
  • R012345678 → Warning (invalid checksum)
  • R01234 → Warning (too short)

7. Date Validity Validation

Rule: Date must not be in future, not older than 10 years.

Implementation:

class DateValidityRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.receipt_date:
            today = date.today()

            # Check future date
            if extraction.receipt_date > today:
                warnings.append(ValidationWarning(
                    field='receipt_date',
                    rule='date_future',
                    message=f'Date is in the future: {extraction.receipt_date}',
                    severity='high'
                ))

            # Check too old (10 years)
            cutoff_date = today.replace(year=today.year - 10)
            if extraction.receipt_date < cutoff_date:
                warnings.append(ValidationWarning(
                    field='receipt_date',
                    rule='date_too_old',
                    message=f'Date is older than 10 years: {extraction.receipt_date}',
                    severity='medium'
                ))
        return warnings

Test Cases:

  • 2025-12-30 (today) → Valid
  • 2025-10-11 → Valid
  • 2026-01-01 → Warning (future)
  • 2015-12-31 → Valid (exactly 10 years)
  • 2014-12-31 → Warning (too old)

Acceptance Criteria

Critical Success Criteria (Must Pass)

AC-1: Five-Holding receipt extracts correct values

  • Given: Production PDF receipt (Five-Holding, 85.99 LEI)
  • When: OCR processes with new validation
  • Then:
    • TOTAL = 85.99 LEI (NOT 859,762.16)
    • TVA = 14.92 LEI (NOT 149,214.92)
    • CUI = R010562600
    • Overall confidence >= 90%

AC-2: Save works with validation warnings

  • Given: Receipt with low confidence (75%)
  • When: User clicks Save
  • Then:
    • Warnings displayed in UI
    • Save button enabled
    • Receipt saved with needs_manual_review=TRUE

AC-3: Cross-validation: CARD + NUMERAR = TOTAL

  • Given: Receipt with CARD=50, NUMERAR=35.99
  • When: OCR extracts TOTAL=85.98 (off by 0.01)
  • Then:
    • Warning displayed: "Payment sum (85.99) ≠ TOTAL (85.98)"
    • Suggested value: 85.99
    • Auto-corrected if confidence < 80%

AC-4: Cross-validation: Σ(TVA entries) = TVA TOTAL

  • Given: Receipt with TVA A=10.00, TVA B=4.92
  • When: OCR extracts TVA TOTAL=14.90 (off by 0.02)
  • Then:
    • Warning displayed: "TVA entries sum (14.92) ≠ TVA TOTAL (14.90)"
    • Auto-corrected to 14.92

AC-5: CUI Mod 11 validation works

  • Given: Receipt with CUI R010562600
  • When: OCR processes
  • Then:
    • CUI validated against Mod 11 checksum
    • If invalid, warning displayed
    • Format normalized to "RO" prefix

Secondary Criteria (Nice-to-Have)

🔲 AC-S1: Medium OCR performs better than Heavy

  • Given: 10 clear PDF receipts
  • When: Processed with Light → Medium → Tesseract
  • Then:
    • No 10x magnitude errors
    • Average confidence >= 90%
    • Processing time < 5s

🔲 AC-S2: Validation warnings show in UI

  • Given: Receipt with 3 validation warnings
  • When: OCR completes
  • Then:
    • Warning section displayed
    • Each warning shows: field, message, severity
    • Suggested values displayed if available

Testing Strategy

Unit Tests (~300 lines)

File: backend/modules/data_entry/tests/test_ocr_validation.py

Test Coverage:

# Amount validation
test_amount_range_valid()
test_amount_range_too_small()
test_amount_range_too_large()
test_amount_decimal_places()

# TVA validation
test_tva_ratio_valid()
test_tva_ratio_too_low()
test_tva_ratio_too_high()
test_tva_greater_than_total()
test_tva_entries_sum_matches()
test_tva_entries_sum_mismatch()

# Payment validation
test_payment_sum_matches()
test_payment_sum_mismatch_within_tolerance()
test_payment_sum_mismatch_auto_corrected()

# CUI validation
test_cui_checksum_valid()
test_cui_checksum_invalid()
test_cui_length_invalid()
test_cui_normalization()

# Date validation
test_date_valid()
test_date_future()
test_date_too_old()

# Inter-OCR consistency
test_inter_ocr_consistency_valid()
test_inter_ocr_consistency_10x_difference()

# Validation engine
test_validation_engine_no_warnings()
test_validation_engine_multiple_warnings()
test_validation_engine_auto_corrections()
test_needs_manual_review_flag()

Integration Tests (~200 lines)

File: backend/modules/data_entry/tests/test_ocr_validation_integration.py

Test Coverage:

# Real receipts
test_five_holding_receipt()           # Production case (85.99 not 859,762.16)
test_omv_receipt()                    # Clear PDF, Light OCR only
test_kaufland_receipt()               # Faded thermal, Medium OCR
test_mega_image_receipt()             # Multiple TVA entries

# OCR pipeline
test_light_ocr_high_confidence_skips_medium()
test_light_ocr_low_confidence_runs_medium()
test_medium_ocr_replaces_heavy()
test_validation_runs_after_merge()

# API responses
test_api_returns_validation_warnings()
test_api_returns_needs_manual_review_flag()
test_api_returns_inter_ocr_ratio()
test_api_auto_corrects_amount_from_payments()

# Edge cases
test_no_ocr_engines_available()
test_pdf_with_multiple_pages()
test_receipt_with_no_tva()
test_receipt_with_no_payment_methods()

Manual Testing Checklist

  1. Upload Five-Holding receipt PDF (production case)

    • Verify TOTAL = 85.99 (not 859,762.16)
    • Verify TVA = 14.92 (not 149,214.92)
    • Verify no validation warnings
    • Verify overall confidence >= 90%
  2. Upload faded thermal receipt photo

    • Verify Medium OCR used (not Heavy)
    • Verify readable text extracted
    • Verify no digit concatenation
  3. Upload receipt with payment methods

    • Verify CARD + NUMERAR displayed
    • Verify sum matches TOTAL
    • If mismatch, verify warning displayed
  4. Upload receipt with multiple TVA entries

    • Verify all TVA entries extracted
    • Verify sum matches TVA TOTAL
    • If mismatch, verify warning displayed
  5. Submit receipt with warnings

    • Verify Save button enabled
    • Verify warnings displayed in UI
    • Verify needs_manual_review flag set
  6. Filter receipts by "Needs Review"

    • Verify filter shows flagged receipts
    • Verify supervisor can review

Risks and Mitigations

Risk Likelihood Impact Mitigation
Medium OCR still causes errors Medium High Keep Tesseract as Step 3 fallback; validation catches issues
CUI Mod 11 validation too strict Medium Low Use warning (not error); allow override; some old CIFs don't have checksums
Validation rules too permissive Low Medium Start conservative, tune based on production data
Validation rules too strict Medium Low Non-blocking warnings allow user override
Performance impact Low Low Validation is fast (<10ms); OCR dominates processing time
Breaking changes to API Low High Add new fields, keep existing fields unchanged; frontend optional
Database migration issues Low Medium Use NULL default (not FALSE); test on staging first

Out of Scope

Explicitly NOT included in this feature:

  1. Reprocessing existing receipts - Only new uploads validated
  2. Machine learning OCR improvements - Use existing PaddleOCR/Tesseract
  3. Custom OCR training - Generic models only
  4. Approval workflow changes - Validation is separate from approval
  5. Automatic approval - Always requires supervisor review
  6. Advanced validation rules - Only basic sanity checks
  7. Multi-currency support - RON only for now
  8. Historical receipt validation - Phase 2 feature
  9. OCR confidence tuning - Accept engine defaults
  10. Frontend validation logic - Backend only (frontend displays)

Open Questions

Q1: Should we keep Heavy preprocessing as fallback?

Answer: No. Remove completely. Evidence shows it causes more harm than good on clear PDFs. Medium preprocessing handles mixed-quality images better.

Q2: What tolerance for payment sum validation?

Answer: ±0.02 RON (2 cents). Romanian receipts use 2 decimal places. This handles rounding errors.

Q3: Should CUI validation be blocking or warning?

Answer: Warning only. Some old Romanian CIFs (pre-2000) don't have Mod 11 checksums. Also, OCR may extract digits incorrectly.

Q4: What if Light OCR has high confidence but wrong values?

Answer: Validation catches this. If Light OCR extracts 859,762.16 with 98% confidence, amount_range rule flags it (>100,000 limit). User sees warning.

Q5: Should we reprocess existing receipts with new validation?

Answer: No. Too risky and time-consuming. Apply to new uploads only. If user wants to re-validate old receipt, they can re-upload.

Q6: What about receipts with no payment methods?

Answer: No validation warning. Not all receipts show CARD/NUMERAR breakdown (especially older thermal receipts). Only validate if payment methods are extracted.

Q7: Should validation auto-correct or just warn?

Answer: Both. Auto-correct obvious errors (TOTAL from payment sum if confidence < 80%). Warn for ambiguous cases. Never silently change high-confidence values.

Q8: How to handle receipts from future (clock skew)?

Answer: Warning only (not error). Allow up to 1 day in future (±24h tolerance) for clock skew. Beyond that, warn user.


Estimated Complexity

Overall: High Justification:

  • File Count: 6 modified, 3 created, 1 migration = 10 files
  • Line Changes: ~1,135 lines (400 new validation, 300 tests, 200 integration tests, 235 modifications)
  • Risk Level: Medium (core OCR pipeline changes, but validation is additive)
  • Testing: 15-20 new test cases, manual testing required
  • Dependencies: None (uses existing OCR engines)
  • Complexity Factors:
    • Multi-layer validation logic
    • Romanian CIF checksum algorithm
    • Cross-field validation dependencies
    • Inter-OCR comparison logic
    • Auto-correction logic
    • Frontend integration
    • Database migration

Estimated Effort: 2-3 days

  • Day 1: Validation engine + unit tests
  • Day 2: OCR pipeline integration + medium preprocessing
  • Day 3: Frontend integration + manual testing + bug fixes

Dependencies

External Libraries

  • cv2 (OpenCV) - Already installed
  • numpy - Already installed
  • paddleocr - Already installed
  • tesseract - Already installed
  • pydantic - Already installed
  • sqlalchemy - Already installed

Internal Modules

  • backend/modules/data_entry/services/ocr_service.py
  • backend/modules/data_entry/services/ocr_extractor.py
  • backend/modules/data_entry/services/image_preprocessor.py
  • backend/modules/data_entry/routers/ocr.py
  • backend/modules/data_entry/schemas/ocr.py
  • backend/modules/data_entry/db/models/receipt.py

Database Schema Changes

  • Add needs_manual_review column to receipts table (nullable BOOLEAN)
  • Alembic migration required

Implementation Notes

  1. Phase 1: Core Validation (Day 1)

    • Create ocr/validation.py module
    • Implement validation rules (amount, TVA, payment, CUI, date)
    • Write unit tests
    • Checkpoint: All tests pass
  2. Phase 2: OCR Integration (Day 2 Morning)

    • Add preprocess_medium() to image_preprocessor
    • Update _merge_extractions() with validation-aware logic
    • Remove/deprecate preprocess_heavy()
    • Checkpoint: Five-Holding receipt extracts correctly
  3. Phase 3: API Updates (Day 2 Afternoon)

    • Update ExtractionResult dataclass with validation fields
    • Update API schemas (ocr.py, routers/ocr.py)
    • Add database migration
    • Checkpoint: API returns validation warnings
  4. Phase 4: Integration Testing (Day 3 Morning)

    • Write integration tests
    • Test with real receipts (Five-Holding, OMV, Kaufland)
    • Checkpoint: All integration tests pass
  5. Phase 5: Frontend & Polish (Day 3 Afternoon)

    • Update Vue components to display warnings
    • Add "Needs Review" filter
    • Manual testing
    • Bug fixes
    • Checkpoint: Production-ready

Code Quality Standards

  • Type hints for all functions
  • Docstrings for all public methods
  • Unit test coverage >90%
  • Integration tests for critical paths
  • Print statements for debugging (will be converted to logging later)
  • Follow existing code patterns (SQLModel, Pydantic v2, FastAPI)

Performance Considerations

  • Validation overhead: <10ms per receipt (negligible vs. OCR time)
  • Medium preprocessing: Similar speed to Heavy (~500ms)
  • Database migration: Non-blocking (adds NULL column)
  • Frontend impact: Minimal (only displays warnings)

Project Context

  • CLAUDE.md: Data Entry module instructions
  • docs/data-entry/DATA-ENTRY-MODULE.md: Module architecture
  • docs/ARCHITECTURE-DECISIONS.md: Ultrathin monolith rationale

Technical References

Similar Features

  • Payment methods extraction: Already implemented in ocr_extractor.py:1361
  • TVA entries extraction: Already implemented in ocr_extractor.py:820
  • Cross-validation logic: Pattern from _cross_validate_and_calculate_amount (lines 468-557)

Summary

This specification provides a comprehensive solution to fix critical OCR data extraction issues in the Data Entry module. The multi-layer validation system ensures data integrity while maintaining user flexibility through non-blocking warnings.

Key Benefits:

  • Prevents 10,000x magnitude errors (85.99 vs 859,762.16)
  • Validates cross-field dependencies (payment sum, TVA sum)
  • Improves CUI extraction with Mod 11 checksum
  • Replaces problematic Heavy OCR with Medium preprocessing
  • Non-blocking warnings preserve user workflow
  • Manual review flag helps supervisors prioritize

Next Steps:

  1. Review and approve specification
  2. Create feature branch: feature/bon-ocr-validation
  3. Implement Phase 1 (validation engine)
  4. Continue with Phases 2-5
  5. Deploy to staging for testing
  6. Monitor production for 1 week before full rollout

Document Version: 1.0 Last Updated: 2025-12-30 Status: Ready for Implementation Estimated Completion: 2026-01-02 (3 working days)